This is an abbreviated and updated version of a presentation from Graphwise Graph AI Summit 2025 by Ben Gardner, R&D lead for Data Mesh and Semantic Infrastructure at AstraZeneca
Our understanding of disease evolves constantly, and with it, our ability to treat patients effectively. A century ago, cancer was considered a single disease. Then we refined our understanding of cancers of specific organs – cancer of the lung, for instance. Later, we distinguished between different cell types, like non-small cell lung cancer. Today, we examine individual mutational profiles of cancers and develop medicines targeting specific subpopulations.
This evolution creates a fundamental challenge: as our understanding grows more precise, we need to understand patients participating in clinical studies in increasingly defined ways. We need to identify the patient and the exact subpopulation they represent within a disease. This precision enables more targeted treatment and better success rates.
But there’s a problem – our data has always been captured in verticals supporting specific processes. What precision medicine requires is horizontal analysis across those silos. That’s why we turned to knowledge graph technology.
Building Scientific Intelligence: a use case-driven approach
At AstraZeneca, we built a tool called Scientific Intelligence to address this challenge. Our approach from the start has been to remain use case-driven. With the enormous volume of data available both internally and publicly, trying to “boil the ocean” simply doesn’t work. We recognized early that science evolves, questions evolve, and everything must be modeled on what is being asked and what the data shows us – not what we could theoretically model.
Scientists ask complex questions: “Find me subjects who participated in studies where the indication was non-small cell lung cancer, where the drug was Tagrisso, where they had adverse event X, with CT scans of their lung, but with a genetic profile of Y.” These queries span many different data modalities.
Our strategy leverages AstraZeneca’s data mesh architecture, where platforms aggregate and manage data around specific disciplines. We have clinical data in STDM format, omics data including gene expression and RNA levels, imaging data covering both medical images like CT scans and digital pathology slides, and critically, sample information about specimens from trial subjects. In principle, we serialize all of this into what we call a “knowledge map” – we use that term with our customers because we are driving navigation of the space. We then surface this knowledge graph through a front end, enabling relatively easy exploration.
The goal is simple: help people find patients or samples matching their profile, then submit compliant requests for data access. Since we work with some of the most ethically and privacy-sensitive data the company holds, we operate in a very compliant fashion. We generally show information about what happened and observations made, rather than the actual observations themselves.
From studies to individual observations
The knowledge graph centers on a few major nodes: the clinical study with everything we can say about it, the subjects who participated, the samples taken from those subjects, and the observations made on those samples or subjects.
We provide summary statistics around studies by indication and drug. For individual studies, we offer a 360-degree view including the title, drug, indication, status, number of patients recruited, milestones completed, and links to critical documents like the clinical study protocol. Moving to the subject level, we have connections radiating outward. We can provide summary views showing the number of adverse events or total lab tests performed, but also drill into individual subject demographics, adverse events, and lab tests. This is where researchers really start mining subpopulations and specifying the exact group they want to examine.
At the sample level, we display inventory information – what’s still available to order, what type of sample was taken (plasma, blood, biopsy), where in the body it originated. Finally, at the individual observation level, we can detail the number and types of images available, tumor types observed, and stains performed. We can get remarkably granular, starting at the clinical study and drilling down through subject, sample, and omics data.
The user interface presents dashboards for clinical studies showing where trials run globally and facets for filtering – by therapeutic area, indication, disease, drugs used, and development phase. Clicking facets builds queries with constraints. A researcher might specify: “Show me only oncology studies where non-small cell lung cancer was the indication, which are phase three.” This continuously narrows the result set. The interface includes an abstraction of the graph where circles represent nodes connected to clinical studies, allowing navigation across the graph to other dashboards covering subjects, samples, and observations.
Unlocking unknown knowns
But the real value becomes clear only through examples. A data scientist recently approached us wanting to build a predictive model for recovery in chemotherapy patients. Some patients experience white blood cell count drops requiring them to stop chemotherapy. Predicting this allows putting them on recovery therapy so they can take additional rounds – a massive impact on cancer survival chances.
The data scientists worked with a particular drug product team and knew of two or three studies likely holding relevant subjects, but they worried about not having enough subjects in the subpopulation to train their machine learning model. Using Scientific Intelligence, we identified five additional studies they weren’t aware of that contained the right population. This gave them confidence to request access to those studies, knowing the number of subjects would be sufficient for their model.
“This was hugely enabling because it [the complex search process] has previously been a very manual process and it could take weeks to do. We went from weeks to minutes.” – Ben Gardner, R&D lead for Data Mesh and Semantic Infrastructure at AstraZeneca
For study design, researchers often need to know variance in liver tests or blood pressure for particular subpopulations – analysis that used to take weeks now takes hours. We can provide the data, though they still complete the final statistical analysis.
Landscape reviews represent another powerful application. These enable therapeutic areas to understand what data and samples they have available based on different subpopulations. This has proven transformative in oncology, changing how teams prosecute their drug programs.
These reviews used to take months to build and immediately went out of date as new data came in – a continuous “Forth Bridge painting” exercise. Now we build queries that update in real time. Every time new data becomes available, they get automatic updates.
Teaching people to think differently
An interesting challenge emerged with our scientists. Most aren’t experienced querying knowledge graphs. When we said, “We’ve pulled these data sources together into a graph and you can ask really interesting questions,” their first response was: “I don’t know what I can ask or what I should ask, because I’ve never been able to ask this sort of question before.”
This created a chicken-and-egg scenario where because they knew they couldn’t ask these questions previously, they never bothered thinking about them. So now we are doing extensive education to help people think differently and recognize new possibilities.
Another problem is that the questions that generate efficiency savings aren’t asked daily by each group. That data scientist asking about the predictive model for white blood cell drops in chemotherapy treatments might ask that question once every six months. By the time they need it again, they’ve forgotten how to use the system. We need better ways of enabling people to interact with and query the data.
We can build beautiful, complicated data structures, but the challenge is making them navigable and consumable for complex questions. I firmly believe we need a return of librarians. When I did my PhD, librarians helped me navigate a large building full of information with a non-intuitive indexing strategy. We need the equivalent for these complicated knowledge graphs we are building.
Scaling through pragmatism
Managing over 20 different data sources and modalities to create that unified picture is no small task. Initially, we handled everything from data capture through cleanup and curation, serializing into data types, integrating into the graph, and exposing through Scientific Intelligence. We could do it, but we burned out. The team size needed to continue evolving and handling this diversity far exceeded our funding capacity.
We asked three simple questions:
- How can we maximize our engineering skills to focus semantic engineers on what’s important while getting others to do standardization?
- How do we better manage the graph ecosystem – updating infrastructure, orchestrating the smorgasbord of tooling, evolving from virtualization to physical?
- How do we democratize the graph to people, making it useful not just through the Scientific Intelligence UI but through other data forms?
The smart answer was to accept that classic SQL-type data cleanup using DBT and Snowflake with SQL-trained data engineers could accomplish significant work. We push the model into that data flow upfront – minting URIs, applying controlled vocabularies, flattening data – all in SQL. This becomes a data product we can serialize into the knowledge graph.
We chose Corporate Memory by eccenca, backed by GraphDB as the quad store, because it allowed a single vendor to manage the whole stack and suite of tooling. Now we just deploy images rather than patching and maintaining many different pieces. Our semantic engineers can focus on semantics, not infrastructure maintenance. It also provided nice integration opportunities with the established ecosystem of data consumption tools. Our Scientific Intelligence interface is built with Discover from OntoForce.
Putting the graph into the data
We initially thought we needed to transform all data and build it into a graph – essentially creating a graph across the entire enterprise. What we learned from our experience with virtualization upfront, serialization into the graph, and using controlled vocabularies changed our perspective fundamentally.
We now materialize controlled vocabularies and make them available not just to our data pipeline but to everyone else’s. We drive common patterns from the reference model and URI patterns we establish. What we realized is that we have pivoted. While we still put data into our knowledge graph for Scientific Intelligence, from an enterprise perspective, we are really putting the graph through artifacts like controlled vocabularies into the data.
This lets us scale differently. We provide controlled vocabularies – built as taxonomies – then flatten them into Snowflake tables for easy consumption by data pipelines. This means data products being produced already have the URIs we want to use in our graph, making them easy to consume. This represents our major learning and pivot.
Key takeaways
Evolving science has driven us to understand diseases with greater and greater granularity, creating new data demands and resulting in complicated data graphs. But we must help people navigate these complex spaces. Focus on generating common patterns. Look for opportunities to help others do hard work for you, enabling you to focus limited skills on truly important process elements.
There’s no shame in using traditional approaches where they make sense. Most importantly, invest in data interoperability. That means building taxonomies, flattening them, and making them available as simple flat lists for SQL engineers to consume – it makes a huge difference. Put editorial governance in place and hide the semantics, because most of the world doesn’t need the semantics but needs the artifacts we create to make their data FAIR.
On ROI, you can calculate time saved and multiply across multiple people and use cases. We looked at frameworks for the increasing value of data as you join together different modalities. With about 350 studies and 750,000 subjects, as we add genetic sequencing information, samples, and imaging data, you get multipliers – ten times as you bring two together, a hundred times with another, 300 times with the next. The potential value becomes enormous. It’s about use cases that enable drawing down against that potential value.
The question about hiding semantics isn’t about toxicity – it’s about avoiding confusion. It’s not a religious war of SQL versus semantics. It’s about blending tools together. At AstraZeneca, 95% of the IT organization knows nothing about graphs or semantics. There’s no point teaching 95% of the organization to work with SPARQL, SKOS, and the RDF stack. We have a specialist group working with this technology stack, and we work out how to enable the artifacts to be consumed by the rest of the organization in data forms they’re familiar with, using skills they already have.
Want to learn more from the experts who presented at the Graphwise AI Summit?