How Graphwise helped a top-5 global pharmaceutical company transform a fragmented preclinical in vivo data landscape into a coherent, semantically rich and queryable knowledge graph by engineering a production platform with FAIR principles built into every layer of its architecture.
The FAIR data principles — Findable, Accessible, Interoperable, Reusable — have become a widely accepted framework for research data management. But articulating the principles is the easy part. Engineering a production platform that embeds them into every layer of the architecture is a different challenge entirely. In this blog post, we describe how Graphwise technology helped a top-5 global pharmaceutical company address that problem — and what the resulting architecture looks like in practice.
Like many large research organizations, our client was struggling with a fragmented preclinical data landscape. Critical study information was siloed across systems, stored in non-harmonized formats, and nearly impossible to query in a meaningful way. To solve this, their team set out to build an internal FAIR in vivo data platform. Engineered from the ground up with FAIR principles baked into every layer of the architecture, the platform is now transforming how in vivo data is captured, connected, and reused across the company’s drug discovery process.
The problem: a fragmented in vivo data landscape
The core challenge the internal FAIR in vivo data platform was built to solve was not just data volume — it was data coherence. Information about study designs, animals, biospecimens, and assay results lived in disconnected systems owned by different teams, with no shared terminology and no consistent structure. Assembling a complete picture of even a single study meant piecing together a puzzle with missing parts.
The deeper issue, however, was context. For data to be truly reusable — by a scientist, or increasingly by an algorithm — it must carry unambiguous meaning. Simply knowing that “10 mice of strain X were treated with compound Y” is insufficient. Which batch of the compound? How was it formulated? Which lobe of the liver was collected, and from which specific animal? Without this level of contextual fidelity, the data’s value diminishes rapidly. The vision for the platform was therefore not just to centralize data, but to preserve its full scientific context programmatically.
What the platform does: unifying the in vivo lifecycle
The internal FAIR in vivo data platform functions as a data registration and sharing platform that captures the complete lifecycle of in vivo data. It covers everything — from study design and in-life animal activities, through to all measurements and assay results derived from animals or their biospecimens. It standardizes terminologies across labs and integrates with the broader ecosystem of tools already in use: LIMS systems as source data providers, ELN platforms, digital pathology systems, and omics data pipelines.
Downstream, the platform feeds into visualization and analysis tools such as Spotfire and D360, while also providing project-based access management to ensure data is both accessible and appropriately secured. The result is that a scientist no longer has to hunt through disconnected files or systems. They can ask complex cross-study questions and receive a structured, enriched answer — ready for analysis.
FAIR by design: four technical pillars
The FAIR principles were not treated as a post-hoc compliance checklist — they were implemented as specific technical solutions across four pillars. First, rich metadata: comprehensive attribute sets were defined to capture all contextual information for each object type. Second, controlled vocabularies: enforced use of standardized terminologies to harmonize values across labs and systems. Third, Globally Unique Persistent and Resolvable Identifiers (GUPRIs) for every digital object. Fourth, a formal semantic model that defines not just the objects themselves, but the relationships between them.
The GUPRI infrastructure deserves particular attention. Using the company’s internal identifier service, each application registers a unique namespace, and GUPRIs are constructed from namespace plus local identifier. The critical feature is persistence: the service resolves GUPRIs to their current location, and if data migrates to a new system, only the redirect rule needs updating. The identifier itself remains valid indefinitely, ensuring data links never break across system migrations or reorganizations.
The semantic model and FAIR API
The backbone of the platform is its semantic model, which covers distinct but connected domains — including the Study Registration System (SRS) and Animal Registration System (ARS) — browsable via a dedicated model browser. Each domain model is paired with a corresponding FAIR API that exposes data as JSON-LD, making it machine-readable and semantically rich by default.
The FAIR API goes beyond a standard REST API in several meaningful ways. Every object carries a GUPRI in the @id field, served from the company’s internal identifier service, making each entity globally findable. The @context links the JSON directly to its definition in the semantic model, ensuring any consumer — human or machine — understands the precise meaning of every field. Relationships to other objects are also expressed as GUPRIs, enabling the construction of a connected graph from the ground up. As a deliberate design choice, the official identifier and preferred label are exposed for all terminology concepts, reinforcing both human readability and machine interoperability.
Platform architecture: microservices, Kafka, and GraphDB
The platform architecture is built around four registration microservices covering study, study design, animal, and biospecimen data. Producers interact with these systems either through the platform UI or directly via the FAIR APIs. Metadata flows from these services via Kafka messages into a central metadata store built on Graphwise GraphDB, where it is harmonized into connected semantic objects. A GraphQL API is then exposed over the metadata store to enable flexible, efficient querying of the full connected metadata network.
Result data follows a separate path into a relational Result Data Store — an Oracle-based system previously used for in vitro data and reused here pragmatically. Both stores are made available to downstream consumers through a unified API gateway, with direct JDBC connections also supported. The choice to use Postgres for the new registration microservices was equally pragmatic: the team had limited initial experience with graph databases at the start, and the semantic layer in GraphDB effectively absorbs the complexity of the connected data model regardless of what sits upstream.
The knowledge graph is assembled by exploiting the GUPRI-based connectivity already embedded in the data. When GraphDB ingests JSON-LD records, it reads the GUPRIs referencing other entities — such as the specific animal a biospecimen was derived from — and automatically creates the corresponding RDF triples. The result is a single traversable graph spanning studies, designs, animals, and biospecimens, queryable as one connected network rather than a set of joined tables.
The API layer: From Semantic Objects to GraphQL
To expose the metadata store, the team used the Graphwise Semantic Objects platform. The GraphQL schema is assembled by combining the four core semantic models and includes project-based authorization rules derived from the user’s access token. The Semantic Objects workbench allows power users to visually explore entities and properties, auto-generating GraphQL queries that can then be parameterized and executed — all traversing the full relationship graph in a single query.
For simpler consumption patterns, the metadata store API is also exposed through the company’s internal API gateway, with predefined parameterized endpoints. A downstream system can call a straightforward RESTful endpoint with a study ID and receive a clean JSON-LD response — no knowledge of the underlying graph required. The gateway also handles API versioning, policy enforcement, and — currently via a workaround — the injection of JSON-LD context into responses. The company is in active discussions with Graphwise to have this handled natively by the Semantic Objects framework.
Looking ahead, the team is evaluating the new native GraphQL support in GraphDB 11, which allows the GraphQL schema to be generated automatically from ontologies and SHACL shapes, eliminating what is currently a manual update process and further tightening the coupling between the semantic model and the query layer.
Value delivered and the road ahead
The value the internal FAIR in vivo data platform delivers is framed as a pyramid of progressively more complex scientific capability. At the base is simple findability — locating study plans, storing result files in a controlled way. Above that, reliable data linkage: tracing a biospecimen back to its source animal and full treatment history. The current focus sits at levels three and four: combining Pharmacokinetics/Pharmacodynamics results and other study data to answer concrete research questions about compound efficacy. The long-term ambition — the apex of the pyramid — is generating novel cross-study insights, including the construction of virtual control groups and compound repurposing opportunities.
One of the most powerful opportunities a knowledge graph unlocks is democratizing access to data — making it available to everyone regardless of their technical expertise. Rather than requiring scientists to write queries, the goal is to let them ask questions in plain natural language. This is an active area of development, and several promising approaches are emerging: Graphwise is already moving in this direction with a native “talk to your graph” feature built into GraphDB.
One example of what this looks like in practice is Graph Talk. This is a Graphwise service that connects a large language model to a GraphDB knowledge graph, translates plain-language scientific questions into precise SPARQL queries, and returns clear, cited answers. Currently, the global pharmaceutical company is evaluating adapting this approach for the animals study registration system. It is a compelling demonstration of what becomes possible when the underlying data is structured, connected, and semantically rich: a complex database transformed into an interactive scientific partner.
To wrap it up
This Graphwise solution demonstrates what becomes possible when FAIR principles are treated as architectural requirements rather than documentation guidelines. By grounding the entire platform in persistent identifiers, a formal semantic model, controlled vocabularies, and a machine-readable API layer, our client has turned a fragmented collection of in vivo data sources into a coherent, queryable, and increasingly intelligent scientific asset.
For data engineers and architects working in complex research environments, the resulting stack (Kafka-driven ingestion, GraphDB as the semantic core, Graphwise Semantic Objects for API exposure, and a GUPRI-based identity layer) offers a concrete, production-tested reference architecture for FAIR by design. The foundation is in place; the most exciting science it will enable is still ahead.
Want to learn more how a semantic layer can put a similar foundation in place for your enterprise?