Blog post

Language, Data, and Knowledge: Piloting A Virtual Graph RAG



April 25, 2025



AI in Action | Knowledge Graph



Reading Time: 9 min

James Buonocore

Content Solutions Consultant
Pilip Yaromenka

Lead Solution Architect at EPAM
Aliaksei Chumachenka

Software Architect at EPAM

All Blog posts

This article shares insights from EPAM’s collaborative experience with GraphDB and its Talk to Your Graph interface, provided by Graphwise.

We conducted a pilot using Graphwise GraphDB’s Talk To Your Graph (TTYG)¹ to establish a semantic layer with three main objectives. The first was to construct the layer as a virtual solution, demonstrating its potential to support smaller data lakes by lowering total cost of ownership (TCO). Alternatively, it could serve as a jumpstart for delivering early value in large-scale data factory implementations. The second goal was to address some of our own internal data challenges — challenges that are common across many organizations. Finally, we aimed to establish a foundational piece of semantic capability that would contribute to building a digital knowledge model of our staff and their work.

In summary, we believe that capturing “the mind of many” in digital form, makes knowledge more widely available. If it’s not in digital form, you need to find a person with the knowledge. The larger an organization becomes, the harder it is to find them through oral networking. Brooks captured this in the Mythical Man Month with the formula for interfaces.

Bringing Our Own Data Challenges to the Table

We at EPAM have deep expertise in data and analytics. We also have a substantial semantic capability in our Business Information vertical where we work with companies whose revenue comes from selling knowledge to industries and horizontals. This pilot drew on both of these areas of expertise.

Like many of our clients, we have gathered most of our important operational data into the lake of our data factory and we create data products that we use to run our business daily, quarterly, and strategically.

The range of data and analytics challenges we face are common to many companies. We rely heavily on deep knowledge of our data sources and their variations. Core business terminology is often poorly defined, leading to inconsistencies in how it’s mapped across various data sources. Our teams depend on report developers and ELT (extract, load and transform) specialists, and the logic required to generate analytic data products is frequently complex and difficult to maintain.

At the same time, we must keep up with the constant growth and evolution of data, shifting sources, and changes in the external landscape, including client needs. Managing governance across both our operational systems and the data factory adds another layer of complexity. And, last but not least, we aim to stay light and lean, maintaining flexibility without sacrificing control. Amid all of this, the staff working on our internal projects often rotates depending on client priorities, which creates the risks of breaks in our institutional knowledge.

The following are some examples of common questions we ask of our data:

How many billable positions were assigned last year across the whole company?
How many positions fall in the “focused demand” demand type?
What is the average time to staff a Java position by region?
How many times on average do positions change planned start dates?
How do specific attributes of positions correlate with their staffing duration?
How does the skill mix of “positions ending” (people rolling off assignments) compare to “focused demand”?

The Road to Building a Digital Knowledge Network

Most of the core data in our lake today is tabular. It represents activity in our operational systems and key entities like clients, staff, skills, projects, technologies, contracts, and commercials. Almost by definition, this type of data is backward looking, though we can extrapolate from trends and leading indicators.

Our most important assets are our employees. We are a knowledge network. Growth, size, and scale put strains on that network. There are limits to how many relationships people can manage. Having a digital knowledge network greatly expands reach when seeking people with specific domain expertise.

At the same time, digital knowledge overwhelmingly depends on language artifacts and semantic understanding. Artifacts that are sometimes referred to as ‘unstructured’ relative to typical tabular data. In reality, they are highly structured and very readable to humans. It’s just that human business, technical, or science language is too complex for traditional data processing. (We have made gains with the most recent wave of large language model (LLM) artificial intelligence.)

The next step in our roadmap is to model our domain knowledge related to industry and horizontal functional domains. This ontology is the framework for uniting our data with the language artifacts we bring into our lake. We can then employ semantic processing techniques to produce a digital knowledge network that reflects the real network stored in the heads of our people.

To simplify, we needed a semantic interaction layer on top, and for this pilot we wanted to evaluate a virtual approach.

We already have the semantic processing capability for the layer underneath thanks to our Content Value Framework, semantic processing accelerators, and decades of experience working with information companies. That’s why we wanted to run a pilot and prove value and ROI before making a larger investment in changing business processes to acquire language artifacts, bring them into the lake, and process through semantic pipes.

Building and Evaluating Our Semantic Proof of Concept

Our approach to the virtual semantic layer was based on several key assumptions. First, we believed it would help mature our current data factory and enhance how data was used across the organization. Second, we expected it to provide the necessary foundation for enabling the interactive aspects of digital knowledge. Finally, by taking a virtual approach, we aimed to evaluate how TTYG could meet smaller-scale needs while also jumpstarting larger initiatives through early value delivery and iterative methods.

Functional overview

The major activities we conducted focused on building a strong foundation for evaluating our proof of concept. We began by selecting key natural language questions from our top 100 user inquiries to guide our evaluation. Next, we modeled core business terminology into ontology concepts, which was an efficient way to formally describe fine-grained business logic and different aspects of the meaning of the data. Then we created a sandbox environment using GraphDB and TTYG, which also provided us with SPARQL access.

To support flexibility for our developers and testers, we added a GraphQL service, allowing them to work with the interface they preferred. We then mapped data via SOML to lower-level ontology classes and connected the sandbox to our data factory and our internal instance of DIAL (our LLM/GenAI orchestration platform available to clients). Additionally, we defined relationships between mapping classes, attributes, constraints, and conditions within the upper classes of business terms. For each question, we identified the expected states across our process flow, covering GraphQL, SPARQL, and output formats. The entire effort was delivered through a cycle of running, testing, iterating, and evaluating.

Watch our video for a more detailed description of the process.

Evaluating the Pilot: Insights and Challenges

Our pilot surfaced a number of valuable insights, both in terms of what worked well and where challenges remain.

Benefits

On the positive side, we found that data virtualization on a limited scale worked effectively and TTYG proved to be a straightforward and capable toolset for this purpose. It allowed us to avoid duplicating data, ELT operations, or creating yet another overlapping data repository. It also enabled substituting higher-level business term classes for traditional ELT pipelines. All this reduced the overall complexity and lowered the TCO.

The TTYG interface provided a delightful user experience, leveraging OpenAI Assistants for natural language interaction. Our use of ontology to define strict business concepts, and this way instruct GenAI, significantly boosted the quality of those interactions and bridged tabular data with language artifacts and semantic processing.

Finally, from a QL standard point of view, both GraphQL and SPARQL brought distinct advantages, making load balancing more fluid. GraphQL offered adequate abstraction, lightweight query capability, easy integration with web APIs, and a growing popularity, while SPARQL provided more maturity, capabilities, and uncoupling from UI constraints.

Challenges

During the pilot, we also encountered important caveats. We confirmed the concern that performance will become an issue when scaling virtualization. We think that a physical graph will be necessary to support a full digital knowledge network — especially one that captures semantic value from language artifacts like knowledge artifacts. For organizations aiming to use ‘unstructured’ content as an analytical subject rather than just for data extraction, virtualization should be seen as an interim technique.

At scale, both reasoning and retrieval-augmented generation (RAG) are likely to be challenging using virtualization, whether using TTYG or other virtual layers available from the purely data community and data vendors. Ultimately, most organizations will need to adopt physical graphs, silver data products, or a combination of both. While this shift comes with added costs, it also opens up opportunities for expanding the scope and insights of a virtual layer if one keeps the components in their platform. Benefits from insights and delivery channels may also offset the hit to TCO.

Finally, our internal demos revealed one of the most persistent challenges: language ambiguity. Just because stakeholders use their “own familiar terms” doesn’t mean the data challenges disappear. People often use the same words to mean different things, and even domain experts can disagree. One example from our pilot sparked immediate debate: “27k can’t be right — we had over 50k staff assigned to billable positions last year.” This highlighted the nuance in two seemingly similar questions: “How many billable positions were assigned last year?” versus “How many position assignments were billable last year?”

We should note that similar confusion is likely to happen even if a human expert is asked to write a database query and produce such a report. The short question just leaves way too much room for interpretation and GenAI cannot be expected to guess the precise intended meaning better than a human. TTYG got the correct answer, due to the semantic modeling of our ontologist. Half our experts did not. Their quick reactions were rely on familiarity based processing in short term memory (STM), before deeper recollection in long memory (LTM), a virtue of deep learning. This is another good argument for a semantic layer that solves such problems in a quick, predictable, and explainable manner.

Wrapping It All Up

As we explored the potential of a semantic layer through our pilot with TTYG, one thing became clear: technology is only part of the equation. The rise of natural language interfaces and tools like TTYG introduces a new skillset into the enterprise — one that blends technical fluency with linguistic precision. That’s why roles like Prompt Engineer are emerging, raising important questions around change management, staffing, and TCO.

Maybe not everyone will become a prompt expert, but power users will continue to play a key role, and demand for GenAI-savvy talent is likely to grow before it levels out. Just as with previous waves of innovation, new technology doesn’t just reshape jobs — it creates entirely new ones. As semantic interactions mature, organizations will need to rethink how they equip their teams, balance expertise, and evolve their digital capabilities to stay ahead.

Download GraphDB and start building knowledge graphs for your data management practices!

onerror="this.style.display='none'" />

Footnotes

1
TTYG offers 4 retrieval methods: vector, semantic similarity (graph embedding), SPARQL, and FTS, so it is also referred to as Quadro

Subscribe to our Newsletter

← Previous Next →

Offering

Bundles

Components

Solutions

Technologies

Resources

Deep Dive

Meet us

Company

Press Room