Blog

My Dear Watson, it is Great to Have Someone to Talk to



December 18, 2024



AI in Action | Graph Database



Reading Time: 13 min

Nedelina Mitankina

Project Manager
Andrey Tagarev

Researcher

All Blog posts

Join us to learn how we empower users to sift through a large pool of data with the help of a trusted assistant – the GraphDB 10.8 Talk To Your Graph customized agent.

Introduction

Whether you are in the position of Sherlock Holmes, a data analyst, or a business manager, it’s always useful to augment your vision of the available data to derive better insights. An intelligent assistant is a handy way to reveal the knowledge locked in the tangled hairball of data. And, when your data is neatly structured and interlinked in a knowledge graph, this lays a solid foundation for this assistant to provide reliable answers, not just ones out of nowhere.

This blog post will cover a specific use case in the fact-checking domain. It will illustrate how users with varying levels of technical knowledge, particularly the less tech-savvy ones, can benefit from the Graphwise GraphDB-based approach to retrieval augmented generation (RAG), underpinned by large language model (LLM) agents.

LLM assistants that understand graphs

With GraphDB 10.8, Graphwise delivers out-of-the-box its special flavor of RAG – the Graph RAG. It facilitates access to complex datasets through natural, context-rich conversations with the help of customizable, intelligent agents. This way, non-technical users no longer depend on data management staff for data retrieval and analysis tasks. Now, they can easily look not only for answers but also for clarifications while checking supporting information. This gives them confidence in the provided answers.

The chatbot agents can leverage various query methods, enabling them to respond to a wide spectrum of inquiries. Depending on the complexity of the task and the way the data is represented, the techniques differ:

SPARQL for structured data
graph embedding-based vector similarity search for focused, open-ended questions
more complex query endpoints for broader open-ended inquiries

Even more exciting is that users can set up multiple agents and tailor them to their own data and domain-specific needs. To illustrate the compelling interaction with a chatbot customized for a particular domain, we will dive into a use case focused on content that verifies mis- and disinformation statements.

The use case: a collection of fact-checking articles

As a partner in the EC-funded project vera.ai, which aims to equip verification professionals with novel and trustworthy artificial intelligence (AI) tools, the Graphwise team develops the Database of Known Fakes (DBKF). This is a GraphDB-powered system that gathers fact-checking content (also called debunks or debunking articles) and enriches it with meaningful metadata and other information. Thanks to the connections in the graph between the source articles and the enrichments, the data is efficiently retrieved to perform further analysis.

Throughout the rest of the blog post, we will interchangeably use the terms debunk, debunking article, and fact-check. What we mean by it is the combination of a claim (the misleading or false statement) and a claim review (a more extensive piece of text that evaluates the claim’s truthfulness). With the latest extensions to the data schema, DBKF can now also process other textual formats such as news articles and social media posts.

Let’s elaborate on the enrichments in DBKF, as they largely set the possible avenues to explore the data:

Appearance and evidence identification – appearance links lead to the original source of the claim that is fact-checked and evidence links point to documents supporting the conclusion about the claim’s veracity. A dedicated service in DBKF detects these automatically. If the corresponding platforms (usually social media) allow access, it then attempts to extract all the available metadata. This includes a short description of the claim, the publication date and time of the claim’s appearance, a URL to an archived version of the appearance content, and so on. In addition, the text of the appearance post is subsequently analyzed to enable search based on identified concepts.
Metadata extraction – the metadata is extracted on the level of the whole debunking article and includes the associated organization that produced it, the author, and the publication date.
Image and video extraction – all image and video links, present in both the text of the debunking article and the appearance/evidence posts, are extracted and indexed by a vera.ai partner service to enable visual similarity searches.
Language detection – the system uses the language tag of the claim’s review to automatically determine and assign a relevant language to the whole debunk.
Concept enrichment – a state-of-the-art multilingual entity linking pipeline (with links to Wikidata identifiers) performs text analysis over debunking articles and appearance/evidence posts to identify mentioned concepts such as people, organizations, locations, and generic terms of interest.
Event type identification – a transformer-based model extracts “events” (the textual representation of “what happened”) with their accompanying arguments from English-language debunks and appearance/evidence posts. This model was developed within the scope of another EC-funded project in the disinformation domain – VIGILANT.
Narrative clustering – this is a semantic text similarity-based approach to grouping claims into clusters. It’s a way to identify similar stories/narratives across languages and analyze how they are modified and spread over time. Research into LLM-based cluster naming and method extension with visual similarity is also done within the VIGILANT project.

Leveraging these enrichments, DBKF enables users to navigate the content through a variety of searches such as full-text search and faceted filters (by the basic metadata, by concept, and by event type). More sophisticated similarity-based searches include narrative clustering and visual similarity (supported by a third-party service for image/video indexing). With so many capabilities to choose from, human analysts greatly benefit from an AI assistant, who can abstract this complex knowledge and figure out the best way to get to the desired output. But before our LLM agent could reach this level of abstraction, it had to familiarize itself with two important aspects: the structure of the data and the available tools for its exploration.

Data and tools

When we configured our chatbot to answer questions about the DBKF data, we provided exact explanations about the available information. This also included a detailed structure of the data to help the agent navigate through it. The DBKF data model is built on a well-established ontology, Schema.org, which has been judiciously adapted and extended in multiple ways.

As illustrated in the high-level DBKF data model diagram below, the starting point is different “Text Documents”. These include disinformation claims and claim reviews, news articles, and all kinds of social media activity such as posts and comments. In addition, we have the “social graph” of “Agents”, which includes authors, publishers, communication channels, and distribution platforms. “Enrichments” are the results of AI algorithms analyzing the data and building metadata such as linking concepts (word-level enrichment), identifying event mentions (sentence-level enrichments), and clustering documents with reused content and repeated themes (document-level enrichments). “Media Objects” (texts, images, videos, and audio files) are contained in the “Text Documents” and are tied to “Reports” (the level where analysis results are produced by the various tools).

Now let’s equip our AI assistant with specialized tools to handle inquiries with varying levels of complexity. Our toolbox includes:

Elastic faceted full-text search powers the main search function of DBKF. It supports filtering by both text and facets, which correspond to all major types of enrichments. This way it combines structured and unstructured knowledge.
Autocomplete links mentions of entities to their objects in the knowledge graph.
SPARQL endpoint provides full access for SPARQL queries to answer questions.
Semantic text vector similarity is a custom-selected model that indexes all texts in the knowledge graph and makes them discoverable for a tool, that matches traditional vector RAG over texts.
Visual similarity is the image similarity over DBKF content, similar to text RAG but for images.

Conversations with the chatbot

To achieve quality output, our AI assistant strongly depends on clear instructions, a well-defined ontology, and tools. Behind the scenes, the customized LLM agent identifies on its own the tools required to collect the necessary information. It also attempts to fix issues or use alternative methods – either based on its decision or the instructions of the human user.

Let’s see what a conversation with our AI assistant looks like when we ask it to dig into the data and explain its contents. Building on the out-of-the-box capabilities of GraphDB 10.8 Talk To Your Graph agent, the DBKF chatbot can talk about what it knows (what the underlying dataset is about). It can also respond to traditional RAG questions and questions requiring aggregations over the data in the knowledge graph. But now that we’ve added custom tools and detailed instructions, our tailored assistant has become much more powerful and can answer really complex queries over the data.

What the chatbot knows

The agent uses the contents of its instructions and the dataset ontology to determine what it can reasonably answer. At this stage, it still doesn’t use the contents of the graph. The important point is that the AI assistant guides the user without any hallucinations because it can clarify how it reached the results and its response can be traced and verified.

Conversation on what the chatbot knows:

Q: What kinds of things do you know?

A: I have access to a dataset that focuses on disinformation claims encountered online and the journalistic debunking articles that counter these claims. The dataset is structured around false claims and factual claim reviews, following an extended version of the ClaimReview schema. It includes information about:
1. **Claims**: False information or disinformation that has been identified.
2. **Claim Reviews**: Articles or documents that debunk or fact-check the false claims.
3. **Claim**: u0022Video shows Narendra Modi refusing to shake hands with Joe Biden at G7 summitu0022
4. **Claim**: u0022Video shows Arvind Kejriwal refusing to shake hands with Rahul Gandhi during Hemant Soren’s oath taking ceremonyu0022

If you need more information about any of these claims, feel free to ask!

Answers within the text

Here is an example of a rather simple query that can be answered with a traditional RAG approach. In this case, the customized AI assistant uses a straightforward semantic similarity vector search to find texts associated with claims similar to our query of “refusing to shake hands.”

Conversation on a query within the text:

Q: What claims mention refusing to shake hands?

A: Here are some claims from the dataset that mention u0022refusing to shake handsu0022:
1. **Claim**: u00222018 image is falsely shared as Turkey President refusing to shake hands with Macronu0022
2. **Claim**: u0022Does This Photo Show Recep Tayyip Erdogan Refusing To Shake Hands With Emmanuel Macron?u0022
3. **Claim**: u0022Video shows Narendra Modi refusing to shake hands with Joe Biden at G7 summitu0022
4. **Claim**: u0022Video shows Arvind Kejriwal refusing to shake hands with Rahul Gandhi during Hemant Soren’s oath taking ceremonyu0022

If you need more information about any of these claims, feel free to ask!

Answers spanning over the whole dataset

We can also ask for help with queries where the answer requires working with large sections of the dataset. These types of aggregations (such as large lists, counts, groupings, comparisons, filtering, and ordering by metadata) are all questions a typical RAG approach can’t handle. The LLM agent will use its SPARQL query tool and knowledge of the ontology to collect and organize the information needed for the answer.

Conversation on a query over the whole dataset:

Q: Can you tell me how many reviews mention climate change? I want the count broken down by language of the text.

A: Here is the count of claim reviews that have been annotated with the concept of “climate change,” broken down by the language of the text:
– English (en): 103 reviews
– German (de): 63 reviews
– Latvian (lv): 5 reviews
– Bulgarian (bg): 7 reviews
– Italian (it): 14 reviews
– French (fr): 9 reviews
– Korean (ko): 2 reviews
– Czech (cs): 2 reviews
– Greek (el): 3 reviews
– Spanish (es): 2 reviews
– Polish (pl): 11 reviews
– Finnish (fi): 1 review
– Hungarian (hu): 1 review
– Romanian (ro): 2 reviews
If you have any further questions or need additional information, feel free to ask!

Complex questions

Complex questions require running a series of tool queries to collect and build up the knowledge needed for the answer. The GraphDB 10.8 agent will attempt to answer these types of queries out of the box. However, we can make performance more consistent and reliable by investing time in better instructions on how to use individual tools and by developing custom tools.

To reach this level of complexity, a developer might take an iterative approach and ask the AI assistant complex questions, check how it interacts with the knowledge graph for information, and guide it into improving if necessary. In the example below we have asked our tailored agent to write queries that use both a Uniform Resource Identifier (URI) and a literal filter (concept and date respectively), combined with custom ordering (by date and by size), and all over nodes that are 4-hops apart in the knowledge graph.

Conversation on a complex query:

Q: What is the most recent narrative cluster about climate change?
A: The most recent narrative cluster about climate change is titled “Roman Tide Baths in Malta Defy Sea Level Rise Amidst Climate Change.” It was last updated on 2024-08-05. The URI for this cluster is http://weverify.eu/resource/…32f1.
If you have any more questions or need further information, feel free to ask!

Q: What is the largest cluster from the last year talking about climate change?
A: The largest cluster from the last year talking about climate change is titled “Impact of Climate Change on Annual Lives Saved.” It contains 2 claims. The URI for this cluster is http://weverify.eu/resource/…c2e6.
If you have any more questions or need further information, feel free to ask!

Wrapping it up

The DBKF chatbot demonstrates a specific LLM agent application in the domain of disinformation-related content. While GraphDB 10.8 provides an out-of-the-box Graph RAG solution, we have shown that customizing the AI agent with specialized tools and better instructions can further hone functionality and improve performance.

Chatbots backed by knowledge graphs can be a great benefit for various groups of stakeholders. Regular, non-expert users don’t need any technical background to access the knowledge they want and can ask basic, free-form questions. Professional analysts and researchers can ask very sophisticated questions over large swaths of data and can have support for iteration and breaking down of complex tasks. Thanks to the chatbot’s transparent error analysis, data engineers can develop and extend datasets and identify shortcomings in modeling or issues with data quality.

Above all, LLM agents grounded with graph knowledge ensure the factuality, explainability, transparency, and data provenance of the output.

Download GraphDB and start building knowledge graphs for your data management practices!

DBKF and the customized LLM assistant are developed with funding from Horizon Europe research and innovation programme vera.ai under grant agreement No:101070093. Work on some of the underlying tools, as stated explicitly in the text of the blog post, has received funding from Horizon Europe research and innovation programme VIGILANT under grant agreement No:101073921. Views and opinions expressed are however those of the author only and do not necessarily reflect those of the European Union or the European Research Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Subscribe to our Newsletter

← Previous Next →

Offering

Bundles

Components

Solutions

Technologies

Resources

Deep Dive

Meet us

Company

Press Room