Discover how Generative AI is applied for Named Entity Recognition for diverse, domain-specific contexts, as we compare top competing models in terms of accuracy and cost to assess their viability for efficient knowledge graph construction.
Introduction
In the real world, obtaining high-quality annotated data remains a challenge. Generative AI (GenAI) models, such as GPT-4, offer a promising solution, potentially reducing the dependency on labor-intensive annotation. At Graphwise, we aim to make knowledge graph construction faster and more cost-effective. Therefore we explored how GenAI could automate several stages of the graph-building pipeline.
Named Entity Recognition (NER) is a foundational step in knowledge extraction and a critical task for knowledge graph construction. By identifying and categorizing entities like names, locations, and organizations within unstructured text, NER creates a foundation for organizing data into meaningful relationships, enabling the creation of structured, domain-specific knowledge graphs. Beyond knowledge graph building, NER supports use cases such as natural language querying (NLQ), where accurate entity recognition improves search accuracy and user experience. Understanding its importance, we investigated how GenAI performs on NER, especially in diverse and domain-specific contexts.
This blog post summarizes our findings, focusing on NER as a first-step key task for knowledge extraction. Our goal is to test whether GenAI can handle diverse domains effectively and determine if it’s a viable tool for domain-specific graph-building tasks.
Data
In Natural Language Processing (NLP), domain-specific knowledge plays a crucial role in the accuracy of tasks like NER. While annotating general datasets can be challenging, annotating domain-specific corpora often requires experts (like doctors or lawyers), significantly increasing time and cost.
To evaluate GenAI’s versatility, we used two annotated corpora:
- AIDA: A Reuters news corpus annotated for entities like people, organizations, locations, and miscellaneous.
- BioRED: A biomedical corpus with entities like genes, diseases, variants, species, cell lines, and chemicals, requiring deeper domain knowledge.
These corpora allowed us to assess GenAI’s performance across general and specialized domains.
Prompting
The quality of GenAI outputs is heavily influenced by how prompts are formulated. Inspired by recent works on prompt engineering for NER12, we focused on breaking the prompt into several modules. Through iterative experimentation, we incrementally added new modules refining the prompts. Here’s the progression that we experimented with:
- Generic Prompting: Basic queries with generic NER instructions.
You are the named entity recognition system. Your goal is to extract entities with types: Person, Organisation, Location.
Extract them in the format entity_text, entity_type.
- Few-shot Chain-of-Thought (CoT) Prompting: More detailed instructions, including few-shot examples, with CoT reasoning why a certain output is correct or incorrect.
Atanas Kiryakov|Person|true|In this context, Atanas is a person as he is mentioned as a CEO
Ontotext|Organisation|true|Ontotext is a company working with knowledge graphs
EKG|Location|false|EKG stands for enterprise knowledge graph therefore it is not a location
- Annotation guidelines: Additional instructions derived directly from corpus annotation guidelines including entity type definitions.
Person: An individual with a unique identity, often described by attributes like name, role, title, and relationships.
Organisation: An entity comprising a group of people with a specific purpose, such as businesses, non-profits, or governmental bodies.
Location: A geographical area or place, which can be physical or abstract, such as cities, countries, landmarks, or regions.
- Error Analysis Iteration: Post-prompt evaluation to identify and address common mistakes through targeted prompt modifications.
If there is a title for the name, include it in the entity, e.g. Mr. Smith not Smith.
Results
Here you can find the tables that present cost, speed, and quality evaluation for different prompts on AIDA and BioRED. We benchmarked GPT-4o3 and Llama-3.1-70b-Instruct (via databricks), against state-of-the-art (SOTA) NER models like BioLinkBERT (trained on BioRED) and BERT (trained on AIDA). The goal of the experiments was to compare the quality, latency, and cost of the two popular large language modes (LLMs) on different domains. The experiments were run five times to account for the non-deterministic nature of LLM outputs, and the averaged results are presented below.
BioRED performance
Prompt | Model | P | R | F1 | Price | Latency |
---|---|---|---|---|---|---|
Generic prompt | GPT-4o | 72 | 35 | 47.8 | $0.003 | 2.9 sec |
Llama | 87.4 | 43.1 | 57.7 | $0.001 | 3.62 sec | |
CoT prompt | GPT-4o | 72 | 37 | 48.9 | $0.017 | 5.85 sec |
Llama | 74.7 | 50.1 | 60 | $0.004 | 10.4 sec | |
CoT prompt and guidelines | GPT-4o | 70.2 | 40.8 | 51.6 | $0.028 | 8.95 sec |
Llama | 74.7 | 56.9 | 64.6 | $0.007 | 34.4 sec | |
CoT, guidelines and error analysis iteration | GPT-4o | 74.2 | 54.1 | 62.6 | $0.030 | 8.79 sec |
Llama | 74.7 | 59.3 | 66.2 | $0.007 | 33.1 sec | |
BioLinkBERT fine-tuned | 88.4 | 89.9 | 89.1 | Annotation cost | ||
AIDA performance
Prompt | Model | P | R | F1 | Cost | Latency |
---|---|---|---|---|---|---|
Generic prompt | GPT-4o | 90.4 | 54.8 | 68.2 | $0.004 | 2.57 sec |
Llama | 87.2 | 59.5 | 70.7 | $0.001 | 2.92 sec | |
CoT prompt | GPT-4o | 86.3 | 69.1 | 76.8 | $0.018 | 5.4 sec |
Llama | 84.8 | 69 | 76.1 | $0.005 | 9.37 sec | |
CoT prompt and guidelines | GPT-4o | 85.8 | 70.7 | 77.5 | $0.020 | 9.16 sec |
Llama | 83.2 | 65.2 | 73.1 | $0.007 | 11.8 sec | |
CoT, guidelines and error analysis iteration | GPT-4o | 86.8 | 73.1 | 79.4 | $0.020 | 9.2 sec |
Llama | 87.2 | 64.9 | 74.5 | $0.006 | 10.3 sec | |
BERT fine-tuned | 95.3 | 95 | 95.1 | Annotation cost | ||
While more elaborate prompts generally yield better results, they also increase inference time and cost. Striking a balance is crucial depending on the application’s requirements. For BioRED Llama yielded consistently better results, while for common-domain tasks the models were quite close to each other. We also experimented with prompt optimization tools, however these experiments did not yield promising results. In many cases, prompt optimizers were removing crucial entity-specific information and oversimplifying.
The charts above illustrate a clear pattern: while the precision of GPT-4o precision is often close to SOTA for many entity types, its recall consistently lags. The same holds for Llama. This trend suggests that while GenAI is great at identifying high-confidence entities, it misses a significant number of relevant cases. This limitation has important implications for its use in automated knowledge graph construction.
In the general domain (AIDA), the standard deviations are relatively small across most entity types, indicating more consistent performance by LLMs for common-domain entities like Person, Organisation, and Location. This reflects the model’s confidence and generalization capability for widely used, well-represented categories. The standard deviations are significantly larger in the biomedical domain (BioRED), especially for Cell lines, Species, and Variants. This can be attributed to the rarity and specificity of those entity types.
What if LLMs have learned the benchmarks?
There are rising concerns about the validity of LLM evaluation with open benchmarks as many of them have consistently been used for training the models4. To ensure that our evaluation is robust to the unseen data, we created a common domain silver corpus of 250 most recent documents from Ontotext NOW corpus. The documents were trimmed to reflect AIDA corpus text length distribution and annotated with BERT model trained on AIDA afterwards.
Results on NOW corpus
Prompt | Model | P | R | F1 | F1 AIDA | |
---|---|---|---|---|---|---|
Generic prompt | GPT-4o | 77 | 61.5 | 68.4 | 68.2 | |
Llama | 80 | 57 | 66.8 | 70.7 | ||
CoT prompt | GPT-4o | 78.9 | 64.2 | 70.8 | 76.8 | |
Llama | 76.9 | 65.6 | 70.8 | 76.1 | ||
CoT prompt and guidelines | GPT-4o | 77.1 | 66.4 | 71.3 | 77.5 | |
Llama | 74.7 | 68.1 | 71.2 | 73.1 | ||
We ran LLMs with the same prompts and configurations to evaluate their robustness to the new data and as we can see the results are quite close to what was obtained on AIDA. The numbers are slightly lower due to the lower precision, which can be explained by the artificial nature of the corpus. Manual error analysis showed that LLMs were better at extracting new entities unfamiliar to BERT like “ChatGPT” or “OpenAI”. Overall, we are satisfied with this evaluation and can confirm that the results hold not only for well-known benchmark data that might have been contaminated.
What does this give us?
Given these findings, a natural application for GenAI in information extraction is as an annotation assistant:
- LLMs are close to humans in specific entities: We’ve seen that for common and clearly described entities like Person, Organisation, Disease, LLMs are close to human annotators and can speed up the annotation process.
- LLMs are robust to unseen texts: We’ve checked that the results hold even for the most recent texts that were not used for LLM training.
- There is no silver bullet: LLMs still need human validation and there does not seem to be one best model as we’ve seen that Llama-70b and GPT-4o perform differently on different tasks.
You can use the Metadata Studio (OMDS) to integrate any NER model and apply it to your documents to extract the entities you are interested in. OMDS even comes with its built-in NER model, which utilizes LLMs and the techniques from this blog post to provide zero-shot tagging in documents for specific domains and entity types. You can combine human with automatic tagging to bootstrap manual annotation processes by domain experts.
OMDS enables several workflows for managing Named Entity Recognition (NER) tasks. It supports the evaluation of NER models through annotation by subject matter experts and quality benchmarking of automatic tagging against human annotations. This helps track model performance over time, identify problematic scenarios, and easily adjust prompts to improve quality.
By combining automatic and manual tagging, OMDS allows for the efficient preparation of high-quality training datasets for fine-tuning models in specific domains. Additionally, it facilitates incremental tagging of live documents, where you can upload and automatically tag content, indexing results for downstream applications. This makes OMDS a centralized hub for all your text processing needs.
Ready to supercharge your knowledge graph building?
- Ashok, D., & Lipton, Z. C. (2023). Promptner: Prompting for named entity recognition. arXiv preprint arXiv:2305.15444. ↩︎
- Yan Hu, Qingyu Chen, Jingcheng Du, Xueqing Peng, Vipina Kuttichi Keloth, Xu Zuo, Yujia Zhou, Zehan Li, Xiaoqian Jiang, Zhiyong Lu, Kirk Roberts, Hua Xu, Improving large language models for clinical named entity recognition via prompt engineering, Journal of the American Medical Informatics Association, Volume 31, Issue 9, September 2024, Pages 1812–1820, https://doi.org/10.1093/jamia/ocad259 ↩︎
- “Gpt-4o-2024-05-13” checkpoint ↩︎
- Deng, C., Zhao, Y., Heng, Y., Li, Y., Cao, J., Tang, X., & Cohan, A. (2024). Unveiling the spectrum of data contamination in language models: A survey from detection to remediation. arXiv preprint arXiv:2406.14644. ↩︎