Introduction to retrieval-augmented generation (RAG). Why data is the new gold while content is still king
← Blog home
Highlights:
|
Once ChatGPT took the world by storm, the legal industry was one of the first professional fields that began to reap its benefits. Why was that? Tapping into the large corpus of text found in documents, contracts, cases and other primary and secondary sources of material was tantamount to striking gold in the Generative AI (GenAI) race for legal research, writing, and more mundane legal tasks.
The use of GenAI backed by domain-specific retrieval-augmented generation (RAG) enables a rich level of nuance and expertise for specialized fields such as law. This is the single biggest differentiator for a more trustworthy and professional-grade legal AI assistant.
Jump to ↓
RAG reliability
Context and other AI terms
Important concepts and terms need to be defined before even getting to the significance and implications of RAG’s application in legal work.
Artificial Intelligence (AI)
AI is the broader field of developing computer systems capable of human-like intelligence. AI itself is the idea that computers can simulate human intelligence. On the one hand, computers enable the artificial side. On the other hand, algorithms represent human intelligence and hence simulate or imitate intelligence.
Logical behavior, decisions, and rules are at the core of human intelligence. This is where it gets interesting for attorneys and legal professionals since they need to follow a set of rules and make logical decisions in order to practice law effectively.
Similarly, AI algorithms are designed to follow logical rules and make decisions based on data and patterns. This allows AI to perform tasks that would typically require human intelligence, such as problem-solving, learning, and decision-making.
Machine learning (ML)
Machine learning is a particular subset of AI that involves training models and computer algorithms to make predictions or decisions without rule-based programming. This allows AI to continuously learn and improve from ingested data to make more accurate decisions similar to how humans learn from experience.
Now to supervised vs. unsupervised learning. Supervised learning involves providing the model with labeled data and desired outcomes, while unsupervised learning involves allowing the model to identify patterns and make decisions on its own. Supervised learning is about predicting values and variables. Unsupervised learning is about unstructured data and making sense of the raw data.
Both approaches have their own advantages and are used in different scenarios depending on the type of data and desired outcome. We can see how and why this affects the legal profession and the debates around legal tasks and even work-product.
Deep learning
Deep learning is a subfield of machine learning that involves training artificial neural networks with many layers (deep neural networks) to perform tasks such as image and speech recognition.
Natural Language Processing (NLP)
Natural Language Processing (NLP) enables computers to comprehend, interpret, and create human language, bridging the gap between computers and human communication.
Generative AI (GenAI)
GenAI is any model that produces flexible outputs — images, text, audio, etc. — as opposed to discriminative AI which deals with classification and regression models. ML has allowed us to understand, hence learn, so GenAI can create. Therefore, GenAI is the technology based on the broader, previously defined terms that enables the ability to generate human-like text, images, and even videos.
Large Language Models (LLM)
LLMs form a bridge between our queries and what it generates. While traditional LLMs only deal with text inputs, more recent multimodal models can handle multiple input types like images, video, and audio.
LLMs are trained on large amounts of data — usually on billions of trainable parameters. They are built from deep learning and machine learning models and are made of complicated formulas that predict the next word.
Overall, LLMs have the potential to greatly assist with legal tasks by automating and expediting processes that would typically require human effort. However, it is important to carefully consider the training data and potential biases and hallucinations when implementing LLMs in legal settings.
Retrieval-augmented generation (RAG)
Finally, RAG is the processing of raw data that truly sets a professional-grade LLM apart from others. This process is called grounding in which an LLM is augmented with industry-specific data that is not part of the development process for mass market LLMs.
Rather than having the LLM answer a question based on its own memory, it first retrieves relevant documents from a search engine and then uses those documents as inputs to the LLM in order to ground the answer. For the legal field, it means gathering and preprocessing legal documents and then fine-tuning it for specific tasks such as contract analysis or legal document summarization. This allows for a more efficient and accurate analysis of legal documents, leading to potential time and cost savings for legal professionals.
RAG is therefore widely used to minimize hallucinations. However, the effectiveness of RAG responses is based on the quality of the raw data and the ability to fine-tune the model for specific tasks.
Retrieval of gold-standard legal content
Content is indeed still king, and data is the new gold. However, it’s the quality of that data and how it’s managed that becomes the critical factor.
There are many legal research tools on the market using LLMs. However, legal tech specialists need to ask the right questions and understand just exactly what kind of content sources and training data its LLM retrieves for every query.
As a side effect of their training, LLMs often tend to please, and if they don’t know the answer offhand, they may make something up in an attempt to be helpful. RAG can mitigate this by providing useful context to help answer questions, similar to an open-book quiz, thereby grounding an LLM’s answer and reducing the risk of hallucinations.
Lead Applied Scientist, TR Labs
Westlaw content
Westlaw has always been the standard for legal research because of its content and proprietary editorial enhancements like the West Key Number System and KeyCite.
In the West reporting system, an attorney-editor reviews each case published. The lawyer-editor finds and summarizes the legal points in the case. These summaries, known as headnotes, are placed at the beginning of the case and typically consist of a paragraph.
Each headnote is assigned a specific topic and key number, and they are organized in multi-volume books called Digests. These Digests act as subject indexes for the case law found in West reporters. It’s important to note that headnotes are editorial aids and do not serve as legal authority themselves.
For instance, AI-Assisted Research on Westlaw Precision focuses the LLM on the actual language of cases, statutes, and regulations. It doesn’t ask the LLM to generate text based on the question asked but rather the content it searches. It finds the very best cases, statutes, and regulations to address the question, as well as the very best portions of those cases, statutes, and regulations.
Studies have shown that poor retrieval and/or bad context can be just as bad as or worse than relying on an LLM’s internal memory — just as a law student using outdated textbook will give wrong legal answers, an LLM using RAG without good sources will generate unreliable content. That’s why the Westlaw and CoCounsel GenAI solutions are so dependable — they are backed by the largest and most comprehensive legal libraries available.
Lead Applied Scientist, TR Labs
Practical Law content
Practical Law provides trusted guidance, checklists and forms that help attorneys practice law effectively, efficiently, and ultimately with less risk.
The reliability of this data is dependent on the people responsible for its labeling, structure, and annotations.
The team of over 650 legal expert editors are highly qualified and have practiced at the world’s leading law firms, corporate law departments and government agencies. Their full-time job is to create and maintain timely, reliable, and accurate resources to ensure they have a great starting point with their legal matters such as new legal realities, legislative changes, and relevant practice areas.
Whether it’s a crisis, an unfamiliar matter, or an ever-evolving issue, they provide comprehensive insight and answers to your “how do I” questions.
RAG reliability
Yet even the gold standard of the most trusted and reliable sources of legal content falls short without a robust testing and benchmarking process. In other words, what constitutes a thoughtful and methodical process to ensure more trust, reliability, and accuracy?
Testing benchmarks
Supervised machine learning designs and implements training on carefully labeled data with known outcomes. This enables the fine-tuning of generated responses that lawyers need in their legal research and drafting.
This is precisely why a Thomson Reuters benchmarking and evaluation team stressed the importance of both retrieval and generation components in legal AI systems.
For a RAG-based system, this means ensuring that the initial document retrieval is accurate and relevant, as it directly impacts the quality of the generated output. Legal tech specialists should therefore thoroughly analyze and weigh the benefits of both RAG and LLM components when considering legal-specific GenAI assistants.
For example, the Search a Database skill first uses various non-LLM-based search systems to retrieve relevant documents before the LLM synthesizes an answer. If the initial retrieval process is substandard, the LLM’s performance will be compromised.
Head of CoCounsel, Thomson Reuters
CoCounsel’s Trust Team recognizes the subjective nature of legal tasks and the variability in what constitutes a correct answer. The best way to address this is embodied in their decision to not only test and benchmark, but also release performance statistics and sample tests.
The testing process aims to simulate real attorney tasks. These tests are based on insights, customer feedback, and secondary sources. An attorney tester manually completes the test to establish an “ideal response,” which is peer-reviewed.
This ideal response sets the benchmark for passing scores. The Trust Team then uses CoCounsel’s skills to perform the task, generating a “model response” which is compared to the ideal response. Differences are assessed to determine if they render the output incomplete, incorrect, or misleading.
Tests can be failed for various reasons, even if the answer isn’t outright wrong, especially for subjective skills like summarization. Evaluation instructions are developed to align LLM performance with human reviewers.
Four core skills were tested and below are some of the results.
Skill | Pass Rate ¹ |
---|---|
Extract Contract Data | 98.8% |
Review Documents | 96.6% |
Search a Database | 95.6% |
Summarize | 90.6% |
¹ Derived from datasets ranging from 89 to 98 test cases. Zero hallucinations were identified among failing tests. Please note that the Summarize skill is inherently subjective, where two attorneys may disagree on the level of correctness of the answer. As such, it is important to highlight that the tests in the dataset for this skill can be failed for many reasons – including where the answer is missing a detail that the tester considers to be a key detail. This does not mean that the answer is outright wrong.
The team is committed to continually monitor and refine the skills test by manually reviewing failure cases from the automated tests and spot-checking passing samples to ensure the automated evaluation aligns with human judgments.
The entire process underscores the importance of transparency in building trust with users. See the full Legal AI benchmarking results for further reference.