Welcome to dsRAG
dsRAG is a retrieval engine for unstructured data. It is especially good at handling challenging queries over dense text, like financial reports, legal documents, and academic papers. dsRAG achieves substantially higher accuracy than vanilla RAG baselines on complex open-book question answering tasks. On one especially challenging benchmark, FinanceBench, dsRAG gets accurate answers 96.6% of the time, compared to the vanilla RAG baseline which only gets 32% of questions correct.
Key Methods
dsRAG uses three key methods to improve performance over vanilla RAG systems:
1. Semantic Sectioning
Semantic sectioning uses an LLM to break a document into sections. It works by annotating the document with line numbers and then prompting an LLM to identify the starting and ending lines for each "semantically cohesive section." These sections should be anywhere from a few paragraphs to a few pages long. The sections then get broken into smaller chunks if needed. The LLM is also prompted to generate descriptive titles for each section. These section titles get used in the contextual chunk headers created by AutoContext, which provides additional context to the ranking models (embeddings and reranker), enabling better retrieval.
2. AutoContext
AutoContext creates contextual chunk headers that contain document-level and section-level context, and prepends those chunk headers to the chunks prior to embedding them. This gives the embeddings a much more accurate and complete representation of the content and meaning of the text. In our testing, this feature leads to a dramatic improvement in retrieval quality. In addition to increasing the rate at which the correct information is retrieved, AutoContext also substantially reduces the rate at which irrelevant results show up in the search results. This reduces the rate at which the LLM misinterprets a piece of text in downstream chat and generation applications.
3. Relevant Segment Extraction (RSE)
Relevant Segment Extraction (RSE) is a query-time post-processing step that takes clusters of relevant chunks and intelligently combines them into longer sections of text that we call segments. These segments provide better context to the LLM than any individual chunk can. For simple factual questions, the answer is usually contained in a single chunk; but for more complex questions, the answer usually spans a longer section of text. The goal of RSE is to intelligently identify the section(s) of text that provide the most relevant information, without being constrained to fixed length chunks.
Quick Example
from dsrag.create_kb import create_kb_from_file
# Create a knowledge base from a file
file_path = "path/to/your/document.pdf"
kb_id = "my_knowledge_base"
kb = create_kb_from_file(kb_id, file_path)
# Query the knowledge base
search_queries = ["What are the main topics covered?"]
results = kb.query(search_queries)
for segment in results:
print(segment)
Evaluation Results
FinanceBench
On FinanceBench, which uses a corpus of hundreds of 10-Ks and 10-Qs with challenging queries that often require combining multiple pieces of information:
- Baseline retrieval pipeline: 32% accuracy
- dsRAG (with default parameters and Claude 3.5 Sonnet): 96.6% accuracy
KITE Benchmark
On the KITE benchmark, which includes diverse datasets (AI papers, company 10-Ks, company handbooks, and Supreme Court opinions), dsRAG shows significant improvements:
Top-k | RSE | CCH+Top-k | CCH+RSE | |
---|---|---|---|---|
AI Papers | 4.5 | 7.9 | 4.7 | 7.9 |
BVP Cloud | 2.6 | 4.4 | 6.3 | 7.8 |
Sourcegraph | 5.7 | 6.6 | 5.8 | 9.4 |
Supreme Court Opinions | 6.1 | 8.0 | 7.4 | 8.5 |
Average | 4.72 | 6.73 | 6.04 | 8.42 |
Getting Started
Check out our Quick Start Guide to begin using dsRAG in your projects.
Community and Support
- Join our Discord for community support
- Fill out our use case form if using dsRAG in production
- Need professional help? Contact our team