Knowledge Bases
A knowledge base in dsRAG is a searchable collection of documents that can be queried to find relevant information. The KnowledgeBase
class handles document processing, storage, and retrieval.
Creating a Knowledge Base
To create a knowledge base:
from dsrag.knowledge_base import KnowledgeBase
# Create a basic knowledge base
kb = KnowledgeBase(
kb_id="my_kb",
title="Product Documentation",
description="Technical documentation for XYZ product"
)
# Or with custom configuration
kb = KnowledgeBase(
kb_id="my_kb",
storage_directory="path/to/storage", # Where to store KB data
embedding_model=custom_embedder, # Custom embedding model
reranker=custom_reranker, # Custom reranking model
vector_db=custom_vector_db, # Custom vector database
chunk_db=custom_chunk_db # Custom chunk database
)
Adding Documents
Documents can be added from text or files:
# Add from text
kb.add_document(
doc_id="intro-guide",
text="This is the introduction guide...",
document_title="Introduction Guide",
metadata={"type": "guide", "version": "1.0"}
)
# Add from file
kb.add_document(
doc_id="user-manual",
file_path="path/to/manual.pdf",
metadata={"type": "manual", "department": "engineering"}
)
# Add with advanced configuration
kb.add_document(
doc_id="technical-spec",
file_path="path/to/spec.pdf",
file_parsing_config={
"use_vlm": True, # Use vision language model for PDFs
"always_save_page_images": True # Save page images for visual content
},
chunking_config={
"chunk_size": 800, # Characters per chunk
"min_length_for_chunking": 2000 # Minimum length to chunk
},
auto_context_config={
"use_generated_title": True, # Generate title if not provided
"get_document_summary": True # Generate document summary
}
)
Querying the Knowledge Base
Search the knowledge base for relevant information:
# Simple query
results = kb.query(
search_queries=["How to configure the system?"]
)
# Advanced query with filtering and parameters
results = kb.query(
search_queries=[
"System configuration steps",
"Configuration prerequisites"
],
metadata_filter={
"field": "doc_id",
"operator": "equals",
"value": "user_manual"
},
rse_params="precise", # Use preset RSE parameters
return_mode="text" # Return text content
)
# Process results
for segment in results:
print(f"""
Document: {segment['doc_id']}
Pages: {segment['segment_page_start']} - {segment['segment_page_end']}
Content: {segment['content']}
Relevance: {segment['score']}
""")
RSE Parameters
The Relevant Segment Extraction (RSE) system can be tuned using different parameter presets:
- "balanced"
: Default preset balancing precision and comprehensiveness
- "precise"
: Favors shorter, more focused segments
- "comprehensive"
: Returns longer segments with more context
Or configure custom RSE parameters:
results = kb.query(
search_queries=["system requirements"],
rse_params={
"max_length": 5, # Max segments length (in number of chunks)
"overall_max_length": 20, # Total length limit across all segments (in number of chunks)
"minimum_value": 0.5, # Minimum relevance score
"irrelevant_chunk_penalty": 0.2 # Penalty for irrelevant chunks in a segment - higher penalty leads to shorter segments
}
)
Metadata Query Filters
Certain vector DBs support metadata filtering when running a query (currently only ChromaDB). This allows you to have more control over what document(s) get searched. A common use case would be asking questions about a single document in a knowledge base, in which case you would supply the doc_id
as a metadata filter.
The metadata filter should be a dictionary with the following structure:
metadata_filter = {
"field": "doc_id", # The metadata field to filter on
"operator": "equals", # The comparison operator
"value": "doc123" # The value to compare against
}
Supported operators:
- equals
- not_equals
- in
- not_in
- greater_than
- less_than
- greater_than_equals
- less_than_equals
For operators that take multiple values (in
and not_in
), the value should be a list where all items are of the same type (string, integer, or float).
Example usage:
# Query a specific document
results = kb.query(
search_queries=["system requirements"],
metadata_filter={
"field": "doc_id",
"operator": "equals",
"value": "technical_spec_v1"
}
)
# Query documents from multiple departments
results = kb.query(
search_queries=["security protocols"],
metadata_filter={
"field": "department",
"operator": "in",
"value": ["security", "compliance"]
}
)