Chat
The Chat functionality in dsRAG provides a powerful way to interact with your knowledge bases through a conversational interface. It handles message history, knowledge base searching, and citation tracking automatically.
Overview
The chat system works by:
- Maintaining a chat thread with message history
- Automatically generating relevant search queries based on user input
- Searching knowledge bases for relevant information
- Generating responses with citations to source materials
Defining Where to Store Chat History
Chat threads in dsRAG need to be persisted somewhere so that conversations can continue across multiple interactions. You'll need to provide your own implementation of the ChatThreadDB
class to handle this storage. This allows you to store chat threads in whatever database or storage system works best for your application.
The ChatThreadDB
interface defines methods for:
- Creating new chat threads
- Retrieving existing threads
- Adding interactions to threads
- Managing thread metadata and configuration
You'll need to initialize your chat thread storage before creating any new chat threads or retrieving responses.
There are two implementations of the ChatThreadDB
interface included:
- BasicChatThreadDB
: A basic implementation that stores chat threads in a JSON file
- SQLiteChatThreadDB
: A SQLite implementation that stores chat threads in a SQLite database
Creating a Chat Thread
To start a conversation, first create a chat thread:
from dsrag.chat.chat import create_new_chat_thread
from dsrag.database.chat_thread.sqlite_db import SQLiteChatThreadDB
# Configure chat parameters
chat_params = {
"kb_ids": ["my_knowledge_base"], # List of knowledge base IDs to use
"model": "gpt-4o", # LLM model to use
"temperature": 0.2, # Response creativity (0.0-1.0)
"system_message": "You are a helpful assistant specialized in technical documentation",
"target_output_length": "medium" # "short", "medium", or "long"
}
# Initialize chat thread database (SQLite in this case)
chat_thread_db = SQLiteChatThreadDB()
# Create the thread
thread_id = create_new_chat_thread(chat_params, chat_thread_db)
Getting Responses
Once you have a thread, you can send messages and get responses:
from dsrag.chat.chat import get_chat_thread_response
from dsrag.chat.chat_types import ChatResponseInput
# Create input with optional metadata filter
response_input = ChatResponseInput(
user_input="What are the key features of XYZ product?",
metadata_filter={
"field": "doc_id",
"operator": "equals",
"value": "user_manual"
}
)
# Create the knowledge base instances
knowledge_bases = {
"my_knowledge_base": KnowledgeBase(kb_id="my_knowledge_base")
}
# Get response
response = get_chat_thread_response(
thread_id=thread_id,
get_response_input=response_input,
chat_thread_db=chat_thread_db,
knowledge_bases=knowledge_bases # Dictionary of your knowledge base instances
)
# Access the response content and citations
print(response["model_response"]["content"])
for citation in response["model_response"]["citations"]:
print(f"Source: {citation['doc_id']}, Page: {citation['page_number']}")
Chat Thread Parameters
The chat thread parameters dictionary supports several configuration options:
kb_ids
: List of knowledge base IDs to searchmodel
: LLM model to use (e.g., "gpt-4")temperature
: Controls response randomness (0.0-1.0)system_message
: Custom instructions for the LLMauto_query_model
: Model to use for generating search queriesauto_query_guidance
: Custom guidance for query generationtarget_output_length
: Desired response length ("short", "medium", "long")max_chat_history_tokens
: Maximum tokens to keep in chat history
Response Structure
Chat responses include:
- User input with timestamp
- Model response with citations
- Search queries used
- Relevant segments found
Streaming Responses
dsRAG supports streaming responses, which allows you to receive and display partial responses as they're being generated. This provides a better user experience, especially for longer responses.
How to Use Streaming
To use streaming, simply add the stream=True
parameter when calling get_chat_thread_response
:
# Get streaming response
for partial_response in get_chat_thread_response(
thread_id=thread_id,
get_response_input=response_input,
chat_thread_db=chat_thread_db,
knowledge_bases=knowledge_bases,
stream=True # Enable streaming
):
# Each partial_response contains the cumulative response so far
current_content = partial_response["model_response"]["content"]
# Update your UI with the new content
display_incremental_response(current_content)
# Citations may be populated as they become available
if partial_response["model_response"]["citations"]:
display_citations(partial_response["model_response"]["citations"])
Streaming Behavior Notes
- When streaming is enabled,
get_chat_thread_response
returns a generator that yields partial responses - Each partial response has the same structure as a complete response
- The final response is automatically saved to the chat thread database
- All supported LLM providers (OpenAI, Anthropic, Gemini) work with streaming
- Structured outputs with citations are supported in streaming mode
Example: Displaying Streaming Responses in a CLI
def display_streaming_response(thread_id, user_input, chat_db, knowledge_bases):
"""Display a streaming response in the terminal."""
import sys
response_input = ChatResponseInput(user_input=user_input)
print("AI: ", end="", flush=True)
# Track previously seen content to only show new tokens
previous_content = ""
for partial in get_chat_thread_response(
thread_id=thread_id,
get_response_input=response_input,
chat_thread_db=chat_db,
knowledge_bases=knowledge_bases,
stream=True
):
current = partial["model_response"]["content"] or ""
# Display only new content
if len(current) > len(previous_content):
new_content = current[len(previous_content):]
print(new_content, end="", flush=True)
previous_content = current
# Print citations at the end
if partial["model_response"]["citations"]:
print("\n\nSources:")
for citation in partial["model_response"]["citations"]:
print(f"- {citation['doc_id']}")
Best Practices
- Set appropriate
target_output_length
based on your use case - Use
system_message
to guide the LLM's behavior - Configure
max_chat_history_tokens
based on your needs - Use metadata filters to focus searches on relevant documents
- Monitor and adjust
temperature
based on desired response creativity - Enable streaming for a better user experience with longer responses
- Handle potential
None
values in streaming partial responses