From Knowledge Base to AI Assistant: Building a Production-Grade RAG Chatbot

No matter how big your company is, as you expand and reach new highs you’ll want an agency to have your back. One with a process that has proven itself over and over again.

admin

Chief product officer at DevOn

May 12, 2025

Reading Time: 21 minutes

Introduction

In this post, I’ll take you behind the scenes of how I built a RAG (Retrieval-Augmented Generation) chatbot for a Customer knowledge base. This system goes far beyond a simple Q&A bot—it’s a fully automated data pipeline that crawls, processes, and semantically indexes thousands of technical support articles to deliver precise, context-aware answers in real time.

What makes this project stand out:

Production-ready: Robust error handling, detailed logging, and real-time monitoring
Scalable: Modular, fault-tolerant architecture with resumable pipelines
Intelligent: Enhanced semantic search with relevance and confidence scoring
Domain-optimized: Fine-tuned for technical support and infrastructure knowledge

RAG Architecture Overview

Let’s start with the high-level architecture. A RAG system combines the power of large language models with domain-specific knowledge retrieval:

Query Processing Flow:

User Query → The system receives a natural language question from the user.
Query Processing → Preprocesses and refines the query (e.g., removing noise, normalizing text) to improve understanding.
Semantic Search → Converts the processed query into embeddings and searches the vector database for related concepts.
Vector Database (ChromaDB) → Stores precomputed semantic embeddings of all knowledge base articles for fast and accurate retrieval.
Retrieved Documents → Fetches the most relevant document chunks based on semantic similarity scores.
Context Preparation → Assembles the retrieved chunks into a coherent context block, ensuring relevance and completeness.
LLM Generation → Uses OpenAI GPT-4 to generate a context-aware, human-like response.
Response with Sources → Returns a structured answer with source references for transparency and traceability.

Knowledge Base Processing Flow:

Knowledge Base → The system starts with existing documentation, FAQs, and support articles.
Web Crawler → Automatically crawls and extracts content from knowledge base pages.
Document Processor → Cleans, formats, and structures raw HTML or Markdown into standardized text.
Text Chunking → Splits large documents into semantic chunks, optimized for retrieval and embedding generation.
Embedding Generation → Creates high-dimensional vector embeddings for each chunk using OpenAI’s embedding model.
Vector Database (ChromaDB) → Stores all embeddings and metadata, enabling fast and intelligent semantic search during query time.

Core Components

Data Pipeline: Crawls and processes knowledge base content
Vector Database: Stores semantic embeddings for fast retrieval
RAG Engine: Orchestrates search and generation
Chat Interface: User-friendly Streamlit frontend

Data Pipeline Implementation:

The data pipeline is the foundation of our RAG system. It’s designed to be robust, resumable, and production-ready.

Web Crawler (Data Collection)

Categories → Articles: Automatically discovers and categorizes content from the knowledge base, ensuring comprehensive coverage across all support topics.
Content Extraction: Parses and extracts clean text, metadata, and structured data from each page for downstream processing.
Rate Limiting: Adheres to site policies with configurable crawl delays to prevent overload and ensure responsible scraping.
Error Handling: Implements robust retry logic and failure recovery, allowing the crawler to resume seamlessly after interruptions.

Document Processor (Content Processing)

Text Chunking: Splits processed documents into semantic chunks (typically 1200 characters with a 250-character overlap) for optimal retrieval granularity.
Metadata Extraction: Captures titles, categories, URLs, and content types to maintain contextual awareness during search.
Embedding Generation: Uses Sentence Transformers to create vector embeddings, enabling high-quality semantic representation of each chunk.
Quality Validation: Performs automated checks to ensure chunk relevance, completeness, and consistency before indexing.

Vector Database (Storage & Indexing)

ChromaDB Storage: Provides persistent storage for embeddings and metadata in a scalable, high-performance vector database.
Index Creation: Builds optimized similarity indexes for low-latency semantic search and retrieval.
Search Optimization: Tunes internal parameters for precision, recall, and performance balance in large-scale environments.
Backup & Recovery: Supports data persistence, incremental backups, and restoration to maintain operational reliability.

Web Crawler Implementation:

The document processor handles intelligent text chunking and metadata extraction:

class WebCrawler:
    """Hierarchical web crawler with rate limiting and retry logic."""
    
    def __init__(self, config: CrawlingConfig):
        self.config = config
        self.session = self._create_session()
        self.stats = CrawlingStats()
        
    def run_full_crawl(self) -> Tuple[List[CategoryData], List[ArticleData]]:
        """Run the complete crawling pipeline."""
        logger.info(" Starting hierarchical web crawling")
        
        # Level 1: Extract categories
        categories = self._crawl_level_1()
        logger.info(f"Found {len(categories)} categories")
        
        # Level 2: Extract articles from categories
        articles = self._crawl_level_2(categories)
        logger.info(f"Found {len(articles)} articles")
        
        return categories, articles
    
    def _crawl_level_1(self) -> List[CategoryData]:
        """Extract main categories from the knowledge base."""
        categories = []
        
        for seed_url in self.config.seed_urls:
            try:
                response = self._make_request(seed_url)
                soup = BeautifulSoup(response.content, 'html.parser')
                
                # Extract category links
                category_links = soup.find_all('a', href=re.compile(r'/kb/'))
                
                for link in category_links:
                    category = CategoryData(
                        title=link.get_text().strip(),
                        url=urljoin(seed_url, link['href']),
                        level=1
                    )
                    categories.append(category)
                    
            except Exception as e:
                logger.error(f"Failed to crawl {seed_url}: {e}")
                
        return categories
    
    def _crawl_level_2(self, categories: List[CategoryData]) -> List[ArticleData]:
        """Extract articles directly from category pages."""
        articles = []
        
        for category in categories:
            try:
                response = self._make_request(category.url)
                soup = BeautifulSoup(response.content, 'html.parser')
                
                # Extract article content directly from category page
                article_content = self._extract_article_content(soup)
                
                if article_content:
                    article = ArticleData(
                        title=category.title,
                        url=category.url,
                        category=category.title,
                        content=article_content
                    )
                    articles.append(article)
                    
            except Exception as e:
                logger.error(f"Failed to process category {category.title}: {e}")
                

        return articles

Document Processing Pipeline:

class DocumentProcessor:
    """Intelligent document processor with semantic chunking."""
    
    def __init__(self, config: EmbeddingConfig):
        self.config = config
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=config.chunk_size,
            chunk_overlap=config.chunk_overlap,
            separators=["\n\n", "\n", ". ", " ", ""]
        )
    
    def process_articles_batch(self, articles: List[ArticleData]) -> List[ContentChunk]:
        """Process articles in batches for efficiency."""
        chunks = []
        
        for article in tqdm(articles, desc="Processing articles"):
            try:
                article_chunks = self._process_single_article(article)
                chunks.extend(article_chunks)
            except Exception as e:
                logger.error(f"Failed to process article {article.title}: {e}")
                
        return chunks
    
    def _process_single_article(self, article: ArticleData) -> List[ContentChunk]:
        """Process a single article into semantic chunks."""
        # Clean and preprocess content
        cleaned_content = self._clean_content(article.content)
        
        # Split into chunks
        text_chunks = self.text_splitter.split_text(cleaned_content)
        
        chunks = []
        for i, chunk_text in enumerate(text_chunks):
            # Extract metadata
            metadata = self._extract_metadata(article, chunk_text, i)
            
            # Create content chunk
            chunk = ContentChunk(
                id=self._generate_chunk_id(article.url, i),
                content=chunk_text,
                source_url=article.url,
                category=article.category,
                article_title=article.title,
                chunk_type=self._classify_chunk_type(chunk_text),
                metadata=metadata
            )
            chunks.append(chunk)
            
        return chunks
    
    def _extract_metadata(self, article: ArticleData, chunk_text: str, index: int) -> Dict[str, Any]:
        """Extract rich metadata from chunk content."""
        return {
            "word_count": len(chunk_text.split()),
            "char_count": len(chunk_text),
            "section_title": self._extract_section_title(chunk_text),
            "key_terms": self._extract_key_terms(chunk_text),
            "chunk_index": index,
            "last_updated": datetime.now().isoformat()
        }

Vector Database & Semantic Search

The heart of our RAG system is the vector database that enables semantic search. We use ChromaDB for its performance and ease of use.

ChromaDB Implementation

class ChromaDBManager:
    """ChromaDB manager for vector operations and semantic search."""
    
    def __init__(self, config: DatabaseConfig):
        self.config = config
        self.client = chromadb.PersistentClient(path=config.chromadb_path)
        self.collection = self._get_or_create_collection()
        self.embedding_model = SentenceTransformer(config.embedding_model)
    
    def add_chunks_to_database(self, chunks: List[ContentChunk]) -> bool:
        """Add processed chunks to the vector database."""
        try:
            # Prepare data for ChromaDB
            documents = [chunk.content for chunk in chunks]
            metadatas = [chunk.metadata for chunk in chunks]
            ids = [chunk.id for chunk in chunks]
            
            # Generate embeddings
            embeddings = self.embedding_model.encode(documents)
            
            # Add to collection
            self.collection.add(
                documents=documents,
                metadatas=metadatas,
                ids=ids,
                embeddings=embeddings.tolist()
            )
            
            logger.info(f"Added {len(chunks)} chunks to database")
            return True
            
        except Exception as e:
            logger.error(f"Failed to add chunks to database: {e}")
            return False
    
    def search_similar(self, query: str, n_results: int = 5, 
                      similarity_threshold: float = 0.5) -> List[Dict[str, Any]]:
        """Perform semantic similarity search."""
        try:
            # Generate query embedding
            query_embedding = self.embedding_model.encode([query])[0].tolist()
            
            # Search ChromaDB
            results = self.collection.query(
                query_embeddings=[query_embedding],
                n_results=n_results,
                include=["documents", "metadatas", "distances"]
            )
            
            # Process results
            documents = []
            if results["documents"] and results["documents"][0]:
                for i, doc in enumerate(results["documents"][0]):
                    metadata = results["metadatas"][0][i] if results["metadatas"][0] else {}
                    distance = results["distances"][0][i] if results["distances"][0] else 1.0
                    
                    # Convert distance to similarity score
                    similarity = 1 - distance
                    
                    if similarity >= similarity_threshold:
                        documents.append({
                            "content": doc,
                            "metadata": metadata,
                            "similarity": similarity,
                            "distance": distance
                        })
            
            return documents
            
        except Exception as e:
            logger.error(f"Search failed: {e}")
            return []

Advanced Search Features

Our implementation includes several advanced search capabilities:

def search_by_category(self, category: str, n_results: int = 10) -> List[Dict[str, Any]]:
    """Search within a specific category."""
    return self.collection.query(
        query_texts=[""],
        n_results=n_results,
        where={"category": category},
        include=["documents", "metadatas", "distances"]
    )

def search_by_chunk_type(self, chunk_type: str, n_results: int = 10) -> List[Dict[str, Any]]:
    """Search for specific types of content (procedures, troubleshooting, etc.)."""
    return self.collection.query(
        query_texts=[""],
        n_results=n_results,
        where={"chunk_type": chunk_type},
        include=["documents", "metadatas", "distances"]
    )

def get_database_stats(self) -> Dict[str, Any]:
    """Get comprehensive database statistics."""
    try:
        total_docs = self.collection.count()
        
        # Get sample metadata for analysis
        sample_results = self.collection.get(limit=min(1000, total_docs), include=["metadatas"])
        
        categories = {}
        chunk_types = {}
        
        if sample_results["metadatas"]:
            for metadata in sample_results["metadatas"]:
                category = metadata.get("category", "Unknown")
                chunk_type = metadata.get("chunk_type", "Unknown")
                
                categories[category] = categories.get(category, 0) + 1
                chunk_types[chunk_type] = chunk_types.get(chunk_type, 0) + 1
        
        return {
            "total_documents": total_docs,
            "categories": categories,
            "chunk_types": chunk_types,
            "collection_name": self.config.collection_name
        }
        
    except Exception as e:
        logger.error(f"Failed to get database stats: {e}")
        return {"error": str(e)}

RAG Engine Implementation:

The RAG engine is the orchestrator that combines retrieval and generation. Here’s how it works:

RAG Engine – Query Processing & Retrieval

Query Processing (User Input Understanding)

Query Preprocessing: Cleans and normalizes the user’s input by removing noise, punctuation, and stop words for better semantic clarity.
Embedding Generation: Converts the processed query into a vector representation using Sentence Transformers, aligning it with the same embedding space as the knowledge base.
Query Enhancement: Expands the query dynamically with synonyms and related terms, improving recall and capturing broader semantic intent.

Document Retrieval (Semantic Search & Ranking)

Vector Search: Performs cosine similarity search in ChromaDB to find documents most aligned with the query embedding.
Similarity Filtering: Applies a minimum similarity threshold to eliminate low-relevance results.
Relevance Ranking: Sorts retrieved chunks by semantic relevance, ensuring the most meaningful matches appear first.
Document Reranking: Refines the order further based on content quality, metadata, and contextual richness for more accurate answers.

RAG Engine – Context & Generation

Context Preparation (Building the Foundation)

Context Assembly: Merges the most relevant retrieved document chunks into a coherent, unified context block for the LLM.
Length Optimization: Dynamically truncates or prioritizes content to fit within GPT-4’s token limits while retaining maximum relevance.
Source Attribution: Maintains traceability by tracking which documents and URLs contributed to the final response.
Context Validation: Performs automated checks to ensure context consistency, completeness, and relevance before sending to the model.

LLM Generation (Intelligent Response Creation)

Prompt Engineering: Constructs precise and adaptive prompts to guide the LLM toward domain-specific, factual responses.
Response Generation: Leverages OpenAI GPT-4 to generate context-aware, human-like answers grounded in retrieved knowledge.
Confidence Scoring: Evaluates response confidence using retrieval metrics and model feedback to measure reliability.
Quality Assessment: Validates response accuracy, completeness, and alignment with verified knowledge base content.

Response Formatting (Output Structuring & Delivery)

Source Formatting: Generates clear citations and hyperlinks to original knowledge base articles.
Response Validation: Runs final checks on structure, correctness, and content formatting before output.
Output Structuring: Organizes the final response into a user-friendly format—typically a summarized answer followed by sources.
Final Review: Conducts one last quality assurance pass to ensure clarity, trustworthiness, and readiness for delivery.

Core RAG Engine

class RAGEngine:
    """Core RAG engine for document retrieval and response generation."""
    
    def __init__(self, config: ChatbotConfig):
        self.config = config
        self.chromadb_client = None
        self.collection = None
        self.embedding_model = None
        self.llm = None
        self._initialize_components()
    
    def generate_response(self, query: str, filters: Optional[Dict[str, Any]] = None,
                         similarity_threshold: Optional[float] = None,
                         max_results: Optional[int] = None) -> Dict[str, Any]:
        """Generate a response using RAG."""
        start_time = time.time()
        
        try:
            # Step 1: Retrieve relevant documents
            retrieved_docs = self._retrieve_documents(
                query, filters, similarity_threshold, max_results
            )
            
            # Step 2: Prepare context
            context = self._prepare_context(retrieved_docs)
            
            # Step 3: Generate response
            response = self._generate_llm_response(query, context)
            
            # Step 4: Calculate confidence
            confidence = self._calculate_response_confidence(query, retrieved_docs, response)
            
            # Step 5: Format sources
            sources = self._format_sources(retrieved_docs)
            
            return {
                "response": response,
                "confidence": confidence,
                "sources": sources,
                "metadata": {
                    "query": query,
                    "response_time": time.time() - start_time,
                    "num_sources": len(sources)
                }
            }
            
        except Exception as e:
            logger.error(f"Error generating response: {e}")
            return {
                "response": self.config.fallback_response,
                "confidence": 0.0,
                "sources": [],
                "metadata": {"error": str(e)}
            }

Advanced Retrieval with Reranking

def _retrieve_documents(self, query: str, filters: Optional[Dict[str, Any]] = None,
                       similarity_threshold: Optional[float] = None,
                       max_results: Optional[int] = None) -> List[Dict[str, Any]]:
    """Retrieve relevant documents with advanced filtering."""
    
    # Use provided parameters or fall back to config defaults
    sim_threshold = similarity_threshold or self.config.similarity_threshold
    max_sources = max_results or self.config.max_sources
    
    # Preprocess query for better retrieval
    processed_query = self._preprocess_query(query)
    
    # Generate query embedding
    query_embedding = self.embedding_model.encode([processed_query])[0].tolist()
    
    # Prepare filters for ChromaDB
    where_filter = self._prepare_chromadb_filters(filters, processed_query)
    
    # Query ChromaDB
    results = self.collection.query(
        query_embeddings=[query_embedding],
        n_results=min(max_sources * 2, 20),  # Get more for reranking
        where=where_filter,
        include=["documents", "metadatas", "distances"]
    )
    
    # Process and filter results
    documents = []
    if results["documents"] and results["documents"][0]:
        for i, doc in enumerate(results["documents"][0]):
            metadata = results["metadatas"][0][i] if results["metadatas"][0] else {}
            distance = results["distances"][0][i] if results["distances"][0] else 1.0
            
            similarity = 1 - distance
            
            if similarity >= sim_threshold:
                documents.append({
                    "content": doc,
                    "metadata": metadata,
                    "similarity": similarity,
                    "distance": distance,
                    "relevance_score": self._calculate_relevance_score(query, doc, metadata)
                })
    
    # Rerank documents for better relevance
    if documents:
        documents = self._rerank_documents(query, documents)
    
    return documents[:max_sources]

Intelligent Context Preparation

def _prepare_context(self, documents: List[Dict[str, Any]]) -> str:
    """Prepare context string from retrieved documents."""
    if not documents:
        return ""
    
    context_parts = []
    total_length = 0
    
    for i, doc in enumerate(documents, 1):
        title = doc["metadata"].get("article_title", f"Document {i}")
        content = self._clean_text(doc["content"])
        
        doc_context = f"Source {i} - {title}:\n{content}\n"
        
        # Check if adding this document would exceed max context length
        if total_length + len(doc_context) > self.config.max_context_length:
            # Try to truncate the document
            remaining_length = (self.config.max_context_length - total_length - 
                             len(f"Source {i} - {title}:\n\n"))
            
            if remaining_length > 100:  # Only add if we have reasonable space
                truncated_content = content[:remaining_length] + "..."
                doc_context = f"Source {i} - {title}:\n{truncated_content}\n"
                context_parts.append(doc_context)
            break
        
        context_parts.append(doc_context)
        total_length += len(doc_context)
    
    return "\n".join(context_parts)

LLM Integration with Structured Prompts

def _generate_llm_response(self, query: str, context: str) -> str:
    """Generate response using the LLM with structured prompts."""
    try:
        system_message = SystemMessage(content=self.config.system_prompt)
        
        human_prompt = f"""Based on the following context from customer's knowledge base, 
        please provide a helpful, accurate, and structured answer to the user's question.

        Context:
        {context}

        Question: {query}

        Your response must be well-structured:
        1. Begin with a direct answer to the main question
        2. For technical processes or error fixes, use numbered steps that are clear and actionable
        3. Include relevant details from the context, organized logically
        4. When referencing information, cite your sources by mentioning "Source 1", "Source 2", etc.
        5. For error messages, explain what causes the error and specific solutions
        6. If helpful, use clear section headings (##) to organize complex responses

        If the context doesn't contain sufficient information, clearly state this limitation 
        before providing what information you do have. DO NOT fabricate information not found in the context.
        """
        
        human_message = HumanMessage(content=human_prompt)
        
        # Generate response
        response = self.llm.invoke([system_message, human_message])
        return response.content.strip()
        
    except Exception as e:
        logger.error(f"Error generating LLM response: {str(e)}")
        return self.config.fallback_response

Production Considerations

Building a production-ready RAG system requires careful attention to several key areas:

Production Deployment Architecture

Docker Containerization

Streamlit App Container (Light Blue): Hosts the web interface for user interactions, allowing real-time query input and response visualization.
Pipeline Container (Green): Manages data processing tasks, including crawling, document processing, and knowledge base updates.
ChromaDB Volume (Orange): Provides persistent storage for vector embeddings and metadata, ensuring semantic data remains intact across restarts.
Data Volume (Purple): Stores raw and processed data, maintaining a versioned record of crawled and transformed content.
Logs Volume (Pink): Centralizes logging and monitoring output, simplifying debugging and performance tracking.

External Service Integration

OpenAI API: Powers the LLM generation and response refinement, enabling GPT-4 to produce accurate and contextually grounded answers.
Knowledge Base Website: Serves as the primary content source for the crawler, feeding up-to-date articles into the data pipeline for continuous learning.

Production Benefits

Scalability: Supports horizontal scaling—multiple containers can run in parallel to handle increased load or faster data ingestion.
Isolation: Maintains clean separation between the application layer and the data pipeline for better security and maintainability.
Persistence: Ensures that data and embeddings remain intact even when containers are rebuilt or restarted.
Monitoring: Enables centralized logging, health checks, and container-level monitoring for real-time operational insight.

Docker Deployment

# Multi-stage build for optimized production image
FROM python:3.13-slim AS base

ENV PYTHONDONTWRITEBYTECODE=1 \
    PYTHONUNBUFFERED=1 \
    PIP_NO_CACHE_DIR=1

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    curl \
    git \
    libjpeg62-turbo-dev \
    zlib1g-dev \
    libpng-dev \
    libxml2-dev \
    libxslt1-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# Copy requirements and install dependencies
COPY requirements.txt ./
RUN pip install --upgrade pip && pip install -r requirements.txt

# Copy application code
COPY . .

# Create necessary directories
RUN mkdir -p logs exports data/raw data/processed chroma_db

# Expose Streamlit port
EXPOSE 8501

# Configure Streamlit for container deployment
RUN mkdir -p ~/.streamlit && \
    printf "[server]\nheadless = true\nenableCORS = false\nenableXsrfProtection = false\nport = 8501\naddress = 0.0.0.0\n" > ~/.streamlit/config.toml

# Default command
CMD ["streamlit", "run", "app.py", "--server.address=0.0.0.0", "--server.port=8501"]

Data Sequence Flow

Step-by-Step User Interaction Flow

User Query Submission: The user enters a question through the Streamlit web interface.
Query Processing: The Streamlit app forwards the query to the RAG Engine, which preprocesses and converts it into an embedding.
Semantic Search: The RAG Engine performs a semantic search in the vector database (ChromaDB) to find contextually similar documents.
Document Retrieval: The vector database returns the most relevant document chunks based on cosine similarity.
Context Preparation: The RAG Engine assembles these chunks into a coherent, high-relevance context for the LLM.
LLM Generation: OpenAI GPT-4 generates a context-aware response, grounded in the retrieved knowledge base content.
Response Formatting: The RAG Engine formats the output, adding source citations and URLs for transparency.
User Display: The Streamlit interface displays the final answer with clickable source links for easy reference.

Performance Characteristics

Query Processing: ~50–100 ms (query cleaning, embedding generation, and enhancement)
Vector Search: ~100–200 ms (semantic similarity search in ChromaDB)
LLM Generation: ~2–5 seconds (response generation via OpenAI API)
Total Response Time: ~3–6 seconds end-to-end, depending on query complexity and API latency.

Conclusion

Building a production-ready Retrieval-Augmented Generation (RAG) system is far more than connecting a language model to a database — it’s about designing a reliable, scalable, and intelligent ecosystem where every component plays a critical role.

A successful system integrates:

Robust Data Pipeline: Automated crawling, cleaning, and structured storage of domain knowledge.
Advanced Vector Search: Semantic retrieval with intelligent relevance and confidence scoring.
Intelligent Generation: Context-aware responses that cite their sources transparently.
Production-Grade Features: Comprehensive error handling, monitoring, and scalability for real-world reliability.

The real key lies in treating each stage — from data collection to LLM response generation — as a production service with proper observability, fault tolerance, and performance optimization. A modular architecture not only simplifies maintenance but also enables smooth scaling as data and traffic grow.

May 12, 2025
Category Blog