Generative AI SQL Chatbot
An LLM-powered chatbot for natural language database queries with extensive observability.
Features
- Multiple interaction methods (RAG, TAG)
- LLM provider selection (OpenAI, Claude)
- Intent classification (Details in Classifier README)
- Vector search with PGVector
- Langfuse Analytics
- Conversation memory (until browser refresh)
- Docker-based deployment
Prerequisites
- Docker and Docker Compose
- Python 3.9+
- OpenAI API key
- Anthropic API key
- Langfuse account (optional)
Installation
- Clone the repository and navigate to the directory:
git clone https://github.com/geoffgin/GenAI-SQL-Chatbot.git
cd GenAI-SQL-Chatbot
- Configure environment variables: Copy
.env.example
to .env
and fill in your API keys and configurations.
- Build and start the Docker services (one-off):
make run
- After the installation, simply run:
make up
- Or run the application in developer mode:
make dev
- Shut down the application:
make down
Chatbot Usage
Go to http://localhost:8501 for the main chatbot interface.
- Select your preferred interaction method (RAG, TAG)
- Choose an LLM provider (OpenAI or Claude)
- Start asking questions about your database
Go to http://localhost:3000 for the Langfuse interface when not running on dev mode.
Architecture
- Frontend: Streamlit
- Document Parsing: Docling
- Vector Database: PostgreSQL with pgvector
- Observability: Langfuse
- LLM Framework: LlamaIndex
- Container Orchestration: Docker Compose
Paper References
- RAG (Retrieval-Augmented Generation): Paper by Facebook AI
- TAG (Table-Augmented Generation): Paper by UC Berkeley & Stanford University
Data Source Statement
This project uses the Chinook database, a media store database, for development and testing purposes. However, it can be easily adapted for any enterprise or domain-specific use case.
- Chinook Database:
- Ownership: Maintained by lerocha
- Licenses and Use: The Chinook Database allows use, distribution, and modification without any warranty of any kind.
- Access: Available on GitHub at Chinook Database
- Intent Classifier Data:
- GretelAI Synthetic Text-to-SQL:
- Ownership: Gretel.ai
- Licenses and Use: Licensed under the Apache License 2.0, permitting use, distribution, and modification with proper attribution.
- Access: Available on Hugging Face at GretelAI Synthetic Text-to-SQL
- Factoid WebQuestions Dataset:
- Ownership: WebQuestions (Berant et al., 2013, CC-BY)
- Licenses and Use: Distributed under the Creative Commons Attribution 4.0 International (CC BY 4.0) license, allowing sharing and adaptation with appropriate credit.
- Access: Available on GitHub at Factoid WebQuestions Dataset
Evaluation Framework
Located under the eval
sub-folder.