From 523871caf97d214db95d0374853edb5f060eb341 Mon Sep 17 00:00:00 2001 From: "hasnain.ahmed" Date: Wed, 31 Dec 2025 01:37:51 +0500 Subject: [PATCH] readme know contains MiniLM-l6-v2 model docs --- README.md | 255 +++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 251 insertions(+), 4 deletions(-) diff --git a/README.md b/README.md index 86d859e..b559173 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,254 @@ -to run this project follow the following steps. +# Islamic Duas Semantic Search API +A FastAPI-based semantic search engine for Islamic duas (prayers) that uses vector embeddings to find relevant prayers based on natural language queries. The system performs semantic similarity search on dua tags using sentence transformers and PostgreSQL with pgvector extension. + +## Features + +- **Semantic Search**: Find duas using natural language queries (e.g., "protection from evil", "morning prayers") +- **Vector Embeddings**: Uses `sentence-transformers/all-MiniLM-L6-v2` for high-quality embeddings +- **PostgreSQL + pgvector**: Efficient vector similarity search in PostgreSQL +- **RESTful API**: FastAPI-powered endpoints with automatic OpenAPI documentation +- **Multi-language Support**: Returns duas in Arabic, transliteration, English translation, Urdu, and Roman Urdu +- **Metadata Filtering**: Access category, occasion, source, and tags information +- **CORS Enabled**: Ready for frontend integration + +## Technology Stack + +- **FastAPI**: Modern, fast web framework for building APIs +- **LangChain**: Framework for working with embeddings and vector stores +- **Sentence Transformers**: State-of-the-art text embedding models +- **PostgreSQL + pgvector**: Vector database for similarity search +- **Pydantic**: Data validation and settings management +- **python-dotenv**: Environment variable management + +## Prerequisites + +- Python 3.13 or higher +- PostgreSQL with pgvector extension installed +- UV package manager (recommended) or pip + +## Installation + +### 1. Clone the repository + +```bash +git clone +cd semantic_search +``` + +### 2. Install dependencies + +Using UV (recommended): +```bash uv sync -source .venv/bin/activate -uvicorn main:app --reload --port=8899 +``` -the fastapi will be server at localhost port 8899, you may go to the localhost:8899/docs, to try the GET search API, and further API docs. \ No newline at end of file +Using pip: +```bash +pip install -r requirements.txt +``` + +### 3. Activate virtual environment + +```bash +source .venv/bin/activate +``` + +## Configuration + +### Environment Variables + +Create a `.env` file in the project root with the following variables: + +```env +CONNECTION_STRING=postgresql+psycopg2://username:password@localhost:5432/database_name +COLLECTION_NAME=duas_embeddings +``` + +**Environment Variables Explained:** +- `CONNECTION_STRING`: PostgreSQL connection string with pgvector extension +- `COLLECTION_NAME`: Name of the collection/table to store embeddings + +### Database Setup + +1. Install pgvector extension in PostgreSQL: +```sql +CREATE EXTENSION vector; +``` + +2. Ensure you have a PostgreSQL database created and accessible with the credentials in your `.env` file. + +## Data Preparation + +### Initial Setup: Generate Embeddings + +Before running the API, you need to generate embeddings from your duas data: + +```bash +python generate_dua_tags_embedding.py +``` + +This script: +- Reads duas from `duas_directus_published.json` +- Generates vector embeddings from dua tags +- Stores embeddings in PostgreSQL with pgvector +- Preserves all metadata (Arabic text, translation, category, etc.) + +**Note**: Ensure `duas_directus_published.json` exists in the project root before running this script. + +## Running the Application + +### Start the FastAPI server + +```bash +uvicorn main:app --reload --port=8899 +``` + +Or simply: +```bash +python main.py +``` + +The API will be available at: `http://localhost:8899` + +### Access API Documentation + +FastAPI provides automatic interactive API documentation: + +- **Swagger UI**: http://localhost:8899/docs +- **ReDoc**: http://localhost:8899/redoc + +## API Endpoints + +### Health Check +- **GET** `/` - Root endpoint with API information +- **GET** `/health` - Health check with database connection status + +### Search +- **GET** `/search?query={query}&k={number}` - Search duas using GET request + - **Parameters**: + - `query` (required): Search query (e.g., "protection from evil") + - `k` (optional, default=5): Number of results to return (1-50) + +### Metadata +- **GET** `/categories` - Get all unique categories from the duas collection + +### Example Request + +```bash +curl "http://localhost:8899/search?query=protection%20from%20evil&k=5" +``` + +### Example Response + +```json +{ + "query": "protection from evil", + "results_count": 5, + "results": [ + { + "id": "123", + "arabic": "أَعُوذُ بِكَلِمَاتِ اللَّهِ التَّامَّاتِ", + "transliteration": "A'udhu bikalimatillahit-tammati", + "translation": "I seek refuge in the perfect words of Allah", + "urdu": "میں اللہ کے کامل کلمات کی پناہ چاہتا ہوں", + "romanUrdu": "Main Allah ke kamil kalimat ki panah chahta hoon", + "category": "Protection", + "occasion": "General", + "source": "Sahih Muslim", + "tags": ["protection", "evil", "refuge"], + "similarity_score": 0.8542 + } + ] +} +``` + +## Project Structure + +``` +semantic_search/ +├── main.py # FastAPI application with API endpoints +├── generate_dua_tags_embedding.py # Script to generate and store embeddings +├── duas_query.py # Helper script for testing queries +├── duas_directus_published.json # Source data file with duas +├── pyproject.toml # Project dependencies and metadata +├── requirements.txt # Python dependencies +├── .env # Environment variables (not committed) +├── README.md # This file +└── .venv/ # Virtual environment +``` + +## Key Files Explained + +### main.py:14-40 +Main FastAPI application with: +- API endpoints for semantic search +- Health check endpoints +- CORS middleware configuration +- Vector store initialization + +### generate_dua_tags_embedding.py:22-54 +Embedding generation script that: +- Loads duas from JSON file +- Creates embeddings from tags only +- Stores full metadata in vector database + +### duas_query.py:21-41 +Helper script for testing search functionality programmatically + +## Development + +### Testing Search Locally + +Use `duas_query.py` for quick testing: + +```bash +python duas_query.py +``` + +Modify the query and k parameters in the script to test different searches. + +### Adding New Duas + +1. Add new duas to `duas_directus_published.json` +2. Run `python generate_dua_tags_embedding.py` to regenerate embeddings +3. Restart the API server + +## Production Considerations + +- Update CORS settings in main.py:21-27 to restrict allowed origins +- Use environment-specific connection strings +- Consider caching for the `/categories` endpoint +- Implement rate limiting for API endpoints +- Add authentication/authorization if needed +- Use a process manager like Gunicorn with Uvicorn workers + +## Troubleshooting + +### Database Connection Issues +- Verify PostgreSQL is running +- Check CONNECTION_STRING in `.env` file +- Ensure pgvector extension is installed + +### Empty Results +- Verify embeddings were generated successfully +- Check if `duas_directus_published.json` has data +- Ensure COLLECTION_NAME matches in all files + +### Port Already in Use +Change the port in the uvicorn command: +```bash +uvicorn main:app --reload --port=8080 +``` + +## License + +[Add your license here] + +## Contributing + +[Add contribution guidelines here] + +## Contact + +[Add contact information here]