# Ragex Configuration Guide This document covers all configuration options for Ragex, including embedding models, caching, and performance tuning. ## Table of Contents - [Embedding Models](#embedding-models) - [Configuration Methods](#configuration-methods) - [Available Models](#available-models) - [Model Selection Guide](#model-selection-guide) - [AI Features Configuration](#ai-features-configuration) - [Cache Configuration](#cache-configuration) - [Migration Guide](#migration-guide) - [Performance Tuning](#performance-tuning) ## Embedding Models Ragex supports multiple embedding models for semantic code search. The model choice affects: - **Quality**: Accuracy of semantic search results - **Speed**: Embedding generation and search time - **Memory**: RAM required to load the model - **Dimensions**: Vector size (impacts storage and similarity computation) ### Default Model By default, Ragex uses **all-MiniLM-L6-v2**: - ✅ Fast inference (384 dimensions) - ✅ Small model size (~90MB) - ✅ Good quality for general-purpose search - ✅ Suitable for small to medium codebases ## Configuration Methods ### 1. Via `config/config.exs` (Recommended) ```elixir import Config # Set embedding model config :ragex, :embedding_model, :all_minilm_l6_v2 # Available options: # :all_minilm_l6_v2 (default) # :all_mpnet_base_v2 (high quality) # :codebert_base (code-specific) # :paraphrase_multilingual (multilingual) ``` ### 2. Via Environment Variable ```bash export RAGEX_EMBEDDING_MODEL=codebert_base mix run --no-halt ``` This overrides `config.exs` settings. ### 3. Checking Current Configuration ```bash mix ragex.embeddings.migrate --check ``` Output example: ``` Checking embedding model status... ✓ Configured Model: all-MiniLM-L6-v2 ID: all_minilm_l6_v2 Dimensions: 384 Type: sentence_transformer Repository: sentence-transformers/all-MiniLM-L6-v2 ✓ No embeddings stored yet Available Models: • all_minilm_l6_v2 (current) all-MiniLM-L6-v2 - 384 dims • all_mpnet_base_v2 all-mpnet-base-v2 - 768 dims • codebert_base CodeBERT Base - 768 dims • paraphrase_multilingual paraphrase-multilingual-MiniLM-L12-v2 - 384 dims ``` ## Available Models ### 1. all-MiniLM-L6-v2 (Default) **Model ID:** `:all_minilm_l6_v2` **Specifications:** - **Dimensions:** 384 - **Max tokens:** 256 - **Type:** Sentence transformer - **Model size:** ~90MB **Best for:** - ✅ General-purpose semantic search - ✅ Small to medium codebases (<10k entities) - ✅ Fast inference requirements - ✅ Limited memory environments **Performance:** - Embedding generation: ~50ms per entity - Memory usage: ~400MB (model + runtime) - Quality: Good for most use cases **Configuration:** ```elixir config :ragex, :embedding_model, :all_minilm_l6_v2 ``` --- ### 2. all-mpnet-base-v2 (High Quality) **Model ID:** `:all_mpnet_base_v2` **Specifications:** - **Dimensions:** 768 - **Max tokens:** 384 - **Type:** Sentence transformer - **Model size:** ~420MB **Best for:** - ✅ Large codebases requiring high accuracy - ✅ Deep semantic understanding - ✅ When quality is more important than speed - ✅ Complex domain-specific terminology **Performance:** - Embedding generation: ~100ms per entity - Memory usage: ~800MB (model + runtime) - Quality: Excellent semantic understanding **Trade-offs:** - ⚠️ 2x slower than all-MiniLM-L6-v2 - ⚠️ 2x more memory - ⚠️ 2x larger embeddings (storage) **Configuration:** ```elixir config :ragex, :embedding_model, :all_mpnet_base_v2 ``` --- ### 3. CodeBERT Base (Code-Specific) **Model ID:** `:codebert_base` **Specifications:** - **Dimensions:** 768 - **Max tokens:** 512 - **Type:** Code model - **Model size:** ~500MB **Best for:** - ✅ Code similarity tasks - ✅ Programming-specific queries - ✅ Multi-language codebases - ✅ API discovery and documentation search **Performance:** - Embedding generation: ~120ms per entity - Memory usage: ~900MB (model + runtime) - Quality: Optimized for code understanding **Special features:** - Pre-trained on code and natural language - Better understanding of programming concepts - Good for finding similar code patterns **Configuration:** ```elixir config :ragex, :embedding_model, :codebert_base ``` --- ### 4. paraphrase-multilingual-MiniLM-L12-v2 (Multilingual) **Model ID:** `:paraphrase_multilingual` **Specifications:** - **Dimensions:** 384 - **Max tokens:** 128 - **Type:** Multilingual - **Model size:** ~110MB **Best for:** - ✅ International teams - ✅ Non-English documentation - ✅ Multilingual codebases (50+ languages) - ✅ Mixed language comments/docs **Performance:** - Embedding generation: ~60ms per entity - Memory usage: ~450MB (model + runtime) --- ## AI Features Configuration Ragex includes AI-powered features for enhanced code analysis (Phases A, B, C). ### Master Switch ```elixir config :ragex, :ai, enabled: true, # Master switch for all AI features default_provider: :deepseek_r1 # or :openai, :anthropic, :ollama ``` ### Feature Flags ```elixir config :ragex, :ai_features, # Phase B - High-Priority Features validation_error_explanation: true, refactor_preview_commentary: true, # Phase C - Analysis Features dead_code_refinement: true, duplication_semantic_analysis: true, dependency_insights: true ``` ### Feature-Specific Options Each feature has optimized defaults for temperature and token limits: | Feature | Temperature | Max Tokens | Cache TTL | |---------|-------------|------------|-----------| | Validation Error Explanation | 0.3 | 300 | 7 days | | Refactor Preview Commentary | 0.7 | 500 | 1 hour | | Dead Code Refinement | 0.6 | 400 | 7 days | | Duplication Semantic Analysis | 0.5 | 600 | 3 days | | Dependency Insights | 0.6 | 700 | 6 hours | ### Runtime Override All features support runtime configuration overrides: ```elixir # Disable AI globally but enable for specific analysis {:ok, dead} = DeadCode.find_unused_exports(ai_refine: true) # Enable AI globally but disable for specific analysis {:ok, clones} = Duplication.detect_in_files(files, ai_analyze: false) # Override AI provider {:ok, insights} = AIInsights.analyze_coupling(data, provider: :openai) ``` ### Graceful Degradation - All AI features are opt-in and disabled by default - When disabled or unavailable, features gracefully return original results - No failures or crashes if AI provider is unavailable - Cache reduces API calls by 40-60% ### API Keys AI features require API keys for external providers: ```bash # DeepSeek (recommended) export DEEPSEEK_API_KEY="sk-..." # OpenAI export OPENAI_API_KEY="sk-..." # Anthropic export ANTHROPIC_API_KEY="sk-..." # Ollama (local, no key needed) export OLLAMA_HOST="http://localhost:11434" ``` ### Documentation See phase completion documents for detailed usage: - `stuff/phases/PHASE_A_AI_FEATURES_FOUNDATION.md` - `stuff/phases/PHASE_B_AI_FEATURES_COMPLETE.md` - `stuff/phases/PHASE_C_AI_ANALYSIS_COMPLETE.md` - Quality: Good for multilingual content **Supported languages:** Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish, and 37+ more **Configuration:** ```elixir config :ragex, :embedding_model, :paraphrase_multilingual ``` ## Model Selection Guide ### Decision Tree ``` Do you have multilingual code/docs? ├─ YES → paraphrase_multilingual └─ NO → Continue... Is your codebase primarily code-focused? ├─ YES → codebert_base └─ NO → Continue... Do you need maximum quality? ├─ YES → all_mpnet_base_v2 └─ NO → all_minilm_l6_v2 (default) ``` ### Use Case Recommendations | Use Case | Recommended Model | Why | |----------|------------------|-----| | **Startup/Small Project** | all_minilm_l6_v2 | Fast, lightweight, good enough | | **Enterprise/Large Codebase** | all_mpnet_base_v2 | Best quality, worth the cost | | **Code-heavy (APIs, Libraries)** | codebert_base | Trained on code specifically | | **International Team** | paraphrase_multilingual | Multi-language support | | **Limited Memory (<4GB)** | all_minilm_l6_v2 | Smallest footprint | | **Quality-Critical** | all_mpnet_base_v2 | Highest accuracy | ### Dimension Compatibility Models with the **same dimensions** can share embeddings: **384-dimensional models (compatible):** - all_minilm_l6_v2 - paraphrase_multilingual **768-dimensional models (compatible):** - all_mpnet_base_v2 - codebert_base You can switch between compatible models without regenerating embeddings! ## Auto-Analyze Directories ### Overview Ragex can automatically scan and index configured directories when the application starts. This is useful for: - Pre-loading commonly used projects into the knowledge graph - Ensuring fresh analysis on server startup - CI/CD pipelines that need indexed code immediately - Development workflows where you work on multiple codebases ### Configuration **In `config/config.exs`:** ```elixir # Auto-analyze directories on startup config :ragex, :auto_analyze_dirs, [ "/opt/Proyectos/MyProject", "/home/user/code/important-lib", "~/workspace/api-server" ] ``` **Default:** `[]` (no automatic analysis) ### Behavior 1. **Startup Phase**: Analysis runs during the `:auto_analyze` start phase 2. **Non-blocking**: Server starts before analysis completes 3. **Logging**: Progress logged to stderr (visible in logs) 4. **Error Handling**: Failures logged as warnings; server continues 5. **Parallel**: Multiple directories analyzed sequentially ### Example Output ``` Auto-analyzing 2 configured directories... Analyzing directory: /opt/Proyectos/MyProject Successfully analyzed /opt/Proyectos/MyProject: 45 files (2 skipped, 0 errors) Analyzing directory: /home/user/code/important-lib Successfully analyzed /home/user/code/important-lib: 23 files (0 skipped, 0 errors) Auto-analysis complete ``` ### Performance Considerations - **Large codebases**: Initial analysis can take 30-60 seconds per 1,000 files - **Embeddings**: Generation adds ~50ms per entity (enable caching to speed up subsequent starts) - **Memory**: Each analyzed file adds ~10KB to ETS tables - **Recommendation**: Limit to 3-5 active projects, use incremental updates ### Environment-Specific Configuration **Development:** ```elixir # config/dev.exs import Config config :ragex, :auto_analyze_dirs, [ ".", # Current project "../shared-lib" # Related library ] ``` **Production:** ```elixir # config/prod.exs import Config config :ragex, :auto_analyze_dirs, [ "/app/src", # Main application "/app/vendor/critical-deps" # Important dependencies ] ``` **CI/CD:** ```elixir # config/ci.exs import Config # Analyze entire codebase for comprehensive checks config :ragex, :auto_analyze_dirs, [ System.get_env("CI_PROJECT_DIR", ".") ] ``` ### Disabling Auto-Analysis Set to empty list: ```elixir config :ragex, :auto_analyze_dirs, [] ``` Or override via environment: ```bash export RAGEX_AUTO_ANALYZE="false" ``` ### Combining with File Watcher Auto-analysis and file watching work together: 1. **Startup**: Auto-analyze directories (full scan) 2. **Runtime**: File watcher tracks changes (incremental updates) 3. **Result**: Always up-to-date knowledge graph ### Troubleshooting **Issue:** "Failed to analyze directory" **Solutions:** - Check path exists and is readable - Verify sufficient disk space for cache - Check file permissions - Look for syntax errors in source files **Issue:** Startup takes too long **Solutions:** - Reduce number of configured directories - Enable embedding cache (see Cache Configuration) - Use file watching for incremental updates instead - Consider analyzing on-demand via MCP tools ## Cache Configuration ### Enable/Disable Cache ```elixir config :ragex, :cache, enabled: true, # Set to false to disable caching dir: Path.expand("~/.cache/ragex"), # Cache directory max_age_days: 30 # Auto-cleanup after 30 days ``` ### Cache Location Default: `~/.cache/ragex/embeddings/.ets` Custom location: ```elixir config :ragex, :cache, enabled: true, dir: "/custom/path/to/cache" ``` ### Cache Management Commands ```bash # Show cache statistics mix ragex.cache.stats # Clear all caches mix ragex.cache.clear # Clear caches older than 7 days mix ragex.cache.clear --older-than 7 ``` ## Migration Guide ### Switching Models #### Scenario 1: Compatible Models (Same Dimensions) **Example:** all_minilm_l6_v2 → paraphrase_multilingual (both 384 dims) **Steps:** 1. Update `config/config.exs`: ```elixir config :ragex, :embedding_model, :paraphrase_multilingual ``` 2. Restart the server: ```bash # Kill existing process # Then restart mix run --no-halt ``` 3. ✅ Done! Existing embeddings still work. --- #### Scenario 2: Incompatible Models (Different Dimensions) **Example:** all_minilm_l6_v2 (384) → all_mpnet_base_v2 (768) **Steps:** 1. Check current status: ```bash mix ragex.embeddings.migrate --check ``` 2. Clear existing embeddings: ```bash # Stop the server # Embeddings are in-memory and will be cleared on restart ``` 3. Update `config/config.exs`: ```elixir config :ragex, :embedding_model, :all_mpnet_base_v2 ``` 4. Restart and re-analyze: ```bash mix run --no-halt # Then analyze your codebase via MCP tools ``` --- ### Using the Migration Tool #### Check Status ```bash mix ragex.embeddings.migrate --check ``` #### Plan Migration ```bash mix ragex.embeddings.migrate --model codebert_base ``` This checks compatibility and provides instructions. #### Force Migration ```bash mix ragex.embeddings.migrate --model codebert_base --force ``` ## Performance Tuning ### Memory Optimization **For systems with limited memory (<4GB):** 1. Use lightweight model: ```elixir config :ragex, :embedding_model, :all_minilm_l6_v2 ``` 2. Limit batch size (in Bumblebee adapter): ```elixir compile: [batch_size: 16, sequence_length: 256] # Reduce from 32 ``` 3. Disable cache if needed: ```elixir config :ragex, :cache, enabled: false ``` --- ### Speed Optimization **For faster embedding generation:** 1. Use faster model: ```elixir config :ragex, :embedding_model, :all_minilm_l6_v2 ``` 2. Reduce sequence length: ```elixir compile: [batch_size: 32, sequence_length: 256] # Reduce from 512 ``` 3. Enable EXLA compiler (if not already): - Ensure `exla` dependency is included - First run will compile (slow), subsequent runs are fast --- ### Quality Optimization **For best search quality:** 1. Use high-quality model: ```elixir config :ragex, :embedding_model, :all_mpnet_base_v2 ``` 2. Generate embeddings for all entities: ```elixir # In analyze_file MCP tool { "generate_embeddings": true # Always true } ``` 3. Use longer text descriptions: - Include more context in function/module docs - Better descriptions = better embeddings --- ## Environment-Specific Configuration ### Development ```elixir # config/dev.exs import Config config :ragex, :embedding_model, :all_minilm_l6_v2 # Fast for dev config :ragex, :cache, enabled: true # Cache for quick restarts ``` ### Production ```elixir # config/prod.exs import Config config :ragex, :embedding_model, :all_mpnet_base_v2 # Quality for prod config :ragex, :cache, enabled: true, max_age_days: 90 # Long cache ``` ### Testing ```elixir # config/test.exs import Config config :ragex, :embedding_model, :all_minilm_l6_v2 # Fast tests config :ragex, :cache, enabled: false # No cache for isolation ``` --- ## Troubleshooting ### Model Won't Load **Symptom:** "Failed to load Bumblebee model" **Solutions:** 1. Check internet connection (first download) 2. Verify disk space (~500MB needed) 3. Check cache directory permissions: `~/.cache/huggingface/` 4. Try clearing HuggingFace cache: ```bash rm -rf ~/.cache/huggingface/ ``` --- ### Dimension Mismatch Error **Symptom:** "Dimension mismatch: expected 384, got 768" **Solution:** ```bash mix ragex.embeddings.migrate --check # Follow instructions to clear embeddings # Then restart with new model ``` --- ### Out of Memory **Symptom:** Server crashes or freezes during embedding generation **Solutions:** 1. Switch to smaller model: ```elixir config :ragex, :embedding_model, :all_minilm_l6_v2 ``` 2. Reduce batch size in code: ```elixir compile: [batch_size: 8, sequence_length: 256] ``` 3. Increase system swap space --- ## Advanced Configuration ### Custom Model (Advanced) To add a custom model, edit `lib/ragex/embeddings/registry.ex`: ```elixir custom_model: %{ id: :custom_model, name: "Custom Model", repo: "organization/model-name", dimensions: 512, max_tokens: 256, description: "My custom embedding model", type: :sentence_transformer, recommended_for: ["custom use case"] } ``` Then configure: ```elixir config :ragex, :embedding_model, :custom_model ``` --- ## Summary **Quick Start (Default):** ```elixir # config/config.exs config :ragex, :embedding_model, :all_minilm_l6_v2 ``` **For Best Quality:** ```elixir config :ragex, :embedding_model, :all_mpnet_base_v2 ``` **For Code-Specific:** ```elixir config :ragex, :embedding_model, :codebert_base ``` **For Multilingual:** ```elixir config :ragex, :embedding_model, :paraphrase_multilingual ``` **Check Status Anytime:** ```bash mix ragex.embeddings.migrate --check ``` --- ## References - [Sentence Transformers Documentation](https://www.sbert.net/) - [HuggingFace Models](https://huggingface.co/models) - [CodeBERT Paper](https://arxiv.org/abs/2002.08155) - [Bumblebee Library](https://hexdocs.pm/bumblebee/)