# Ragex Code Analysis Guide Comprehensive guide to Ragex's code analysis capabilities powered by Metastatic and semantic embeddings. ## Table of Contents 1. [Overview](#overview) 2. [Analysis Approaches](#analysis-approaches) 3. [Code Duplication Detection](#code-duplication-detection) 4. [Dead Code Detection](#dead-code-detection) 5. [Dependency Analysis](#dependency-analysis) 6. [Impact Analysis](#impact-analysis) 7. [MCP Tools Reference](#mcp-tools-reference) 8. [Best Practices](#best-practices) 9. [Troubleshooting](#troubleshooting) ## Overview Ragex provides advanced code analysis capabilities through two complementary approaches: 1. **AST-Based Analysis** - Precise structural analysis via Metastatic 2. **Embedding-Based Analysis** - Semantic similarity via ML embeddings All analysis features are accessible via MCP tools and can be integrated into your development workflow. ### Supported Languages - Elixir (.ex, .exs) - Erlang (.erl, .hrl) - Python (.py) - JavaScript/TypeScript (.js, .ts) - Ruby (.rb) - Haskell (.hs) ## Analysis Approaches ### AST-Based Analysis (Metastatic) **Advantages:** - Precise structural matching - Language-aware analysis - Detects subtle code patterns - No training required **Use Cases:** - Exact and near-exact code duplication - Dead code detection (unreachable code) - Structural similarity analysis ### Embedding-Based Analysis **Advantages:** - Semantic understanding - Cross-language similarity - Finds conceptually similar code - Works with comments and documentation **Use Cases:** - Finding semantically similar functions - Code smell detection - Refactoring opportunities - Cross-project similarity ## Code Duplication Detection Ragex detects four types of code clones using Metastatic's AST comparison: ### Clone Types #### Type I: Exact Clones Identical code with only whitespace/comment differences. ```elixir # File 1 defmodule A do def calculate(x, y) do x + y * 2 end end # File 2 (Type I clone) defmodule A do def calculate(x, y) do x + y * 2 end end ``` #### Type II: Renamed Clones Same structure with different identifiers. ```elixir # File 1 defmodule A do def process(data, options) do Map.put(data, :result, options.value) end end # File 2 (Type II clone) defmodule A do def process(input, config) do Map.put(input, :result, config.value) end end ``` #### Type III: Near-Miss Clones Similar structure with minor modifications. ```elixir # File 1 defmodule A do def process(x) do result = x * 10 result + 100 end end # File 2 (Type III clone) defmodule A do def process(x) do result = x * 10 result + 200 # Different constant end end ``` #### Type IV: Semantic Clones Different syntax, same behavior. ```elixir # File 1 def sum_list(items) do Enum.reduce(items, 0, &+/2) end # File 2 (Type IV clone) def sum_list(items) do items |> Enum.sum() end ``` ### API Usage #### Detect Duplicates Between Two Files ```elixir alias Ragex.Analysis.Duplication # Basic usage {:ok, result} = Duplication.detect_between_files("lib/a.ex", "lib/b.ex") if result.duplicate? do IO.puts("Found #{result.clone_type} clone") IO.puts("Similarity: #{result.similarity_score}") end # With options {:ok, result} = Duplication.detect_between_files( "lib/a.ex", "lib/b.ex", threshold: 0.9 # Stricter matching ) ``` #### Detect Duplicates Across Multiple Files ```elixir files = ["lib/a.ex", "lib/b.ex", "lib/c.ex"] {:ok, clones} = Duplication.detect_in_files(files) Enum.each(clones, fn clone -> IO.puts("#{clone.file1} <-> #{clone.file2}") IO.puts(" Type: #{clone.clone_type}") IO.puts(" Similarity: #{clone.similarity}") end) ``` #### Scan Directory for Duplicates ```elixir # Recursive scan with defaults {:ok, clones} = Duplication.detect_in_directory("lib/") # Custom options {:ok, clones} = Duplication.detect_in_directory("lib/", recursive: true, threshold: 0.8, exclude_patterns: ["_build", "deps", ".git", "test"] ) IO.puts("Found #{length(clones)} duplicate pairs") ``` #### Embedding-Based Similarity ```elixir # Find similar functions using embeddings {:ok, similar} = Duplication.find_similar_functions( threshold: 0.95, # High similarity limit: 20, node_type: :function ) Enum.each(similar, fn pair -> IO.puts("#{inspect(pair.function1)} ~ #{inspect(pair.function2)}") IO.puts(" Similarity: #{pair.similarity}") IO.puts(" Method: #{pair.method}") # :embedding end) ``` #### Generate Comprehensive Report ```elixir {:ok, report} = Duplication.generate_report("lib/", include_embeddings: true, threshold: 0.8 ) IO.puts(report.summary) IO.puts("AST clones: #{report.ast_clones.total}") IO.puts("Embedding similar: #{report.embedding_similar.total}") # Access detailed data report.ast_clones.by_type # %{type_i: 5, type_ii: 3, ...} report.ast_clones.pairs # List of clone pairs report.embedding_similar.pairs # List of similar pairs ``` ### MCP Tools #### `find_duplicates` Detect duplicates using AST-based analysis. ```json { "name": "find_duplicates", "arguments": { "mode": "directory", "path": "lib/", "threshold": 0.8, "format": "detailed" } } ``` **Modes:** - `"directory"` - Scan entire directory - `"files"` - Compare specific files (provide `file1` and `file2`) **Formats:** - `"summary"` - Brief overview - `"detailed"` - Full clone information - `"json"` - Machine-readable JSON #### `find_similar_code` Find semantically similar code using embeddings. ```json { "name": "find_similar_code", "arguments": { "threshold": 0.95, "limit": 20, "format": "summary" } } ``` ## Dead Code Detection Ragex provides two types of dead code detection: ### 1. Interprocedural (Graph-Based) Detects unused functions by analyzing the call graph. ```elixir alias Ragex.Analysis.DeadCode # Find unused public functions {:ok, unused_exports} = DeadCode.find_unused_exports() # Returns: [{:module, ModuleName, :function_name, arity}, ...] # Find unused private functions {:ok, unused_private} = DeadCode.find_unused_private() # Find unused modules {:ok, unused_modules} = DeadCode.find_unused_modules() # Generate removal suggestions {:ok, suggestions} = DeadCode.removal_suggestions(confidence_threshold: 0.8) ``` ### 2. Intraprocedural (AST-Based via Metastatic) Detects unreachable code patterns within functions. ```elixir # Analyze single file {:ok, patterns} = DeadCode.analyze_file("lib/my_module.ex") Enum.each(patterns, fn pattern -> IO.puts("#{pattern.type}: Line #{pattern.line}") IO.puts(" #{pattern.description}") end) # Analyze directory {:ok, results} = DeadCode.analyze_files("lib/") # Returns: Map of file paths to dead code patterns ``` **Detected Patterns:** - Unreachable code after `return` - Constant conditions (always true/false) - Unused variables - Dead branches ### MCP Tools #### `find_dead_code` Graph-based unused function detection. ```json { "name": "find_dead_code", "arguments": { "confidence_threshold": 0.8, "include_private": true, "format": "detailed" } } ``` #### `analyze_dead_code_patterns` AST-based unreachable code detection. ```json { "name": "analyze_dead_code_patterns", "arguments": { "path": "lib/my_module.ex", "format": "json" } } ``` ## Dependency Analysis Analyze module dependencies and coupling. ### Finding Circular Dependencies ```elixir alias Ragex.Analysis.DependencyGraph # Find all circular dependencies {:ok, cycles} = DependencyGraph.find_cycles() Enum.each(cycles, fn cycle -> IO.puts("Cycle: #{inspect(cycle)}") end) ``` ### Coupling Metrics ```elixir # Calculate coupling for a module metrics = DependencyGraph.coupling_metrics(MyModule) IO.puts("Afferent coupling: #{metrics.afferent}") # Incoming deps IO.puts("Efferent coupling: #{metrics.efferent}") # Outgoing deps IO.puts("Instability: #{metrics.instability}") # 0.0 to 1.0 ``` **Instability** = efferent / (afferent + efferent) - 0.0 = Stable (many dependents, few dependencies) - 1.0 = Unstable (few dependents, many dependencies) ### Finding God Modules ```elixir # Modules with high coupling {:ok, god_modules} = DependencyGraph.find_god_modules(threshold: 10) ``` ### MCP Tools #### `analyze_dependencies` ```json { "name": "analyze_dependencies", "arguments": { "module": "MyModule", "include_transitive": true } } ``` #### `find_circular_dependencies` ```json { "name": "find_circular_dependencies", "arguments": { "min_cycle_length": 2 } } ``` #### `coupling_report` ```json { "name": "coupling_report", "arguments": { "format": "json", "sort_by": "instability" } } ``` ## Impact Analysis Predict the impact of code changes before making them using graph traversal and metrics. ### Overview Impact Analysis answers critical questions: - Which code will be affected by this change? - Which tests need to run? - How risky is this refactoring? - How much effort will this take? **Key Features:** - Graph-based call chain analysis - Risk scoring (importance + coupling + complexity) - Effort estimation for refactoring operations - Test discovery ### Analyzing Change Impact ```elixir alias Ragex.Analysis.Impact # Analyze impact of changing a function {:ok, analysis} = Impact.analyze_change({:function, MyModule, :process, 2}) IO.puts("Direct callers: #{length(analysis.direct_callers)}") IO.puts("Total affected: #{analysis.affected_count}") IO.puts("Risk score: #{analysis.risk_score}") IO.puts("Importance: #{analysis.importance}") # Show recommendations Enum.each(analysis.recommendations, &IO.puts/1) ``` **Parameters:** - `depth` - Maximum traversal depth (default: 5) - `include_tests` - Include test files in analysis (default: true) - `exclude_modules` - Modules to exclude from traversal **Returns:** - `target` - The node being analyzed - `direct_callers` - Functions that directly call this - `all_affected` - All reachable callers (transitive) - `affected_count` - Total number of affected nodes - `risk_score` - Overall risk (0.0 to 1.0) - `importance` - PageRank-based importance - `recommendations` - Actionable advice ### Finding Affected Tests ```elixir # Find tests that will be affected by changing this function {:ok, tests} = Impact.find_affected_tests({:function, MyModule, :process, 2}) IO.puts("#{length(tests)} tests affected") Enum.each(tests, fn {:function, module, name, arity} -> IO.puts(" - #{module}.#{name}/#{arity}") end) ``` **Custom Test Patterns:** ```elixir # Support non-standard test naming (e.g., specs) {:ok, tests} = Impact.find_affected_tests( {:function, MyModule, :process, 2}, test_patterns: ["Spec", "Test", "_test"] ) ``` ### Estimating Refactoring Effort ```elixir # Estimate effort for rename operation {:ok, estimate} = Impact.estimate_effort( :rename_function, {:function, MyModule, :old_name, 2} ) IO.puts("Operation: #{estimate.operation}") IO.puts("Changes needed: #{estimate.estimated_changes} locations") IO.puts("Complexity: #{estimate.complexity}") IO.puts("Time estimate: #{estimate.estimated_time}") # Review risks IO.puts("\nRisks:") Enum.each(estimate.risks, fn risk -> IO.puts(" - #{risk}") end) # Review recommendations IO.puts("\nRecommendations:") Enum.each(estimate.recommendations, fn rec -> IO.puts(" - #{rec}") end) ``` **Supported Operations:** - `:rename_function` - Rename a function - `:rename_module` - Rename a module - `:extract_function` - Extract code into new function - `:inline_function` - Inline a function - `:move_function` - Move function to another module - `:change_signature` - Change function signature **Complexity Levels:** - `:low` - < 5 affected locations (< 30 min) - `:medium` - 5-20 locations (30 min - 2 hours) - `:high` - 20-50 locations (2-4 hours) - `:very_high` - 50+ locations (1+ day) ### Risk Assessment ```elixir # Calculate risk score for a change {:ok, risk} = Impact.risk_score({:function, MyModule, :critical_fn, 1}) IO.puts("Target: #{inspect(risk.target)}") IO.puts("Overall risk: #{risk.overall} (#{risk.level})") IO.puts("\nComponents:") IO.puts(" Importance: #{risk.importance} # PageRank") IO.puts(" Coupling: #{risk.coupling} # Edges") IO.puts(" Complexity: #{risk.complexity} # Code metrics") ``` **Risk Levels:** - `:low` - Overall < 0.3 (safe to change) - `:medium` - 0.3 ≤ Overall < 0.6 (needs review) - `:high` - 0.6 ≤ Overall < 0.8 (risky, comprehensive testing) - `:critical` - Overall ≥ 0.8 (very risky, plan carefully) **Risk Components:** 1. **Importance** - Based on PageRank (how central in the call graph) 2. **Coupling** - Number of incoming/outgoing edges (normalized) 3. **Complexity** - Code complexity metrics (if available) ### MCP Tools #### `analyze_impact` Analyze the impact of changing a function or module. ```json { "name": "analyze_impact", "arguments": { "target": "MyModule.process/2", "depth": 5, "include_tests": true, "format": "detailed" } } ``` **Target Formats:** - `"Module.function/arity"` - Specific function - `"Module"` - Entire module #### `estimate_refactoring_effort` Estimate effort for a refactoring operation. ```json { "name": "estimate_refactoring_effort", "arguments": { "operation": "rename_function", "target": "MyModule.old_name/2", "format": "summary" } } ``` **Operations:** `rename_function`, `rename_module`, `extract_function`, `inline_function`, `move_function`, `change_signature` #### `risk_assessment` Calculate risk score for a change. ```json { "name": "risk_assessment", "arguments": { "target": "MyModule.critical/1", "format": "detailed" } } ``` ### Workflow Example **Before Refactoring:** ```elixir # Step 1: Analyze impact {:ok, impact} = Impact.analyze_change({:function, MyModule, :old_name, 2}) if impact.affected_count > 20 do IO.puts("Warning: Large impact (#{impact.affected_count} locations)") end # Step 2: Find affected tests {:ok, tests} = Impact.find_affected_tests({:function, MyModule, :old_name, 2}) IO.puts("Tests to run: #{length(tests)}") # Step 3: Estimate effort {:ok, estimate} = Impact.estimate_effort( :rename_function, {:function, MyModule, :old_name, 2} ) IO.puts("Estimated time: #{estimate.estimated_time}") # Step 4: Assess risk {:ok, risk} = Impact.risk_score({:function, MyModule, :old_name, 2}) case risk.level do :low -> IO.puts("✓ Safe to proceed") :medium -> IO.puts("⚠ Review carefully") :high -> IO.puts("⚠ High risk - thorough testing required") :critical -> IO.puts("❌ Critical risk - consider alternative approach") end # Step 5: Proceed with refactoring if acceptable if risk.level in [:low, :medium] do # Run refactoring # Run affected tests # Commit changes end ``` ### Best Practices 1. **Always analyze before refactoring** - Know the scope of changes 2. **Check risk levels** - Don't proceed with critical-risk changes without planning 3. **Run affected tests** - Use test discovery to optimize CI time 4. **Review transitive callers** - Indirect impacts can be significant 5. **Consider alternatives** - High-risk operations may have safer approaches 6. **Document high-impact changes** - Leave notes for future maintainers 7. **Use depth wisely** - Deep traversal (depth > 10) can be expensive 8. **Exclude test files for production impact** - Use `include_tests: false` ### Limitations **Current limitations:** - Dynamic function calls (apply, send) not fully tracked - Macros may not be accurately analyzed - Cross-module dependencies require full analysis - Complexity metrics require quality analysis to be run first **Workarounds:** - Run comprehensive analysis before impact analysis - Manually review dynamic call sites - Use conservative estimates for macro-heavy code - Lower confidence scores indicate potential dynamic usage ## MCP Tools Reference ### Summary of All Analysis Tools | Tool | Purpose | Analysis Type | |------|---------|---------------| | `find_duplicates` | Code duplication detection | AST (Metastatic) | | `find_similar_code` | Semantic similarity | Embedding | | `find_dead_code` | Unused functions | Graph | | `analyze_dead_code_patterns` | Unreachable code | AST (Metastatic) | | `analyze_dependencies` | Module dependencies | Graph | | `find_circular_dependencies` | Circular deps | Graph | | `coupling_report` | Coupling metrics | Graph | | `analyze_impact` | Change impact analysis | Graph | | `estimate_refactoring_effort` | Effort estimation | Graph + Metrics | | `risk_assessment` | Risk scoring | Graph + PageRank | ### Common Parameters **Formats:** - `"summary"` - Brief, human-readable - `"detailed"` - Complete information - `"json"` - Machine-readable JSON **Thresholds:** - Duplication: 0.8-0.95 (higher = stricter) - Similarity: 0.9-0.99 (higher = more similar) - Confidence: 0.7-0.9 (higher = more certain) ## Best Practices ### Duplication Detection 1. **Start with high thresholds** (0.9+) to find obvious duplicates 2. **Lower gradually** to find near-misses 3. **Review Type II/III clones carefully** - they may be intentional 4. **Use embedding-based search** for conceptual similarity 5. **Exclude build artifacts** - always exclude `_build`, `deps`, etc. ### Dead Code Detection 1. **Check confidence scores** - low confidence may indicate dynamic calls 2. **Review entry points** - callbacks, GenServer handlers, etc. may not show up in call graph 3. **Combine both approaches** - graph-based for unused functions, AST-based for unreachable code 4. **Run regularly** - integrate into CI/CD pipeline 5. **Keep whitelist** of intentionally unused functions (e.g., API compatibility) ### Dependency Analysis 1. **Monitor instability** - high instability modules are risky to change 2. **Break circular dependencies** - they indicate poor separation of concerns 3. **Watch for God modules** - high coupling suggests need for refactoring 4. **Track trends over time** - coupling should decrease as code improves ### Performance Tips 1. **Use incremental analysis** - only analyze changed files 2. **Exclude test directories** for production analysis 3. **Limit depth** for transitive dependency analysis 4. **Cache results** - Ragex automatically caches embeddings 5. **Run in parallel** - analysis operations are concurrent-safe ## Troubleshooting ### No Duplicates Found (Expected Some) **Possible causes:** - Threshold too high - try lowering to 0.7-0.8 - Files not in supported languages - check file extensions - Structural differences too large - use embedding-based similarity **Solutions:** ```elixir # Try lower threshold {:ok, clones} = Duplication.detect_in_directory("lib/", threshold: 0.7) # Or use embedding-based similarity {:ok, similar} = Duplication.find_similar_functions(threshold: 0.85) ``` ### Too Many False Positives **Possible causes:** - Threshold too low - Structural patterns common in the language (e.g., GenServer boilerplate) - Short functions with similar structure **Solutions:** ```elixir # Increase threshold {:ok, clones} = Duplication.detect_in_directory("lib/", threshold: 0.95) # Filter by minimum size clones |> Enum.filter(fn clone -> clone.details.locations |> Enum.any?(fn loc -> loc.lines > 5 end) end) ``` ### Dead Code False Positives **Possible causes:** - Dynamic function calls (`apply/3`, `__MODULE__`) - Reflection usage - Entry points not in call graph (callbacks, tests) **Solutions:** 1. Check confidence scores - low confidence = likely dynamic 2. Maintain whitelist of known entry points 3. Review before deletion ### Parse Errors **Possible causes:** - Invalid syntax in source files - Unsupported language features - Missing language parser **Solutions:** ```elixir # Check logs for specific parse errors # Ragex logs warnings for unparseable files # Exclude problematic files {:ok, clones} = Duplication.detect_in_directory("lib/", exclude_patterns: ["problem_file.ex"] ) ``` ### Performance Issues **Symptoms:** - Slow analysis on large codebases - Memory usage spikes **Solutions:** 1. Analyze incrementally (changed files only) 2. Exclude large generated files 3. Use streaming for large result sets 4. Increase system resources ```elixir # Analyze only changed files changed_files = ["lib/a.ex", "lib/b.ex"] {:ok, clones} = Duplication.detect_in_files(changed_files) ``` ## Integration Examples ### CI/CD Pipeline ```bash #!/bin/bash # detect_issues.sh # Find duplicates echo "Checking for code duplication..." mix ragex.analyze.duplicates --threshold 0.9 --format json > duplicates.json # Find dead code echo "Checking for dead code..." mix ragex.analyze.dead_code --confidence 0.8 --format json > dead_code.json # Check for circular dependencies echo "Checking for circular dependencies..." mix ragex.analyze.cycles --format json > cycles.json # Fail if issues found if [ -s duplicates.json ] || [ -s dead_code.json ] || [ -s cycles.json ]; then echo "Code quality issues detected!" exit 1 fi ``` ### Pre-commit Hook ```bash #!/bin/bash # .git/hooks/pre-commit # Get staged Elixir files STAGED_FILES=$(git diff --cached --name-only --diff-filter=ACM | grep -E '\\.ex$|\\.exs$') if [ -n "$STAGED_FILES" ]; then echo "Checking staged files for duplication..." mix ragex.analyze.duplicates --files $STAGED_FILES --threshold 0.95 fi ``` ### Interactive Analysis ```elixir # In IEx alias Ragex.Analysis.Duplication # Generate report {:ok, report} = Duplication.generate_report("lib/") # Display summary IO.puts(report.summary) # Investigate specific clones report.ast_clones.pairs |> Enum.filter(&(&1.clone_type == :type_i)) |> Enum.each(fn clone -> IO.puts("\n#{clone.file1} <-> #{clone.file2}") IO.puts(" #{clone.details.summary}") end) ``` ## Further Reading - [Metastatic Documentation](https://github.com/oeditus/metastatic) --- **Version:** Ragex 0.2.0 **Last Updated:** January 2026 **Status:** Production Ready