Ragex.Analysis.Duplication (Ragex v0.10.1)

View Source

Code duplication detection using two complementary approaches.

Primary: AST-Based Detection (via Metastatic)

Delegates to Metastatic.Analysis.Duplication for precise clone detection:

  • Type I: Exact clones (identical AST)
  • Type II: Renamed clones (same structure, different identifiers)
  • Type III: Near-miss clones (similar structure with modifications)
  • Type IV: Semantic clones (different syntax, same behavior)

Works across different programming languages by comparing MetaAST representations.

Secondary: Embedding-Based Detection

Uses existing semantic embeddings to find similar functions:

  • Semantic similarity via cosine distance
  • Configurable similarity threshold (default: 0.95)
  • Complements AST-based detection
  • Useful for finding "code smells" and refactoring opportunities

Usage

alias Ragex.Analysis.Duplication

# AST-based detection (via Metastatic)
{:ok, result} = Duplication.detect_in_files(["lib/a.ex", "lib/b.ex"])

# Embedding-based detection
{:ok, similar} = Duplication.find_similar_functions(threshold: 0.95)

# Detect in directory
{:ok, results} = Duplication.detect_in_directory("lib/")

Summary

Functions

Detects duplicates between two files using Metastatic's AST comparison.

Detects duplicates in all supported files within a directory.

Detects duplicates across multiple files.

Finds code duplicates in a directory.

Finds similar functions using semantic embeddings.

Generates a duplication report for a project.

Types

clone_pair()

@type clone_pair() :: %{
  file1: String.t(),
  file2: String.t(),
  clone_type: clone_type(),
  similarity: float(),
  details: map()
}

clone_type()

@type clone_type() :: :type_i | :type_ii | :type_iii | :type_iv

function_ref()

@type function_ref() :: {:function, module(), atom(), non_neg_integer()}

similar_pair()

@type similar_pair() :: %{
  function1: function_ref(),
  function2: function_ref(),
  similarity: float(),
  method: :embedding | :ast
}

Functions

detect_between_files(file1_path, file2_path, opts \\ [])

@spec detect_between_files(String.t(), String.t(), keyword()) ::
  {:ok, Metastatic.Analysis.Duplication.Result.t()} | {:error, term()}

Detects duplicates between two files using Metastatic's AST comparison.

Parameters

  • file1_path - Path to first file
  • file2_path - Path to second file
  • opts - Keyword list of options
    • :threshold - Similarity threshold for Type III (default: 0.8)
    • :min_tokens - Minimum tokens for detection (default: 5)
    • :cross_language - Enable cross-language detection (default: true)

Returns

  • {:ok, result} - Metastatic.Analysis.Duplication.Result struct
  • {:error, reason} - Error if analysis fails

Examples

{:ok, result} = Duplication.detect_between_files("lib/a.ex", "lib/b.ex")
if result.duplicate? do
  IO.puts("Found #{result.clone_type} clone")
end

detect_in_directory(directory, opts \\ [])

@spec detect_in_directory(
  String.t(),
  keyword()
) :: {:ok, [clone_pair()]} | {:error, term()}

Detects duplicates in all supported files within a directory.

Recursively scans the directory for supported file types and detects duplicates using Metastatic's AST comparison.

Parameters

  • directory - Path to directory
  • opts - Keyword list of options
    • :recursive - Recursively scan subdirectories (default: true)
    • :threshold - Similarity threshold (default: 0.8)
    • :exclude_patterns - List of patterns to exclude (default: ["_build", "deps", ".git"])

Returns

  • {:ok, [clone_pair]} - List of detected clone pairs
  • {:error, reason} - Error if analysis fails

Examples

{:ok, clones} = Duplication.detect_in_directory("lib/")
IO.puts("Found #{length(clones)} duplicate pairs")

detect_in_files(file_paths, opts \\ [])

@spec detect_in_files(
  [String.t()],
  keyword()
) :: {:ok, [clone_pair()]} | {:error, term()}

Detects duplicates across multiple files.

Returns a list of clone pairs found across the provided files.

Parameters

  • file_paths - List of file paths to analyze
  • opts - Keyword list of options (same as detect_between_files/3)
    • :ai_analyze - Use AI for semantic analysis (default: from config)

Returns

  • {:ok, [clone_pair]} - List of detected clone pairs
  • {:error, reason} - Error if analysis fails

Examples

{:ok, clones} = Duplication.detect_in_files(["lib/a.ex", "lib/b.ex", "lib/c.ex"])
Enum.each(clones, fn clone ->
  IO.puts("#{clone.file1} <-> #{clone.file2}: #{clone.clone_type}")
end)

find_duplicates(path, opts \\ [])

@spec find_duplicates(
  String.t(),
  keyword()
) :: {:ok, [map()]} | {:error, term()}

Finds code duplicates in a directory.

Alias for detect_in_directory/2. Provided for API consistency with mix tasks.

Examples

{:ok, duplicates} = Duplication.find_duplicates("lib/", threshold: 0.85)

find_similar_functions(opts \\ [])

@spec find_similar_functions(keyword()) :: {:ok, [similar_pair()]} | {:error, term()}

Finds similar functions using semantic embeddings.

This is a complementary approach to AST-based detection. Uses cosine similarity on function embeddings to find semantically similar code.

Parameters

  • opts - Keyword list of options
    • :threshold - Similarity threshold (0.0-1.0, default: 0.95)
    • :limit - Maximum number of pairs to return (default: 100)
    • :node_type - Type of node to compare (default: :function)

Returns

  • {:ok, [similar_pair]} - List of similar function pairs
  • {:error, reason} - Error if analysis fails

Examples

{:ok, similar} = Duplication.find_similar_functions(threshold: 0.95)
Enum.each(similar, fn pair ->
  IO.puts("#{format_function(pair.function1)} ~ #{format_function(pair.function2)}")
  IO.puts("  Similarity: #{pair.similarity}")
end)

generate_report(directory, opts \\ [])

@spec generate_report(
  String.t(),
  keyword()
) :: {:ok, map()} | {:error, term()}

Generates a duplication report for a project.

Combines both AST-based and embedding-based detection to provide a comprehensive view of code duplication.

Parameters

  • directory - Path to project directory
  • opts - Keyword list of options
    • :include_embeddings - Include embedding-based results (default: true)
    • :format - Output format (:summary, :detailed, :json, default: :summary)

Returns

  • {:ok, report} - Duplication report map
  • {:error, reason} - Error if analysis fails