Benchmark Evaluation

Copy Markdown View Source

Benchmark results for PTC-Lisp code generation across different models.

Results Summary (v0.9.0)

ModelTestsRunsPass RateAvg AttemptsDuration
Claude Haiku 4.5303099.7% (897/900)1.2450.7m
Gemini 3.1 Flash Lite Preview303099.4% (895/900)1.2246.1m

Configuration: 30 tests across 5 difficulty levels, schema data mode, March 2026.

Both Haiku 4.5 and Flash Lite are small, inexpensive models. The high pass rates demonstrate that PTC-Lisp generation does not require large or expensive models.

Test Suite

The benchmark uses 30 tests organized into 5 categories:

CategoryTestsTurn LimitDescription
Basic (Level 1)1-41count, filter, sum, avg
Intermediate (Level 2)5-81boolean fields, numeric comparison, AND logic, extremes
Advanced (Level 3)9-131top-N, OR logic, cross-dataset joins
Lisp-specific14-153group-by with aggregation, map destructuring
Multi-turn16-302-6tool calls, temporal analysis, optimization, exploration, plan execution

Tests 1-15 are single-shot (one turn, no recovery). Tests 16-30 allow multi-turn interaction — the model can explore, recover from errors, and refine its approach.

What the Numbers Show

Multi-turn attempts are mostly by design, not errors

The 1.22-1.24 average attempts per test does not mean 22-24% of tests fail on the first try. Multi-turn tests (16-30) are designed to require multiple turns — the model searches, inspects results, then returns an answer. This is the REPL pattern working as intended.

Breaking this down by test category:

  • Single-shot tests (1-15): near-100% first-attempt success rate across both models. These are pure data transformation — the model writes one correct program on the first try.
  • Multi-turn tests (16-30): these naturally use 2-3 turns because the model needs to call tools, inspect results, then return an answer. Multiple turns here is the expected workflow, not a failure.
  • Genuine recovery: a small percentage of attempts fail due to code errors (unsupported interop methods, type mismatches). The runtime provides clear feedback and the model self-corrects on the next turn.
  • Unrecoverable (0.3-0.6%): the few tests that failed even after retries.

Unrecoverable failures are LLM reasoning errors

Across both models (1800 total test executions), 8 tests ended as FAIL. All 8 were reasoning errors — the generated code ran successfully but produced the wrong answer:

Failure TypeCountExample
Hallucinated values3Returned a made-up document ID instead of extracting from tool results
Wrong field lookup3Searched :content for "ergonomics" when the word was in :topics
Guessed instead of checking2Printed results with println but then guessed the answer

Recoverable errors — ones the model self-corrected after runtime feedback — did include language-related issues: unsupported Java interop methods (.substring, .contains on non-string types), nested #() anonymous functions, and type mismatches. These are real PTC-Lisp limitations that the multi-turn loop compensates for.

Recovery works

When the model writes code that fails at runtime (unsupported Java interop, type mismatches, nested anonymous functions), PTC-Runner returns a clear error message. The model then corrects its approach on the next turn. This recovery succeeds in nearly all cases — only the reasoning failures (where the code runs but produces the wrong answer) are unrecoverable.

Common recovered errors:

  • Using .substring or .contains on non-string types (switches to subs or some)
  • Nested #() anonymous functions (switches to (fn [...] ...))
  • Arithmetic on nil values (adds default values)

Hardest Tests

The tests that cause the most failures and retries:

TestChallengeWhy It's Hard
#20: Find certification reimbursement policySearch returns decoy resultsMust fetch and compare content, not trust first match
#23: Which document mentions 'ergonomics'?Answer requires inspecting fetched contentModel must check the right field, not guess from titles
#17: Find policy covering two topicsMulti-step search and intersectionModel must search, analyze, and narrow results

These are all multi-turn tool-calling tasks requiring the model to resist premature answers and actually verify its findings.

Improving Reliability

1. Turn Limits

For complex queries, allow more iterations:

SubAgent.run(agent, context, max_turns: 8)  # default is 5

2. Prompt Customization

The base prompt includes common mistakes to avoid. Domain-specific examples can further improve reliability. See SubAgent Advanced.

3. Language Improvements (Ongoing)

Some retries stem from models expecting Clojure functions or Java interop methods that PTC-Lisp doesn't support. We add commonly-expected functions and interop methods as they are identified through benchmark analysis.

Running Benchmarks

cd demo

# Run benchmark with reports (30 tests, default model)
mix lisp --test --runs=5 --report

# Specific model
mix lisp --test --model=haiku --runs=30

# Verbose output to debug failures
mix lisp --test --model=haiku -v

Further Reading