Skip to contents

rrlmgraphbench (development version)

Documentation

  • README.md: updated “eight retrieval strategies” → “ten”; added graph_rag_agentic and random_k rows to the strategy comparison table; corrected example console output to show Strategies: 10. Closes #78.
  • vignettes/benchmark_report.Rmd: corrected schedule description from “every Monday” to “four times daily (03:00, 09:00, 15:00, and 21:00 UTC)”, matching the actual run-benchmark.yml cron triggers. Closes #77.

CI / Security

  • All GitHub Actions workflows are now pinned to full commit SHA digests instead of mutable @v4/@v2 tags. Closes #79.
  • R-CMD-check.yaml: added { os: windows-latest, r: "oldrel-1" } matrix entry to match parity with the rrlmgraph CI matrix (rrlmgraph#88). Closes #80.

Bug fixes

  • run_full_benchmark(): graph_rag_agentic and random_k are now included in the default strategies vector so they run in every benchmark by default (#71).
  • score_response(): eval(parse(text=response_code)) now executes in a sandboxed environment where system, system2, unlink, file.remove, shell, Sys.setenv, writeLines, and write are replaced with no-ops, blocking adversarial shell-escape from LLM-generated code (#73).
  • mcp_build_ellmer_tools() / run_single(): agentic strategy now records every node name the LLM retrieves via get_node_info, find_callers, and find_callees; /10 is back-filled after chat$chat() returns instead of always being NA (#74).
  • bm25_retrieve() / term_overlap_retrieve(): token budget is now enforced with .bench_estimate_tokens() (1 token ≈ 3.5 chars, matching rrlmgraph’s ceiling(nchar / 3.5)) instead of the 4-char heuristic. This eliminates the ~14 % systematic over-allocation that inflated these baselines relative to graph-RAG strategies (#75).
  • run_single(): per-turn token vectors are normalised to length 2 before Reduce("+"), preventing R’s recycling rule from corrupting the cumulative token count when a turn reports only one value (#76).

CI / workflow

  • run-benchmark.yml: Ollama install step now pins v0.18.2 and verifies the installer SHA256 before execution, replacing the unsafe curl | sh pattern that allowed arbitrary code execution from the Ollama CDN (#72).

rrlmgraphbench 0.1.3

Dependencies

  • Requires rrlmgraph >= 0.1.3 to pick up the task_trace_weight cold-start fix (#78): spoke nodes no longer score above min_relevance when the weight is uninitialized, preventing context inflation in the tfidf/MCP strategies.

New strategies

  • graph_rag_agentic: agentic strategy — LLM drives MCP tool calls via find_callers, find_callees, list_functions, and search_nodes; no fixed graph seed (#51).
  • graph_rag_tfidf_noseed: query-only seeding — TF-IDF graph traversal using the query itself as the seed instead of a caller-specified node, for tasks where the seed node is unknown at runtime (#54).

Other changes

  • All rrlmgraph_* strategy names renamed to graph_rag_* for clarity (#50).
  • Budget cap added to bm25_retrieve and term_overlap_retrieve so they never exceed budget_tokens (#53).
  • Added 24 hard multi-hop evaluation task JSON files (#52).

rrlmgraphbench (development version)

CI / workflow

  • run-benchmark.yml now runs daily (was weekly) with a version-bump gate:
    • A cheap check-version job fetches the remote rrlmgraph DESCRIPTION and compares the Version: field to inst/last-benchmarked-rrlmgraph-version.txt.
    • On scheduled runs, the heavy benchmark job is skipped (and all its ~8 min of Node / Ollama / R setup) when the version has not changed.
    • Manual workflow_dispatch always bypasses the gate. A force_run boolean input is also provided for explicit overrides.
    • If the remote version cannot be fetched (network failure), the gate defaults to running (fail-open) so a version bump is never silently missed.
    • The version stamp file is written after a successful benchmark run only. A failed run does not advance the stamp, so the next scheduled run retries.
    • Benchmark commit messages now include the rrlmgraph version for traceability (e.g. chore: update benchmark results [rrlmgraph 0.1.2]).

Bug fixes

  • mcp_read_response(): fixed two bugs that silently disabled the rrlmgraph_mcp strategy in every benchmark run:
    1. processx::process$poll_io() returns keys "output" and "error" on all platforms, but the code was testing ready[["stdout"]], throwing "subscript out of bounds" in R and causing the outer tryCatch in mcp_start_server() to return NULL, which made build_context() return empty context for every MCP task.
    2. The initialize JSON-RPC request sent "capabilities":[] (an empty JSON array) instead of "capabilities":{} (an empty object). The Zod schema in the MCP SDK rejects arrays, so the server returned a -32603 error before the fix was applied. The same fix is applied to the notifications/initialized params field. Together these two bugs caused rrlmgraph_mcp scores to equal no_context (mean ~0.689) in the n=2 benchmark results. (bench#30)
  • run_full_benchmark(): the rrlmgraph_mcp strategy now starts a fresh MCP server per task rather than one global server per run. The previous design called mcp_start_server() with projects_dir (the parent of all task project directories) as project_path, but better-sqlite3 could not create graph.sqlite there on read-only CI paths, causing a silent crash before the JSON-RPC initialize handshake completed. Each task now: (1) exports its TF-IDF graph to a temporary SQLite file via rrlmgraph::export_to_sqlite(), (2) starts an MCP server with the correct per-task --project-path and --db-path, and
    1. kills the server and deletes the temp file after all trials complete. Fixes the rrlmgraph_mcp strategy being silently dropped from all CI benchmark runs. (bench#30)
  • run_full_benchmark() workflow: bumped n_trials from 1 to 2 so the paired Wilcoxon test has 60 task-pairs instead of 30, making it possible to reach statistical significance (p < 0.05).

Improvements

  • mcp_start_server(): added db_path parameter. When supplied, --db-path <db_path> is passed to the Node.js process, allowing the caller to point the MCP server at an existing SQLite export rather than the default <project_path>/.rrlmgraph/graph.sqlite location.

rrlmgraphbench 0.1.1

Bug fixes

  • run_single(): LLM responses are now stripped of markdown code fences (```r ... ```) before scoring. Without this fix parse(text = response_code) failed on every GPT-4.1-mini response, causing syntax_valid = FALSE and runs_without_error = FALSE for all 146 non-NA benchmark rows; scores were consequently ~0.06 instead of the true ~0.3+ range. Adds a new internal helper strip_code_fences(). (bench#36)

Improvements

  • run_full_benchmark(): new strategies parameter (character vector, defaults to five strategies: rrlmgraph_tfidf, full_files, term_overlap, bm25_retrieval, no_context). The previous hardcoded list of six non-Ollama strategies produced 180 LLM calls per benchmark run, exhausting the GitHub Models free-tier quota (~150 req/day) and leaving tasks 026-030 as NA in every CI run. The new default of five strategies yields exactly 150 calls (30 tasks × 5), staying within the free-tier limit. Callers can override strategies to run any subset, including "random_k" when a higher quota is available.

rrlmgraphbench 0.1.0

First release.

Bug fixes

  • run-benchmark.yml CI workflow: added models: read permission so that GITHUB_TOKEN can call the GitHub Models inference API. Without it every ellmer::chat_github() call returned an empty string and all scores were a degenerate 0.6 (#2).
  • Removed [skip ci] from the auto-commit message so that pkgdown rebuilds the benchmark report vignette after every results update.
  • LLM call failures in run_single() now emit a message() (was warning()) so they are visible in CI logs.

Original first-release notes

  • Task corpus: 15 coding tasks across three fixture projects (mini data-science script, medium Shiny app, small R package), covering function-modification, bug-diagnosis, new-feature, refactoring, and documentation categories.
  • Ground-truth solutions committed under inst/ground_truth/solutions/.
  • run_full_benchmark() — evaluate six retrieval strategies against every task.
  • compute_benchmark_statistics() — summary table, TER, pairwise Welch t-tests, Cohen’s d, Bonferroni correction, NDCG.
  • count_hallucinations() — detect invented functions, invalid arguments, and wrong-namespace references in LLM-generated R code.
  • Vignette: benchmark_report — reproducible benchmark report using precomputed results.