rrlmgraphbench (development version)
Documentation
-
README.md: updated “eight retrieval strategies” → “ten”; addedgraph_rag_agenticandrandom_krows to the strategy comparison table; corrected example console output to showStrategies: 10. Closes #78. -
vignettes/benchmark_report.Rmd: corrected schedule description from “every Monday” to “four times daily (03:00, 09:00, 15:00, and 21:00 UTC)”, matching the actualrun-benchmark.ymlcron triggers. Closes #77.
Bug fixes
-
run_full_benchmark():graph_rag_agenticandrandom_kare now included in the defaultstrategiesvector so they run in every benchmark by default (#71). -
score_response():eval(parse(text=response_code))now executes in a sandboxed environment wheresystem,system2,unlink,file.remove,shell,Sys.setenv,writeLines, andwriteare replaced with no-ops, blocking adversarial shell-escape from LLM-generated code (#73). -
mcp_build_ellmer_tools()/run_single(): agentic strategy now records every node name the LLM retrieves viaget_node_info,find_callers, andfind_callees; NDCG@5/10 is back-filled afterchat$chat()returns instead of always beingNA(#74). -
bm25_retrieve()/term_overlap_retrieve(): token budget is now enforced with.bench_estimate_tokens()(1 token ≈ 3.5 chars, matching rrlmgraph’sceiling(nchar / 3.5)) instead of the 4-char heuristic. This eliminates the ~14 % systematic over-allocation that inflated these baselines relative to graph-RAG strategies (#75). -
run_single(): per-turn token vectors are normalised to length 2 beforeReduce("+"), preventing R’s recycling rule from corrupting the cumulative token count when a turn reports only one value (#76).
CI / workflow
-
run-benchmark.yml: Ollama install step now pinsv0.18.2and verifies the installer SHA256 before execution, replacing the unsafecurl | shpattern that allowed arbitrary code execution from the Ollama CDN (#72).
rrlmgraphbench 0.1.3
Dependencies
- Requires
rrlmgraph >= 0.1.3to pick up thetask_trace_weightcold-start fix (#78): spoke nodes no longer score abovemin_relevancewhen the weight is uninitialized, preventing context inflation in the tfidf/MCP strategies.
New strategies
-
graph_rag_agentic: agentic strategy — LLM drives MCP tool calls viafind_callers,find_callees,list_functions, andsearch_nodes; no fixed graph seed (#51). -
graph_rag_tfidf_noseed: query-only seeding — TF-IDF graph traversal using the query itself as the seed instead of a caller-specified node, for tasks where the seed node is unknown at runtime (#54).
rrlmgraphbench (development version)
CI / workflow
-
run-benchmark.ymlnow runs daily (was weekly) with a version-bump gate:- A cheap
check-versionjob fetches the remote rrlmgraphDESCRIPTIONand compares theVersion:field toinst/last-benchmarked-rrlmgraph-version.txt. - On scheduled runs, the heavy
benchmarkjob is skipped (and all its ~8 min of Node / Ollama / R setup) when the version has not changed. - Manual
workflow_dispatchalways bypasses the gate. Aforce_runboolean input is also provided for explicit overrides. - If the remote version cannot be fetched (network failure), the gate defaults to running (fail-open) so a version bump is never silently missed.
- The version stamp file is written after a successful benchmark run only. A failed run does not advance the stamp, so the next scheduled run retries.
- Benchmark commit messages now include the rrlmgraph version for traceability (e.g.
chore: update benchmark results [rrlmgraph 0.1.2]).
- A cheap
Bug fixes
-
mcp_read_response(): fixed two bugs that silently disabled therrlmgraph_mcpstrategy in every benchmark run:-
processx::process$poll_io()returns keys"output"and"error"on all platforms, but the code was testingready[["stdout"]], throwing"subscript out of bounds"in R and causing the outertryCatchinmcp_start_server()to returnNULL, which madebuild_context()return empty context for every MCP task. - The
initializeJSON-RPC request sent"capabilities":[](an empty JSON array) instead of"capabilities":{}(an empty object). The Zod schema in the MCP SDK rejects arrays, so the server returned a-32603error before the fix was applied. The same fix is applied to thenotifications/initializedparamsfield. Together these two bugs causedrrlmgraph_mcpscores to equalno_context(mean ~0.689) in the n=2 benchmark results. (bench#30)
-
-
run_full_benchmark(): therrlmgraph_mcpstrategy now starts a fresh MCP server per task rather than one global server per run. The previous design calledmcp_start_server()withprojects_dir(the parent of all task project directories) asproject_path, butbetter-sqlite3could not creategraph.sqlitethere on read-only CI paths, causing a silent crash before the JSON-RPC initialize handshake completed. Each task now: (1) exports its TF-IDF graph to a temporary SQLite file viarrlmgraph::export_to_sqlite(), (2) starts an MCP server with the correct per-task--project-pathand--db-path, and- kills the server and deletes the temp file after all trials complete. Fixes the
rrlmgraph_mcpstrategy being silently dropped from all CI benchmark runs. (bench#30)
- kills the server and deletes the temp file after all trials complete. Fixes the
-
run_full_benchmark()workflow: bumpedn_trialsfrom 1 to 2 so the paired Wilcoxon test has 60 task-pairs instead of 30, making it possible to reach statistical significance (p < 0.05).
Improvements
-
mcp_start_server(): addeddb_pathparameter. When supplied,--db-path <db_path>is passed to the Node.js process, allowing the caller to point the MCP server at an existing SQLite export rather than the default<project_path>/.rrlmgraph/graph.sqlitelocation.
rrlmgraphbench 0.1.1
Bug fixes
-
run_single(): LLM responses are now stripped of markdown code fences (```r ... ```) before scoring. Without this fixparse(text = response_code)failed on every GPT-4.1-mini response, causingsyntax_valid = FALSEandruns_without_error = FALSEfor all 146 non-NA benchmark rows; scores were consequently ~0.06 instead of the true ~0.3+ range. Adds a new internal helperstrip_code_fences(). (bench#36)
Improvements
-
run_full_benchmark(): newstrategiesparameter (character vector, defaults to five strategies:rrlmgraph_tfidf,full_files,term_overlap,bm25_retrieval,no_context). The previous hardcoded list of six non-Ollama strategies produced 180 LLM calls per benchmark run, exhausting the GitHub Models free-tier quota (~150 req/day) and leaving tasks 026-030 asNAin every CI run. The new default of five strategies yields exactly 150 calls (30 tasks × 5), staying within the free-tier limit. Callers can overridestrategiesto run any subset, including"random_k"when a higher quota is available.
rrlmgraphbench 0.1.0
First release.
Bug fixes
-
run-benchmark.ymlCI workflow: addedmodels: readpermission so thatGITHUB_TOKENcan call the GitHub Models inference API. Without it everyellmer::chat_github()call returned an empty string and all scores were a degenerate0.6(#2). - Removed
[skip ci]from the auto-commit message so thatpkgdownrebuilds the benchmark report vignette after every results update. - LLM call failures in
run_single()now emit amessage()(waswarning()) so they are visible in CI logs.
Original first-release notes
- Task corpus: 15 coding tasks across three fixture projects (mini data-science script, medium Shiny app, small R package), covering function-modification, bug-diagnosis, new-feature, refactoring, and documentation categories.
- Ground-truth solutions committed under
inst/ground_truth/solutions/. -
run_full_benchmark()— evaluate six retrieval strategies against every task. -
compute_benchmark_statistics()— summary table, TER, pairwise Welch t-tests, Cohen’s d, Bonferroni correction, NDCG. -
count_hallucinations()— detect invented functions, invalid arguments, and wrong-namespace references in LLM-generated R code. - Vignette:
benchmark_report— reproducible benchmark report using precomputed results.