The goal of rrlmgraphbench is to provide an objective, reproducible benchmark that measures whether rrlmgraph context retrieval actually helps LLMs solve real R coding tasks — and by how much, compared to simpler baselines. It runs a standard set of coding tasks against ten retrieval strategies, scores each LLM response, and reports mean score, 95 % confidence intervals, token efficiency, and hallucination rates per strategy. Results are regenerated automatically four times daily (03:00, 09:00, 15:00, and 21:00 UTC) and committed back to the repository so the GitHub Pages site always shows live data.
Retrieval strategies compared
| Strategy | Description |
|---|---|
graph_rag_tfidf |
Graph-RAG: graph traversal with TF-IDF embeddings |
graph_rag_tfidf_noseed |
Graph-RAG with TF-IDF, no seed node (query-only seeding) |
graph_rag_ollama |
Graph-RAG: graph traversal with Ollama embeddings |
graph_rag_mcp |
Graph-RAG via MCP server: graph traversal via JSON-RPC |
full_files |
Entire source files dumped verbatim (baseline) |
term_overlap |
Term-presence keyword retrieval (no graph) |
bm25_retrieval |
BM25 keyword retrieval, no graph |
graph_rag_agentic |
Graph-RAG with LLM-driven tool-call traversal (agentic) |
random_k |
Random subset of k nodes (chance baseline) |
no_context |
No context supplied |
Installation
You can install the development version of rrlmgraphbench from GitHub with:
# install.packages("pak")
pak::pak("davidrsch/rrlmgraph-bench")Example
Example 1: Run the full benchmark
run_full_benchmark() evaluates all ten strategies across every task in the built-in task suite using an LLM of your choice. The "github" provider uses GITHUB_PAT (automatically set in GitHub Actions) — no extra secret needed.
library(rrlmgraphbench)
# Full run: GitHub Models, 3 trials per task x strategy pair.
results <- run_full_benchmark(
output_path = "results/benchmark_results.rds",
n_trials = 3L,
llm_provider = "github",
llm_model = "gpt-4.1-mini"
)
#> ── rrlmgraphbench ──────────────────────────────────────────────────────────────
#> ℹ Tasks : 54 (30 × standard, 24 × hard)
#> ℹ Strategies: 10
#> ℹ Trials : 3 per strategy × task
#> ───────────────────────────────────────────────────────────────────────────────
#> ✔ [1/90] graph_rag_tfidf × task_001_fm_mini_ds trial 1 (score 0.8, 1 234 tok, 2.3 s)
#> ✔ [2/90] graph_rag_tfidf × task_001_fm_mini_ds trial 2 (score 0.8, 1 198 tok, 2.1 s)
#> ...
#> ✔ [90/90] random_k × task_015_doc_rpkg trial 3 (score 0.4, 892 tok, 1.8 s)
#> ✔ Results saved to results/benchmark_results.rds (90 rows × 16 cols)
# Quick integration check without calling an LLM (returns dummy 0.5 scores).
dry <- run_full_benchmark(
output_path = tempfile(fileext = ".rds"),
.dry_run = TRUE
)
nrow(dry)
#> [1] 90Example 2: Compute and inspect benchmark statistics
compute_benchmark_statistics() aggregates per-trial scores into a summary table with 95 % confidence intervals, Token Efficiency Ratio (TER), and pairwise Welch t-tests with Bonferroni correction and Cohen’s d.
stats <- compute_benchmark_statistics(results)
# Per-strategy summary (ordered by mean score).
stats$summary[order(-stats$summary$mean_score),
c("strategy", "n", "mean_score", "ci_lo_95", "ci_hi_95",
"mean_total_tokens", "hallucination_rate")]
#> strategy n mean_score ci_lo_95 ci_hi_95 mean_total_tokens hallucination_rate
#> full_files 270 0.869 0.833 0.905 18 274 0.072
#> graph_rag_ollama 270 0.862 0.826 0.898 4 103 0.054
#> graph_rag_tfidf 270 0.859 0.822 0.896 3 821 0.041
#> term_overlap 270 0.855 0.819 0.891 2 934 0.058
#> bm25_retrieval 270 0.850 0.813 0.887 6 440 0.078
#> graph_rag_mcp 270 0.836 0.798 0.874 5 127 0.069
#> graph_rag_tfidf_noseed 270 0.834 0.796 0.872 3 651 0.046
#> no_context 270 0.821 0.782 0.860 187 0.098
# Token Efficiency Ratio: score per token relative to full_files.
# TER > 1 means better score at lower token cost.
stats$ter
#> graph_rag_tfidf graph_rag_ollama bm25_retrieval random_k no_context
#> 3.544 2.953 0.977 2.517 NAFor pairwise statistical significance:
pw <- stats$pairwise
pw[pw$p_bonferroni < 0.05,
c("strategy_1", "strategy_2", "cohens_d", "p_bonferroni", "sig")]
#> strategy_1 strategy_2 cohens_d p_bonferroni sig
#> graph_rag_tfidf no_context 1.821 <0.001 ***
#> graph_rag_ollama no_context 1.643 <0.001 ***
#> graph_rag_tfidf full_files 0.423 0.031 *Example 3: Detect hallucinations in LLM responses
count_hallucinations() inspects generated R code for invented function names, invalid argument names, and wrong-namespace calls. The benchmark calls this automatically; you can also use it to audit any LLM-generated snippet.
# Invented function and wrong package namespace in the same snippet.
code <- '
df <- dplyr::filtrate(mtcars, cyl == 6) # "filtrate" does not exist in dplyr
result <- xgboost::xgb_train(df) # "xgb_train" is not exported by xgboost
'
count_hallucinations(code)
#> [[1]]
#> [[1]]$type
#> [1] "wrong_namespace"
#> [[1]]$fn
#> [1] "dplyr::filtrate"
#> [[1]]$detail
#> [1] "'filtrate' is not exported by the 'dplyr' package"
#>
#> [[2]]
#> [[2]]$type
#> [1] "wrong_namespace"
#> [[2]]$fn
#> [1] "xgboost::xgb_train"
#> [[2]]$detail
#> [1] "'xgb_train' is not exported by the 'xgboost' package"
# Pass a graph to also trust project-internal functions.
g <- rrlmgraph::build_rrlm_graph("path/to/mypkg")
count_hallucinations(code, graph = g)Task difficulty levels
Tasks are split into two difficulty tiers:
Standard (tasks 001–030,
difficulty: "standard"): each task provides aseed_nodethat anchors graph traversal to the most relevant entry point. Scored viaast_diff— structural similarity to a reference solution patch.Hard (tasks 031–054,
difficulty: "hard"):seed_nodeis intentionallynull. Graph-RAG strategies must rely on query-only traversal with no anchor, making retrieval harder by design. Scored vianode_presence— whether the LLM’s response mentions the required function names.
This split is deliberate: standard tasks measure retrieval precision when an entry point is known; hard tasks measure how well graph-RAG recovers with only a natural-language query.
Learn more
- Benchmark Report — live results from the latest automated run
- Reference — full function documentation
- rrlmgraph — the package being benchmarked