Benchmark Suite for rrlmgraph • rrlmgraphbench

The goal of rrlmgraphbench is to provide an objective, reproducible benchmark that measures whether rrlmgraph context retrieval actually helps LLMs solve real R coding tasks — and by how much, compared to simpler baselines. It runs a standard set of coding tasks against ten retrieval strategies, scores each LLM response, and reports mean score, 95 % confidence intervals, token efficiency, and hallucination rates per strategy. Results are regenerated automatically four times daily (03:00, 09:00, 15:00, and 21:00 UTC) and committed back to the repository so the GitHub Pages site always shows live data.

Retrieval strategies compared

Strategy	Description
`graph_rag_tfidf`	Graph-RAG: graph traversal with TF-IDF embeddings
`graph_rag_tfidf_noseed`	Graph-RAG with TF-IDF, no seed node (query-only seeding)
`graph_rag_ollama`	Graph-RAG: graph traversal with Ollama embeddings
`graph_rag_mcp`	Graph-RAG via MCP server: graph traversal via JSON-RPC
`full_files`	Entire source files dumped verbatim (baseline)
`term_overlap`	Term-presence keyword retrieval (no graph)
`bm25_retrieval`	BM25 keyword retrieval, no graph
`graph_rag_agentic`	Graph-RAG with LLM-driven tool-call traversal (agentic)
`random_k`	Random subset of k nodes (chance baseline)
`no_context`	No context supplied

Installation

You can install the development version of rrlmgraphbench from GitHub with:

# install.packages("pak")
pak::pak("davidrsch/rrlmgraph-bench")

Example

Example 1: Run the full benchmark

run_full_benchmark() evaluates all ten strategies across every task in the built-in task suite using an LLM of your choice. The "github" provider uses GITHUB_PAT (automatically set in GitHub Actions) — no extra secret needed.

library(rrlmgraphbench)

# Full run: GitHub Models, 3 trials per task x strategy pair.
results <- run_full_benchmark(
  output_path  = "results/benchmark_results.rds",
  n_trials     = 3L,
  llm_provider = "github",
  llm_model    = "gpt-4.1-mini"
)
#> ── rrlmgraphbench ──────────────────────────────────────────────────────────────
#> ℹ Tasks   : 54  (30 × standard, 24 × hard)
#> ℹ Strategies: 10
#> ℹ Trials  : 3  per strategy × task
#> ───────────────────────────────────────────────────────────────────────────────
#> ✔ [1/90]  graph_rag_tfidf  ×  task_001_fm_mini_ds  trial 1  (score 0.8, 1 234 tok, 2.3 s)
#> ✔ [2/90]  graph_rag_tfidf  ×  task_001_fm_mini_ds  trial 2  (score 0.8, 1 198 tok, 2.1 s)
#> ...
#> ✔ [90/90] random_k  ×  task_015_doc_rpkg  trial 3  (score 0.4,  892 tok, 1.8 s)
#> ✔ Results saved to results/benchmark_results.rds  (90 rows × 16 cols)

# Quick integration check without calling an LLM (returns dummy 0.5 scores).
dry <- run_full_benchmark(
  output_path = tempfile(fileext = ".rds"),
  .dry_run    = TRUE
)
nrow(dry)
#> [1] 90

Example 2: Compute and inspect benchmark statistics

compute_benchmark_statistics() aggregates per-trial scores into a summary table with 95 % confidence intervals, Token Efficiency Ratio (TER), and pairwise Welch t-tests with Bonferroni correction and Cohen’s d.

stats <- compute_benchmark_statistics(results)

# Per-strategy summary (ordered by mean score).
stats$summary[order(-stats$summary$mean_score),
              c("strategy", "n", "mean_score", "ci_lo_95", "ci_hi_95",
                "mean_total_tokens", "hallucination_rate")]
#>                  strategy   n mean_score ci_lo_95 ci_hi_95 mean_total_tokens hallucination_rate
#>              full_files 270      0.869    0.833    0.905            18 274               0.072
#>        graph_rag_ollama 270      0.862    0.826    0.898             4 103               0.054
#>         graph_rag_tfidf 270      0.859    0.822    0.896             3 821               0.041
#>            term_overlap 270      0.855    0.819    0.891             2 934               0.058
#>          bm25_retrieval 270      0.850    0.813    0.887             6 440               0.078
#>           graph_rag_mcp 270      0.836    0.798    0.874             5 127               0.069
#> graph_rag_tfidf_noseed  270      0.834    0.796    0.872             3 651               0.046
#>              no_context 270      0.821    0.782    0.860               187               0.098

# Token Efficiency Ratio: score per token relative to full_files.
# TER > 1 means better score at lower token cost.
stats$ter
#>  graph_rag_tfidf graph_rag_ollama   bm25_retrieval         random_k       no_context
#>            3.544            2.953            0.977            2.517               NA

For pairwise statistical significance:

pw <- stats$pairwise
pw[pw$p_bonferroni < 0.05,
   c("strategy_1", "strategy_2", "cohens_d", "p_bonferroni", "sig")]
#>         strategy_1    strategy_2 cohens_d p_bonferroni sig
#>  graph_rag_tfidf    no_context    1.821       <0.001  ***
#> graph_rag_ollama    no_context    1.643       <0.001  ***
#>  graph_rag_tfidf    full_files    0.423        0.031    *

Example 3: Detect hallucinations in LLM responses

count_hallucinations() inspects generated R code for invented function names, invalid argument names, and wrong-namespace calls. The benchmark calls this automatically; you can also use it to audit any LLM-generated snippet.

# Invented function and wrong package namespace in the same snippet.
code <- '
  df <- dplyr::filtrate(mtcars, cyl == 6)   # "filtrate" does not exist in dplyr
  result <- xgboost::xgb_train(df)          # "xgb_train" is not exported by xgboost
'

count_hallucinations(code)
#> [[1]]
#> [[1]]$type
#> [1] "wrong_namespace"
#> [[1]]$fn
#> [1] "dplyr::filtrate"
#> [[1]]$detail
#> [1] "'filtrate' is not exported by the 'dplyr' package"
#>
#> [[2]]
#> [[2]]$type
#> [1] "wrong_namespace"
#> [[2]]$fn
#> [1] "xgboost::xgb_train"
#> [[2]]$detail
#> [1] "'xgb_train' is not exported by the 'xgboost' package"

# Pass a graph to also trust project-internal functions.
g <- rrlmgraph::build_rrlm_graph("path/to/mypkg")
count_hallucinations(code, graph = g)

Task difficulty levels

Tasks are split into two difficulty tiers:

Standard (tasks 001–030, difficulty: "standard"): each task provides a seed_node that anchors graph traversal to the most relevant entry point. Scored via ast_diff — structural similarity to a reference solution patch.
Hard (tasks 031–054, difficulty: "hard"): seed_node is intentionally null. Graph-RAG strategies must rely on query-only traversal with no anchor, making retrieval harder by design. Scored via node_presence — whether the LLM’s response mentions the required function names.

This split is deliberate: standard tasks measure retrieval precision when an entry point is known; hard tasks measure how well graph-RAG recovers with only a natural-language query.

Learn more

Benchmark Report — live results from the latest automated run
Reference — full function documentation
rrlmgraph — the package being benchmarked