Run the full rrlmgraph benchmark

Evaluates retrieval strategies across every task in tasks_dir using n_trials independent trials each, and persists the combined results to output_path. By default all ten strategies are run.

Usage

run_full_benchmark(
  tasks_dir = system.file("tasks", package = "rrlmgraphbench"),
  projects_dir = system.file("projects", package = "rrlmgraphbench"),
  output_path,
  n_trials = 3L,
  llm_provider = c("github", "openai", "anthropic", "ollama"),
  llm_model = NULL,
  seed = 42L,
  rate_limit_delay = 6,
  strategies = c("graph_rag_tfidf", "graph_rag_tfidf_noseed", "graph_rag_ollama",
    "full_files", "term_overlap", "bm25_retrieval", "no_context", "graph_rag_mcp",
    "graph_rag_agentic", "random_k"),
  resume = FALSE,
  mcp_server_dir = NULL,
  max_new_tasks = NULL,
  .dry_run = FALSE
)

Arguments

tasks_dir: Path to the directory containing task JSON files (default: system.file("tasks", package = "rrlmgraphbench")).
projects_dir: Path to the directory containing benchmark project source trees (default: system.file("projects", package = "rrlmgraphbench")).
output_path: File path where the resulting data.frame is saved as an RDS file. Parent directories are created if needed.
n_trials: Integer(1). Number of independent trials per task x strategy pair. Defaults to 3L.
llm_provider: Character(1). LLM provider passed to ellmer. One of "github" (default), "openai", "anthropic", "ollama".
llm_model: Character(1) or NULL. Model name. When NULL a sensible per-provider default is used: "gpt-4o-mini" for "github" and "openai", "claude-3-5-haiku-latest" for "anthropic", "llama3.2" for "ollama".
seed: Integer(1). Random seed passed to base::set.seed() before any stochastic operations. Defaults to 42L.
rate_limit_delay: Numeric(1). Seconds to wait between LLM API calls to avoid rate-limit errors. Defaults to 6.
strategies: Character vector. Subset of strategies to run. Defaults to all ten strategies. Useful for reducing the total number of LLM API calls when the provider enforces a daily request quota (e.g. GitHub Models free tier allows ~210 requests/day; with 54 tasks and all 10 strategies that is exactly 540 calls). Ollama and MCP/agentic strategies are silently skipped when their prerequisites are unavailable.
resume: Logical(1). When TRUE, check for an existing partial checkpoint file (output_path with _partial suffix) and skip any (task, strategy, trial) combinations already recorded there. Useful when a previous run was interrupted by a daily rate-limit quota wall. Defaults to FALSE.
mcp_server_dir: Character(1) or NULL. Path to the rrlmgraph-mcp package directory containing a built dist/index.js. When NULL (default), the environment variable RRLMGRAPH_MCP_DIR is consulted. Required when "graph_rag_mcp" is included in strategies; the strategy is silently dropped (with a warning) if no path is found or Node.js is not installed.
max_new_tasks: Integer(1) or NULL. Maximum number of new tasks (tasks that have at least one unseen (strategy, trial) combination) to process in this run. When NULL (default) all tasks are processed. Useful when the available API quota is known in advance: set max_new_tasks = floor(remaining_requests / n_strategies) and combine with resume = TRUE so tomorrow's run continues where today's left off.
.dry_run: Logical(1). When TRUE the LLM is not called; dummy scores of 0.5 are returned. Useful for integration tests.

Value

A data.frame (saved to output_path and also returned invisibly) with one row per trial, containing columns:

task_id: Character.
strategy: Character.
trial: Integer.
score: Numeric in [0, 1].
context_tokens: Integer. API-reported input token count when available; falls back to tokenizers::count_words() or nchar/4.
response_tokens: Integer. API-reported output token count; same fallback chain as context_tokens.
total_tokens: Integer.
latency_sec: Numeric.
hallucination_count: Integer.
hallucination_details: List column (character vectors).
syntax_valid: Logical.
runs_without_error: Logical.
graph_retrieved_n: Integer. Number of graph nodes retrieved by graph-RAG strategies; 0L for non-graph strategies (bm25_retrieval, full_files, term_overlap, no_context).
ndcg5: Numeric. NDCG\@5 against ground_truth_nodes for graph-RAG strategies; NA_real_ for non-graph strategies.

Details

Strategies (all supported values for the `strategies` argument)

Label	Description
`graph_rag_tfidf`	Graph-RAG: graph traversal with TF-IDF node embeddings
`graph_rag_tfidf_noseed`	Graph-RAG with TF-IDF embeddings but no seed node (query-only seeding)
`graph_rag_ollama`	Graph-RAG: graph traversal with Ollama-backed embeddings
`graph_rag_mcp`	Graph-RAG via MCP server: graph traversal via stdio JSON-RPC
`graph_rag_agentic`	Agentic graph navigation via MCP tools (no fixed seed)
`full_files`	Dump every source file in full (baseline)
`term_overlap`	Simple term-presence keyword retrieval (no graph)
`bm25_retrieval`	True BM25 retrieval – IDF-weighted, length-normalised
`no_context`	No context provided to the LLM
`random_k`	k randomly sampled code chunks

LLM calls are issued sequentially via ellmer. A progress message is emitted after each task x strategy combination together with a rolling time estimate.

Authentication

"github" (default): Uses GITHUB_PAT / GITHUB_TOKEN. In GitHub Actions this is set automatically as secrets.GITHUB_TOKEN – no extra secret needed.
"openai": Requires OPENAI_API_KEY.
"anthropic": Requires ANTHROPIC_API_KEY.
"ollama": No key needed (local daemon).

Examples

if (FALSE) { # \dontrun{
# Uses GitHub Models (GITHUB_TOKEN auto-set in Actions -- no secret needed)
results <- run_full_benchmark(
  output_path = "inst/results/benchmark_results.rds",
  n_trials    = 3L
)
head(results)
} # }