Skip to contents

Evaluates retrieval strategies across every task in tasks_dir using n_trials independent trials each, and persists the combined results to output_path. By default all ten strategies are run.

Usage

run_full_benchmark(
  tasks_dir = system.file("tasks", package = "rrlmgraphbench"),
  projects_dir = system.file("projects", package = "rrlmgraphbench"),
  output_path,
  n_trials = 3L,
  llm_provider = c("github", "openai", "anthropic", "ollama"),
  llm_model = NULL,
  seed = 42L,
  rate_limit_delay = 6,
  strategies = c("graph_rag_tfidf", "graph_rag_tfidf_noseed", "graph_rag_ollama",
    "full_files", "term_overlap", "bm25_retrieval", "no_context", "graph_rag_mcp",
    "graph_rag_agentic", "random_k"),
  resume = FALSE,
  mcp_server_dir = NULL,
  max_new_tasks = NULL,
  .dry_run = FALSE
)

Arguments

tasks_dir

Path to the directory containing task JSON files (default: system.file("tasks", package = "rrlmgraphbench")).

projects_dir

Path to the directory containing benchmark project source trees (default: system.file("projects", package = "rrlmgraphbench")).

output_path

File path where the resulting data.frame is saved as an RDS file. Parent directories are created if needed.

n_trials

Integer(1). Number of independent trials per task x strategy pair. Defaults to 3L.

llm_provider

Character(1). LLM provider passed to ellmer. One of "github" (default), "openai", "anthropic", "ollama".

llm_model

Character(1) or NULL. Model name. When NULL a sensible per-provider default is used: "gpt-4o-mini" for "github" and "openai", "claude-3-5-haiku-latest" for "anthropic", "llama3.2" for "ollama".

seed

Integer(1). Random seed passed to base::set.seed() before any stochastic operations. Defaults to 42L.

rate_limit_delay

Numeric(1). Seconds to wait between LLM API calls to avoid rate-limit errors. Defaults to 6.

strategies

Character vector. Subset of strategies to run. Defaults to all ten strategies. Useful for reducing the total number of LLM API calls when the provider enforces a daily request quota (e.g. GitHub Models free tier allows ~210 requests/day; with 54 tasks and all 10 strategies that is exactly 540 calls). Ollama and MCP/agentic strategies are silently skipped when their prerequisites are unavailable.

resume

Logical(1). When TRUE, check for an existing partial checkpoint file (output_path with _partial suffix) and skip any (task, strategy, trial) combinations already recorded there. Useful when a previous run was interrupted by a daily rate-limit quota wall. Defaults to FALSE.

mcp_server_dir

Character(1) or NULL. Path to the rrlmgraph-mcp package directory containing a built dist/index.js. When NULL (default), the environment variable RRLMGRAPH_MCP_DIR is consulted. Required when "graph_rag_mcp" is included in strategies; the strategy is silently dropped (with a warning) if no path is found or Node.js is not installed.

max_new_tasks

Integer(1) or NULL. Maximum number of new tasks (tasks that have at least one unseen (strategy, trial) combination) to process in this run. When NULL (default) all tasks are processed. Useful when the available API quota is known in advance: set max_new_tasks = floor(remaining_requests / n_strategies) and combine with resume = TRUE so tomorrow's run continues where today's left off.

.dry_run

Logical(1). When TRUE the LLM is not called; dummy scores of 0.5 are returned. Useful for integration tests.

Value

A data.frame (saved to output_path and also returned invisibly) with one row per trial, containing columns:

task_id

Character.

strategy

Character.

trial

Integer.

score

Numeric in [0, 1].

context_tokens

Integer. API-reported input token count when available; falls back to tokenizers::count_words() or nchar/4.

response_tokens

Integer. API-reported output token count; same fallback chain as context_tokens.

total_tokens

Integer.

latency_sec

Numeric.

hallucination_count

Integer.

hallucination_details

List column (character vectors).

syntax_valid

Logical.

runs_without_error

Logical.

graph_retrieved_n

Integer. Number of graph nodes retrieved by graph-RAG strategies; 0L for non-graph strategies (bm25_retrieval, full_files, term_overlap, no_context).

ndcg5

Numeric. NDCG\@5 against ground_truth_nodes for graph-RAG strategies; NA_real_ for non-graph strategies.

Details

Strategies (all supported values for the strategies argument)

LabelDescription
graph_rag_tfidfGraph-RAG: graph traversal with TF-IDF node embeddings
graph_rag_tfidf_noseedGraph-RAG with TF-IDF embeddings but no seed node (query-only seeding)
graph_rag_ollamaGraph-RAG: graph traversal with Ollama-backed embeddings
graph_rag_mcpGraph-RAG via MCP server: graph traversal via stdio JSON-RPC
graph_rag_agenticAgentic graph navigation via MCP tools (no fixed seed)
full_filesDump every source file in full (baseline)
term_overlapSimple term-presence keyword retrieval (no graph)
bm25_retrievalTrue BM25 retrieval – IDF-weighted, length-normalised
no_contextNo context provided to the LLM
random_kk randomly sampled code chunks

LLM calls are issued sequentially via ellmer. A progress message is emitted after each task x strategy combination together with a rolling time estimate.

Authentication

"github" (default)

Uses GITHUB_PAT / GITHUB_TOKEN. In GitHub Actions this is set automatically as secrets.GITHUB_TOKEN – no extra secret needed.

"openai"

Requires OPENAI_API_KEY.

"anthropic"

Requires ANTHROPIC_API_KEY.

"ollama"

No key needed (local daemon).

Examples

if (FALSE) { # \dontrun{
# Uses GitHub Models (GITHUB_TOKEN auto-set in Actions -- no secret needed)
results <- run_full_benchmark(
  output_path = "inst/results/benchmark_results.rds",
  n_trials    = 3L
)
head(results)
} # }