Evaluates retrieval strategies across every task in tasks_dir
using n_trials independent trials each, and persists the combined
results to output_path. By default all ten strategies are run.
Usage
run_full_benchmark(
tasks_dir = system.file("tasks", package = "rrlmgraphbench"),
projects_dir = system.file("projects", package = "rrlmgraphbench"),
output_path,
n_trials = 3L,
llm_provider = c("github", "openai", "anthropic", "ollama"),
llm_model = NULL,
seed = 42L,
rate_limit_delay = 6,
strategies = c("graph_rag_tfidf", "graph_rag_tfidf_noseed", "graph_rag_ollama",
"full_files", "term_overlap", "bm25_retrieval", "no_context", "graph_rag_mcp",
"graph_rag_agentic", "random_k"),
resume = FALSE,
mcp_server_dir = NULL,
max_new_tasks = NULL,
.dry_run = FALSE
)Arguments
- tasks_dir
Path to the directory containing task JSON files (default:
system.file("tasks", package = "rrlmgraphbench")).- projects_dir
Path to the directory containing benchmark project source trees (default:
system.file("projects", package = "rrlmgraphbench")).- output_path
File path where the resulting
data.frameis saved as an RDS file. Parent directories are created if needed.- n_trials
Integer(1). Number of independent trials per task x strategy pair. Defaults to
3L.- llm_provider
Character(1). LLM provider passed to ellmer. One of
"github"(default),"openai","anthropic","ollama".- llm_model
Character(1) or
NULL. Model name. WhenNULLa sensible per-provider default is used:"gpt-4o-mini"for"github"and"openai","claude-3-5-haiku-latest"for"anthropic","llama3.2"for"ollama".- seed
Integer(1). Random seed passed to
base::set.seed()before any stochastic operations. Defaults to42L.- rate_limit_delay
Numeric(1). Seconds to wait between LLM API calls to avoid rate-limit errors. Defaults to
6.- strategies
Character vector. Subset of strategies to run. Defaults to all ten strategies. Useful for reducing the total number of LLM API calls when the provider enforces a daily request quota (e.g. GitHub Models free tier allows ~210 requests/day; with 54 tasks and all 10 strategies that is exactly 540 calls). Ollama and MCP/agentic strategies are silently skipped when their prerequisites are unavailable.
- resume
Logical(1). When
TRUE, check for an existing partial checkpoint file (output_pathwith_partialsuffix) and skip any (task, strategy, trial) combinations already recorded there. Useful when a previous run was interrupted by a daily rate-limit quota wall. Defaults toFALSE.- mcp_server_dir
Character(1) or
NULL. Path to the rrlmgraph-mcp package directory containing a builtdist/index.js. WhenNULL(default), the environment variableRRLMGRAPH_MCP_DIRis consulted. Required when"graph_rag_mcp"is included instrategies; the strategy is silently dropped (with a warning) if no path is found or Node.js is not installed.- max_new_tasks
Integer(1) or
NULL. Maximum number of new tasks (tasks that have at least one unseen (strategy, trial) combination) to process in this run. WhenNULL(default) all tasks are processed. Useful when the available API quota is known in advance: setmax_new_tasks = floor(remaining_requests / n_strategies)and combine withresume = TRUEso tomorrow's run continues where today's left off.- .dry_run
Logical(1). When
TRUEthe LLM is not called; dummy scores of0.5are returned. Useful for integration tests.
Value
A data.frame (saved to output_path and also returned
invisibly) with one row per trial, containing columns:
task_idCharacter.
strategyCharacter.
trialInteger.
scoreNumeric in [0, 1].
context_tokensInteger. API-reported input token count when available; falls back to
tokenizers::count_words()ornchar/4.response_tokensInteger. API-reported output token count; same fallback chain as
context_tokens.total_tokensInteger.
latency_secNumeric.
hallucination_countInteger.
hallucination_detailsList column (character vectors).
syntax_validLogical.
runs_without_errorLogical.
graph_retrieved_nInteger. Number of graph nodes retrieved by graph-RAG strategies;
0Lfor non-graph strategies (bm25_retrieval,full_files,term_overlap,no_context).ndcg5Numeric. NDCG\@5 against
ground_truth_nodesfor graph-RAG strategies;NA_real_for non-graph strategies.
Details
Strategies (all supported values for the strategies argument)
| Label | Description |
graph_rag_tfidf | Graph-RAG: graph traversal with TF-IDF node embeddings |
graph_rag_tfidf_noseed | Graph-RAG with TF-IDF embeddings but no seed node (query-only seeding) |
graph_rag_ollama | Graph-RAG: graph traversal with Ollama-backed embeddings |
graph_rag_mcp | Graph-RAG via MCP server: graph traversal via stdio JSON-RPC |
graph_rag_agentic | Agentic graph navigation via MCP tools (no fixed seed) |
full_files | Dump every source file in full (baseline) |
term_overlap | Simple term-presence keyword retrieval (no graph) |
bm25_retrieval | True BM25 retrieval – IDF-weighted, length-normalised |
no_context | No context provided to the LLM |
random_k | k randomly sampled code chunks |
LLM calls are issued sequentially via ellmer. A progress message is emitted after each task x strategy combination together with a rolling time estimate.
Authentication
"github"(default)Uses
GITHUB_PAT/GITHUB_TOKEN. In GitHub Actions this is set automatically assecrets.GITHUB_TOKEN– no extra secret needed."openai"Requires
OPENAI_API_KEY."anthropic"Requires
ANTHROPIC_API_KEY."ollama"No key needed (local daemon).
Examples
if (FALSE) { # \dontrun{
# Uses GitHub Models (GITHUB_TOKEN auto-set in Actions -- no secret needed)
results <- run_full_benchmark(
output_path = "inst/results/benchmark_results.rds",
n_trials = 3L
)
head(results)
} # }