Compute benchmark statistics from a full results data frame
Source:R/compute_statistics.R
compute_benchmark_statistics.RdAggregates per-trial results produced by run_full_benchmark() into
a comprehensive report containing:
Summary table (mean, SD, 95 \ hallucination rate) per strategy.
Token Efficiency Ratio (TER) relative to the
"full_files"baseline.Pairwise two-sample Welch t-tests with Cohen's d and Bonferroni-corrected p-values.
Mean Normalized Discounted Cumulative Gain (NDCG) where relevance ranks are available.
Arguments
- all_results
A
data.frameproduced byrun_full_benchmark(). Required columns:strategyCharacter. Strategy label.
scoreNumeric in [0, 1]. Task score.
total_tokensInteger. Total tokens consumed.
hallucination_countInteger.
Optional columns for NDCG:
rank,relevant.
Value
A list with the following elements:
summarydata.framewith one row per strategy.terNamed numeric vector. TER values;
NAfor the baseline strategy.pairwisedata.frameof pairwise Welch t-test results.ndcgNamed numeric or
NULLif rank data absent.wilcoxondata.frameof one-sided paired Wilcoxon signed-rank tests, comparing each strategy against"bm25_retrieval"on a per-task basis (mean score across trials per task). Columns:strategy,reference,V(test statistic),p_value,effect_r(rank-biserial correlation in \([-1, 1]\)),n_pairs,wins,ties,losses.NULLif"bm25_retrieval"is absent ortask_idcolumn is missing.
Details
When a strategy has fewer than 30 observations, normality is tested
with stats::shapiro.test(). If p < 0.05, bootstrap 95 \
confidence intervals (5 000 resamples) are used instead of the
normal-approximation CI.
Examples
if (FALSE) { # \dontrun{
results <- run_full_benchmark("inst/tasks", "inst/projects",
tempfile(fileext = ".rds"))
stats <- compute_benchmark_statistics(results)
stats$summary
} # }