Recursive Language Model Graph for R Projects • rrlmgraph

The goal of rrlmgraph is to give large language models the right context about your R project — not the whole codebase. It builds a typed knowledge graph that encodes functions, files, packages, and tests as nodes, then performs a relevance-guided traversal to extract a token-budgeted context window for every LLM query. Compared to pasting entire source files, rrlmgraph reduces hallucinations by grounding responses in verified code and cuts token consumption by orders of magnitude.

Installation

You can install the development version of rrlmgraph from GitHub with:

# install.packages("pak")
pak::pak("davidrsch/rrlmgraph")

Example

Example 1: Build a knowledge graph for your R project

build_rrlm_graph() parses every R source file in your project, extracts function definitions, resolves cross-file calls and package imports, computes TF-IDF embeddings for each node, and assembles everything into an igraph object annotated with PageRank scores.

library(rrlmgraph)

# Point it at any R project — package, Shiny app, or script directory.
# verbose = TRUE shows the build progress messages.
g <- build_rrlm_graph("path/to/mypkg", verbose = TRUE)
#> ✔ Detected R package: mypkg
#> ✔ Parsed 14 source files — 38 function nodes, 6 package nodes, 9 test files
#> ✔ Built CALLS (52), IMPORTS (31), TESTS (9), SEMANTIC (17) edges
#> ✔ TF-IDF embeddings computed (vocab size: 1 204)
#> ✔ PageRank computed
#> ✔ Graph cached at path/to/mypkg/.rrlmgraph/

# The object is a plain igraph, so all igraph tools work.
igraph::vcount(g)
#> [1] 53
igraph::ecount(g)
#> [1] 109

# A typed summary is available via print/summary.
g
#> <rrlm_graph>  mypkg 0.1.0
#>   Nodes : 53  (38 function, 6 package, 9 testfile)
#>   Edges : 109 (52 CALLS, 31 IMPORTS, 9 TESTS, 17 SEMANTIC)
#>   Embed : tfidf  (vocab 1 204)
#>   Cached: path/to/mypkg/.rrlmgraph/graph.rds

Example 2: Query context for a coding task

query_context() performs a relevance-guided breadth-first search from the most informative seed node. The token budget is a hard constraint — the returned context string is guaranteed to fit within it.

# Ask for the most relevant code for a specific task.
ctx <- query_context(
  g,
  query        = "How does the model fitting pipeline handle missing values?",
  budget_tokens = 2000L
)

# ctx is a list with class "rrlm_context".
ctx$seed_node
#> [1] "preprocess_data"

ctx$nodes
#> [1] "preprocess_data" "impute_missing"  "fit_model"       "validate_inputs"
#> [5] "load_data"

ctx$tokens_used
#> [1] 1847

# The assembled context string is ready to paste into any LLM prompt.
cat(ctx$context_string)
#> # rrlm_graph Context
#> # Project: mypkg | R 4.5.0 | ~1847 tokens
#> # Query: How does the model fitting pipeline handle missing values?
#>
#> ## CORE FUNCTIONS
#> ---
#> ### preprocess_data
#> preprocess_data <- function(df, strategy = c("median", "knn", "drop")) {
#>   strategy <- match.arg(strategy)
#>   df <- validate_inputs(df)
#>   impute_missing(df, strategy = strategy)
#> }
#>
#> ## SUPPORTING FUNCTIONS
#> ---
#> ### impute_missing
#> impute_missing(df, strategy)
#> Imputes missing values using the selected strategy.
#> ...

You can inspect the relevance scores to understand why each node was selected:

sort(ctx$relevance_scores, decreasing = TRUE)
#>  preprocess_data   impute_missing        fit_model validate_inputs
#>        0.8341           0.7129            0.6803            0.5512
#>    load_data
#>        0.3204

Example 3: Chat with an LLM grounded in project code

chat_with_context() wraps query_context() and sends a grounded prompt to an LLM via the ellmer package. The model is instructed to answer only from the provided code snippets and cite node names. Supports OpenAI, Anthropic, GitHub Models, and Ollama.

# GitHub Models — uses GITHUB_PAT, no extra secret needed.
answer <- chat_with_context(
  g,
  message       = "Refactor preprocess_data() to support a 'mode' imputation strategy.",
  provider      = "github",
  budget_tokens = 3000L
)
cat(answer)
#> Based on `preprocess_data` (data_pipeline.R) and `impute_missing`
#> (impute.R), here is a minimal refactor:
#>
#> ```r
#> preprocess_data <- function(df,
#>                              strategy = c("median", "knn", "drop", "mode")) {
#>   strategy <- match.arg(strategy)
#>   df <- validate_inputs(df)
#>   impute_missing(df, strategy = strategy)
#> }
#> ```
#>
#> `impute_missing()` already dispatches on `strategy` via a switch statement
#> (impute.R line 34), so adding `"mode"` there is the only other change needed.

# Ollama (local, no API key required).
# answer <- chat_with_context(g, "...", provider = "ollama", model = "llama3.2")

Example 4: Export to SQLite for the rrlmgraph-mcp server

export_to_sqlite() writes the full graph — nodes with body text, embeddings, and PageRank, edges with types and weights, and the TF-IDF vocabulary — to a single SQLite file that the rrlmgraph-mcp TypeScript server reads directly.

# Export once after building (or rebuilding) the graph.
export_to_sqlite(g, db_path = "path/to/mypkg/.rrlmgraph/graph.sqlite")
#> ✔ Graph exported to path/to/mypkg/.rrlmgraph/graph.sqlite

# Then point the MCP server at the file.
# In ~/.cursor/mcp.json (or equivalent):
# {
#   "mcpServers": {
#     "rrlmgraph": {
#       "command": "node",
#       "args": [
#         "/path/to/rrlmgraph-mcp/dist/index.js",
#         "--db-path", "path/to/mypkg/.rrlmgraph/graph.sqlite"
#       ]
#     }
#   }
# }

The SQLite file is self-contained and can be committed alongside your project so teammates benefit from graph context without running R.

Learn more

Getting Started — end-to-end walkthrough
Reference — full function documentation
rrlmgraph-mcp — MCP server for Cursor, Claude Desktop, and other MCP hosts