Introduction

Architecture

What happens when you run larkx index, parsing, graph building, and storage.

Pipeline overview

larkx runs a fixed 6-stage pipeline:

Walker, recursively finds source files, respecting .gitignore, .claudeignore, and security exclusions in .larkx/config.json
Parser, extracts functions, classes, imports, exports, and call edges via tree-sitter (with a regex fallback if native bindings are unavailable)
Graph builder, creates typed nodes (file, function, class) and edges (contains, imports, calls, inherits, tested_by)
Incremental indexer, SHA-256 hashes each file; unchanged files reuse cached parse results so reindex is near-instant
Storage, everything saved as JSON in .larkx/ (added to .gitignore automatically)
MCP server, exposes 6 tools that read the saved graph and return compact text

What gets stored

.larkx/

.larkx/
├── index.json       # Per-file symbol data (functions, classes, imports, signatures)
├── graph.json       # Nodes + edges (the actual graph structure)
├── summaries.json   # Optional AI-generated one-line file summaries
├── meta.json        # Counts, language stats, last-indexed timestamp, file hashes
└── config.json      # Security exclusions + user-defined entry points

Total size for a 500-file project is typically 1-3 MB. The folder is added to .gitignore automatically, it's a build artifact, not source.

The graph format

Each node has a type, file path, name, optional line number, and language:

node example

{
  "id": "src/auth/login.ts::validateJWT",
  "type": "function",
  "file": "src/auth/login.ts",
  "name": "validateJWT",
  "line": 6,
  "lang": "typescript"
}

Each edge is a typed directed relationship:

edge example

{
  "from": "src/auth/login.ts",
  "to":   "src/utils/crypto.ts",
  "type": "imports"
}

Edge types: contains, imports, calls, inherits, tested_by.

The MCP layer

When you run larkx mcp, it starts a Model Context Protocol server over stdio. AI clients (Claude Code, Cursor, etc.) spawn this process and send JSON-RPC requests. Six tools are registered. Each tool reads the cached graph (no re-parsing) and returns a compact plain-text response under ~200 character lines.

Why plain text instead of JSONPlain text is roughly 30-40% fewer tokens for the same data, and AI models read it just as accurately. Every byte of MCP output is optimized for token cost.

Token-efficient serialization

Files are serialized as one line each:

level 2 output

src/auth/login.ts[ts]: validateJWT@6, hashPassword@10, LoginService@14 | +../utils/crypto, +../db/users
src/auth/middleware.ts[ts]: authMiddleware@11, rateLimiter@20 | +./login
src/utils/crypto.ts[ts]: sha256@3, generateToken@7 |

The notation is:

path[lang], file path and short language tag
: symbols, comma-separated names with @line
| +imports, pipe-separated imports prefixed with +

Reachability for dead code

The dead code detector does BFS from entry points across the graph. Entry points are determined by three layers:

Universal patterns, index.*, main.*, app.*, cli/index.*, test files, story files, config files
Framework patterns, auto-detected from package.json: Next.js routes, NestJS modules, SvelteKit pages, Express routes, etc.
User entry points, anything in config.json under entryPoints

Anything unreachable from any entry is genuinely dead, not just "no incoming imports".

The browser UI

larkx serve launches an Express server on port 2911 that exposes the same graph data over HTTP. The frontend is a D3-based force graph (Graph view) plus a Module view that renders each file as a container with its functions inside. Both support search, folder clustering, dead code highlighting, and "Open in VS Code" deep links.

What's not in the graph

Source code content (only structure)
Comments or documentation strings
Variables or type definitions
Files matching exclude patterns (security defaults)
Files in unsupported languages

This is intentional. The graph is a map, not a copy. When the AI needs deeper detail, it calls get_file_summary or reads the file directly.