Architecture
What happens when you run larkx index, parsing, graph building, and storage.
Pipeline overview
larkx runs a fixed 6-stage pipeline:
- Walker, recursively finds source files, respecting
.gitignore,.claudeignore, and security exclusions in.larkx/config.json - Parser, extracts functions, classes, imports, exports, and call edges via tree-sitter (with a regex fallback if native bindings are unavailable)
- Graph builder, creates typed nodes (file, function, class) and edges (contains, imports, calls, inherits, tested_by)
- Incremental indexer, SHA-256 hashes each file; unchanged files reuse cached parse results so reindex is near-instant
- Storage, everything saved as JSON in
.larkx/(added to .gitignore automatically) - MCP server, exposes 6 tools that read the saved graph and return compact text
What gets stored
.larkx/
├── index.json # Per-file symbol data (functions, classes, imports, signatures)
├── graph.json # Nodes + edges (the actual graph structure)
├── summaries.json # Optional AI-generated one-line file summaries
├── meta.json # Counts, language stats, last-indexed timestamp, file hashes
└── config.json # Security exclusions + user-defined entry pointsTotal size for a 500-file project is typically 1-3 MB. The folder is added to .gitignore automatically, it's a build artifact, not source.
The graph format
Each node has a type, file path, name, optional line number, and language:
{
"id": "src/auth/login.ts::validateJWT",
"type": "function",
"file": "src/auth/login.ts",
"name": "validateJWT",
"line": 6,
"lang": "typescript"
}Each edge is a typed directed relationship:
{
"from": "src/auth/login.ts",
"to": "src/utils/crypto.ts",
"type": "imports"
}Edge types: contains, imports, calls, inherits, tested_by.
The MCP layer
When you run larkx mcp, it starts a Model Context Protocol server over stdio. AI clients (Claude Code, Cursor, etc.) spawn this process and send JSON-RPC requests. Six tools are registered. Each tool reads the cached graph (no re-parsing) and returns a compact plain-text response under ~200 character lines.
Token-efficient serialization
Files are serialized as one line each:
src/auth/login.ts[ts]: validateJWT@6, hashPassword@10, LoginService@14 | +../utils/crypto, +../db/users
src/auth/middleware.ts[ts]: authMiddleware@11, rateLimiter@20 | +./login
src/utils/crypto.ts[ts]: sha256@3, generateToken@7 |The notation is:
path[lang], file path and short language tag: symbols, comma-separated names with@line| +imports, pipe-separated imports prefixed with+
Reachability for dead code
The dead code detector does BFS from entry points across the graph. Entry points are determined by three layers:
- Universal patterns,
index.*,main.*,app.*,cli/index.*, test files, story files, config files - Framework patterns, auto-detected from
package.json: Next.js routes, NestJS modules, SvelteKit pages, Express routes, etc. - User entry points, anything in
config.jsonunderentryPoints
Anything unreachable from any entry is genuinely dead, not just "no incoming imports".
The browser UI
larkx serve launches an Express server on port 2911 that exposes the same graph data over HTTP. The frontend is a D3-based force graph (Graph view) plus a Module view that renders each file as a container with its functions inside. Both support search, folder clustering, dead code highlighting, and "Open in VS Code" deep links.
What's not in the graph
- Source code content (only structure)
- Comments or documentation strings
- Variables or type definitions
- Files matching exclude patterns (security defaults)
- Files in unsupported languages
This is intentional. The graph is a map, not a copy. When the AI needs deeper detail, it calls get_file_summary or reads the file directly.