Graphify: Building a Knowledge Graph of Your Codebase for AI Assistants

Stanley Ulili

Updated on May 4, 2026

The problem with raw directory context
How Graphify works
Installation and usage
Output files
Context comparison
Tradeoffs
Final thoughts

Graphify is a command-line tool that analyzes a codebase and produces a structured knowledge graph. Instead of providing raw source files to an AI assistant, you provide the knowledge graph as context. The AI can then query a structured map of relationships rather than re-reading the entire codebase on each request.

The problem with raw directory context

When an AI agent receives a folder of source files, it has no inherent understanding of how the files relate to each other. For every query, it reads through the relevant files from scratch to build a mental model of the project. For a question like "how is user authentication handled?", the agent may need to read through routing files, middleware, models, and utilities to piece together the answer, consuming tens of thousands of tokens per query.

Terminal window showing an AI agent consuming over 3,200 tokens and taking more than a minute to process a single request

Beyond cost, this approach produces unreliable results. Without a structured understanding of relationships, the AI relies on text similarity to infer connections between components. It may miss subtle dependencies, use deprecated functions, or misattribute behavior to the wrong module.

How Graphify works

Graphify uses a two-stage analysis pipeline to build its knowledge graph.

Structural analysis uses Tree-sitter to parse source files and build an Abstract Syntax Tree for each one. This extracts explicit, verifiable relationships: which functions are defined, which modules are imported, which classes inherit from others.

Semantic analysis uses an LLM to extract meaning and intent from the project. It reads code comments, Markdown documentation (README.md, design decision documents), SVG architecture diagrams, and optionally audio, video, and PDF files. This links high-level architectural decisions documented in Markdown to the specific functions that implement them.

The structural and semantic data are merged into a single knowledge graph where files, functions, and concepts are nodes, and relationships are edges tagged with confidence scores: EXTRACTED for explicit connections and INFERRED for semantic links. A clustering step identifies communities of related code, automatically detecting logical components like "Authentication Module" or "Task Service Logic."

All processing runs locally. Source code never leaves the machine.

Installation and usage

The package name on PyPI uses two y's:

Copied!

pip install graphifyy

With the package installed, run the analysis from the project root:

Copied!

graphify

Terminal showing the graphify command being executed, kicking off the analysis process

Graphify logs its progress: checking the cache for unchanged files, dispatching AST extraction for each code file, running semantic extraction sub-agents on documentation and images, and reporting the token cost of the analysis. When complete, a graphify-out directory appears in the project root.

Output files

`GRAPH_REPORT.md`

This is a structured Markdown summary of the entire project, designed to be used as AI context.

The report includes:

Summary: node count (files, functions, concepts), edge count (relationships), and community count (clusters)
Community Hubs: the logical components Graphify discovered, with links to each cluster's detail section
God Nodes: the most highly-connected elements, ordered by connection count, identifying the core abstractions in the codebase
Surprising Connections: inferred relationships that may not be obvious from reading the code, such as a README.md whose described concepts map semantically to a specific module
Hyperedges: group relationships showing how entire communities connect to each other

`graph.html`

An interactive force-directed graph of the entire project, viewable in any browser.

Interactive knowledge graph displayed in a browser showing a complex web of color-coded nodes and connections

Each node represents a project element: a file, class, function, or concept from a document. Edges represent relationships. Clicking a node opens an information panel showing the node's type, source file, degree (number of connections), and its direct neighbors. A community sidebar lets you isolate and view only the nodes belonging to a specific cluster.

Zoomed-in view of the interactive graph with the auth.py node selected, highlighting its relationships with the get_current_user function and the AuthUser model

Context comparison

For a query like "how is user authentication handled in this project?", the difference in approach is direct.

Without Graphify, the agent reads auth.py, tasks_router.py, models.py, main.py, and potentially more to build its answer, consuming roughly 14,000 tokens with significant latency. The answer is probabilistic.

With Graphify, the agent receives GRAPH_REPORT.md instead of the source files. It finds the "Authentication Module" community, the AuthUser god node, and its explicit connections to get_current_user. Token consumption drops to a few hundred and the response is near-instantaneous, derived from documented relationships rather than text similarity.

Graphify caches the analysis per file. Re-running after changing one file only re-processes that file, keeping the knowledge graph cheap to maintain.

Tradeoffs

The initial run on a large project consumes tokens proportional to the amount of documentation and code being analyzed. This is an upfront cost that is amortized over subsequent queries.

Semantic inference is not infallible. Graphify mitigates this by labeling each relationship with a confidence level, so inferred connections are distinguishable from extracted ones.

Detailed view of the GRAPH_REPORT.md showing Hyperedges and confidence scores for findings

For very small projects or single scripts, the overhead of running the analysis is unnecessary. The tool is most useful when the codebase is large enough that an AI assistant struggles to hold the full context in a single query.

Final thoughts

Graphify addresses the right problem. The bottleneck in AI-assisted development on existing codebases is not code generation; it is context retrieval. By doing the structural and semantic work once and persisting it in a reusable form, it converts repeated expensive re-reads into cheap lookups against a pre-built map.

The GRAPH_REPORT.md doubles as useful documentation for human developers, particularly for onboarding or for understanding unfamiliar legacy code. The interactive graph.html makes the architecture of a codebase navigable in a way that static documentation rarely achieves.

The project and further documentation are available at github.com/graphifyy.

Got an article suggestion? Let us know

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.