Caveman: Reducing LLM Output Tokens by up to 75% with a Prompt Skill
Caveman is a prompt skill that rewrites how an LLM structures its output. Instead of conversational paragraphs, the model produces dense, fact-based statements with articles dropped, hedging removed, and arrows used for causality instead of full sentences. The claimed output token reduction is up to 75%, though real-world results vary by query type.
The verbosity problem
LLMs are trained to be conversational. A response to "how does authentication work in this app?" might open with "Here's how auth works in this app" and spend several sentences framing the context before delivering the actual facts. Every word in that response is a token, and API pricing for both input and output is calculated in tokens.
For developers running many queries or building chat applications, this conversational padding adds up. The goal with Caveman is to have the model function as a pure information retrieval engine rather than a conversational partner.
Before Caveman, a standard Claude response to a question about an auth system might read:
Here's how auth works in this app. This is a simulated authentication system — no backend, no passwords, no real security. It exists to demonstrate Better Stack RUM user tracking...
After Caveman, the same question produces:
The technical content is identical. The conversational scaffolding is gone.
Why brevity may also improve accuracy
A research paper titled "Brevity Constraints Reverse Performance Hierarchies in Language Models" found that constraining large models to brief responses improved accuracy by up to 26 percentage points on certain benchmarks. The researchers identified the cause as "spontaneous scale-dependent verbosity": larger models tend to over-elaborate their reasoning, which introduces errors and steers them away from correct answers. Forcing brevity can remove the mechanism that causes these deviations.
This suggests Caveman is not just a cost optimization. It is also a mitigation for a documented failure mode in large models.
Token economics
Output savings
Benchmark tests comparing a standard baseline, a simple "be concise" prompt, and Caveman produced the following results across 10 prompts:
Caveman achieved a 45% reduction in output tokens compared to the baseline and a 39% reduction compared to asking the model to simply be terse.
Input cost and single-shot penalty
The Caveman skill is a detailed system prompt. Sending it adds a significant number of tokens to each initial request. In the benchmark, the input cost for a Caveman session was around 4 cents versus a fraction of a cent for a bare baseline prompt. For a single isolated query, Caveman was approximately 10% more expensive than baseline once both input and output are accounted for.
How caching changes the math
In conversational sessions, modern LLM APIs cache the initial system prompt. On subsequent turns in the same session, the Caveman skill is not re-billed at full price. The high initial input cost is amortized over the conversation, and the consistent output token savings dominate. Factoring in prompt caching, the total cost savings across a multi-turn session were approximately 39%.
Caveman is most cost-effective in interactive chat sessions and agents with multi-step reasoning, not in single one-off API calls.
Installation and setup
The skill is installed via the Vercel Skills package:
Once installed, it can be activated with /caveman in a compatible chat interface, or loaded as a system prompt in code.
The skill.md structure
The skill.md file is the actual system prompt the model receives. It contains:
Rules: Explicit negative constraints. Drop articles (a, an, the), pleasantries (Sure, I'd be happy to help), hedging (likely, probably), and verbose synonyms (big instead of extensive, fix instead of implement a solution for).
Pattern: A defined output structure: [thing] [action] [reason]. [next step]. This enforces a direct logical flow.
Intensity levels:
lite: Removes filler and hedging, keeps articles and full sentences. Tight but readable.full: The default. Drops articles and sentence fragments. Classic Caveman style.ultra: Abbreviates common words (dbfor database), strips conjunctions, uses->for causality throughout.
Wenyan mode: Uses Classical Chinese (文言文) for maximum information density per character. Not practical for most use cases but demonstrates the theoretical ceiling of token compression.
Related skills in the ecosystem
caveman-commit generates Git commit messages in Conventional Commits format. Messages are terse and exact, keeping version history scannable.
caveman-review produces one-line code review comments per finding: location, problem, and suggested fix without preamble.
caveman-compress rewrites a natural-language document (such as a README.md or preferences file) in Caveman style. The compressed version can be fed back to an LLM as context in future sessions, reducing input tokens for prompts that rely on that document.
Final thoughts
Caveman is most useful in two scenarios: interactive chat sessions where prompt caching makes the economics favorable, and agent pipelines where the model reasons across multiple steps and verbose intermediate output accumulates cost. For single-shot queries it adds more input cost than it saves on output, so the expected use case matters before reaching for it.
The accuracy angle from the brevity research paper is the more surprising argument for it. If enforcing conciseness reduces over-elaboration errors in large models, it has value beyond cost optimization. Whether that effect holds for a given task and model is worth testing empirically rather than assuming.
The repository and full skill documentation are available at github.com/juliusbrussee/caveman.