Caveman: Reducing LLM Output Tokens by up to 75% with a Prompt Skill

Stanley Ulili

Updated on April 13, 2026

The verbosity problem
Why brevity may also improve accuracy
Token economics
Installation and setup
The skill.md structure
Related skills in the ecosystem
Final thoughts

Caveman is a prompt skill that rewrites how an LLM structures its output. Instead of conversational paragraphs, the model produces dense, fact-based statements with articles dropped, hedging removed, and arrows used for causality instead of full sentences. The claimed output token reduction is up to 75%, though real-world results vary by query type.

The verbosity problem

LLMs are trained to be conversational. A response to "how does authentication work in this app?" might open with "Here's how auth works in this app" and spend several sentences framing the context before delivering the actual facts. Every word in that response is a token, and API pricing for both input and output is calculated in tokens.

For developers running many queries or building chat applications, this conversational padding adds up. The goal with Caveman is to have the model function as a pure information retrieval engine rather than a conversational partner.

Before Caveman, a standard Claude response to a question about an auth system might read:

Typical verbose output from Claude Code before applying the Caveman skill

Here's how auth works in this app. This is a simulated authentication system — no backend, no passwords, no real security. It exists to demonstrate Better Stack RUM user tracking...

After Caveman, the same question produces:

Hyper-concise and direct output generated after activating the Caveman skill

Copied!

Auth in this app
Demo-only; client-side auth. No real security. Built for Better Stack RUM tracking demos.

How it works
- auth-provider.tsx -> React context. Store user in localStorage key ...

Flow
- App load -> check localStorage for saved user
- User pick demo persona -> save to localStorage + set Better Stack RUM identity

The technical content is identical. The conversational scaffolding is gone.

Why brevity may also improve accuracy

A research paper titled "Brevity Constraints Reverse Performance Hierarchies in Language Models" found that constraining large models to brief responses improved accuracy by up to 26 percentage points on certain benchmarks. The researchers identified the cause as "spontaneous scale-dependent verbosity": larger models tend to over-elaborate their reasoning, which introduces errors and steers them away from correct answers. Forcing brevity can remove the mechanism that causes these deviations.

Caveman repository referencing the research paper on brevity improving LLM accuracy

This suggests Caveman is not just a cost optimization. It is also a mitigation for a documented failure mode in large models.

Token economics

Output savings

Benchmark tests comparing a standard baseline, a simple "be concise" prompt, and Caveman produced the following results across 10 prompts:

Dashboard comparing output token usage across Baseline, Terse, and Caveman showing Caveman's clear advantage

Caveman achieved a 45% reduction in output tokens compared to the baseline and a 39% reduction compared to asking the model to simply be terse.

Input cost and single-shot penalty

The Caveman skill is a detailed system prompt. Sending it adds a significant number of tokens to each initial request. In the benchmark, the input cost for a Caveman session was around 4 cents versus a fraction of a cent for a bare baseline prompt. For a single isolated query, Caveman was approximately 10% more expensive than baseline once both input and output are accounted for.

How caching changes the math

In conversational sessions, modern LLM APIs cache the initial system prompt. On subsequent turns in the same session, the Caveman skill is not re-billed at full price. The high initial input cost is amortized over the conversation, and the consistent output token savings dominate. Factoring in prompt caching, the total cost savings across a multi-turn session were approximately 39%.

Caveman is most cost-effective in interactive chat sessions and agents with multi-step reasoning, not in single one-off API calls.

Installation and setup

The skill is installed via the Vercel Skills package:

Copied!

npx skills add https://github.com/juliusbrussee/caveman --skill caveman

Terminal command needed to install the Caveman skill into a project

Once installed, it can be activated with /caveman in a compatible chat interface, or loaded as a system prompt in code.

The `skill.md` structure

The skill.md file is the actual system prompt the model receives. It contains:

Rules: Explicit negative constraints. Drop articles (a, an, the), pleasantries (Sure, I'd be happy to help), hedging (likely, probably), and verbose synonyms (big instead of extensive, fix instead of implement a solution for).

Pattern: A defined output structure: [thing] [action] [reason]. [next step]. This enforces a direct logical flow.

Intensity levels:

lite: Removes filler and hedging, keeps articles and full sentences. Tight but readable.
full: The default. Drops articles and sentence fragments. Classic Caveman style.
ultra: Abbreviates common words (db for database), strips conjunctions, uses -> for causality throughout.

Wenyan mode: Uses Classical Chinese (文言文) for maximum information density per character. Not practical for most use cases but demonstrates the theoretical ceiling of token compression.

caveman-commit generates Git commit messages in Conventional Commits format. Messages are terse and exact, keeping version history scannable.

caveman-review produces one-line code review comments per finding: location, problem, and suggested fix without preamble.

caveman-compress rewrites a natural-language document (such as a README.md or preferences file) in Caveman style. The compressed version can be fed back to an LLM as context in future sessions, reducing input tokens for prompts that rely on that document.

Final thoughts

Caveman is most useful in two scenarios: interactive chat sessions where prompt caching makes the economics favorable, and agent pipelines where the model reasons across multiple steps and verbose intermediate output accumulates cost. For single-shot queries it adds more input cost than it saves on output, so the expected use case matters before reaching for it.

The accuracy angle from the brevity research paper is the more surprising argument for it. If enforcing conciseness reduces over-elaboration errors in large models, it has value beyond cost optimization. Whether that effect holds for a given task and model is worth testing empirically rather than assuming.

The repository and full skill documentation are available at github.com/juliusbrussee/caveman.

Got an article suggestion? Let us know

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.