The Agent Reading Test: Benchmarking AI with Canary Tokens

Stanley Ulili

Updated on April 13, 2026

Why AI agents fail to read web pages accurately
Canary tokens
The ten reading challenges
Score inflation
Running the test
Final thoughts

The Agent Reading Test is a benchmark created by Dachary Carey to measure how accurately AI agents retrieve and process web content. It works by embedding unique strings called canary tokens at locations that an agent's fetch pipeline can only reach if it successfully handles a specific technical challenge. The test consists of ten pages, each targeting a distinct failure mode, and produces a score out of 20 points with a per-test breakdown showing exactly where each agent succeeds or struggles.

Why AI agents fail to read web pages accurately

When an AI agent fetches a URL, it does not experience the page the way a browser does. A browser executes HTML, CSS, and JavaScript to produce the fully rendered visual experience. An agent's fetch pipeline often works with raw or partially processed source code, which means several common web patterns can silently obstruct its access to content.

Client-side rendering and single-page applications load a minimal HTML shell on initial fetch. All actual content is injected by JavaScript after page load. An agent that cannot execute JavaScript sees an empty shell and may conclude the page has no content.

Boilerplate burial occurs when thousands of lines of inline CSS precede the article content. An agent with a limited context window for its initial fetch may process only the CSS and truncate the rest before reaching the actual text.

Tabbed and interactive content hides information in secondary tabs, accordions, or other components. A pipeline that scrapes only the initially visible tab misses everything in the others.

Complex redirects and content negotiation can cause an agent to access the wrong content version or fail to follow a redirect to the final destination, depending on how its pipeline handles HTTP headers.

These are not hypothetical edge cases. They affect a large portion of the modern web and cause agents to produce confident-sounding summaries based on incomplete information.

Canary tokens

The test's core mechanism is the canary token, a unique string embedded at a specific location on each test page. The name references the coal mine canary: a sensitive indicator of whether a dangerous condition exists. If an agent reports a token, it proves the pipeline successfully navigated that specific challenge. If the token is missing from the report, the failure mode is confirmed.

Each of the ten test pages places tokens at locations that require overcoming a specific obstacle, such as reaching the end of a 150,000 character document, loading content injected by JavaScript, or requesting the Markdown version of a page rather than the HTML version.

The ten reading challenges

Screenshot of the Agent Reading Test website showing the grid layout of ten distinct tests

Truncation places canary tokens at intervals of 10K, 40K, 75K, 100K, and 130K characters in a 150,000 character document. The test reveals the exact character limit at which a pipeline stops processing content.

Boilerplate burial puts meaningful content after 80,000 characters of inline CSS. It tests whether a pipeline distinguishes navigable content from styling boilerplate or gives up before reaching the text.

SPA shell serves an empty HTML shell. The canary token only appears after JavaScript executes. This test directly assesses whether the pipeline can render client-side content at all.

Tabbed content organizes content into eight language tabs (Python, JavaScript, Ruby, and others) with canary tokens in tabs 1, 4, and 8. It measures whether a pipeline reads all serialized tab content or only the default visible tab.

Soft 404 returns an HTTP 200 status with a "Page not found" message body. It tests whether an agent recognizes the semantic meaning of an error page or blindly trusts the HTTP status code.

Broken code fence contains unclosed Markdown code blocks. A canary token appears after the broken fence. Markdown parsers that interpret everything after an unclosed fence as code will swallow the token along with the rest of the page content.

Content negotiation makes the canary token available only in the Markdown version of the page, not the HTML version. It tests whether the pipeline sends appropriate Accept headers to request the format that contains the relevant content.

Cross-host redirect issues a 301 redirect to a different hostname. The canary token is on the destination page. Many automated pipelines refuse to follow cross-host redirects for security reasons, causing them to miss the final content.

Header quality contains three sections with identical generic headers (Step 1, Step 2, Step 3) documenting different cloud platforms (AWS, GCP, Azure). Canary tokens are in each section. It tests whether an agent can use body content to understand context rather than relying on structural headers.

Content start buries article content after substantial navigation chrome such as sidebars and header menus. It tests whether the pipeline correctly identifies where the main content begins.

Score inflation

A complication the test's creator discovered is score inflation: an agent reports finding a canary token through a workaround rather than through its primary pipeline successfully handling the challenge.

In the cross-host redirect test, an agent whose pipeline failed to follow the redirect automatically might notice the redirect in the HTTP response header, manually issue a second request to the target URL, find the canary, and report it as found. The point is claimed but the pipeline's redirect behavior was never actually tested.

In the SPA shell test, an agent that cannot render JavaScript might inspect the HTML source, locate the <script src="app.js"> tag, fetch the JavaScript file directly, find the canary string in the source code, and report success. The page content was never rendered; the agent found the token through source inspection.

Both workarounds demonstrate resourcefulness, but they mask actual pipeline weaknesses. A score achieved partly through workarounds overstates the pipeline's real capability for typical browsing tasks. The test was later updated to include instructions that discourage this behavior, but the phenomenon remains worth understanding when interpreting results.

Running the test

Starting the test

Navigate to https://agentreadingtest.com/start/ and give an agent with web browsing access the following prompt:

Go to https://agentreadingtest.com/start/ and follow the instructions.

The agent reads the instructions on that page, which direct it to visit each of the ten test pages and a results page, collecting all canary tokens it can find.

Extracting the canary list

After the agent completes all pages, its output will include a consolidated comma-separated list of canary strings. The documentation questions and page summaries in the output can be ignored; the canary list is what gets scored.

AI agent output with the comma-separated list of canary strings highlighted

Scoring

Copy the canary string list and navigate to the Score Your Results section on the Agent Reading Test site.

The "Score Your Results" page showing the input box where canary strings are pasted

Paste the list and submit. The site calculates a score out of 20 and shows a per-test breakdown indicating which challenges passed and which failed.

Interpreting results

A demonstration with the Kimi 2.5 agent produced a score of 13 out of 20. The breakdown revealed failures in the tabbed content test (tokens in later tabs were not found) and the content negotiation test (the Markdown version was not requested).

Detailed results breakdown highlighting the "Content Negotiation" section where the agent failed with 0/1

The per-test breakdown is more useful than the total score. It identifies specific structural patterns the agent cannot handle, which directly affects which types of web pages it can reliably read for research or summarization tasks.

Final thoughts

The Agent Reading Test provides a diagnostic picture of where an agent's fetch pipeline breaks down. For developers or teams relying on agents for web-based research, documentation reading, or content extraction, the breakdown view identifies which types of content to treat with skepticism or to verify through alternative means.

The benchmark and scoring tool are available at agentreadingtest.com.

Got an article suggestion? Let us know

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.