- Key Takeaway
- In May 2026, AI crawlers made 2,078 verified requests to tomdahne.com. Of those, 639 fetched robots.txt and 239 fetched sitemap.xml, but only 2 touched any AI-discovery convention file such as llms.txt or ai-discovery.json. The major AI crawlers read your site through robots.txt, sitemaps, and your HTML — not through llms.txt. The convention files are fetched almost entirely by SEO crawlers and AI-readiness audit tools.
Do AI Crawlers Actually Read Your llms.txt? I Checked My Own Server Logs
There is a confident claim doing the rounds: put an llms.txt file on your site and the AI engines will read it, understand your content better, and cite you more often. It is repeated in dozens of guides, sold as the next big lever for "AI visibility," and bolted onto plugins and SaaS waitlists everywhere.
I build SEO and AI-discovery tools for a living, so I had a way to test the claim that most people writing about it do not: I pulled my own raw server logs for May 2026 and counted. Not a survey, not a vendor's marketing deck — the actual access log for my own site, every request, every crawler, every status code. The answer surprised me, and it points in the opposite direction to the hype. It also happens to be good news, just not the good news everyone is selling.
What I measured, and how
The log was a standard Apache combined-format access log for the full month: a little under 31,000 requests once I trimmed it to clean calendar days. The first job in any honest log read is to throw out the noise that would flatter the numbers. I removed my own IP, which alone accounted for about a fifth of all traffic — that is me hammering the admin console, testing pages, and running my own tools against the site. I also stripped out every request from my own AI Discovery Kit's user agent, because a tool I wrote crawling my own files and then counting those fetches as "AI crawler interest" would be circular nonsense. That distinction matters more than it sounds, and I will come back to it.
What remained, after removing me and my own tooling, was genuine third-party traffic: real search and AI crawlers, SEO backlink bots, monitoring services, scanners, and human visitors. From that, I isolated every request made by a verified AI crawler — ClaudeBot, GPTBot, OAI-SearchBot, ChatGPT-User, PerplexityBot, Bytespider, Applebot, Google's AI agents, CCBot, and the rest — and looked at exactly what they asked for.
The headline number
In May, those AI crawlers made 2,078 requests to my site. Here is where they went:
- 639 requests to robots.txt. Led by ClaudeBot (241) and OAI-SearchBot (240), then Applebot, Bytespider, CCBot, Claude-User and PerplexityBot. They check it constantly.
- 239 requests to sitemap.xml. The crawlers use it to discover what pages exist.
- 1,437 requests to ordinary pages and assets — my guides, product pages, stylesheets and scripts. In other words, they crawled the actual HTML.
- 2 requests to any AI-discovery convention file. Two. Out of 2,078. Both were CCBot, and CCBot is a general web-archive crawler, not an answer engine.
Read that last line again. The files the entire "AI visibility" industry is telling you to create — llms.txt, ai-discovery.json, knowledge-graph.json, ai-sitemap.xml — were fetched by genuine AI crawlers a grand total of two times in a month, against more than two thousand requests to the files those same crawlers actually use. That is not a rounding error in favour of the convention files. It is a rounding error against them.
So who is fetching the convention files?
The files do get requested — just not by the crawlers the marketing implies. When I looked at who fetched llms.txt specifically, the list was SeznamBot, Ahrefs, SERanking, WebCEO, and a dedicated AI-readiness scanner called CitabilityBot, plus a single CCBot hit. ai-discovery.json was much the same story, topped by Bingbot, then Ahrefs, Seznam, Semrush and the audit scanners. Not one fetch from ClaudeBot, GPTBot, OAI-SearchBot or PerplexityBot across either file, in either the May data or, when I checked back, the April data before it.
So the convention files are being read by two groups: ordinary SEO crawlers that sweep every URL on a site whether it is useful to them or not, and a small but real category of tools whose entire job is to scan for these files and grade you on whether you have them. That second group matters, and I will get to why. But neither group is "an AI engine reading your curated content map so it can cite you better." That specific promise simply is not happening on my site.
Note: This is one site's data, and my traffic is modest. But the pattern lines up with much larger studies. SE Ranking analysed around 300,000 domains and found roughly 10% adoption of llms.txt with no measurable correlation to AI citations. One AI-visibility firm, Limy, monitored over 500 million AI-bot visits across 90 days and found only 408 that targeted llms.txt directly. My "2 out of 2,078" is the same near-zero ratio, just at a scale I can verify line by line.
Why this is structural, not a phase
It would be easy to assume this is early days and the crawlers will catch up. I do not think they will, at least not on the current trajectory, and the reason is worth understanding because it changes what you should spend your time on.
Google, OpenAI and Anthropic have spent enormous sums building pipelines that parse HTML, strip away navigation and boilerplate, and extract the meaningful content of a page reliably and at scale. That machinery already works. An llms.txt file asks them to do something different: fetch an extra file, trust a site owner's hand-written summary of their own pages, and then verify it against the real HTML anyway because a self-declared content map is trivially easy to game. That is more steps, more latency, more cost, and more spam risk, in exchange for information the crawler can already get directly from the page. From the crawler's point of view there is no reason to switch.
For access control — the legitimate question of "which AI bots may use my content" — there has been a working standard since 1994, and it is robots.txt. The major AI crawlers respect it, and OpenAI's own documentation points site owners to robots.txt, not llms.txt, for crawler control. On the standards side, llms.txt remains a community convention with no backing from any recognised standards body; an IETF effort has been discussed but has not materialised. Google has been blunt on the record: in July 2025 it confirmed it does not support llms.txt and is not planning to, with one of its search engineers comparing the file to the long-discredited keywords meta tag.
The one genuine, growing use case is developer tooling. AI coding assistants such as Cursor and GitHub Copilot do fetch llms.txt in real time to pull the right documentation pages with less wasted context, and Anthropic ships an llms.txt for its own docs largely for that IDE-agent purpose. If your site is API documentation that a coding agent pulls live, the file earns its place. If your site is a normal business, blog, or product site, that audience mostly is not yours.
The part that is genuinely good news
If the story stopped at "the convention files do nothing," that would be deflating. It does not. The same logs that debunk the llms.txt pitch contain a much more useful, and more honest, finding: the AI crawlers are reading my site heavily, and the volume is climbing.
Total verified AI-crawler traffic rose from about 1,455 requests in April to 2,078 in May — up roughly 43% in a single month. Within that, the standout was ChatGPT-User, which more than doubled, from 88 to 225. That user agent is not generic background crawling; it fires when a person asks ChatGPT something and the model goes and fetches a page in real time to answer. GPTBot and OAI-SearchBot were both up as well. The engines are ingesting my content — they are just doing it through the front door, by crawling the HTML, after discovering it via robots.txt and the sitemap.
And it converts. Buried in the same log is a single, unambiguous referral: a real human who arrived on my paid Site Auditor page directly from a ChatGPT answer, then loaded the screenshots and watched the demo video. One visitor is not a trend, but it is proof of the mechanism. AI engines cited my page, a person followed the citation, and they engaged. That is the channel that actually works — and notice that it had nothing to do with llms.txt and everything to do with crawlable content and clean markup.
So should you bother with the files at all?
Yes — but for the right reasons, and with the right expectations. There are two honest justifications for publishing AI-discovery files in 2026, and neither is "AI engines will read them and cite you."
The first is the audit ecosystem. Remember that second group of fetchers — the AI-readiness scanners like CitabilityBot. A growing number of tools, scorecards and consultants now grade sites on whether they have these files. If a prospect, a partner, or a procurement team runs your domain through one of those scanners, having a well-formed ai-discovery.json and a tidy llms.txt is the difference between a green check and a red cross on a report you do not control. That is reputational, not technical, but it is real.
The second is cheap insurance. These files are small, static, and safe. They cost nothing to serve and they cannot hurt you if they are well-formed. If the convention is ever standardised and the major engines do start consuming it, the sites that already publish clean files are forward-compatible on day one. You are not betting the farm; you are leaving a light on. The mistake is not publishing the files — it is reorganising your whole strategy around a promise the data does not support.
Tip: Spend your effort where the logs point. The signals the AI crawlers genuinely consume are your robots.txt AI-bot rules, your sitemap, and your structured data. Get those right first, treat the discovery files as a static checkbox, and put the time you save into content worth citing.
What I actually recommend
Set up the things the crawlers use: explicit AI-bot rules in robots.txt so you control who may train on and cite your content, a clean sitemap so they can find every page, and accurate schema.org structured data so that when they do parse your HTML — which is what they are doing — they parse it correctly. Then generate the discovery files as a one-time, low-effort checkbox, review them, and upload them. Do not build a strategy on top of them.
That mix is exactly why I built Tom's AI Discovery Kit. It is a free, portable Windows tool that crawls your site, checks which discovery and control files you already have, scores your AI-readiness, and generates a clean package — robots.txt AI rules, sitemap, structured data starter, and the discovery files included, with an honest note on what each one is actually worth. No account, no email, nothing uploaded; it runs locally and the files are yours to review before they go anywhere near your server.
If the audit turns up deeper issues — broken structure, missing schema, crawl problems the AI engines will trip over too — that is where my paid tool earns its keep. Tom's Site Auditor runs a full 30-plus check SEO audit with fix guidance for every issue, free for 7 days. Start with the free kit, and step up only if the scan tells you there is something worth fixing.