Tom's AI Discovery Kit is a free Windows tool that crawls your site, scores its AI readiness, and generates the discovery and crawler-control files that actually matter — robots.txt AI rules, sitemaps, structured data, and the llms.txt family included. No account, no cloud, nothing uploaded.
Scan a site and see exactly which AI-discovery and crawler-control files it has, which it is missing, and a single readiness score.
It audits the signals AI crawlers genuinely consume, scores where you stand, and generates a clean, reviewable package you can upload as-is.
A light crawl checks your site against the AI-discovery and crawler-control endpoints and gives you one clear percentage — so you know where you stand before you change anything.
Probes the standard endpoints — robots.txt AI rules, sitemap.xml, ai-discovery.json, llms.txt and the rest — and tells you exactly which you already have, which are missing, and which are misconfigured.
Generates explicit allow/deny rules for the AI crawlers that actually respect robots.txt — GPTBot, ClaudeBot, Google-Extended, PerplexityBot and more. Merges into an existing file additively, never overwriting your own blocks.
Builds a schema.org structured-data starter and a knowledge-graph.json from what it finds on your pages — the markup AI engines actually parse when they read your HTML.
Declares your text-and-data-mining position in a standard tdmrep.json,
tied to your training choice so robots.txt and tdmrep agree. Your stance, stated
unambiguously.
Produces a clean sitemap.xml and an ai-sitemap.xml from the
crawl, with per-URL last-modified dates — the discovery file AI crawlers fetch most
after robots.txt.
Generates llms.txt, llms-lite.txt,
ai-discovery.json and ai-discovery.md too — included as
forward-looking extras, with a plain note on what current adoption data actually shows.
Labels every page it finds — product, article, guide, utility, legal, or index — so the generated files describe your site accurately instead of treating every URL the same.
The crawl runs from your machine and the package is written to a folder next to the EXE. No account, no telemetry, nothing uploaded. The only network traffic is the crawl of the site you point it at.
An honest word on llms.txt. Server-log evidence — including my own — shows the major AI crawlers reach sites through robots.txt, sitemaps, and your HTML, and rarely fetch the llms.txt-style files. This kit leads with the signals that work and includes the convention files as cheap, static insurance — not as a magic visibility lever. Read the data →
Three tabs take you from a URL to an upload-ready package.
The first tab after a scan. Every discovery and control file is marked found, missing, or blocked, alongside your AI-readiness score and a breakdown of the pages found by type — product, article, guide, utility, legal, or index.
Every page the crawl reached, with its title, classification, word count, and the schema.org types detected on it. Sort by any column to see how your site is structured and where the schema gaps are.
Confirm your details, set your AI training, citation and summarisation permissions,
then generate. One click writes the full package and a zip — with a
README.txt and an upload-map.txt telling you exactly where
each file goes on your server.
Four steps from a cold URL to a reviewed, upload-ready set of files.
Point it at your site. A light crawl maps your pages and checks which discovery and control files already exist.
Read your AI-readiness score and the file-by-file status. Now you know what's missing instead of guessing.
Confirm your details and set your training, citation and summarisation permissions — all on one screen.
One click produces the package and zip. Review the files, back up anything you're replacing, and upload.
Every file is plain text or JSON, safe to read, and yours to review before it goes anywhere near your server.
| File | What it is |
|---|---|
robots.txt.NEW | AI-crawler allow/deny rules — compare with your existing robots.txt and back up first |
sitemap.xml | Standard sitemap from the crawl — review against any CMS/plugin sitemap first |
ai-sitemap.xml | AI-oriented sitemap, ready to upload |
ai-discovery.json | Machine-readable site summary and permission declarations |
ai-discovery.md | Human-readable companion — review then upload |
knowledge-graph.json | Structured-data starter — add relationships manually |
llms-lite.txt | Curated content map for the llms.txt convention |
tdmrep.json | Text-and-data-mining reservation — goes in .well-known/ |
README.txt & upload-map.txt | Deployment guide and a file-to-server-path reference |
Built to the same rules as every other tool on this site: offline-first, zero dependencies, single portable EXE.
Windows 10 and 11, x64. Built with C++17 and the Win32 API. No MFC, no Qt, no frameworks, no .NET.
No database. Settings live in an INI file next to the
EXE; the generated package is written to an output\ folder beside it.
WinHTTP for the crawl and endpoint checks. No third-party HTTP libraries, no telemetry of any kind.
Owner-drawn Win32 UI with a dark theme, DPI-aware. Consistent look without external UI toolkits.
None. Unzip and run. Delete the folder to uninstall — nothing is written outside it except the files you choose to upload.
Free for personal and commercial use. Source not distributed. No warranties.
Single ZIP. No installer, no account, no subscription. Unzip and run.