Crawl any site, score its AI readiness, and generate the discovery and crawler-control files that matter — entirely offline, nothing uploaded.
This walkthrough takes you from first launch to an upload-ready package. Each step has a full reference further down the page.
AIDiscoveryKit.exe. It runs as a single portable executable with
no dependencies — nothing is written outside its own folder.https://yoursite.com).output\<domain>\ folder next to the EXE, plus a
zip. Review the files, back up anything you're replacing, and upload.Start with a moderate page limit. You can always re-scan with a higher limit if the site is larger than the crawl reached — the status bar tells you when a limit stopped the crawl.
The kit is a single-window application. From top to bottom you'll find the scan controls, the results area, and the generate controls, with a menu bar above and a status bar below.
The scan controls hold the URL field and the crawl options — depth, page limit, and crawl delay. The results area shows the ADP status panel, your AI-readiness score, and the list of pages found with their classifications. The generate controls let you enter your site details and AI permissions and produce the package. The status bar at the bottom reports progress and flags conditions such as a page limit being reached. A banner links through to Tom's Site Auditor for a deeper SEO audit when you want one.
Enter a URL and choose your crawl options. The scan is deliberately light — it is not a full SEO audit — and is designed to map your structure and check your AI-discovery files quickly.
Depth controls how many links deep the crawl follows from the starting page (default 3). Page limit caps how many pages the crawl will visit — choose from 100, 250, 500, 1000, 2000, or No limit (default 500). When a real limit stops the crawl before the site is exhausted, the status bar tells you so the count isn't mistaken for the whole site. Crawl delay inserts a pause between requests so a polite crawl doesn't hammer the server.
For each page it visits, the kit records the title, meta description, H1, headings, detected schema.org types, an approximate word count, and the links it finds. Pages are de-duplicated so the same URL isn't counted twice.
Alongside the crawl, the kit probes the standard AI-discovery and crawler-control endpoints to see what the site already publishes:
| Endpoint | Purpose |
|---|---|
robots.txt | AI-crawler allow/deny rules — the file the major AI crawlers actually read |
sitemap.xml | Standard page-discovery sitemap |
ai-sitemap.xml | AI-oriented sitemap variant |
llms.txt, llms-lite.txt | Curated content-map convention files |
ai-discovery.json (root and /.well-known/) | Machine-readable site summary and permissions |
ai-discovery.md | Human-readable discovery companion |
knowledge-graph.json | Structured-data graph |
tdmrep.json (/.well-known/) | Text-and-data-mining reservation |
When the scan completes, three things tell you where the site stands.
Each checked endpoint is marked as Found, Missing, or Blocked / unreadable. "Blocked / unreadable" is important and distinct from "Missing": it means the request did not return a clean result — the server may be protecting the file, returning an error, or already hosting a file the kit could not read. In that case you should treat the corresponding generated file as something to merge, not blindly replace.
The score is a single percentage summarising how many of the checked files exist and are in good shape. It is a quick orientation number, not a grade from an authority — use it to see progress as you add the files that are missing.
Every page the crawl found is labelled by type — product, article, guide, utility, legal, or index — based on its path, schema, and content. This lets the generated files describe your site accurately instead of treating every URL the same. You can sort the page list by column to review the classifications.
Before generating, the kit asks for the details a crawl can't determine and lets you set your AI permissions. Where it can, it pre-fills fields from what the crawl found (for example, a site description from the homepage meta description).
Provide your name or business name, a one-line site description, and a contact. These appear
in the generated ai-discovery.json and the human-readable companion so the files
describe a real owner.
Three permission choices control what the generated files declare and enforce:
| Permission | Effect |
|---|---|
| Allow AI training | Governs machine ingest — drives the robots.txt AI-bot rules and the tdmrep.json reservation. Turning it off produces files that ask crawlers not to train on your content. |
| Allow AI citation | Declared in ai-discovery.json and llms-lite.txt. A stated preference, not an enforced block. |
| Allow AI summarisation | Declared in the discovery files alongside citation. Also a stated preference. |
Training and the two declared permissions work differently on purpose — see The Permission Model below. In short: training is enforced through robots.txt and tdmrep, while citation and summarisation are declared in the discovery files rather than technically enforced.
Click the generate button and the kit writes the complete package to an
output\<domain>\ folder next to the EXE, plus a .zip of the
same. Nothing is uploaded anywhere — the files sit on your machine until you choose to publish
them.
The package always includes a README.txt deployment guide and an
upload-map.txt that tells you exactly where each file belongs on your server.
Read those two first.
Every file is plain text or JSON — safe to open, read, and edit before you upload it.
| File | What it is & what to do |
|---|---|
robots.txt.NEW | AI-crawler allow/deny rules. Compare with your existing robots.txt and back up first — see Uploading below. |
sitemap.xml | Standard sitemap from the crawl. Review against any sitemap your CMS or plugin already generates before replacing. |
ai-sitemap.xml | AI-oriented sitemap. Ready to upload. |
ai-discovery.json | Machine-readable site summary and permission declarations. Ready to upload. |
ai-discovery.md | Human-readable companion. Review, then upload. |
knowledge-graph.json | Structured-data starter. Add relationships manually for richer output. |
llms-lite.txt | Curated content map for the llms.txt convention. Review, then upload. |
tdmrep.json | Text-and-data-mining reservation. Goes in your .well-known/ directory. Review before upload. |
README.txt | Deployment guide for the package. |
upload-map.txt | File-to-server-path reference. |
Follow the upload-map.txt for exact destinations. Most files go in your site
root; tdmrep.json goes in the /.well-known/ directory. A few points
matter:
The kit generates robots.txt.NEW rather than overwriting your file, because your
existing robots.txt may contain rules you set deliberately. If you already have a robots.txt,
back it up, then merge the new AI-bot rules into it — adding the bots you're
missing without disturbing your own blocks. If the scan reported your robots.txt as "Blocked /
unreadable," assume a file already exists and merge rather than replace.
If your CMS or a plugin already produces a sitemap, reconcile it with the generated one rather than running two competing sitemaps. One canonical sitemap is better than two that disagree.
Always back up robots.txt before replacing it. An incorrect robots.txt can
block search and AI crawlers from your whole site. When in doubt, merge by hand and keep the
original safe.
It is worth understanding why training is treated differently from citation and summarisation, because the files reflect that distinction.
Training governs machine ingest, and it is expressed through mechanisms crawlers
respect. When you disallow training, the kit writes robots.txt AI-bot rules that ask
the training crawlers to stay out, and sets the tdmrep.json reservation
accordingly. These are the levers that actually influence crawler behaviour today.
Citation and summarisation are declared, not enforced. There is no
widely-respected technical mechanism to permit citation while forbidding summarisation, so the
kit records your preference in ai-discovery.json and llms-lite.txt
as a clear, machine-readable statement of intent. Treat them as declarations of your position
rather than hard controls.
The kit keeps its configuration in an aidk.ini file next to the executable.
Settings include a debug-logging toggle; when enabled, the kit writes a detailed
debug.log next to the EXE that records the crawl, the endpoint checks, and any
errors — useful when reporting an issue. The crawl delay is set directly in the scan controls
rather than buried in settings.
The settings dialog is mouse-driven — click the buttons to confirm or cancel rather than relying on Enter or Tab.
The kit generates the full llms.txt family — llms.txt,
llms-lite.txt, ai-discovery.json, ai-discovery.md — but
it's worth being clear-eyed about what they do. Server-log evidence, including from this
site, shows the major AI crawlers reach sites through robots.txt, sitemaps, and your HTML, and
rarely fetch the llms.txt-style files. Large studies put adoption around 10% with no measurable
citation lift, and Google has said on the record it doesn't use llms.txt.
So treat these files as cheap, static insurance — they cost nothing to publish, they make you forward-compatible if adoption shifts, and AI-readiness audit tools do check for them — but put your real effort into the robots.txt rules, sitemaps, and structured data the crawlers genuinely consume. That's why the kit leads with those.
Delete aidk.ini next to the executable and restart. This resets settings and
window state to defaults — especially helpful after upgrading from an earlier build.
Check three things: the page limit (raise it if the status bar said a limit was reached), the crawl depth (increase it to follow links further), and whether the site is blocking the crawl (a robots.txt result of "Blocked / unreadable" or a homepage that needs JavaScript to render its links can both reduce what a light crawl can see).
This means the request didn't return a clean result — the file may already exist behind a protection layer, or the server returned an error. Don't assume the file is absent; back up and merge rather than replace.
Enable debug logging in settings to write a detailed debug.log next to the EXE.
It records the crawl steps, endpoint checks, and errors — useful for diagnosing unexpected
behaviour or reporting an issue.
A few accepted trade-offs come with the zero-dependency, portable design:
The kit is single-threaded, so the window may feel less responsive while a scan is in progress. The crawl is deliberately light and follows ordinary links, so pages reachable only through scripts or unusual link formats may not be discovered. The structured-data and discovery output is a strong starting point, not a hand-tuned knowledge graph — review and enrich it before relying on it. These are intentional choices in favour of a single, portable executable with no external dependencies.