AK Tom's AI Discovery Kit tomdahne.com / ai-discovery-kit · User Guide

Tom's AI Discovery Kit

Crawl any site, score its AI readiness, and generate the discovery and crawler-control files that matter — entirely offline, nothing uploaded.

v1FreePortable EXEZero DependenciesOffline-First

Quick Start

This walkthrough takes you from first launch to an upload-ready package. Each step has a full reference further down the page.

  1. Download and run. There is no installer. Extract the zip and double-click AIDiscoveryKit.exe. It runs as a single portable executable with no dependencies — nothing is written outside its own folder.
  2. Enter a URL. Type or paste the address of the site you want to check into the URL field at the top (for example https://yoursite.com).
  3. Set your crawl options and scan. Choose a crawl depth, a page limit, and a crawl delay, then start the scan. The kit performs a light crawl of the site and probes the standard AI-discovery endpoints.
  4. Read your results. When the scan finishes you'll see the ADP status panel (which files exist, which are missing), your AI-readiness score, and a breakdown of the pages found by type.
  5. Enter your details and permissions. Fill in the fields the crawler can't infer — your name or business, a one-line description, contact — and set whether you allow AI training, citation, and summarisation.
  6. Generate the package. Click the generate button. The kit writes the full set of files to an output\<domain>\ folder next to the EXE, plus a zip. Review the files, back up anything you're replacing, and upload.
Tip

Start with a moderate page limit. You can always re-scan with a higher limit if the site is larger than the crawl reached — the status bar tells you when a limit stopped the crawl.

Application Layout

The kit is a single-window application. From top to bottom you'll find the scan controls, the results area, and the generate controls, with a menu bar above and a status bar below.

The scan controls hold the URL field and the crawl options — depth, page limit, and crawl delay. The results area shows the ADP status panel, your AI-readiness score, and the list of pages found with their classifications. The generate controls let you enter your site details and AI permissions and produce the package. The status bar at the bottom reports progress and flags conditions such as a page limit being reached. A banner links through to Tom's Site Auditor for a deeper SEO audit when you want one.

Scanning a Site

Enter a URL and choose your crawl options. The scan is deliberately light — it is not a full SEO audit — and is designed to map your structure and check your AI-discovery files quickly.

Crawl options

Depth controls how many links deep the crawl follows from the starting page (default 3). Page limit caps how many pages the crawl will visit — choose from 100, 250, 500, 1000, 2000, or No limit (default 500). When a real limit stops the crawl before the site is exhausted, the status bar tells you so the count isn't mistaken for the whole site. Crawl delay inserts a pause between requests so a polite crawl doesn't hammer the server.

What the crawl collects

For each page it visits, the kit records the title, meta description, H1, headings, detected schema.org types, an approximate word count, and the links it finds. Pages are de-duplicated so the same URL isn't counted twice.

The AI-discovery endpoint checks

Alongside the crawl, the kit probes the standard AI-discovery and crawler-control endpoints to see what the site already publishes:

EndpointPurpose
robots.txtAI-crawler allow/deny rules — the file the major AI crawlers actually read
sitemap.xmlStandard page-discovery sitemap
ai-sitemap.xmlAI-oriented sitemap variant
llms.txt, llms-lite.txtCurated content-map convention files
ai-discovery.json (root and /.well-known/)Machine-readable site summary and permissions
ai-discovery.mdHuman-readable discovery companion
knowledge-graph.jsonStructured-data graph
tdmrep.json (/.well-known/)Text-and-data-mining reservation

Reading the Results

When the scan completes, three things tell you where the site stands.

ADP status panel

Each checked endpoint is marked as Found, Missing, or Blocked / unreadable. "Blocked / unreadable" is important and distinct from "Missing": it means the request did not return a clean result — the server may be protecting the file, returning an error, or already hosting a file the kit could not read. In that case you should treat the corresponding generated file as something to merge, not blindly replace.

AI-readiness score

The score is a single percentage summarising how many of the checked files exist and are in good shape. It is a quick orientation number, not a grade from an authority — use it to see progress as you add the files that are missing.

Page classification

Every page the crawl found is labelled by type — product, article, guide, utility, legal, or index — based on its path, schema, and content. This lets the generated files describe your site accurately instead of treating every URL the same. You can sort the page list by column to review the classifications.

Configuring the Package

Before generating, the kit asks for the details a crawl can't determine and lets you set your AI permissions. Where it can, it pre-fills fields from what the crawl found (for example, a site description from the homepage meta description).

Your details

Provide your name or business name, a one-line site description, and a contact. These appear in the generated ai-discovery.json and the human-readable companion so the files describe a real owner.

AI permissions

Three permission choices control what the generated files declare and enforce:

PermissionEffect
Allow AI trainingGoverns machine ingest — drives the robots.txt AI-bot rules and the tdmrep.json reservation. Turning it off produces files that ask crawlers not to train on your content.
Allow AI citationDeclared in ai-discovery.json and llms-lite.txt. A stated preference, not an enforced block.
Allow AI summarisationDeclared in the discovery files alongside citation. Also a stated preference.
Note

Training and the two declared permissions work differently on purpose — see The Permission Model below. In short: training is enforced through robots.txt and tdmrep, while citation and summarisation are declared in the discovery files rather than technically enforced.

Generating the Package

Click the generate button and the kit writes the complete package to an output\<domain>\ folder next to the EXE, plus a .zip of the same. Nothing is uploaded anywhere — the files sit on your machine until you choose to publish them.

The package always includes a README.txt deployment guide and an upload-map.txt that tells you exactly where each file belongs on your server. Read those two first.

The Generated Files

Every file is plain text or JSON — safe to open, read, and edit before you upload it.

FileWhat it is & what to do
robots.txt.NEWAI-crawler allow/deny rules. Compare with your existing robots.txt and back up first — see Uploading below.
sitemap.xmlStandard sitemap from the crawl. Review against any sitemap your CMS or plugin already generates before replacing.
ai-sitemap.xmlAI-oriented sitemap. Ready to upload.
ai-discovery.jsonMachine-readable site summary and permission declarations. Ready to upload.
ai-discovery.mdHuman-readable companion. Review, then upload.
knowledge-graph.jsonStructured-data starter. Add relationships manually for richer output.
llms-lite.txtCurated content map for the llms.txt convention. Review, then upload.
tdmrep.jsonText-and-data-mining reservation. Goes in your .well-known/ directory. Review before upload.
README.txtDeployment guide for the package.
upload-map.txtFile-to-server-path reference.

Uploading to Your Server

Follow the upload-map.txt for exact destinations. Most files go in your site root; tdmrep.json goes in the /.well-known/ directory. A few points matter:

robots.txt — merge, don't blindly replace

The kit generates robots.txt.NEW rather than overwriting your file, because your existing robots.txt may contain rules you set deliberately. If you already have a robots.txt, back it up, then merge the new AI-bot rules into it — adding the bots you're missing without disturbing your own blocks. If the scan reported your robots.txt as "Blocked / unreadable," assume a file already exists and merge rather than replace.

Sitemaps

If your CMS or a plugin already produces a sitemap, reconcile it with the generated one rather than running two competing sitemaps. One canonical sitemap is better than two that disagree.

Important

Always back up robots.txt before replacing it. An incorrect robots.txt can block search and AI crawlers from your whole site. When in doubt, merge by hand and keep the original safe.

The Permission Model

It is worth understanding why training is treated differently from citation and summarisation, because the files reflect that distinction.

Training governs machine ingest, and it is expressed through mechanisms crawlers respect. When you disallow training, the kit writes robots.txt AI-bot rules that ask the training crawlers to stay out, and sets the tdmrep.json reservation accordingly. These are the levers that actually influence crawler behaviour today.

Citation and summarisation are declared, not enforced. There is no widely-respected technical mechanism to permit citation while forbidding summarisation, so the kit records your preference in ai-discovery.json and llms-lite.txt as a clear, machine-readable statement of intent. Treat them as declarations of your position rather than hard controls.

Settings

The kit keeps its configuration in an aidk.ini file next to the executable. Settings include a debug-logging toggle; when enabled, the kit writes a detailed debug.log next to the EXE that records the crawl, the endpoint checks, and any errors — useful when reporting an issue. The crawl delay is set directly in the scan controls rather than buried in settings.

Note

The settings dialog is mouse-driven — click the buttons to confirm or cancel rather than relying on Enter or Tab.

An Honest Word on llms.txt

The kit generates the full llms.txt family — llms.txt, llms-lite.txt, ai-discovery.json, ai-discovery.md — but it's worth being clear-eyed about what they do. Server-log evidence, including from this site, shows the major AI crawlers reach sites through robots.txt, sitemaps, and your HTML, and rarely fetch the llms.txt-style files. Large studies put adoption around 10% with no measurable citation lift, and Google has said on the record it doesn't use llms.txt.

So treat these files as cheap, static insurance — they cost nothing to publish, they make you forward-compatible if adoption shifts, and AI-readiness audit tools do check for them — but put your real effort into the robots.txt rules, sitemaps, and structured data the crawlers genuinely consume. That's why the kit leads with those.

Troubleshooting

The window looks wrong or controls are misaligned

Delete aidk.ini next to the executable and restart. This resets settings and window state to defaults — especially helpful after upgrading from an earlier build.

The scan found very few pages

Check three things: the page limit (raise it if the status bar said a limit was reached), the crawl depth (increase it to follow links further), and whether the site is blocking the crawl (a robots.txt result of "Blocked / unreadable" or a homepage that needs JavaScript to render its links can both reduce what a light crawl can see).

A file shows "Blocked / unreadable" instead of "Missing"

This means the request didn't return a clean result — the file may already exist behind a protection layer, or the server returned an error. Don't assume the file is absent; back up and merge rather than replace.

Debug logging

Enable debug logging in settings to write a detailed debug.log next to the EXE. It records the crawl steps, endpoint checks, and errors — useful for diagnosing unexpected behaviour or reporting an issue.

Known Limitations

A few accepted trade-offs come with the zero-dependency, portable design:

The kit is single-threaded, so the window may feel less responsive while a scan is in progress. The crawl is deliberately light and follows ordinary links, so pages reachable only through scripts or unusual link formats may not be discovered. The structured-data and discovery output is a strong starting point, not a hand-tuned knowledge graph — review and enrich it before relying on it. These are intentional choices in favour of a single, portable executable with no external dependencies.