Key Takeaway: To make a site discoverable to AI search engines in 2026, prioritise the signals AI crawlers actually consume: explicit AI-bot rules in robots.txt, a complete sitemap.xml, and accurate schema.org structured data. AI-discovery files such as llms.txt and ai-discovery.json are worth adding as low-cost extras, but server-log evidence shows the major AI crawlers rarely fetch them. Control and structure come first; the convention files are a checkbox.

Last updated: May 2026

How to Make Your Site Discoverable to AI Search in 2026 (The Parts That Actually Work)

Half the advice on "AI visibility" right now is people guessing. The other half is people selling you a file. This guide is neither. It is a practical, ordered checklist of what actually moves the needle when AI engines crawl, ingest, and cite your site — based on what those crawlers genuinely request when they visit, not on what a plugin vendor wishes they requested.

The short version, before the detail: the AI crawlers reliably read three things — your robots.txt, your sitemap.xml, and your page HTML. Everything that works flows from getting those three right. The fashionable AI-discovery files sit at the bottom of the priority list for a reason I will explain. Let us work top to bottom.

First, understand what an AI crawler actually does

When an AI engine encounters your site, the sequence is unglamorous and entirely familiar. It requests robots.txt to find out whether it is allowed in and what it may do. It requests sitemap.xml to discover which pages exist. Then it crawls the HTML of those pages and runs them through a content-extraction pipeline that strips navigation and boilerplate and pulls out the meaningful text and any structured data. That extracted, structured content is what ends up informing answers and citations.

Notice what is not in that sequence: a detour to fetch a separate hand-written summary file. The engines already have a reliable way to read your content — your actual pages — so the highest-leverage work is making those pages, and the two files that gate access to them, as clean and clear as possible. That is the whole strategy in one paragraph. The rest is execution.

Step 1: Control access with explicit AI-bot rules in robots.txt

This is the single most important file for AI discovery, and it is the one most sites get wrong by leaving on autopilot. robots.txt is the standard the major AI crawlers genuinely respect — GPTBot, ClaudeBot, Google-Extended, PerplexityBot and the rest all check it, and check it often. It is where you make an explicit, deliberate decision about which AI bots may access your content and for what.

The decision splits into two questions that are easy to conflate. The first is training: do you want AI models to ingest your content into their training data? The second is real-time citation: do you want AI search and answer products to fetch your pages live and cite them when users ask relevant questions? These are different, and the bots that do them are different. A blanket block keeps you out of AI answers entirely, which for most businesses is the opposite of what they want. A blanket allow opts you into everything. The right answer is usually deliberate: allow the citation and search agents that send you traffic, and make a conscious choice about the training crawlers.

In practice that means adding named User-agent blocks for the specific AI crawlers rather than relying on a single wildcard. A wildcard rule is a blunt instrument; named rules let you allow the agent that drives a ChatGPT citation while still controlling a training-only crawler. Whatever you decide, decide it on purpose — the default of "whatever the host generated years ago" is not a decision.

Note: If your site already has a robots.txt with custom rules, never blindly overwrite it. Back it up, then merge new AI-bot rules into it — adding the bots you are missing without disturbing blocks you set deliberately. Replacing the whole file is how people accidentally deindex themselves.

Step 2: Give them a complete, current sitemap.xml

If robots.txt is the door, the sitemap is the floor plan. AI crawlers fetch it to discover every page worth reading, and a good sitemap means nothing important gets missed. This is not glamorous and it is not new — search engines have used sitemaps for two decades — but the AI crawlers lean on it just as heavily, and an incomplete or stale sitemap quietly costs you coverage.

Three things make a sitemap pull its weight. It should list every page you actually want discovered, and nothing you do not — no redirect chains, no dead URLs, no thin pages you would be embarrassed to be cited from. It should carry accurate lastmod dates so crawlers know what has changed and can prioritise re-fetching it. And there should be one canonical version, not three competing sitemaps generated by different plugins fighting each other. If you run a CMS that auto-generates a sitemap, check what it is actually producing before you add a second one; two conflicting sitemaps are worse than one good one.

Step 3: Make your HTML legible with structured data

This is where the real, durable wins live, and it is the step most "AI visibility" content skips because it is less exciting than a magic file. Since the AI engines read your actual HTML, the quality of that HTML — and especially the structured data embedded in it — directly shapes how well they understand and represent you.

Schema.org structured data, embedded as JSON-LD, is how you tell a machine unambiguously what a page is: this is a software product with this name, price, and operating system; this is an article by this author published on this date; this is an organisation with these contact details and this logo. When an extraction pipeline parses a page that carries clean schema, it does not have to guess. It reads the facts you have declared and can represent them accurately in an answer. When the schema is missing, malformed, or contradicts the visible page, the engine falls back to guessing — and guesses get you misquoted or skipped.

For most sites the high-value schema types are straightforward: Organization so the engines know who you are, SoftwareApplication or Product on your product pages, Article on your content, and BreadcrumbList so the structure is clear. The goal is not to stuff every page with every type; it is to make sure the pages that matter declare, accurately, what they are. Accuracy beats volume — schema that disagrees with the visible content is worse than none.

Step 4: Declare your training preferences with tdmrep

If you care about the training question specifically — whether models may ingest your content for training as opposed to citing it — there is a cleaner machine-readable signal than burying it in robots.txt comments. The Text and Data Mining Reservation Protocol, published as a tdmrep.json file in your .well-known/ directory, lets you declare a reservation against text-and-data-mining use in a structured, standardised way.

This is an honest "declare it" signal rather than a hard enforcement mechanism — it states your position clearly for the crawlers and tools that honour it. Pair it with your robots.txt training rules so the two agree, and you have made your stance unambiguous to anyone, machine or human, who checks.

Step 5: Add the AI-discovery files — as a checkbox, not a strategy

Now we reach llms.txt, ai-discovery.json, knowledge-graph.json, ai-sitemap.xml and the rest of the convention files. You have probably been told these are the centre of AI discovery. They are not, and it is worth being clear-eyed about why before you spend real time on them.

Server-log evidence — including my own, which I have written up separately — consistently shows that the major AI crawlers almost never fetch these files. In a recent month of my own traffic, verified AI crawlers made over two thousand requests to my site and touched any of these convention files exactly twice, while hitting robots.txt and the sitemap hundreds of times. Large-scale studies agree: adoption sits around 10% of sites, with no measurable correlation to AI citations, and the engines that do crawl simply read the HTML instead. Google has stated on the record that it does not use llms.txt. The files are read mostly by SEO crawlers and by AI-readiness audit tools.

So why include them at all? Two honest reasons. First, a growing set of audit tools and scorecards grade sites on whether they have these files, so a clean set is reputational cover if someone runs your domain through one. Second, they are small, static, and safe — cheap insurance that makes you forward-compatible if the convention is ever standardised and the engines start consuming it. Publish them, review them, upload them, and move on. The error is not having the files; it is reorganising your strategy around them when the data says steps 1 to 4 are doing the actual work.

Tip: Do steps 1 to 3 properly even if you skip everything else. Controlled robots.txt access, a clean sitemap, and accurate structured data is 90% of practical AI discoverability. The rest is polish.

Step 6: Publish content worth citing

All the technical plumbing above does one thing: it makes sure that when an AI engine reads your site, it reads it correctly and is allowed to. It does not create anything for the engine to cite. That part is still on you. The referrals I see from AI engines land on substantial, specific, well-structured pages — guides that answer a real question completely, product pages with clear facts, content that is genuinely the best answer to something. Thin or generic pages get crawled and ignored. The plumbing earns its value only when there is something good flowing through it.

Putting it together without doing it by hand

Working through all six steps manually — auditing your existing robots rules, checking your sitemap for dead URLs, validating schema across your pages, generating the discovery files correctly — is fiddly and easy to get subtly wrong. That is precisely the job I built Tom's AI Discovery Kit to do. It is a free, portable Windows tool that crawls your site, checks which control and discovery files you already have, scores your AI-readiness, and generates a clean, reviewable package: robots.txt AI-bot rules, a sitemap, a structured-data starter, the tdmrep declaration, and the discovery files — each with an honest note on what it is actually worth. No account, no email, nothing uploaded; everything runs locally and stays on your machine until you choose to publish it.

The kit tells you where you stand and sets up the parts that work. If the scan surfaces deeper problems — broken structure, missing or contradictory schema, crawl issues that trip up search engines and AI engines alike — that is where a full audit pays off. Tom's Site Auditor runs a complete 30-plus check SEO audit with specific fix guidance for every issue it finds, free for 7 days. Start free with the kit, and only step up if the scan says there is something worth fixing.