1. Getting Started
Tom's Site Auditor is a standalone Windows desktop application that crawls your website and checks every page for SEO issues, broken links, missing metadata, content quality problems, and more. It generates a detailed HTML report you can open in any browser.
1.1 System Requirements
Windows 10 or Windows 11. No installation required: the application runs as a single .exe file. No internet connection is needed for the application itself, but an internet connection is required to crawl websites.
1.2 Installation
Download the .zip file from tomdahne.com/site-auditor and extract it to any folder. Double-click TomsSiteAuditor.exe to launch. The app stores its settings and scan data in a 'data' folder next to the executable.
1.3 First Launch
On first launch you will see the main window with a URL input field at the top, scan settings in the middle, and a log output window at the bottom. Enter a website URL and click Start to begin your first scan.
2. The Main Interface
The main window is divided into several areas. At the top is the URL bar where you enter the website address to scan. Below that are the scan configuration controls. The bottom half contains the live log output that shows progress during a scan.
2.1 URL Input
Enter the full URL of the website you want to scan, for example: https://example.com. You can also enter just the domain name (example.com) and the app will automatically add https:// for you. The URL must contain a valid domain with a recognized top-level domain (TLD).
2.2 Max Pages
Controls the maximum number of HTML pages the crawler will process. Default is 100. You can set this up to 10,000. Non-HTML resources (images, PDFs, zip files) are discovered but do not count toward this limit. For a quick check of a small site, 20-50 pages is usually sufficient.
2.3 Max Depth
Controls the maximum crawl depth (clicks from the start URL). Default is 0, which means unlimited depth. Setting a depth limit (e.g., 3) tells the crawler to only follow links up to that many clicks from the homepage. This is useful for large sites where you only want to audit the main navigation pages without crawling into deep archive or blog sections.
2.4 Start / Stop Buttons
Click Start to begin a scan. The button changes to Stop while a scan is running. Clicking Stop will gracefully halt the crawl and still generate a report with whatever pages were scanned so far.
2.5 Log Output Window
The log window at the bottom shows real-time progress: each page being fetched, issues found, external link checks, and timing information. After a scan completes, the log shows a summary of total pages scanned and issues found.
2.6 Navigation
The left sidebar provides navigation between the main tabs: Overview, Pages, Issues, Diagnostics, Keywords, Export & Import, and Help. During a scan only the Overview tab is accessible. After a scan completes, all tabs become active.
The Help tab shows quick-reference documentation within the app. At the top of the Help tab is an About / License button that opens the About dialog, where you can view your license status, machine code, and activate your copy with an unlock key.
3. Scan Settings Explained
3.1 Scan Type
Controls how thorough the scan is. Four modes are available:
| Scan Type | Checks Run | Best For |
|---|---|---|
| Lite Scan | Core checks only: missing titles, missing meta descriptions, missing H1, HTTP errors | Quick health check, large sites |
| Normal Scan | Core + Structural: adds duplicate detection, canonical tags, word count, image alt text, multiple H1 | Regular audits |
| Deep Scan | All checks: heading hierarchy, content quality, page depth, link analysis, TF-IDF recommendations, OpenGraph, redirects, images, and more | Full audit before launch or major changes |
| Custom Scan | You choose exactly which check groups to enable via the Edit Groups dialog | Targeted audits, focusing on specific areas |
Tip: Deep mode runs all check groups and produces the most comprehensive report. Custom mode lets you pick and choose — for example, run only Core SEO + TF-IDF Recommendations to find linking opportunities without the noise of every other check.
3.2 Crawl Delay
Controls the pause between page fetches. This affects how polite the crawler is to the target server.
| Preset | Delay | Use Case |
|---|---|---|
| Fast | 100ms | Your own sites or local servers |
| Normal | 150ms (default) | Most websites |
| Slow | 500ms | Shared hosting, rate-limited servers |
Tip: If you get many timeout errors or HTTP 429 responses, switch to Slow delay.
3.3 Deterministic Mode
When enabled, the crawler processes discovered URLs in sorted alphabetical order instead of the order they were found. This means running the same scan twice will produce identical results, making it easier to compare scans over time. Recommended to keep ON for consistent, repeatable audits.
3.4 Check External Links
When enabled, after crawling all internal pages the app will check every unique external URL found on your site to see if it is reachable. Each external link is tested with an HTTP request and classified into one of four buckets: Confirmed broken, Blocked (bot-blocked), Temporary failure, or Uncertain. Up to 200 external links are checked per scan.
Tip: External link checking can add significant time to a scan (30 seconds to several minutes) depending on how many external links your site has and how fast those servers respond.
3.5 Skip Preflight
Before starting a scan, the app performs a preflight check to verify the domain exists (DNS lookup) and the server responds. For common TLDs like .com, .org, .net, the DNS check is skipped. For uncommon TLDs, a DNS check runs first to catch typos. Check this box to skip all preflight checks entirely and go straight to crawling.
4. Running Your First Scan
Step-by-Step
- Enter your website URL in the URL field (e.g.,
https://yoursite.com). - Set Max Pages to a reasonable number. Start with 20-50 for a quick test.
- Choose your Scan Type. Normal is a good default for most sites.
- Leave Crawl Delay on Normal unless you know the server is slow.
- Click Start.
- Watch the log output. You will see each page being fetched with its HTTP status code.
- When the scan completes, you can export a HTML report.
- Click View Report to open the HTML report in your default browser.
What Happens During a Scan
When you click Start, the button is immediately disabled to prevent accidental double-clicks. The app then performs a preflight check: DNS resolution and an HTTP request to verify the server responds. If preflight fails, the Start button is re-enabled so you can try again. Once preflight passes, the crawler fetches the robots.txt file (if Respect Robots is on) to learn which areas of the site should not be crawled. It then fetches the start URL, parses it for links, and follows those links to discover more pages. Each page is checked for SEO issues as it is crawled. After all pages are crawled, finalize checks run to detect cross-page issues like duplicate titles, orphan pages, and anchor text quality.
5. Understanding the Results
5.1 Overview Tab
After a scan completes, the Overview tab shows a summary: total pages scanned, total issues found, and a breakdown by severity (Critical, Warning, Info). The Pages tab shows all crawled pages with their HTTP status, title, word count, score, and other data. The Issues tab lists every issue found.
5.2 Severity Levels
| Severity | Meaning |
|---|---|
| Critical | Must fix. These are serious problems that directly harm your SEO: missing titles, missing H1 headings, broken pages (4xx/5xx errors). |
| Warning | Should fix. These are significant issues that may hurt your rankings: missing meta descriptions, duplicate content, missing canonical tags, multiple H1 tags. |
| Info | Consider fixing. These are opportunities for improvement: short titles, low word count, missing OpenGraph tags, slow responses, image optimization. |
5.3 Pages Tab
The Pages tab lists every page the crawler found. Columns include the URL, HTTP status code, page title, H1 heading, word count, response time, page score, page rating, priority, and a recommended changes tip. Click any column header to sort. Use the checkboxes to select pages for bulk actions.
The toolbar above the Pages list provides:
- Select All / Unselect All — Toggle selection checkboxes for all visible rows.
- Copy — Copy selected rows to the clipboard as plain text or CSV.
- Hide Selected — Persistently hide selected pages. Hidden pages are excluded from sitemap exports. See section 5.8 for details.
- Unhide All — Restore all hidden pages (with confirmation). Only appears when pages are hidden.
- Show Hidden — Checkbox that toggles visibility of hidden pages. When checked, hidden pages reappear with greyed-out text so you can review or restore them.
5.4 Issues Tab
The Issues tab lists every issue detected during the scan. Each row shows the affected URL, the issue type, severity, a human-readable message explaining the problem, and a suggested fix. Issues can be sorted by any column.
5.5 Issue Clustering
When multiple pages share the same issue (e.g., 15 pages all missing a meta description), they are automatically grouped into a cluster. The cluster header row shows the issue type and the number of affected pages. Double-click a cluster to expand it and see every individual URL. Double-click again to collapse.
The Expand All / Collapse All button toggles every cluster open or closed at once. When you expand clusters, all columns automatically resize to fit the full URLs and content.
5.6 Issues Toolbar
The toolbar above the Issues list provides bulk actions:
- Select All / Unselect All — Toggle selection checkboxes for all visible rows.
- Expand All / Collapse All — Open or close every cluster at once.
- Copy — Copy selected rows to the clipboard as plain text or CSV. The clipboard output is clean: any visual indicators used for clustering (arrows, expand icons) are stripped so the data pastes cleanly into spreadsheets or text editors.
- Hide Selected — Temporarily hide selected rows from the list. Hiding a cluster header also hides all its child rows. Use Undo to bring hidden rows back.
- Suppress ▼ — Mark selected issues as reviewed. Choose a reason from the dropdown: Ignore, Acceptable, False Positive, or Fixed Externally. Suppressed issues are excluded from scores, reports, and diagnostics. See section 5.7 for full details.
- Restore — Restore suppressed issues back to active status. Only visible when "Show Suppressed" is checked and suppressed rows are selected.
- Show Suppressed — Checkbox that toggles visibility of suppressed issues. When unchecked (default), suppressed issues are hidden from the list.
- Severity Filters (Critical / Warning / Info) — Three checkboxes that filter the issue list by severity. All are checked by default. Uncheck one to temporarily hide that severity level from view — for example, uncheck Info to focus only on Critical and Warning issues.
5.7 Issue Suppression
Not every issue found by the auditor needs action. Some are intentional, some are false positives, and some have been fixed outside the tool. Suppression lets you mark these issues as reviewed so they stop cluttering your results.
5.7.1 How to Suppress Issues
- Select one or more issues on the Issues tab (use checkboxes or Select All).
- Click Suppress ▼ and choose a reason:
| Option | When to Use |
|---|---|
| Ignore | Not relevant to this audit (e.g., a staging page you don't care about). |
| Acceptable | Reviewed and OK as-is (e.g., a deliberately short title). |
| False Positive | The issue doesn't actually apply (e.g., the page is correctly noindexed). |
| Fixed Externally | Already handled outside this tool (e.g., you fixed it in your CMS). |
All four options have the same effect: the issue is excluded from scores, reports, diagnostics, and issue counts. The choice records why you suppressed it, shown in the Status column for your reference.
5.7.2 How to Restore Suppressed Issues
- Check the Show Suppressed checkbox in the toolbar. Suppressed issues reappear with their status displayed.
- Select the issues you want to restore.
- Click Restore. The issues return to active status and are included in all counts and reports again.
5.7.3 What Suppression Affects
- Site Score — Recalculated without suppressed issues (score goes up).
- Overview Cards — Critical/Warning/Info counts exclude suppressed issues.
- Diagnostics — Pie chart, category health, score histogram, and Fix These First panel all filter out suppressed issues.
- HTML Report — Exported reports reflect suppression-filtered counts and scores.
- CSV Export — Suppressed issues are not included in CSV exports.
- Scan History — Issue counts and comparison deltas use filtered numbers.
5.7.4 Persistence
Suppressions are saved per-scan in a suppression.json file inside the scan folder. They persist across app restarts and are preserved when importing scans. Raw scan data (pages.ndjson, issues.ndjson) is never modified — suppression is a non-destructive overlay.
Tip: Suppress issues you've reviewed but can't or won't fix. This cleans up your results and makes the remaining active issues more actionable. Your site score will reflect only the issues that actually matter to you.
5.8 Page Hiding (Sitemap Exclusion)
The Pages tab lets you permanently hide pages you want to exclude from sitemap exports. Unlike the temporary "Hide Selected" on the Issues tab, page hiding is persistent — hidden pages stay hidden across app restarts and scan reloads.
5.8.1 How to Hide Pages
- Go to the Pages tab and select one or more pages using the checkboxes.
- Click Hide Selected. The pages disappear from the list.
- The status bar shows how many pages are excluded from the sitemap.
5.8.2 What Hiding Affects
- Sitemap Export — Hidden pages are excluded from the exported sitemap.xml. Pages with noindex meta tags are also automatically excluded.
- Pages Tab — Hidden pages are not shown by default. Check "Show Hidden" to see them greyed out.
5.8.3 What Hiding Does NOT Affect
- Issues Tab — Issues for hidden pages are not affected. Page hiding and issue suppression are independent.
- HTML Report — Hidden pages still appear in the "All Pages" table of the HTML report.
- Scores — Page scores and site score are not affected by hiding.
5.8.4 Restoring Hidden Pages
Click Unhide All on the Pages toolbar. A confirmation dialog shows how many pages will be restored. After restoring, the pages reappear in the list and will be included in future sitemap exports.
5.8.5 Persistence
Hidden pages are saved per-scan in a hidden_pages.json file inside the scan folder. They persist across app restarts and are preserved when importing scans. Starting a new scan clears all page hides.
Tip: Use page hiding to exclude utility pages, login screens, or test pages from your sitemap without deleting them from the crawl results. This gives you a clean sitemap while keeping the full audit data intact.
6. The HTML Report
After each scan, you can export a HTML report which is generated in the scan folder. This is a self-contained file you can open in any browser, share with colleagues, or archive for comparison.
6.1 Report Sections
The report contains the following sections:
- Audit Summary - A short client-ready summary at the top of the report with site score, issue counts, and key findings.
- Fix These First - A priority list of the top issues to address, sorted by impact.
- Issues by Type - All issues grouped by type with severity badges, explanations of why each matters, and suggested fixes.
- Internal Links - Link graph showing inbound and outbound link counts per page, sorted by weakest-linked first. Flags orphan pages (zero inlinks) and weakly-linked pages. Available in Deep and Custom scans with Internal Linking enabled.
- Anchor Text - Anchor text distribution for each page, showing what text is used in links pointing to it. Flags generic anchors ("click here") and over-optimized anchors (same text used too often). Available in Deep and Custom scans with Anchor Text enabled.
- Link Recommendations - TF-IDF content similarity analysis that identifies pages with related content that are not yet linked to each other. Shows similarity percentage and suggested link text. Available in Deep and Custom scans with TF-IDF Recommendations enabled.
- Page Depth - Crawl depth (clicks from homepage) for each page. Flags pages that are buried too deep in the site structure. Available in Deep and Custom scans with Page Depth enabled.
- Heading Structure - Heading hierarchy (H1 → H2 → H3...) for each page, highlighting skipped levels. Available in Deep and Custom scans with Structural enabled.
- All Pages - A table of every crawled page with: URL, Status code, Page title, Page response time, Page score, Page rating, Tip for fixing issues.
- External Links - Results of external link checking if enabled, showing each URL's status and classification.
6.2 Re-Exporting Reports
You can re-export the HTML report at any time — for example, after suppressing some issues to get a cleaner report. Re-exporting overwrites the previous report file in the scan folder. There is only ever one report per scan (no duplicate copies are created).
Suppressed issues are excluded from the exported report. The issue counts, site score, and Fix These First panel in the report all reflect the current suppression state at the time of export.
6.3 CSV Export
In addition to the HTML report, you can export scan data as CSV files for use in spreadsheets. The Pages CSV contains one row per page with all metadata columns. The Issues CSV contains one row per issue.
6.4 Sitemap Export
Click the Sitemap button on the Export & Import tab to generate a sitemap.xml file in the scan folder. The sitemap includes all HTML pages that returned HTTP 200, excluding pages with a noindex meta tag and any pages you have hidden on the Pages tab (see section 5.8). You can re-export the sitemap at any time after changing which pages are hidden.
7. Issue Reference (A-Z)
This section documents every issue type that Tom's Site Auditor can detect.
7.1 Critical Issues
These must be fixed. They directly harm your search engine rankings and user experience.
| Issue | Tier | What It Means | How to Fix |
|---|---|---|---|
| Missing Title | Lite+ | The page has no <title> tag in the <head> section. | Add a unique, descriptive title element (30-60 characters). |
| Missing H1 | Lite+ | The page has no H1 heading. | Add a single, descriptive H1 that summarizes the page content. |
| HTTP 4xx Error | Lite+ | The page returned an HTTP 400-499 error (e.g., 404 Not Found). | Fix or remove links pointing to this page. |
| HTTP 5xx Error | Lite+ | The page returned an HTTP 500-599 error (server error). | Check your server configuration and error logs. |
7.2 Warning Issues
These should be fixed. They may hurt your rankings or indicate structural problems.
| Issue | Tier | What It Means | How to Fix |
|---|---|---|---|
| Missing Meta Description | Lite+ | No meta description tag found. | Add a compelling 150-160 character meta description. |
| Duplicate Title | Normal+ | Two or more pages share the same title. | Give each page a unique title. |
| Duplicate Meta Desc. | Normal+ | Two or more pages share the same meta description. | Write unique meta descriptions for each page. |
| Multiple H1 | Normal+ | The page has more than one H1 heading. | Use a single H1 per page for clear SEO structure. |
| Missing Canonical | Normal+ | No canonical URL tag found. | Add a rel="canonical" link to specify the preferred URL. |
| Orphan Page | Deep | The page has zero inbound internal links. | Add internal links from relevant pages. |
| Very Slow Response | Deep | The server took over 5 seconds to respond. | Investigate server performance, caching, and page complexity. |
| Long Redirect Chain | Deep | The page goes through 3 or more redirects. | Update links to point directly to the final URL. |
7.3 Info Issues
These are suggestions for improvement. Fixing them can improve rankings and user experience.
| Issue | Tier | How to Fix |
|---|---|---|
| Title Too Short | Deep | Aim for 30-60 characters for optimal search display. |
| Low Word Count | Normal+ | Consider adding more substantive content. |
| Thin Content | Deep | Add substantive content or verify this is intentionally minimal. |
| Missing OpenGraph | Deep | Add og:title, og:description, og:image for social sharing. |
| No Index Page | Deep | Remove noindex if this page should be indexed. |
| Long URL | Deep | Use shorter, cleaner URLs for better usability. |
| Image Missing Alt | Normal+ | Add descriptive alt attributes for accessibility and SEO. |
| Generic Anchor Text | Deep | Use descriptive anchor text that describes the target page. |
| Over-Optimized Anchor | Deep | Vary anchor text across linking pages for a natural profile. |
| Deep Page | Deep | Add shortcut links to reduce click depth for important content. |
| Skipped Heading Level | Deep | Fix the heading hierarchy to use sequential levels (H1 → H2 → H3) without gaps. |
| Weakly Linked Page | Deep | Add more internal links from relevant pages. Only 1 inbound link makes a page fragile. |
| Broken External Link | Deep | Remove or replace the broken outbound link. |
| Slow Response | Deep | Investigate server performance. Response took over 2 seconds. |
| Missing Image Dimensions | Deep | Add width and height attributes to prevent layout shift. |
| Missing Lazy Loading | Deep | Add loading="lazy" to below-the-fold images for faster page loads. |
| Link Recommendation | Deep | TF-IDF analysis found topically similar pages with no link between them. Consider adding an internal link using the suggested anchor text. |
8. Scan Modes: Lite / Normal / Deep / Custom
8.1 Lite Scan
Lite scan runs only the Core SEO check group. This catches the most critical problems: missing titles, missing H1 headings, missing meta descriptions, and HTTP errors (4xx/5xx). It is the fastest scan mode and is ideal for large sites where you just want to catch the worst problems, or for a quick sanity check.
8.2 Normal Scan
Normal scan runs Core SEO + Structural checks. In addition to everything in Lite, it adds: duplicate title detection, duplicate meta description detection, multiple H1 detection, missing canonical tags, low word count, and image alt text checking. This is a good default for regular audits.
8.3 Deep Scan
Deep scan runs all check groups: Core SEO, Structural, Content Quality, Images, Internal Linking, Anchor Text, Page Depth, External Links, Performance, and TF-IDF Recommendations. This produces the most comprehensive report including link graph analysis, anchor text distribution, content similarity recommendations, heading structure, and page depth analysis. Use this for the most thorough audit before a site launch or major redesign.
8.4 Custom Scan
Custom scan lets you pick exactly which check groups to enable. Select "Custom Scan" from the dropdown, then click the "Edit Groups..." button to open the Custom Scan Settings dialog. The 10 available check groups are:
| Check Group | What It Checks | Dependencies |
|---|---|---|
| Core SEO | Missing titles, meta descriptions, H1 tags, HTTP errors, broken internal links | Always enabled (required) |
| Structural | Duplicate titles/descriptions, multiple H1, missing canonical, heading hierarchy | None |
| Content Quality | Thin content, low word count, long URLs, missing OpenGraph tags, noindex detection | None |
| Images | Missing alt text, missing dimensions, missing lazy loading | None |
| Internal Linking | Orphan pages, weakly-linked pages, link graph population | None |
| Anchor Text | Generic anchor text ("click here"), over-optimized anchors | Requires Internal Linking |
| Page Depth | Deep pages (4+ clicks from homepage) | Requires Internal Linking |
| External Links | Broken outbound links (HTTP checks on external URLs) | None |
| Performance | Slow responses (>2s), very slow responses (>5s), redirect chains | None |
| TF-IDF Recommendations | Content similarity analysis to find linking opportunities between related pages | Requires Internal Linking |
Tip: Groups marked with a * in the dialog require Internal Linking. If you check one of these, Internal Linking is automatically enabled. If you uncheck Internal Linking, all dependent groups are unchecked too.
The Custom Scan dialog also includes a separate Keyword Analysis checkbox. When checked, keyword extraction runs alongside your selected check groups during the crawl. See section 11 for full details on keyword features.
8.5 Scan Settings Persistence
Your Custom scan group selections are saved to a JSON file (data/custom_scan_profile.json) and persist across app restarts. You can reset to Deep defaults or Normal defaults using the buttons at the bottom of the dialog. The selected scan type (Lite, Normal, Deep, or Custom) is also saved and restored when you relaunch the app, so you do not need to re-select it each time.
8.6 External Links Checkbox Sync
The "Check external links" checkbox on the main interface stays synchronized with the External Links group in your Custom profile. When you switch to Custom scan mode, the checkbox reflects your profile. When you toggle the checkbox while in Custom mode, the profile updates. Switching back to Lite/Normal/Deep restores the checkbox to its previous state.
9. External Link Checking
When the "Check External Links" option is enabled, the app tests every unique external URL found on your site after the crawl completes. This helps you find broken outbound links that hurt user experience and credibility.
9.1 How It Works
Each external URL is fetched using an HTTP GET request with a Range header (to minimize bandwidth). If the first attempt fails, a retry is made after a 2-second delay. A 250ms pause is inserted between checks to avoid triggering rate limits on external servers. Up to 200 external URLs are checked per scan.
9.2 Classification Buckets
| Classification | Meaning | Action |
|---|---|---|
| Confirmed | HTTP 404/410 or DNS failure. The link is definitively broken. | Remove or replace the link. |
| Blocked | HTTP 403 or 429. The server is blocking automated requests. | Verify manually in a browser. Likely works for real users. |
| Temporary | HTTP 5xx, timeout, or connection error. | Recheck later. The server may have been temporarily down. |
| Uncertain | Redirect loop or other unclassifiable failure. | Verify manually in a browser. |
10. TF-IDF Link Recommendations
One of the most powerful features is the TF-IDF content similarity analysis. It automatically discovers pages on your site that cover similar topics but are not yet linked to each other — revealing internal linking opportunities you might have missed.
10.1 How It Works
After all pages are crawled, the app extracts the visible text from each page, removes stop words (common words like "the", "and", "is"), and builds a term frequency matrix. It then computes TF-IDF (Term Frequency–Inverse Document Frequency) scores that weight words by how important they are to each page relative to the whole site. Finally, it computes cosine similarity between every pair of pages to find which pages are topically related.
10.2 How Recommendations Are Generated
For each pair of pages with a similarity score above 40%, the app checks the link graph to see if a direct link already exists between them. If page A and page B are similar but neither links to the other, both directions (A → B and B → A) are recommended. If A already links to B but B doesn't link back, only the B → A direction is recommended.
Each recommendation includes:
- Source page - The page that should add a link.
- Target page - The page to link to.
- Similarity score - How topically related the pages are (40-100%).
- Suggested anchor text - Key terms shared between the pages that make good link text.
10.3 Reading the Results
In the HTML report, the Link Recommendations section shows each recommendation as a card with the source and target pages, the similarity percentage, and highlighted phrases you can use as anchor text. Higher similarity percentages indicate a stronger topical connection. Recommendations are limited to the single best match per source page to keep the list actionable.
Tip: Pages with 70%+ similarity that aren't already linked are your highest-value opportunities. These are pages your visitors would naturally want to navigate between.
10.4 Requirements
TF-IDF analysis requires the Internal Linking check group to be enabled (for the link graph that filters out already-linked pages). In a Deep scan this is always active. In a Custom scan, checking TF-IDF Recommendations will automatically enable Internal Linking. Pages need at least 5 unique terms (after stop word removal) to be eligible for analysis.
11. Keyword Analysis
The Keywords tab provides a built-in keyword research and tracking system. It extracts keyword phrases from your crawled pages, lets you research new terms via autocomplete APIs, fetch search volume data, and track your search engine positions over time. All keyword data is stored in a per-domain SQLite database that persists across scans.
11.1 Enabling Keyword Extraction
To extract keywords during a scan, select Custom Scan and check Keyword Analysis in the Edit Groups dialog. When enabled, the crawler extracts 2–5 word phrases from every page's title, H1, meta description, headings, body text, image alt text, and URL slug. After the scan completes, the top 250 keyword candidates (ranked by prominence) are saved to the keyword database.
You can also click the Scan Keywords button on the Keywords tab to extract keywords from scan data already in memory, without running a new crawl. This uses titles, headings, meta descriptions, and URL slugs (body text is not available for post-scan extraction).
11.2 Prominence Scoring
Each keyword receives a prominence score based on where it appears on your pages. Keywords in titles score highest, followed by H1 headings, meta descriptions, URL slugs, subheadings, and body text. The score is aggregated across all pages — a keyword in the title of 3 pages scores higher than one mentioned once in the body.
| Source | Weight | Label |
|---|---|---|
| Title | 10.0 | T |
| H1 | 8.0 | H1 |
| Meta description | 7.0 | M |
| URL slug | 6.0 | U |
| H2–H6 headings | 4.0 | H2 |
| Body text | 1.0 | B |
The Found In column shows compact labels (e.g., "T H1 M B") indicating which sources contain the keyword. This tells you at a glance whether a keyword is structurally emphasised or just mentioned in passing.
11.3 Keyword States
Every keyword has one of three states that reflect your review workflow:
- Found — Discovered by the scanner. Raw material you haven't reviewed yet.
- Kept — You've reviewed it and decided it matters. Right-click → Keep.
- Tracked — You're actively monitoring your ranking for this term. Right-click → Track Position.
States are one-directional: Found → Kept → Tracked. Keywords persist until you delete them. Deleting a keyword adds it to the block list so it won't reappear on future scans.
11.4 Keyword Types and Products
Each keyword can be classified as Primary (core target keywords) or Supporting (secondary keywords that support content strategy). You can also assign keywords to Products — user-defined categories you manage via the Products dialog. This lets you filter and group keywords by product line or content area.
11.5 Block List
When you delete a keyword, it's added to the block list. Blocked terms are permanently skipped by future scans — they never re-appear. Click the Block List button to view all blocked terms and optionally unblock any you want to allow back in.
Tip: Use the block list aggressively after your first scan. Many of the 250 candidates will be navigation terms, footer text, or boilerplate. Blocking junk early means your next scan fills those slots with real keyword candidates instead.
11.6 Keyword Research Panel
Click the Research button to switch to the research panel. Enter a seed keyword and optional modifiers (comma-separated), then click Fetch to pull autocomplete suggestions from Google, DuckDuckGo, and YouTube. These are real queries people are typing into search engines.
Select the suggestions you want and click Save Selected to add them to your keyword database. A dialog lets you assign a type and product before saving. Duplicate terms already in your list are automatically skipped.
11.7 Search Volume Data
If you have a Keywords Everywhere API key, you can fetch search volume, CPC, competition, and trend data for your keywords. Click the API Key button to enter your key (stored securely using Windows DPAPI encryption — never saved in plain text). Then select keywords and click Get Volume to fetch data in batches.
Use the Country dropdown on the filter row to choose which region's volume data to fetch. Supported countries: United States, United Kingdom, Canada, Australia, New Zealand, South Africa, India, and Global. Your selection is saved between sessions.
Volume, CPC, Competition, and Trend columns appear in the keyword list. The Trend column compares the average search volume of the last 3 months against the prior 3 months and shows an indicator: ↑ +12% (rising), ↓ -8% (declining), or → 0% (stable). Sort by Trend to find keywords with growing search demand.
11.8 Position Tracking
For keywords in the Tracked state, you can record your search engine ranking position. Right-click → Update Position to open a dialog where you can enter your position (1–100) or mark "Not Ranking" for each of five search engines: Google, Bing, DuckDuckGo, Yahoo, and Brave.
The Position column shows your current ranking with trend arrows: ↑ (improved since last check), ↓ (declined), – (unchanged). Right-click → View History to see the full timeline of position changes for a keyword.
Save & Next: After entering positions, click Save & Next instead of Save. The app saves your entries, automatically opens search results for the next tracked keyword in your list, scrolls the listview to highlight it, and reopens the position dialog — so you can work through all your tracked keywords in one sitting without closing and reopening anything.
11.9 Filtering and Search
The Keywords tab toolbar provides filters for State (All / Found / Kept / Tracked), Type (All / Primary / Supporting), and Product. A search box lets you filter by keyword text. All filters work together — for example, you can show only Tracked Primary keywords for a specific product.
11.10 Export
Click Export CSV to save all visible keywords (respecting current filters) as a CSV file. The export includes 17 columns covering all keyword data: term, state, type, product, article, notes, count, pages, prominence, found-in labels, top URL, volume, CPC, competition, position, engine, and date.
11.11 Other Features
- Double-click a keyword to search for it in your configured search engines (opens in your default browser).
- Right-click → Edit Notes to add free-text notes to any keyword (e.g., "write comparison article for this term").
- Right-click → Edit Article to link a keyword to a target page or article title.
- Right-click → View All Pages to see which URLs the keyword was found on and how many times.
- Select All / Unselect All for bulk operations.
- Copy copies selected keyword terms to the clipboard, one per line.
- Ctrl+A selects all checkboxes. Delete key deletes selected keywords.
11.12 How the Keyword Database Works
Each domain gets its own SQLite database file stored in data/keywords/ (e.g., tomdahne.com.keywords.db). The database persists across scans and app restarts. New scans add new "Found" keywords without removing or modifying your existing Kept and Tracked terms — prominence scores and page counts are updated to reflect the current state of the site.
The database only opens when you take an explicit keyword action (scan, research, import, or history load). It does not auto-load from a leftover URL when the app restarts.
Tip: Build your keyword database over time through repeated scan-block-research cycles. After a few months, you'll have a complete map of your site's relationship with search — what you're strong on, what you're missing, and what's worth targeting next.
12. Advanced Settings
12.1 Respect Robots.txt
When enabled (default), the crawler fetches the site's robots.txt file and obeys its Disallow and Allow rules. This means some pages may be skipped if the site owner has restricted crawler access. Disable this only if you own the site and want to scan areas normally blocked from crawlers.
12.2 Robots.txt Override
When both Respect Robots and Override are enabled, the app fetches and logs the robots.txt rules but does not enforce them. This lets you see what would be blocked while still crawling everything.
12.3 Crawl Source Mode
Controls how the crawler discovers pages:
- Normal (discover links) - Starts from the URL and follows internal links. This is the default.
- Sitemap seeds + discover - Fetches the sitemap.xml first to seed the URL queue, then also follows discovered links.
- Sitemap only - Only crawls URLs found in the sitemap. Does not follow links on pages.
12.4 Strip Tracking Parameters
When enabled, common tracking parameters like utm_source, utm_medium, utm_campaign, fbclid, gclid, etc. are stripped from discovered URLs before crawling. This prevents the same page from being crawled multiple times with different tracking parameters.
12.5 Custom User-Agent
Allows you to set a custom User-Agent string for the crawler's HTTP requests. By default the crawler identifies as TomsSiteAuditor/3.0.0. Some sites may block or serve different content based on the User-Agent. Leave blank to use the default.
12.6 Allow Insecure TLS
When enabled, the crawler will accept invalid or expired SSL certificates. This is useful for scanning development or staging servers that use self-signed certificates. Do not enable this for production sites.
13. Scan History and Comparisons
13.1 Scan History
Every scan is saved to a timestamped folder inside the data/scans directory. The folder name includes the domain and timestamp, e.g., example.com-scan_2026-02-08_143022. You can revisit any past scan by selecting it from the scan history dropdown.
13.2 Comparing Scans
After completing a scan, if a previous scan of the same domain exists, you can compare the two. The comparison shows changes in site score, issue counts, and pages scanned. Issue counts in comparisons reflect suppression-filtered totals — if you suppressed issues in either scan, the comparison uses the filtered numbers for an accurate picture of progress.
Tip: Use Deterministic Mode for the most reliable comparisons. It ensures pages are crawled in the same order, making differences more meaningful.
14. Managing Scan Data
Over time, scan folders accumulate in the data/scans directory. Rather than manually deleting folders in your file manager, you can manage scan data directly from the Export & Import tab.
14.1 Deleting Old Scans
The "Manage scan data" row on the Export & Import tab provides a dropdown with age-based presets:
- All Scans — Deletes every scan folder.
- Older than 24 hours
- Older than 7 days
- Older than 30 days (default selection)
- Older than 90 days
- Custom (days)... — Enter any number of days in the text field that appears.
Click Delete Scans to scan the data folder. A confirmation dialog shows exactly how many scan folders match and their total size on disk. Nothing is deleted until you confirm.
14.2 What Gets Deleted
Deleting scans removes the scan folders (containing exported reports, CSV files, and raw scan data) and cleans up matching entries in scan_history.json. The URL history dropdown is refreshed automatically. If the currently loaded scan is among those deleted, the UI resets to a clean state.
14.3 What Is NOT Deleted
Your settings, license, custom scan profiles, and the data folder itself are never touched. Only scan result folders are affected.
14.4 How Age Is Determined
Scan age is determined by the timestamp embedded in the folder name (e.g., example.com-scan_2026-02-08_143022), not by the filesystem modified date. This is reliable even if files are copied or moved.
Tip: You cannot delete scans while a scan is running. Finish or stop the current scan first.
15. Debug Logging
15.1 Enabling Debug Logging
Check the Debug Logging checkbox before starting a scan. This writes a detailed log file (debug.log) to the data folder next to the executable. The log captures millisecond-precision timestamps for every operation.
15.2 What the Debug Log Contains
- Scan start/end markers with total elapsed time
- App version, scan type, crawl delay, and all active settings
- Robots.txt fetch timing (per-attempt: first try vs retry)
- Every page fetch with HTTP status, response time, content type, and redirect count
- Page parse results: title, H1 count, word count, link counts, image counts, depth
- Every issue detected as it happens (type, severity, affected URL)
- Finalize check timing (per-check and total duration)
- TF-IDF analysis: eligible pages, vocabulary size, per-page term counts, pairwise similarity computation time, and each recommendation with similarity percentage
- Custom scan: lists all enabled check groups at scan start
- External link check progress (per-URL with HTTP status, timing, and classification)
- HTML/CSV report generation timing
15.3 Viewing the Debug Log
The View Debug Log button on the Export & Import tab opens debug.log in your default text editor (usually Notepad). The app temporarily releases its file handle so the editor can open the file. Logging resumes automatically on the next scan or log write.
15.4 Deleting the Debug Log
The Delete Debug Log button removes the debug.log file after a confirmation prompt. The current file size is shown next to the buttons so you can see how much space it is using. A new log file is created automatically the next time debug logging is active during a scan.
Tip: The debug log has a 50 MB size cap with automatic rotation. If the file reaches 50 MB, it is renamed to debug.log.old and a fresh log starts. You generally do not need to delete it manually unless you want to reclaim disk space.
16. Troubleshooting
16.1 Scan Takes a Long Time to Start
In v3.0.0, the app performs a fast IPv4-only DNS probe during preflight, which avoids the 10-15 second delays that affected earlier versions on some Windows systems. If you still experience slow startup, check the log output for clues. Common causes:
- Slow DNS resolution: If the log shows a DNS probe time over 2 seconds, see section 16.6 below for an optional Windows-level fix.
- Slow robots.txt fetch: Usually a server-side issue. The retry mechanism handles it automatically.
16.2 Preflight Check Fails
If you see "Preflight failed" when starting a scan, the app could not reach the website. Check that you can access the site in your browser. If the site works in a browser but not in the app, try checking "Skip Preflight" to bypass the check.
16.3 Site Returns 0 Pages
If a scan finds the site but crawls 0 pages, the homepage likely returned a non-200 status. Common causes: the site requires authentication (HTTP 401), the site blocks bots (HTTP 403), or the site uses JavaScript rendering exclusively.
16.4 Too Many Issues Found
If a Deep scan produces hundreds of Info-level issues, this is normal for larger sites. Focus on Critical issues first, then Warning, then Info. Alternatively, use a Custom scan to focus on specific check groups.
16.5 External Link Check Is Slow
External link checking tests each URL sequentially with a 15-second timeout per URL and a 250ms delay between checks. If your site has 100+ external links, this can take several minutes. This is normal.
16.6 Slow DNS Resolution (Windows 11)
As of v3.0.0, the app uses an IPv4-only DNS probe that avoids the Windows LLMNR/Smart Name Resolution stalls that caused 10-15 second delays in earlier versions. Most users will not need any system changes.
If you still see slow DNS in the log (over 2 seconds), or if other applications on your system are also experiencing slow DNS, you can optionally disable Windows Smart Name Resolution system-wide. Run the following commands in an Administrator PowerShell:
New-Item -Path "HKLM:\SOFTWARE\Policies\Microsoft\Windows NT\DNSClient" -Force
Set-ItemProperty -Path "HKLM:\SOFTWARE\Policies\Microsoft\Windows NT\DNSClient" -Name DisableSmartNameResolution -Value 1 -Type DWord -Force
Set-ItemProperty -Path "HKLM:\SOFTWARE\Policies\Microsoft\Windows NT\DNSClient" -Name EnableMulticast -Value 0 -Type DWord -Force
Restart your computer for the changes to take effect. Note that Windows may reset these settings during Group Policy refresh cycles. If the settings revert, create a scheduled task to reapply them at startup.
17. Tips for Best Results
- Start with Normal scan. It catches the most important issues without overwhelming you with detail.
- Fix Critical issues first. Missing titles and H1 headings have the biggest impact on search rankings.
- Use Deep scan for a full audit. It runs all checks including TF-IDF link recommendations, anchor text analysis, and page depth tracking.
- Use Custom scan to focus. If you only care about internal linking opportunities, select Core SEO + TF-IDF Recommendations. The report will only show relevant sections.
- Use Deterministic Mode. This makes scan results reproducible, essential for tracking progress.
- Scan regularly. Run a scan after every significant content change or site update.
- Check external links periodically. A quarterly check of your outbound links is good practice.
- Act on TF-IDF recommendations. Pages with 70%+ similarity that aren't linked are easy wins for improving internal navigation and SEO.
- Read the HTML report in a browser. It's more readable than the in-app tables and includes clickable links.
- Use ignore rules for irrelevant sections. Exclude /api/ or /admin/ sections to focus on user-facing content.
- Enable debug logging when reporting bugs. The debug.log file contains everything needed to diagnose issues. Use the View button on the Export & Import tab to open it without leaving the app.
- Suppress issues you've reviewed. Mark intentional, acceptable, or already-fixed issues as suppressed. Your site score and reports update to reflect only the issues that matter. You can always restore them later.
- Re-export reports after suppressing. The HTML report reflects current suppression state. Suppress the noise, re-export, and share a cleaner report with clients or colleagues.
- Keep your scan history. Old scans provide a valuable audit trail showing how your site's health has changed.
- Clean up old scans periodically. Use the Manage Scan Data controls on the Export & Import tab to delete scans older than 30 or 90 days and free up disk space.
- Use Expand All to review clusters. When the Issues tab shows clustered issues, click Expand All to see every affected URL at once, then copy to a spreadsheet for tracking.
- Hide pages from your sitemap. Use Hide Selected on the Pages tab to exclude utility pages, login screens, or test pages from sitemap exports without removing them from your audit data.
- Filter issues by severity. Use the Critical / Warning / Info checkboxes on the Issues tab to focus on one severity level at a time. Uncheck Info to cut through the noise and tackle the important stuff first.
- Use the block list aggressively. After your first keyword scan, block the junk terms (navigation words, footer text, boilerplate). Those slots fill with real candidates on your next scan.
- Scan regularly to build your keyword database. Each scan adds new keyword candidates without removing your existing Kept and Tracked terms. Over time, you build a complete picture of your site's keyword landscape.
- Use Research to find gaps. Enter your pillar topics as seed keywords in the Research panel. The autocomplete suggestions show what people actually search for — compare against what your site already covers.
- Track positions for your money keywords. Set your most important terms to Tracked and update positions monthly. The trend arrows show at a glance whether your efforts are working.
- Use the About / License dialog. Click the About / License button on the Help tab at any time to view your license status, machine code, or activate your copy.