Tom's Site Auditor – Instructions

1) What this app does

Tom's Site Auditor crawls a website starting from a URL you provide, discovers pages via links, runs 25+ deterministic health checks, and generates a self-contained HTML report with scores, prioritized issues, and fix guidance.

What it checks

Missing or duplicate page titles
Missing meta descriptions
Missing or multiple H1 tags
Broken internal links (4xx/5xx)
Broken external links (optional)
Missing image alt text
Missing canonical tags
Missing OpenGraph tags
Noindex/nofollow detection
Low word count pages
Redirect chains
And more...

What it is NOT

Not a cloud service or SaaS
Not a keyword research tool
Not a ranking predictor
Not a vulnerability / pentest scanner
Not dependent on external APIs
No account required

↑ Back to top

2) System requirements

Required

Windows 10 or Windows 11 (64-bit)
Internet connection (to crawl websites)
Enough disk space for scan data (~1–50 MB per scan)

Recommended

8 GB RAM for large scans (1,000+ pages)
SSD for faster scan storage
Modern multi-core CPU for responsiveness

Note: The app itself is a single portable EXE with zero external dependencies. No .NET, no Java, no runtimes needed.

↑ Back to top

3) First-time setup

Extract the ZIP
Right-click the downloaded ZIP → Extract All. Do not run the app from inside the ZIP.
Keep the folder structure intact
The app creates a data folder next to the EXE for scan storage. Don't move files around after extraction.
Run the app
Double-click TomsSiteAuditor.exe. No installer or admin rights needed.

Windows SmartScreen: If you see "Windows protected your PC", click More info → Run anyway. This happens because the app is new and not yet widely distributed.

↑ Back to top

4) Quick start

Enter a URL
Paste a website address (e.g. https://example.com) into the URL field on the Overview tab.
Set max pages
Start with a small number like 50–100 for your first scan to see how it works.
Click Start
The crawler will begin discovering pages, checking links, and collecting data.
Review results
Switch between the Pages, Issues, and Diagnostics tabs as the scan runs. When complete, export an HTML report.

Tip: For your very first scan, try scanning your own website or a small site you're familiar with. This makes it easier to verify the results make sense.

↑ Back to top

5) Overview tab

The Overview tab is where you configure and launch scans. It shows the URL input, scan settings, and live status during a crawl.

Configuration

Start URL — the website to crawl
Max pages — hard cap to prevent runaway crawls
Crawl delay — time between requests (Normal/Slow)
Deterministic mode — stable crawl order for comparison
External link checking — on/off toggle

Live status

Pages discovered vs. pages scanned
Current URL being fetched
Error/warning counts
Log panel with crawl messages

↑ Back to top

6) Pages tab

Lists every page discovered during the crawl. Each row shows the URL, HTTP status code, page title, meta description presence, word count, and other extracted signals.

Tip: Click any URL in the list to open it in your default browser. The list uses virtual scrolling and can handle thousands of pages efficiently.

↑ Back to top

7) Issues tab

Shows all detected problems grouped by type and severity. Each issue includes:

What the issue is — clear description
Why it matters — impact on site health/SEO
How to fix it — actionable guidance
Affected URL(s) — which pages are affected

Prioritization: The "Fix These First" section highlights the highest-impact issues to address before anything else.

↑ Back to top

8) Diagnostics tab

Shows aggregate scan statistics including total pages scanned, failed pages, redirect counts, response code distribution, rate limiting events (429s), and scan timing.

Note: Diagnostics data is useful for understanding how the crawl went — for example, a high 429 count means the server was rate-limiting you. Try increasing the crawl delay.

↑ Back to top

9) Reports & exports

After a scan completes, you can export results in multiple formats:

HTML Report

Self-contained file with inlined CSS
Site health score and label
"Fix These First" priority section
Issues by type with fix guidance
Full pages table
Open in any browser, email to clients

Other exports

CSV — pages.csv and issues.csv for spreadsheets
XML Sitemap — generated from discovered pages
NDJSON — raw scan data for advanced use

Tip: The HTML report is designed to be shared with non-technical clients. The score, priority list, and plain-language guidance make it easy to understand without SEO expertise.

↑ Back to top

10) Scan history & comparison

All scans are saved locally in the data/scans/ folder. You can:

Browse and reload previous scans
Compare two scans to see what changed
Detect regressions (new issues that appeared since the last scan)
Export reports from any saved scan

Deterministic mode + scan comparison is the recommended workflow for ongoing monitoring. Run periodic scans with the same settings and compare results to catch regressions early.

↑ Back to top

11) Advanced settings

Crawl delay presets

Controls how fast the crawler makes requests. Use "Slow" if you're getting rate-limited (429 errors) or if you want to be polite to smaller servers.

Deterministic mode

When enabled, the crawler processes URLs in a stable sorted order. This means repeated scans of the same site with the same settings will produce the same crawl order, making comparison results more meaningful.

External link checking

When enabled, the crawler validates outbound links to other domains (checking for broken external links). This increases scan time but gives a more complete picture.

Max pages

Sets a hard cap on how many pages to crawl. For large sites, start with 500–1000 and increase as needed. The scan will stop cleanly when the limit is reached and note it in the report.

URL ignore rules

Pattern-based rules to exclude certain URLs from crawling (e.g. admin paths, login pages, infinite calendars).

Robots.txt

The crawler can optionally respect robots.txt rules. This is configurable per scan.

TLS certificate validation

By default, the crawler allows connections to sites with expired or self-signed certificates (common for SEO tools that need to audit all sites). You can enable strict mode if you prefer to fail on bad certificates.

↑ Back to top

12) Known limitations (current version)

HTML parsing: The parser is heuristic/string-based, not a full browser DOM. Some edge cases with complex JavaScript-rendered content may produce false positives/negatives.
JavaScript-heavy sites: The crawler fetches raw HTML and does not execute JavaScript. Single-page applications (SPAs) may not be fully crawlable.
Memory usage: All pages and issues are stored in RAM during a scan. Very large sites (10,000+ pages) will use significant memory.
External link checking: Adds scan time and is subject to network variability (timeouts, rate limiting by third-party servers).
Deterministic mode: Guarantees stable crawl ordering, but cannot guarantee identical results if the site's content or server behavior changes between scans.

↑ Back to top

13) Troubleshooting & support

If you're running into problems, try the quick fixes below before submitting a support request.

Before you report a bug, try these quick fixes:

Close and reopen the app
Re-extract the ZIP (corrupted extract can cause issues)
Make sure you are not running the app from inside the ZIP
Try scanning a different website to confirm if the issue is site-specific
Check that you have an internet connection (the app needs to reach the target site)

Common issues

Problem: "The app won't open" or "Windows protected your PC"

Click More info → Run anyway
Ensure the ZIP is extracted first
If your antivirus quarantined files, restore them and add the app folder as an exception

Problem: Scan seems stuck or very slow

The target site may be rate-limiting you — increase the crawl delay
Disable external link checking for faster scans
Reduce max pages to start with a smaller scan
Check the Diagnostics tab for 429 (rate limited) responses

Problem: Report shows unexpected issues

Some issues may be false positives due to JavaScript-rendered content
Verify by opening the affected URL in a browser and viewing source
If the issue is genuinely wrong, please report it

Problem: Settings don't persist between sessions

Make sure the app has write access to its data folder
Don't run the app from a read-only location

How to submit a support request (copy/paste template)

Please copy this template and fill it out:

Tom's Site Auditor Support Request 1) App Version: (Shown in the About section) 2) Windows Version: (Windows 10/11 + edition if known) 3) CPU + RAM: (Example: Ryzen 5, 16GB) 4) What URL were you scanning? 5) What settings were you using? (Max pages, crawl delay, deterministic mode, etc.) 6) What did you expect to happen? 7) What actually happened? (Error message? Crash? Wrong results? Hang?) 8) Steps to reproduce: 1. 2. 3. 9) Any screenshots or exported reports: (Attach if available)

Best possible bug report: App version + URL + settings used + exact steps + a screenshot of any error or unexpected behavior.

↑ Back to top