This page describes the bot identified by the user-agent string ttek2-bot/1.0 (+https://ttek2.com/about/crawler; contact@ttek2.com). It is what publishers will see in their access logs when this site indexes their content.

What it does

The crawler walks an explicit allowlist of ~80 technology-publication and forum hosts. It fetches HTML article pages, RSS/Atom feeds, and sitemaps. Extracted content is classified for tech relevance, deduplicated, and stored in a local SQLite/FTS5 index that powers the search and topics sections of this site. We do not republish article bodies; only titles, short snippets, and links back to the source are surfaced to readers.

Etiquette

  • robots.txt is honored for the user-agent above and for the wildcard User-agent: *. Disallowed paths are not fetched, even if discovered through outlinks.
  • Crawl-delay directives override our default per-host pacing when they are higher.
  • Per-domain caps limit fetches to a configurable rolling 24-hour ceiling per host (default 200/day; lower for smaller sites).
  • HTTP 429 / 503 / 509 responses trigger an exponential backoff (5m → 15m → 1h → 4h → 24h) on the affected host, automatically lifted once the host responds normally again.
  • Conditional GET (If-None-Match / If-Modified-Since) is sent on every refetch, so unchanged pages return 304 and avoid bandwidth waste.
  • noindex and noai meta tags are obeyed; pages carrying either are dropped without indexing.
  • Paywalls / authentication walls are not bypassed. We do not solve CAPTCHAs, follow login redirects, or use shared credentials.
  • No JavaScript execution. Pages that require JS to render their main content are skipped or fetched via the publisher's RSS/sitemap if available.

How to opt out

If you would like your site removed from the index, you can:

  1. Block via robots.txt. Add the following block — we will stop fetching within 24 hours of the directive being live:
User-agent: ttek2-bot
Disallow: /
  1. Email us at contact@ttek2.com with the host(s) you want removed. Removal is permanent (we add the host to a deny-list so future re-discovery via outlinks does not re-add it). Already-indexed documents from that host are removed from the next scheduled re-index pass (within 24h).
  1. Specific URL takedowns are also handled at the same address — please include the URL(s).

If a request is urgent (e.g. content that should never have been published), reply with subject urgent and we will process it the same day.

What we do NOT do

  • We do not train AI models on crawled article bodies.
  • We do not redistribute full-text or images of article bodies; only titles, short snippets (~30 words), and a link back to the source are exposed to readers.
  • We do not run third-party advertising on indexed content. The site has no ad network.
  • We do not sell access to the index. There is no public scraping API.
  • We do not impersonate browsers to evade anti-bot measures. The user-agent string above is what every fetch sends, with two well-documented exceptions: a small allowlist of hosts behind aggressive Cloudflare anti-bot rules (configured in crawler.json per host) where the bot identifies as a stock Firefox UA. This is not done to bypass robots.txt; those hosts' robots.txt is still honored.

Volume and cadence

  • Average: 1,000–4,000 article fetches per day across the full allowlist.
  • Peak per-host: 200/day (configurable lower).
  • Default per-host crawl delay: 2 seconds.

Source code

The crawler is implemented in core/CrawlFrontier.php, core/HtmlFetcher.php, core/RobotsCache.php, and core/CrawlerPipeline.php. The orchestrator that schedules batches is admin/crawler-orchestrator.php. Our allowlist and per-host overrides live in content/config/crawler.json.

Contact

  • Removal / takedown / questions: contact@ttek2.com
  • Security: same address — please use "security" in the subject line.