Crawler

Last updated March 21, 2026

SynthLink's crawlers periodically collect data from public sources and convert it into normalized document records ready for LLM enrichment and API delivery. Each crawler handles a different external source, but all follow the same contract — fetch, filter, normalize, and upsert into the shared documents table.

Overview

The crawler layer has three core responsibilities. First, standardizing different external data formats into a single documents table schema. Second, controlling duplicates by URL while tracking the most recent observation time. Third, making stored documents reliably available to the downstream enrichment pipeline and the read-only API.

Every document produced by a crawler contains the same base fields regardless of source.

Document fields produced by all crawlers
{
  "title":          string,   // extracted from source
  "url":            string,   // canonical URL after normalization
  "summary":        string,   // raw excerpt — not LLM-generated
  "content":        string,   // full body if available, else null
  "source":         string,   // logical source identifier
  "content_source": string,   // ingestion method: rss | detail | api
  "created_at":     string    // first seen (ISO 8601)
}

source identifies which logical source a document belongs to. content_source describes how the content was actually obtained — whether only the RSS summary was stored, whether the detail page was fetched and parsed, or whether the content came directly from a structured API response.

Data collection

SynthLink uses three collection strategies depending on the source.

Feed-based collection

OpenAI, NASA, and arXiv crawlers read RSS or Atom feeds first, extracting titles, links, and summaries from each entry. OpenAI and NASA go a step further — after parsing the feed, they fetch the detail page HTML and attempt to extract longer body text from the article or main region. If the detail page yields insufficient content, the crawler falls back to the RSS summary rather than discarding the document.

API-based collection

GitHub, NVD, and Hacker News crawlers call public APIs directly. Because responses are already structured, no HTML parsing is needed. Instead, relevant fields are composed into a summary and content. For example, GitHub composes a summary from the repository description, star count, fork count, primary language, and topics. NVD structures the CVE description, CVSS score, KEV status, and reference links. Hacker News combines the story title, points, comment count, author, and story text.

Hybrid collection

OpenAI and NASA use a hybrid approach — feed parsing for item discovery and detail page fetching for content enrichment. This separation means new items are detected quickly via the feed, and detail page failures do not block the entire crawl cycle.

Normalization and filtering

Crawlers do not store raw content as-is. Two steps run before every upsert.

URL normalization

Each crawler produces a canonical URL used as the deduplication key. Common transformations include removing tracking parameters (utm_source etc.), stripping trailing slashes, and normalizing domain variants. arXiv normalizesexport.arxiv.org URLs and strips version suffixes to produce a stable arxiv.org/abs/... form. Hacker News substitutes the HN item URL when no external link is present.

Quality filtering

Each crawler applies minimum quality thresholds before writing a record. OpenAI and NASA require a minimum body length and summary length — documents that fail both checks are discarded. arXiv rejects abstracts that are too short. GitHub, NVD, and Hacker News apply filters based on star count, CVSS score, story score, and time range respectively. The system is designed to expose only information worth surfacing through the API, not everything that was fetched.

Data flow

After a crawler writes to the documents table, the data moves through two more stages before reaching external consumers.

Full pipeline
external source
  → crawler (upsert into documents)
  → insight-worker (LLM enrichment → upsert into insights)
  → public API (/api/v1/documents, /api/v1/insights, /api/v1/combined)

The insight-worker picks up documents that do not yet have an insight, or insight records in a retryable failed state. It uses content as the LLM input when available, falling back to summary. The model output is parsed into llm_summary, keywords, tags, and category, then written to the insights table.

Upserts use url as the conflict key. If a document with the same URL already exists, the existing record is preserved. created_at is never modified after initial insertion.

Automation

Each crawler implements both scheduled() and fetch() handlers, supporting scheduled runs and HTTP-triggered manual runs without any code changes.

Schedules are managed at two levels. Some crawlers define a Cloudflare cron trigger in wrangler.toml. Others are registered with an external scheduler and tracked in the crawlers table alongside metadata such as trigger_url, cron_schedule, and cron_enabled. In practice, the active schedule for any given crawler may differ from what is defined in the repository — always refer to the Status page for the latest run history.

SourceIntervalSchedule method
openai_newsevery 12hExternal scheduler
nasa_newsevery 24hExternal scheduler
github_trendingevery 6hWrangler cron
arxivevery 12hWrangler cron
hnevery 3hWrangler cron
nvdConfigured externallyExternal scheduler

Failure handling

Crawlers are designed to assume external request failures. Each worker uses up to three retries with exponential backoff. Transient errors (429, 5xx) are retried. NVD respects the retry-after header. OpenAI and NASA continue with RSS summary if detail page fetching fails, rather than discarding the item.

The insight-worker applies a separate retry policy for LLM enrichment. Failed insight records with retry_count < 3 are reprocessed with delays of 1, 5, and 15 minutes between attempts. Errors classified as non-retryable — such as missing_openrouter_api_key, empty_source_text, or missing_document — are marked permanently failed without further retries.

Note:A failed crawl cycle does not affect existing documents. Records already in the database remain accessible. The main observable effect is that no new documents from that source will appear until the next successful run.

Observability

Every crawler writes its run result to the worker_runs table — worker name, success flag, number of processed records, and error message if applicable. This is the data source for the crawler history shown on the Status page.

The insight-worker additionally writes to integrity_checks after each run — recording the count of orphan insights, duplicate URLs, and completed insights with an empty llm_summary. These values are also visible on the Status page.

TableWritten byContents
worker_runsAll crawlers + insight-workerRun result, processed count, error
integrity_checksinsight-workerOrphan count, duplicate count, null summary count
status_snapshotsInternal status APIAggregated health snapshot for the Status page

Sources

Each source section below follows the same format — input method, filter criteria, URL normalization, and operational notes.

OpenAI News

openai_newsevery 12h

Input

RSS feed → detail page HTML

content_source

rss + detail

Filter

Minimum summary and body length thresholds

URL normalization

export.arxiv.org URLs normalized to arxiv.org/abs/...

Falls back to RSS summary if detail page fetch fails or body is too short.

NASA Science

nasa_newsevery 24h

Input

RSS feed → detail page HTML

content_source

rss + detail

Filter

Minimum summary and body length thresholds

URL normalization

Trailing slashes removed

Falls back to RSS summary if detail page fetch fails.

arXiv Papers

arxivevery 12h

Input

Atom feed

content_source

rss

Filter

Abstract minimum length threshold

URL normalization

Normalized to arxiv.org/abs/... format, version suffix removed

Summary is the abstract text. No detail page fetch.

Hacker News

hnevery 3h

Input

HN public API (top stories)

content_source

api

Filter

Minimum score threshold; falls back to HN item URL if no external link

URL normalization

External link preferred; HN item URL as fallback

Summary is composed from story title, points, comment count, author, and story text.

NVD CVE Feed

nvdConfigured externally

Input

NVD REST API v2

content_source

api

Filter

Time range filter; respects retry-after header on 429

URL normalization

Canonical CVE URL (nvd.nist.gov/vuln/detail/...)

Summary includes CVE description, CVSS score, KEV status, and reference links.

Was this helpful?