Crawler

Last updated March 21, 2026

SynthLink's crawlers periodically collect data from public sources and convert it into normalized document records ready for LLM enrichment and API delivery. Each crawler handles a different external source, but all follow the same contract — fetch, filter, normalize, and upsert into the shared documents table.

Overview

The crawler layer has three core responsibilities. First, standardizing different external data formats into a single documents table schema. Second, controlling duplicates by URL while tracking the most recent observation time. Third, making stored documents reliably available to the downstream enrichment pipeline and the read-only API.

Every document produced by a crawler contains the same base fields regardless of source.

Document fields produced by all crawlers

{
  "title":          string,   // extracted from source
  "url":            string,   // canonical URL after normalization
  "summary":        string,   // raw excerpt — not LLM-generated
  "content":        string,   // full body if available, else null
  "source":         string,   // logical source identifier
  "content_source": string,   // ingestion method: rss | detail | api
  "created_at":     string    // first seen (ISO 8601)
}

source identifies which logical source a document belongs to. content_source describes how the content was actually obtained — whether only the RSS summary was stored, whether the detail page was fetched and parsed, or whether the content came directly from a structured API response.

Data collection

SynthLink uses three collection strategies depending on the source.

Feed-based collection

OpenAI, NASA, and arXiv crawlers read RSS or Atom feeds first, extracting titles, links, and summaries from each entry. OpenAI and NASA go a step further — after parsing the feed, they fetch the detail page HTML and attempt to extract longer body text from the article or main region. If the detail page yields insufficient content, the crawler falls back to the RSS summary rather than discarding the document.

API-based collection

GitHub, NVD, and Hacker News crawlers call public APIs directly. Because responses are already structured, no HTML parsing is needed. Instead, relevant fields are composed into a summary and content. For example, GitHub composes a summary from the repository description, star count, fork count, primary language, and topics. NVD structures the CVE description, CVSS score, KEV status, and reference links. Hacker News combines the story title, points, comment count, author, and story text.

Hybrid collection

OpenAI and NASA use a hybrid approach — feed parsing for item discovery and detail page fetching for content enrichment. This separation means new items are detected quickly via the feed, and detail page failures do not block the entire crawl cycle.

Normalization and filtering

Crawlers do not store raw content as-is. Two steps run before every upsert.

URL normalization

Each crawler produces a canonical URL used as the deduplication key. Common transformations include removing tracking parameters (utm_source etc.), stripping trailing slashes, and normalizing domain variants. arXiv normalizesexport.arxiv.org URLs and strips version suffixes to produce a stable arxiv.org/abs/... form. Hacker News substitutes the HN item URL when no external link is present.

Quality filtering

Each crawler applies minimum quality thresholds before writing a record. OpenAI and NASA require a minimum body length and summary length — documents that fail both checks are discarded. arXiv rejects abstracts that are too short. GitHub, NVD, and Hacker News apply filters based on star count, CVSS score, story score, and time range respectively. The system is designed to expose only information worth surfacing through the API, not everything that was fetched.

Data flow

After a crawler writes to the documents table, the data moves through two more stages before reaching external consumers.

Full pipeline

external source
  → crawler (upsert into documents)
  → insight-worker (LLM enrichment → upsert into insights)
  → public API (/api/v1/documents, /api/v1/insights, /api/v1/combined)

The insight-worker picks up documents that do not yet have an insight, or insight records in a retryable failed state. It uses content as the LLM input when available, falling back to summary. The model output is parsed into llm_summary, keywords, tags, and category, then written to the insights table.

Upserts use url as the conflict key. If a document with the same URL already exists, the existing record is preserved. created_at is never modified after initial insertion.

Automation

Each crawler implements both scheduled() and fetch() handlers, supporting scheduled runs and HTTP-triggered manual runs without any code changes.

Schedules are managed at two levels. Some crawlers define a Cloudflare cron trigger in wrangler.toml. Others are registered with an external scheduler and tracked in the crawlers table alongside metadata such as trigger_url, cron_schedule, and cron_enabled. In practice, the active schedule for any given crawler may differ from what is defined in the repository — always refer to the Status page for the latest run history.

SourceIntervalSchedule method

openai_newsevery 12hExternal scheduler

nasa_newsevery 24hExternal scheduler

github_trendingevery 6hWrangler cron

arxivevery 12hWrangler cron

hnevery 3hWrangler cron

nvdConfigured externallyExternal scheduler

Failure handling

Crawlers are designed to assume external request failures. Each worker uses up to three retries with exponential backoff. Transient errors (429, 5xx) are retried. NVD respects the retry-after header. OpenAI and NASA continue with RSS summary if detail page fetching fails, rather than discarding the item.

The insight-worker applies a separate retry policy for LLM enrichment. Failed insight records with retry_count < 3 are reprocessed with delays of 1, 5, and 15 minutes between attempts. Errors classified as non-retryable — such as missing_openrouter_api_key, empty_source_text, or missing_document — are marked permanently failed without further retries.

Note:A failed crawl cycle does not affect existing documents. Records already in the database remain accessible. The main observable effect is that no new documents from that source will appear until the next successful run.

Observability

NVD REST API v2

content_source

api

Filter

Time range filter; respects retry-after header on 429

URL normalization

Canonical CVE URL (nvd.nist.gov/vuln/detail/...)

Summary includes CVE description, CVSS score, KEV status, and reference links.

Was this helpful?

Sources

Status

Crawler

Overview

Data collection

Feed-based collection

API-based collection

Hybrid collection

Normalization and filtering

URL normalization

Quality filtering

Data flow

Automation

Failure handling

Observability

Sources

OpenAI News

NASA Science

GitHub Trending

arXiv Papers

Hacker News

NVD CVE Feed