MASTASCUSA HOLDINGS
Live system Last refresh · 44d ago

Mastascusa Holdings · Case study

Nine websites in. One map out.
Every six hours, on the dot.

This is the live map reverse-engineered: every box, every decision, every line of math. Plain-English up top, the engineering chops underneath. Same pipeline pointed at your data is a thing I sell.

Items right now

4,232

Sources

9

Cadence

6h

Window

2026-04-29
→ 2026-05-01

Numbers above pulled from kb-preview.json at build time — refreshed every six hours by the pipeline you're about to read about.

Plain English

A baby could follow this:

  1. 1.

    Computer reads nine AI websites every six hours.

  2. 2.

    It remembers what it's already seen so it never shows the same thing twice.

  3. 3.

    It turns each headline into a dot on a map. Headlines about the same idea sit next to each other.

  4. 4.

    It saves the map. The website updates. You're looking at it.

Engineering version

Five stages. Each one is a deliberate choice.

01

Ingest — Read nine websites.

feedparser pulls RSS from 8 publishers. Anthropic has no feed, so httpx walks the sitemap and regex-extracts title + meta-description from each URL under /news, /research, /engineering. Hacker News is the Algolia search API gated by points ≥ 100 and a keyword regex.

02

Fingerprint — Give every item a unique ID.

SHA-256 of the URL. Collisions impossible at this volume (~10⁻⁷⁷ at our scale). The fingerprint is the SQLite primary key, so the pipeline is idempotent — run it 100 times in a row, store exactly one row per item.

03

Embed — Turn each title into a vector of numbers.

TF-IDF on title + summary (1–2 gram, 20k features, sublinear TF), then TruncatedSVD down to 64 dimensions. Cached to .npz keyed by fingerprint — re-runs only embed the deltas, so the 6-hour cadence stays cheap.

04

Project — Squish those vectors into a 2-D map.

Mean-center, take the top-2 principal components via SVD, scale by the 99th-percentile absolute value, clip to [-1, 1]. The map you see is literally argmax-variance projection of the corpus.

05

Publish — Save it. Push to GitHub. Site rebuilds.

Two outputs: a chronological markdown archive grouped by ISO week, and a JSON snapshot embedded in this site. The publish step is a git commit on the website repo — Vercel detects it and redeploys in ~30 seconds. The page you're reading is the proof.

Receipts

The dedup, in nine lines.

The whole pipeline is ~290 lines of Python. This is the load-bearing chunk — the reason re-running the crawler doesn't corrupt the database.

# Idempotent ingest. Every item's primary key is SHA-256(url).
# Run this 1× or 1000× and the table is identical.
def store_new(conn, items):
    new = []
    for it in items:
        if conn.execute(
            "SELECT 1 FROM items WHERE fingerprint=?",
            (it.fingerprint,),
        ).fetchone():
            continue
        conn.execute(
            "INSERT INTO items VALUES (?,?,?,?,?,?,?)",
            (it.fingerprint, it.source, it.title, it.url,
             it.published.isoformat(), it.summary, now),
        )
        new.append(it)
    conn.commit()
    return new

Decisions I'll defend on a whiteboard

Every choice has a "why."

Why SHA-256, not a softer dedup?
Title-based dedup is fragile (publishers retitle posts mid-flight). URL-fingerprint is bit-exact and cheap. Boring decision, but boring is correct here.
Why TF-IDF + SVD instead of a giant transformer?
For 4,000 short documents the variance is in the vocabulary, not the deep semantics. TF-IDF + SVD takes 800ms cold; sentence-transformers takes a minute and adds nothing the eye can see in 2-D. Pick the floor of the model that solves the problem.
Why SQLite, not a real database?
One file. No server. No connection pool. Backups are `cp`. For a single-writer pipeline this is the right answer. Switching to Postgres would buy zero capability and cost one ops headache.
Why publish via git push?
It's already the source of truth for the site. Routing the data through it means the website state is reproducible from a single commit hash — no separate "freshness pipeline" to monitor.

Stack

Ingestion

Python · feedparser · httpx · regex

Storage

SQLite (one file, single-writer)

Embeddings

scikit-learn · TF-IDF · TruncatedSVD · .npz cache

Projection

NumPy SVD, 2-D principal components

Scheduling

Windows Task Scheduler · 6-hour interval

Publish

git push → Vercel autodeploy → Astro static render

DIY

Want to run it yourself?

The whole thing is one Python file, one SQLite DB, and a batch script. No Docker. No cloud account. Five minutes from install to first crawl.

# What it looks like end-to-end:
python -m venv .venv && .venv\Scripts\activate
pip install -r requirements.txt
python crawler.py
# → SQLite + markdown archive populated. Done.

Point ARXIV_FEEDS and BLOG_FEEDS at any RSS in crawler.py and the same pipeline reads whatever you give it. The dedup, embeddings, and publish loop are all source-agnostic.

Email me for the source →

For your organization

Want this on your data?

Research literature. Contract repositories. Customer call transcripts. Regulatory filings. Internal wikis. The architecture above is source-agnostic — I swap the adapters, you get the map.