Skip to main content
datascrapingmonitoringguidetoolingmeta

How We Keep a Benchmark Site Honest: Collecting and Monitoring Web Data Without Writing Scrapers

The data pipeline behind BenchLM — how we extract pricing tables, model specs, and competitor leaderboards, and monitor them for changes, using no-code scraping (Browse AI) instead of a fleet of brittle custom scrapers.

Glevd·Published June 19, 2026·13 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

A benchmark site looks like a set of leaderboards. Underneath, it is a data pipeline. Prices change, new models ship most weeks, providers quietly edit a context-window number in their docs, and a competitor adds a row to their own table. The leaderboard is the easy part. Keeping it true — every number current, every source traceable — is the actual job.

This post documents how we run that pipeline on BenchLM, and the part that surprises people: we do not maintain a fleet of custom scrapers. Most of the collection and monitoring runs through a no-code layer, and a small amount of our own code does the parts that are genuinely specific to us. If you want the short version of the tooling, we use Browse AI for no-code extraction and change-monitoring — it turns a page into structured rows or an API without us writing or babysitting selectors. (Partner link — it never affects our rankings.) The rest of this post is the workflow around it, which matters far more than any single tool.

This is the practitioner version, not the thought-leadership version. Everything here is something we actually run.

First, the uncomfortable truth: web data rots

The default mental model for web data is "fetch it once, store it, move on." That model is wrong for anything that has to stay accurate. Three things happen to every page you depend on:

  1. The numbers change. Pricing moves. A model's advertised context window gets revised. A "preview" becomes "GA" and the rate limits shift. If your copy of that number is three weeks old, you are now publishing something false.
  2. The page changes shape. A redesign moves the price from a table into a pricing card. A class name you depended on disappears. Your scraper does not error — it silently returns the wrong cell, or nothing, and you do not notice until a reader emails you.
  3. The access changes. A page that was open last month is behind a bot wall this month, or rate-limits you after the third request, or renders entirely client-side now.

Each of these is survivable on its own. The problem is the maintenance tax when you have dozens of sources and they each break on their own schedule. The bottleneck in a data site is almost never extraction — it is the ongoing cost of keeping extraction working. That is the lens for everything below.

Two jobs that look like one

The single most useful distinction we make internally is between extraction and monitoring. They feel like the same task and they are not.

  • Extraction is "give me the structured contents of this page now." A pricing table becomes rows of {provider, model, input_price, output_price, context}. A spec sheet becomes fields. You run it on demand or on a schedule, and you get a dataset back.
  • Monitoring is "watch this page and tell me when a value I care about changes." You do not want the data on a timer — you want an event the moment the number moves, so you can react before your readers notice the stale figure.

Conflating the two leads to the classic homegrown failure: a cron job that re-scrapes everything every hour, produces a giant diff nobody reads, and either spams you with noise or lulls you into ignoring it. The fix is to use each for what it is good at. Extraction builds and refreshes the dataset. Monitoring decides when a refresh is even worth running.

A quick decision guide we use:

  • Need the full contents of a page, structured, on a predictable cadence? Extract on a schedule.
  • Need to know the instant one specific value changes, with everything else ignored? Monitor that value.
  • Have a page that changes rarely but matters enormously when it does (a flagship pricing page)? Monitor for change, then trigger an extract — the best of both, and the cheapest.

Our pipeline, end to end

Here is the shape of the whole thing, source to site:

source pages → no-code robots → raw structured rows → validation → our data store → site build

Each arrow is a place something can go wrong, so each one earns its own attention.

Source pages. We keep an explicit registry of every page we depend on, with the URL, what we pull from it, how often, and who owns it if it breaks. This list is boring and it is the most valuable artifact in the pipeline. You cannot monitor what you have not written down.

No-code robots. This is where the work that used to be custom scrapers now lives. Instead of writing selectors, we point a robot at a page and show it the fields by clicking them — this price, that model name, this context number. The tool figures out the structure, handles pagination and "load more" lists, and follows through to detail pages when a row links out. Crucially, when a page redesign happens, fixing the robot is a re-train by clicking, not a debugging session in a codebase. That is the maintenance tax collapsing from hours to minutes.

We schedule these to run on a cadence that matches each source — and, for the high-stakes pages, we let a monitor decide when to run them at all. The output is a clean table or an API endpoint we can pull from. (Browse AI is the tool we use for this layer; the pattern is what matters, not the brand. Partner link — it never affects our rankings.)

Validation. Nothing from a robot touches the live site without passing a schema check first. More on this below, because it is the step everyone skips and the step that separates a toy from production.

Our data store and site build. Once data is validated and normalized into our own schema, it flows into the static data the site builds from. From the reader's perspective this is invisible — they just see a number that happens to be current and a "last verified" date that happens to be real.

The honest scope note: the no-code layer does collection and monitoring. Our own code still does the parts that are genuinely ours — the benchmark math, deduplication, matching a scraped "GPT-style" label to the canonical model identity in our database, and the editorial judgment about what is even worth tracking. Outsourcing the brittle part lets the small amount of code we do own be the part with actual leverage.

Using this to build data for RAG and fine-tuning

Everything above is framed around keeping a site fresh, but the exact same pipeline is how you build a dataset to feed a model — which is probably why more of you are here.

The instinct for RAG or fine-tuning is to grab raw HTML and figure it out later. Resist it. Raw HTML is mostly navigation, ads, cookie banners, and markup noise; embedding that wastes tokens and pollutes retrieval. Structured extraction is a cleaning step disguised as a collection step. When a robot returns {question, answer, source, date} instead of a 200KB HTML blob, you have already done most of the work that makes retrieval good.

The workflow we would use for a knowledge base:

  1. Collect with scheduled robots, returning structured fields rather than pages.
  2. Dedupe aggressively — the web is full of near-identical copies, and duplicates quietly bias both retrieval and fine-tuning.
  3. Structure and tag each record with its source and a timestamp so you can filter stale or low-trust entries later.
  4. Chunk along the structure you already have (one Q&A, one spec, one row) instead of blindly splitting on token count.
  5. Embed the clean, chunked records.

A licensing and ethics aside that is not optional: scrape what you are allowed to. Respect robots.txt and terms of service, prefer public and first-party data, do not collect personal data you have no basis to hold, and do not hammer a server because a tool makes it easy to. For a project whose entire value is being trustworthy, the ethical bar sits above the legal one. When in doubt, attribute and link rather than silently absorb.

Monitoring as an early-warning system

The monitoring half of the pipeline is what turns a data site from reactive to proactive. A few of the watches we actually run:

  • Provider pricing pages. A price change is the highest-value event we can catch, because being first to reflect it accurately is exactly the kind of freshness that gets a page cited rather than skipped.
  • Competitor and upstream leaderboards. When a new model row appears somewhere, it is a signal we should be looking at that model too.
  • Spec and limits pages. Context windows, rate limits, deprecation notices — the quiet edits that never get a launch tweet but change what the right answer is.

The mechanics matter as much as the targets. A change event flows: alert → human verifies → row gets updated → site rebuilds. We deliberately keep a human in that loop for anything that gets published. The monitor's job is to catch the change fast; a person's job is to confirm it is real and not a transient A/B test or a typo on the source's side.

The failure mode to design against is alert fatigue. Two rules keep it sane:

  • Watch values, not whole pages. "The page changed" fires on every cookie-banner tweak. "The output price changed" fires when something you care about moved.
  • Set thresholds. Not every diff deserves a ping. A one-cent rounding change can wait for the weekly review; a 40% price drop should wake someone up.

Done well, monitoring means the embarrassing email — "your price for X is wrong" — simply stops arriving, because you already knew and already fixed it.

Pitfalls and the guardrails that fix them

Things break. Here is where, and what we do about it.

Anti-bot walls. Some pages genuinely do not want to be read by machines, and that is their right. When a source fights extraction, the answer is not an arms race — it is to find a first-party source (an official API, a docs export, a partner feed) or to accept a manual update for that one field. A pipeline that depends on defeating bot protection is a pipeline that will break the week you are on vacation.

Schema drift. This is the silent killer. The robot keeps returning data, but the shape shifted — a price column now includes the currency symbol, a number is suddenly a string, a field is occasionally null. The guardrail is a validation layer between extraction and your store that enforces types, ranges, and required fields, and quarantines anything that fails instead of publishing it. A price that parses to zero or to five figures is rejected, not displayed.

Monitoring your monitors. A robot that silently stops running looks identical to a page that simply stopped changing. We treat "no data from this source in longer than its expected cadence" as its own alert. The absence of news is not good news in a data pipeline.

Cost discipline. Every run and every monitor costs something — money, rate-limit budget, goodwill with the source. Match frequency to the real rate of change. Polling a monthly-updated page every hour is 700 wasted runs a month and a faster path to getting blocked. Event-driven beats time-driven almost every time.

Provenance. Every record carries a source URL and a timestamp. This is what lets us put an honest "last verified" date on a page, trace any published number back to where it came from, and show the receipts when someone asks. For a benchmark site, traceability is not a nice-to-have — it is the product.

A 30-minute starter setup

If you want to feel this work rather than read about it, here is the smallest version that delivers real value, and you can build it in half an hour:

  1. Pick one high-value page. A single provider's pricing page is perfect — it matters, it changes, and it is structured.
  2. Train a robot on the fields. Point a no-code tool at the page and click the cells you want: model name, input price, output price, context window. No selectors, no code.
  3. Schedule it weekly and add a change alert. Weekly is plenty for pricing; the alert is what makes it useful between runs.
  4. Send the output to a spreadsheet. Now you have a living, dated table instead of a screenshot from last month.
  5. Add one validation check. Before you trust a row, assert the prices are numbers in a sane range. Reject anything that is not. This one check is what makes the difference between "a neat demo" and "data I will actually publish."

That is the whole loop in miniature — collect, schedule, watch, validate — and it scales by repeating it per source, not by writing more code. (Browse AI is where we would build steps 2–4; the validation in step 5 stays your code. Partner link — it never affects our rankings.)

What we deliberately keep manual

Automation is not the goal; accuracy is. There are places where a human is simply better, and we leave them alone on purpose.

Editorial judgment about what to track. A robot can tell you a new model exists. It cannot tell you whether it is worth a row on a leaderboard, whether its self-reported scores are credible, or whether a benchmark has been contaminated. Those are calls we make by hand, and we would not want it any other way.

First publication of a big change. When a flagship price moves or a major model lands, the monitor fires and a person looks before anything goes live. The cost of being wrong on a headline number is far higher than the cost of being five minutes slower than a fully automated competitor.

Resolving conflicts between sources. When two pages disagree — and they do — no amount of scraping resolves it. Someone has to decide which source is authoritative, or hold the number until it is clear. Pretending automation can adjudicate truth is how you end up confidently publishing the wrong figure.

The rule of thumb: automate collection and detection, keep judgment and publication human. The machine finds the change; the person owns the claim.

The pipeline is the moat

It is tempting to think the value of a benchmark site is the ranking. It is not — anyone can publish a table. The value is the invisible machinery that keeps that table true week after week without a team of ten: the registry of sources, the no-code robots absorbing the maintenance tax, the monitors catching changes before readers do, and the validation layer refusing to publish a number that does not make sense.

The reframing that has served us well: extraction is cheap, maintenance is expensive, and trust is the actual product. No-code scraping and monitoring is how a small team pays the maintenance tax without drowning in it — and how the leaderboard stays the easy part.

If you are building anything data-driven — a benchmark, a price tracker, a knowledge base, a RAG app — start with the boring registry of sources, separate extraction from monitoring, and put a validation step between the web and whatever you publish. The tooling is interchangeable; the discipline is not.

Some links in this post are affiliate links. They never affect our rankings or what we choose to track.

New models drop every week. We send one email a week with what moved and why.