The data pipeline behind BenchLM — how we extract pricing tables, model specs, and competitor leaderboards, and monitor them for changes, using no-code scraping (Browse AI) instead of a fleet of brittle custom scrapers.
Share This Report
Copy the link, post it, or save a PDF version.
A benchmark site looks like a set of leaderboards. Underneath, it is a data pipeline. Prices change, new models ship most weeks, providers quietly edit a context-window number in their docs, and a competitor adds a row to their own table. The leaderboard is the easy part. Keeping it true — every number current, every source traceable — is the actual job.
This post documents how we run that pipeline on BenchLM, and the part that surprises people: we do not maintain a fleet of custom scrapers. Most of the collection and monitoring runs through a no-code layer, and a small amount of our own code does the parts that are genuinely specific to us. If you want the short version of the tooling, we use Browse AI for no-code extraction and change-monitoring — it turns a page into structured rows or an API without us writing or babysitting selectors. (Partner link — it never affects our rankings.) The rest of this post is the workflow around it, which matters far more than any single tool.
This is the practitioner version, not the thought-leadership version. Everything here is something we actually run.
The default mental model for web data is "fetch it once, store it, move on." That model is wrong for anything that has to stay accurate. Three things happen to every page you depend on:
Each of these is survivable on its own. The problem is the maintenance tax when you have dozens of sources and they each break on their own schedule. The bottleneck in a data site is almost never extraction — it is the ongoing cost of keeping extraction working. That is the lens for everything below.
The single most useful distinction we make internally is between extraction and monitoring. They feel like the same task and they are not.
{provider, model, input_price, output_price, context}. A spec sheet becomes fields. You run it on demand or on a schedule, and you get a dataset back.Conflating the two leads to the classic homegrown failure: a cron job that re-scrapes everything every hour, produces a giant diff nobody reads, and either spams you with noise or lulls you into ignoring it. The fix is to use each for what it is good at. Extraction builds and refreshes the dataset. Monitoring decides when a refresh is even worth running.
A quick decision guide we use:
Here is the shape of the whole thing, source to site:
source pages → no-code robots → raw structured rows → validation → our data store → site build
Each arrow is a place something can go wrong, so each one earns its own attention.
Source pages. We keep an explicit registry of every page we depend on, with the URL, what we pull from it, how often, and who owns it if it breaks. This list is boring and it is the most valuable artifact in the pipeline. You cannot monitor what you have not written down.
No-code robots. This is where the work that used to be custom scrapers now lives. Instead of writing selectors, we point a robot at a page and show it the fields by clicking them — this price, that model name, this context number. The tool figures out the structure, handles pagination and "load more" lists, and follows through to detail pages when a row links out. Crucially, when a page redesign happens, fixing the robot is a re-train by clicking, not a debugging session in a codebase. That is the maintenance tax collapsing from hours to minutes.
We schedule these to run on a cadence that matches each source — and, for the high-stakes pages, we let a monitor decide when to run them at all. The output is a clean table or an API endpoint we can pull from. (Browse AI is the tool we use for this layer; the pattern is what matters, not the brand. Partner link — it never affects our rankings.)
Validation. Nothing from a robot touches the live site without passing a schema check first. More on this below, because it is the step everyone skips and the step that separates a toy from production.
Our data store and site build. Once data is validated and normalized into our own schema, it flows into the static data the site builds from. From the reader's perspective this is invisible — they just see a number that happens to be current and a "last verified" date that happens to be real.
The honest scope note: the no-code layer does collection and monitoring. Our own code still does the parts that are genuinely ours — the benchmark math, deduplication, matching a scraped "GPT-style" label to the canonical model identity in our database, and the editorial judgment about what is even worth tracking. Outsourcing the brittle part lets the small amount of code we do own be the part with actual leverage.
Everything above is framed around keeping a site fresh, but the exact same pipeline is how you build a dataset to feed a model — which is probably why more of you are here.
The instinct for RAG or fine-tuning is to grab raw HTML and figure it out later. Resist it. Raw HTML is mostly navigation, ads, cookie banners, and markup noise; embedding that wastes tokens and pollutes retrieval. Structured extraction is a cleaning step disguised as a collection step. When a robot returns {question, answer, source, date} instead of a 200KB HTML blob, you have already done most of the work that makes retrieval good.
The workflow we would use for a knowledge base:
A licensing and ethics aside that is not optional: scrape what you are allowed to. Respect robots.txt and terms of service, prefer public and first-party data, do not collect personal data you have no basis to hold, and do not hammer a server because a tool makes it easy to. For a project whose entire value is being trustworthy, the ethical bar sits above the legal one. When in doubt, attribute and link rather than silently absorb.
The monitoring half of the pipeline is what turns a data site from reactive to proactive. A few of the watches we actually run:
The mechanics matter as much as the targets. A change event flows: alert → human verifies → row gets updated → site rebuilds. We deliberately keep a human in that loop for anything that gets published. The monitor's job is to catch the change fast; a person's job is to confirm it is real and not a transient A/B test or a typo on the source's side.
The failure mode to design against is alert fatigue. Two rules keep it sane:
Done well, monitoring means the embarrassing email — "your price for X is wrong" — simply stops arriving, because you already knew and already fixed it.
Things break. Here is where, and what we do about it.
Anti-bot walls. Some pages genuinely do not want to be read by machines, and that is their right. When a source fights extraction, the answer is not an arms race — it is to find a first-party source (an official API, a docs export, a partner feed) or to accept a manual update for that one field. A pipeline that depends on defeating bot protection is a pipeline that will break the week you are on vacation.
Schema drift. This is the silent killer. The robot keeps returning data, but the shape shifted — a price column now includes the currency symbol, a number is suddenly a string, a field is occasionally null. The guardrail is a validation layer between extraction and your store that enforces types, ranges, and required fields, and quarantines anything that fails instead of publishing it. A price that parses to zero or to five figures is rejected, not displayed.
Monitoring your monitors. A robot that silently stops running looks identical to a page that simply stopped changing. We treat "no data from this source in longer than its expected cadence" as its own alert. The absence of news is not good news in a data pipeline.
Cost discipline. Every run and every monitor costs something — money, rate-limit budget, goodwill with the source. Match frequency to the real rate of change. Polling a monthly-updated page every hour is 700 wasted runs a month and a faster path to getting blocked. Event-driven beats time-driven almost every time.
Provenance. Every record carries a source URL and a timestamp. This is what lets us put an honest "last verified" date on a page, trace any published number back to where it came from, and show the receipts when someone asks. For a benchmark site, traceability is not a nice-to-have — it is the product.
If you want to feel this work rather than read about it, here is the smallest version that delivers real value, and you can build it in half an hour:
That is the whole loop in miniature — collect, schedule, watch, validate — and it scales by repeating it per source, not by writing more code. (Browse AI is where we would build steps 2–4; the validation in step 5 stays your code. Partner link — it never affects our rankings.)
Automation is not the goal; accuracy is. There are places where a human is simply better, and we leave them alone on purpose.
Editorial judgment about what to track. A robot can tell you a new model exists. It cannot tell you whether it is worth a row on a leaderboard, whether its self-reported scores are credible, or whether a benchmark has been contaminated. Those are calls we make by hand, and we would not want it any other way.
First publication of a big change. When a flagship price moves or a major model lands, the monitor fires and a person looks before anything goes live. The cost of being wrong on a headline number is far higher than the cost of being five minutes slower than a fully automated competitor.
Resolving conflicts between sources. When two pages disagree — and they do — no amount of scraping resolves it. Someone has to decide which source is authoritative, or hold the number until it is clear. Pretending automation can adjudicate truth is how you end up confidently publishing the wrong figure.
The rule of thumb: automate collection and detection, keep judgment and publication human. The machine finds the change; the person owns the claim.
It is tempting to think the value of a benchmark site is the ranking. It is not — anyone can publish a table. The value is the invisible machinery that keeps that table true week after week without a team of ten: the registry of sources, the no-code robots absorbing the maintenance tax, the monitors catching changes before readers do, and the validation layer refusing to publish a number that does not make sense.
The reframing that has served us well: extraction is cheap, maintenance is expensive, and trust is the actual product. No-code scraping and monitoring is how a small team pays the maintenance tax without drowning in it — and how the leaderboard stays the easy part.
If you are building anything data-driven — a benchmark, a price tracker, a knowledge base, a RAG app — start with the boring registry of sources, separate extraction from monitoring, and put a validation step between the web and whatever you publish. The tooling is interchangeable; the discipline is not.
Some links in this post are affiliate links. They never affect our rankings or what we choose to track.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
A practitioner's guide to getting cited by ChatGPT, Perplexity, and Claude — the exact AEO/GEO changes we shipped on BenchLM: quotable lines, Dataset schema, llms.txt, AI-crawler access, and the tooling we use to find what to answer.
Which LLM is best for voice agents in 2026? We rank models by first-answer latency and output speed — the metrics that actually decide voice — name the fastest capable models, and compare the voice-agent platforms (Retell, Vapi, OpenAI Realtime, ElevenLabs).
Best LLM for math 2026: GPT-5.4 leads AIME 2025, MATH-500, and BRUMO. Compare Claude, Gemini, DeepSeek-R1, GPT-5.5, and value picks by use case.