A practical guide to deploying AI apps and LLM-powered products — the model layer vs. the app layer, what your host must support (streaming, functions, secrets), and the exact setup we use to run BenchLM.
Share This Report
Copy the link, post it, or save a PDF version.
You picked a model. That's the part this site exists for — the leaderboards told you what's strong at what, the cost calculator told you what it'll cost per month. Now you have a working prototype on localhost and the actual question: how do you get this thing in front of users?
Here's the part most "deploy AI" content gets wrong: it answers a different question than the one you have. Search for AI deployment and you'll drown in Kubernetes tutorials, inference servers, and GPU autoscaling guides — the machinery for running model weights. If you're calling a model API (and you almost certainly are), none of that is your problem. Your problem is deploying the app around the model, and that's a much more solvable one.
This is the practitioner's guide to that problem: the model-layer/app-layer split, what your host actually needs to support, the three architectures that work, the deployment we run for real, where RAG fits, and the handful of pitfalls that break AI apps in their first week of production.
Every AI product has two layers, and they have completely different deployment stories.
The model layer is where inference happens — GPU memory holding weights, an inference server batching requests, autoscaling that has to deal with cold starts measured in tens of seconds, quantization decisions, KV-cache management. If you use OpenAI, Anthropic, Google, Mistral, DeepSeek, or any inference provider serving open-weight models, this layer is their problem. You interact with it through an HTTPS endpoint and an API key, and the hardest infrastructure problems in AI are somebody else's pager.
The app layer is everything your users actually touch: the frontend, auth, your data, and — critically — the server-side routes that hold your API key and relay requests to the model. This is a web application. It deploys like a web application. The fact that one of its upstream dependencies happens to be a language model changes exactly two things about hosting it (streaming and timeouts — we'll get there), and nothing else.
The decision rule is short:
Roughly speaking: if you don't have a concrete reason to hold the weights (compliance, unit economics at serious sustained volume, latency control, air-gapped environments), the API is the right call and your deployment just got an order of magnitude simpler. The teams that get this backwards spend their first quarter building inference infrastructure for a product that doesn't have users yet.
Strip away the hype and a 2026 AI app has a boring, repeatable shape:
The first rule of the app layer, and the one that gets violated constantly: the model call never happens in the browser. Anything shipped to the client is public — including your API key. A key scraped from a JS bundle becomes someone else's free inference until your card gets declined; there are scrapers that do nothing but crawl deployed bundles looking for exactly this. Every model call routes through a function you control, where the key lives in an environment variable and where you can rate-limit, log, and cap spend.
The second rule follows from the first: because all model traffic flows through your functions, those functions are your control point for everything — usage metering, abuse prevention, prompt versioning, model fallbacks. Treat that thin proxy layer as a real component, not glue code.
Not every web host can run this shape well. Before you commit, verify five things — in order of how painful they are to discover late:
1. End-to-end streaming. This is the sneaky one. Your function can stream perfectly and the platform's proxy layer can still buffer the whole response before releasing it — turning your live token stream into a long blank pause followed by a wall of text. Streaming has to survive every hop: function runtime, CDN, edge network. Some platforms support it natively, some support it only in specific function types, some buffer silently and document it in a footnote. Test it with a real slow generation on the deployed URL, not a hello-world on localhost.
2. Function timeouts that match generation reality. A long completion from a large reasoning model can run well past 10 seconds — extended thinking modes can run minutes. Know your platform's synchronous function limit, and know what the escape hatch is: streaming responses (which keep the connection alive as chunks flow) or background functions for jobs measured in minutes. If the platform's answer is "requests cap at 10 seconds, no exceptions," it cannot host a serious LLM app, full stop.
3. Server-side secrets. Environment variables scoped to functions, not baked into the client bundle. Table stakes, but check how preview and branch deploys handle secrets too — you want your production key in exactly one deploy context, and low-limit test keys everywhere else. A platform that copies all env vars into every PR preview is quietly multiplying your attack surface by the number of open branches.
4. Preview deploys. AI apps need more iteration than most software — prompts are code that you tune by feel, and "does this feel better?" is a question you answer by sharing a link, not by describing a diff. A URL per branch or PR, with the functions actually running against a test key, is the difference between "tweak the prompt, push, share the link" and a staging-server bottleneck that makes prompt iteration a scheduled event.
5. A free tier that covers real usage. The app layer should cost you nothing while you find out whether anyone wants the product. Your variable cost should be model tokens, not hosting. Any platform that charges meaningfully before you have traffic is charging you for the privilege of being early.
Within the app-layer world, you have three real choices. All three deploy to the same class of platform.
Static frontend + serverless functions. The frontend is fully pre-built at deploy time; every dynamic AI interaction goes through a function. This is the simplest, cheapest, and fastest-to-first-byte option, and it's what a chat interface, a document analyzer, or a generation tool actually needs. It's also what BenchLM itself runs — hundreds of statically generated pages, functions for the dynamic edges.
Server-rendered app (SSR). Next.js or SvelteKit rendering pages on demand, with AI calls in route handlers. Choose this when the page content itself is personalized or model-generated per request — not just when your framework's marketing suggests it. SSR adds a server hop to every page view; take that cost only where you use it.
Edge functions for the latency-sensitive path. Edge runtimes put your proxy code in the region closest to the user. For LLM apps this matters less than people assume — generation time dwarfs network time — but it matters for the first token and for lightweight pre-processing (auth checks, rate limiting, routing between models). A sensible pattern: rate-limit and authenticate at the edge, do the actual model call in a standard function.
If in doubt, start with the first option. Every AI product I've seen ship fast was static-plus-functions; the ones that stalled were the ones that reached for heavy infrastructure before they had a user to serve with it.
BenchLM is itself an AI-adjacent app deployed the way this post describes — static-generated pages, serverless functions for the dynamic parts (OG image generation, newsletter signup, API endpoints), all deployed from a Git repo on every push. We run it on Netlify, and it checks every box on the list above: functions with streaming support, edge functions when we want lower latency, per-branch preview deploys with scoped environment variables, and a free tier that carried this site well past its first traffic spikes. (Partner link — it never affects our rankings.)
Here's the minimal shape of an AI endpoint as a Netlify Function — a complete, deployable model proxy with streaming:
// netlify/functions/chat.mjs
export default async (req) => {
const { messages } = await req.json();
const upstream = await fetch("https://api.anthropic.com/v1/messages", {
method: "POST",
headers: {
"x-api-key": process.env.ANTHROPIC_API_KEY, // set in the Netlify UI, never in code
"anthropic-version": "2023-06-01",
"content-type": "application/json",
},
body: JSON.stringify({
model: "claude-sonnet-5",
max_tokens: 1024,
stream: true,
messages,
}),
});
// Forward the token stream straight through to the browser.
return new Response(upstream.body, {
headers: { "content-type": "text/event-stream" },
});
};
export const config = { path: "/api/chat" };
That's the whole trick. The browser calls /api/chat on your own domain, the key never leaves the server, and tokens render as they arrive. Swap the endpoint and headers for OpenAI, Mistral, or an inference provider serving open-weight models — the shape is identical. The deployment workflow around it:
ANTHROPIC_API_KEY (or your provider's equivalent) as an environment variable. Scope the production key to the production context; give previews a separate key with a low spend limit so a leaked preview URL can't hurt you.Time from working localhost prototype to production URL, done honestly: under an hour, most of which is reading your own code for hardcoded keys.
The most common "but my app is more complicated" objection is retrieval-augmented generation, and it's the objection that dissolves fastest under inspection. A RAG app adds exactly three pieces, and all three fit the same architecture:
If someone tells you RAG means you've outgrown serverless, they're describing 2023.
Requests that finish in seconds are the easy case. Two workloads don't fit it:
Batch work — re-embedding a corpus, generating summaries for a thousand documents, nightly evaluation runs. These belong in background functions (most platforms offer a variant with a timeout measured in minutes, not seconds) triggered on a schedule or by an event. The pattern to avoid is the clever one: chaining synchronous functions to dodge timeouts. It works until it doesn't, and it fails invisibly.
Agent runs — multi-step tool-using sessions that might run for minutes and need to survive their own failures. The honest 2026 answer is that a long agent run wants a queue and a worker, which is one managed service more (Upstash QStash, Inngest, Trigger.dev and similar all speak "call my function later" natively). The function receives a job, does one agent step, persists state, enqueues the next step. Each step stays inside serverless limits; the run as a whole can go as long as it needs. Short agent loops — a handful of tool calls — fit fine in a single streaming background function; don't add the queue until a real run actually hits a real limit.
Notice what's still not on the list: your own servers.
Having watched a lot of these ship (and shipped a few), the failure modes are remarkably consistent:
1. Buffered streaming. Covered above, worth repeating: test streaming on the deployed URL, not localhost. If the first token doesn't render until the last token is generated, some layer is buffering — and your users' first impression is a 25-second blank screen.
2. Timeout kills mid-generation. Long completions die at the platform's sync limit and the user sees a truncated answer or a 502. Fix: stream (the open connection keeps the function alive on most platforms), or move long jobs to background functions and poll or push the result.
3. The leaked key. VITE_- and NEXT_PUBLIC_-prefixed variables ship to the browser by design — that's what the prefix means. If your provider key ever had one of those prefixes, even briefly on a preview deploy, rotate it today. Set provider-side spend caps as the backstop for the leak you haven't noticed yet.
4. Unbounded spend. Token costs scale with usage patterns you don't control once the URL is public. One user pasting entire books into your chat box, or one scraper hammering the endpoint overnight, can cost more in a weekend than hosting costs in a year. Rate limits per user, max_tokens caps per request, provider spend limits per key — all three, before launch, not after the invoice.
5. No fallback for provider incidents. Model APIs have bad days; every provider's status page has a history tab for a reason. A try/catch that degrades to a smaller or alternate model — the leaderboard is useful for picking a fallback in the same capability class — keeps your product up while a provider is down. Even a canned "generation is degraded right now" state beats a raw 500.
A sixth, quieter one: shipping with no observability. Log every model call server-side — model, token counts, latency, truncated prompt hash — from day one. When a user reports "it gave me a weird answer yesterday," that log is the difference between a fix and a shrug. Your function proxy is the natural place for it, which is one more reason the browser-calls-the-API shortcut costs more than it saves.
Sometimes self-hosting is right: open-weight models have closed most of the capability gap in several categories, and at sustained high volume the per-token math can flip in favor of rented GPUs — run your own numbers in the self-host calculator before trusting anyone's blog post, including this one. Privacy and compliance can also force the issue regardless of economics.
When that's your path, here is the part that surprises people: the app-layer guidance above doesn't change at all. Your frontend and functions still deploy exactly as described; the only difference is that the endpoint your function calls is your own inference server on a GPU box instead of a provider's API. The two layers stay cleanly separated — which is precisely why it pays to build them that way from day one. You can start on a hosted API this week, prove the product, and swap the model layer later by changing one URL in one function.
Pick the model with our leaderboards, estimate the tokens with the cost calculator, then ship the thing. The gap between "works on my machine" and "works on a URL" has never been smaller.
New models drop every week. We send one email a week with what moved and why.
Share This Report
Copy the link, post it, or save a PDF version.
On this page
Which models moved up, what’s new, and what it costs. One email a week, 3-min read.
Free. One email per week.
The data pipeline behind BenchLM — how we extract pricing tables, model specs, and competitor leaderboards, and monitor them for changes, using no-code scraping (Browse AI) instead of a fleet of brittle custom scrapers.
Which LLM is best for voice agents in 2026? We rank models by first-answer latency and output speed — the metrics that actually decide voice — name the fastest capable models, and compare the voice-agent platforms (Retell, Vapi, OpenAI Realtime, ElevenLabs).
A practitioner's guide to getting cited by ChatGPT, Perplexity, and Claude — the exact AEO/GEO changes we shipped on BenchLM: quotable lines, Dataset schema, llms.txt, AI-crawler access, and the tooling we use to find what to answer.