Skip to main content
deploymentguidetoolingai-appsserverless

How to Deploy an AI App in 2026: From Model Pick to Production URL

A practical guide to deploying AI apps and LLM-powered products — the model layer vs. the app layer, what your host must support (streaming, functions, secrets), and the exact setup we use to run BenchLM.

Glevd·Published July 2, 2026·15 min read

Share This Report

Copy the link, post it, or save a PDF version.

Share on XShare on LinkedIn

You picked a model. That's the part this site exists for — the leaderboards told you what's strong at what, the cost calculator told you what it'll cost per month. Now you have a working prototype on localhost and the actual question: how do you get this thing in front of users?

Here's the part most "deploy AI" content gets wrong: it answers a different question than the one you have. Search for AI deployment and you'll drown in Kubernetes tutorials, inference servers, and GPU autoscaling guides — the machinery for running model weights. If you're calling a model API (and you almost certainly are), none of that is your problem. Your problem is deploying the app around the model, and that's a much more solvable one.

This is the practitioner's guide to that problem: the model-layer/app-layer split, what your host actually needs to support, the three architectures that work, the deployment we run for real, where RAG fits, and the handful of pitfalls that break AI apps in their first week of production.

The distinction that sorts everything: model layer vs. app layer

Every AI product has two layers, and they have completely different deployment stories.

The model layer is where inference happens — GPU memory holding weights, an inference server batching requests, autoscaling that has to deal with cold starts measured in tens of seconds, quantization decisions, KV-cache management. If you use OpenAI, Anthropic, Google, Mistral, DeepSeek, or any inference provider serving open-weight models, this layer is their problem. You interact with it through an HTTPS endpoint and an API key, and the hardest infrastructure problems in AI are somebody else's pager.

The app layer is everything your users actually touch: the frontend, auth, your data, and — critically — the server-side routes that hold your API key and relay requests to the model. This is a web application. It deploys like a web application. The fact that one of its upstream dependencies happens to be a language model changes exactly two things about hosting it (streaming and timeouts — we'll get there), and nothing else.

The decision rule is short:

  • Calling a model API? You are deploying an app layer only. Any modern web platform can host it; the rest of this post is your checklist.
  • Self-hosting open-weight model weights? You are also deploying a model layer, which means renting GPUs. That's a real option with real economics — we built a self-host cost calculator precisely because the break-even math is unintuitive — but it's a cost/privacy/control decision you make deliberately, not a default you drift into.

Roughly speaking: if you don't have a concrete reason to hold the weights (compliance, unit economics at serious sustained volume, latency control, air-gapped environments), the API is the right call and your deployment just got an order of magnitude simpler. The teams that get this backwards spend their first quarter building inference infrastructure for a product that doesn't have users yet.

What an AI app actually looks like in production

Strip away the hype and a 2026 AI app has a boring, repeatable shape:

  1. A frontend — static pages or a server-rendered framework (Next.js, Astro, SvelteKit, plain Vite + React). Nothing AI-specific about deploying this. It goes to a CDN like every other frontend since 2018.
  2. Server-side functions — the routes that read the API key from the environment, call the model, and return results. This is the only architecturally AI-ish part of the stack, and it's maybe forty lines of code.
  3. A streaming path — users will not stare at a spinner for 20 seconds while a long generation completes. Tokens must flow to the browser as they're produced, and every hop between the model and the user has to cooperate.
  4. Sometimes: a data layer — a vector store for RAG, a database for chat history, session state. All of it available as managed services your functions call.
  5. Sometimes: background work — embedding pipelines, batch jobs, agent runs that outlive a request/response cycle.

The first rule of the app layer, and the one that gets violated constantly: the model call never happens in the browser. Anything shipped to the client is public — including your API key. A key scraped from a JS bundle becomes someone else's free inference until your card gets declined; there are scrapers that do nothing but crawl deployed bundles looking for exactly this. Every model call routes through a function you control, where the key lives in an environment variable and where you can rate-limit, log, and cap spend.

The second rule follows from the first: because all model traffic flows through your functions, those functions are your control point for everything — usage metering, abuse prevention, prompt versioning, model fallbacks. Treat that thin proxy layer as a real component, not glue code.

The hosting checklist: what your platform must support

Not every web host can run this shape well. Before you commit, verify five things — in order of how painful they are to discover late:

1. End-to-end streaming. This is the sneaky one. Your function can stream perfectly and the platform's proxy layer can still buffer the whole response before releasing it — turning your live token stream into a long blank pause followed by a wall of text. Streaming has to survive every hop: function runtime, CDN, edge network. Some platforms support it natively, some support it only in specific function types, some buffer silently and document it in a footnote. Test it with a real slow generation on the deployed URL, not a hello-world on localhost.

2. Function timeouts that match generation reality. A long completion from a large reasoning model can run well past 10 seconds — extended thinking modes can run minutes. Know your platform's synchronous function limit, and know what the escape hatch is: streaming responses (which keep the connection alive as chunks flow) or background functions for jobs measured in minutes. If the platform's answer is "requests cap at 10 seconds, no exceptions," it cannot host a serious LLM app, full stop.

3. Server-side secrets. Environment variables scoped to functions, not baked into the client bundle. Table stakes, but check how preview and branch deploys handle secrets too — you want your production key in exactly one deploy context, and low-limit test keys everywhere else. A platform that copies all env vars into every PR preview is quietly multiplying your attack surface by the number of open branches.

4. Preview deploys. AI apps need more iteration than most software — prompts are code that you tune by feel, and "does this feel better?" is a question you answer by sharing a link, not by describing a diff. A URL per branch or PR, with the functions actually running against a test key, is the difference between "tweak the prompt, push, share the link" and a staging-server bottleneck that makes prompt iteration a scheduled event.

5. A free tier that covers real usage. The app layer should cost you nothing while you find out whether anyone wants the product. Your variable cost should be model tokens, not hosting. Any platform that charges meaningfully before you have traffic is charging you for the privilege of being early.

Three architectures that work (and when to use each)

Within the app-layer world, you have three real choices. All three deploy to the same class of platform.

Static frontend + serverless functions. The frontend is fully pre-built at deploy time; every dynamic AI interaction goes through a function. This is the simplest, cheapest, and fastest-to-first-byte option, and it's what a chat interface, a document analyzer, or a generation tool actually needs. It's also what BenchLM itself runs — hundreds of statically generated pages, functions for the dynamic edges.

Server-rendered app (SSR). Next.js or SvelteKit rendering pages on demand, with AI calls in route handlers. Choose this when the page content itself is personalized or model-generated per request — not just when your framework's marketing suggests it. SSR adds a server hop to every page view; take that cost only where you use it.

Edge functions for the latency-sensitive path. Edge runtimes put your proxy code in the region closest to the user. For LLM apps this matters less than people assume — generation time dwarfs network time — but it matters for the first token and for lightweight pre-processing (auth checks, rate limiting, routing between models). A sensible pattern: rate-limit and authenticate at the edge, do the actual model call in a standard function.

If in doubt, start with the first option. Every AI product I've seen ship fast was static-plus-functions; the ones that stalled were the ones that reached for heavy infrastructure before they had a user to serve with it.

The setup we actually run

BenchLM is itself an AI-adjacent app deployed the way this post describes — static-generated pages, serverless functions for the dynamic parts (OG image generation, newsletter signup, API endpoints), all deployed from a Git repo on every push. We run it on Netlify, and it checks every box on the list above: functions with streaming support, edge functions when we want lower latency, per-branch preview deploys with scoped environment variables, and a free tier that carried this site well past its first traffic spikes. (Partner link — it never affects our rankings.)

Here's the minimal shape of an AI endpoint as a Netlify Function — a complete, deployable model proxy with streaming:

// netlify/functions/chat.mjs
export default async (req) => {
  const { messages } = await req.json();

  const upstream = await fetch("https://api.anthropic.com/v1/messages", {
    method: "POST",
    headers: {
      "x-api-key": process.env.ANTHROPIC_API_KEY, // set in the Netlify UI, never in code
      "anthropic-version": "2023-06-01",
      "content-type": "application/json",
    },
    body: JSON.stringify({
      model: "claude-sonnet-5",
      max_tokens: 1024,
      stream: true,
      messages,
    }),
  });

  // Forward the token stream straight through to the browser.
  return new Response(upstream.body, {
    headers: { "content-type": "text/event-stream" },
  });
};

export const config = { path: "/api/chat" };

That's the whole trick. The browser calls /api/chat on your own domain, the key never leaves the server, and tokens render as they arrive. Swap the endpoint and headers for OpenAI, Mistral, or an inference provider serving open-weight models — the shape is identical. The deployment workflow around it:

  1. Connect the repo. Push to Git, connect the repo in the dashboard, and the framework is auto-detected — build command and publish directory included for Next.js, Astro, SvelteKit, and friends. No Dockerfile, no YAML.
  2. Set your secrets. Add ANTHROPIC_API_KEY (or your provider's equivalent) as an environment variable. Scope the production key to the production context; give previews a separate key with a low spend limit so a leaked preview URL can't hurt you.
  3. Push. Every push builds and deploys; every pull request gets its own preview URL with functions live. Prompt-tuning becomes "push branch, send link, get a yes/no in five minutes."
  4. Add guardrails before you share the URL. Rate-limit the function (by IP or user), set a max token budget per request, and set a hard spend cap on the provider side. An AI endpoint without limits is a public faucet wired to your credit card.

Time from working localhost prototype to production URL, done honestly: under an hour, most of which is reading your own code for hardcoded keys.

RAG doesn't change the story — it adds one service

The most common "but my app is more complicated" objection is retrieval-augmented generation, and it's the objection that dissolves fastest under inspection. A RAG app adds exactly three pieces, and all three fit the same architecture:

  • The vector store is a managed service — Pinecone, Supabase pgvector, Upstash Vector, Turso, take your pick. Your functions query it over HTTPS with a key from the environment, exactly like the model API. You don't host it, the same way you don't host the model.
  • The query path — embed the user's question, retrieve the nearest chunks, stuff them into the prompt — is a few extra awaits inside the same function that was already calling the model. Latency budget: one embedding call plus one vector query, both fast relative to generation.
  • The ingestion pipeline — chunking and embedding your corpus — is the one genuinely new piece, and it's a background job, not a service. Run it as a scheduled function for periodic refreshes, or trigger it on content changes. (If your corpus lives on other people's websites, that's an extraction problem — we wrote up how we run that pipeline for BenchLM's own data.)

If someone tells you RAG means you've outgrown serverless, they're describing 2023.

Background jobs and agents: the long-running problem

Requests that finish in seconds are the easy case. Two workloads don't fit it:

Batch work — re-embedding a corpus, generating summaries for a thousand documents, nightly evaluation runs. These belong in background functions (most platforms offer a variant with a timeout measured in minutes, not seconds) triggered on a schedule or by an event. The pattern to avoid is the clever one: chaining synchronous functions to dodge timeouts. It works until it doesn't, and it fails invisibly.

Agent runs — multi-step tool-using sessions that might run for minutes and need to survive their own failures. The honest 2026 answer is that a long agent run wants a queue and a worker, which is one managed service more (Upstash QStash, Inngest, Trigger.dev and similar all speak "call my function later" natively). The function receives a job, does one agent step, persists state, enqueues the next step. Each step stays inside serverless limits; the run as a whole can go as long as it needs. Short agent loops — a handful of tool calls — fit fine in a single streaming background function; don't add the queue until a real run actually hits a real limit.

Notice what's still not on the list: your own servers.

The five ways AI apps break in production

Having watched a lot of these ship (and shipped a few), the failure modes are remarkably consistent:

1. Buffered streaming. Covered above, worth repeating: test streaming on the deployed URL, not localhost. If the first token doesn't render until the last token is generated, some layer is buffering — and your users' first impression is a 25-second blank screen.

2. Timeout kills mid-generation. Long completions die at the platform's sync limit and the user sees a truncated answer or a 502. Fix: stream (the open connection keeps the function alive on most platforms), or move long jobs to background functions and poll or push the result.

3. The leaked key. VITE_- and NEXT_PUBLIC_-prefixed variables ship to the browser by design — that's what the prefix means. If your provider key ever had one of those prefixes, even briefly on a preview deploy, rotate it today. Set provider-side spend caps as the backstop for the leak you haven't noticed yet.

4. Unbounded spend. Token costs scale with usage patterns you don't control once the URL is public. One user pasting entire books into your chat box, or one scraper hammering the endpoint overnight, can cost more in a weekend than hosting costs in a year. Rate limits per user, max_tokens caps per request, provider spend limits per key — all three, before launch, not after the invoice.

5. No fallback for provider incidents. Model APIs have bad days; every provider's status page has a history tab for a reason. A try/catch that degrades to a smaller or alternate model — the leaderboard is useful for picking a fallback in the same capability class — keeps your product up while a provider is down. Even a canned "generation is degraded right now" state beats a raw 500.

A sixth, quieter one: shipping with no observability. Log every model call server-side — model, token counts, latency, truncated prompt hash — from day one. When a user reports "it gave me a weird answer yesterday," that log is the difference between a fix and a shrug. Your function proxy is the natural place for it, which is one more reason the browser-calls-the-API shortcut costs more than it saves.

When you do need the model layer

Sometimes self-hosting is right: open-weight models have closed most of the capability gap in several categories, and at sustained high volume the per-token math can flip in favor of rented GPUs — run your own numbers in the self-host calculator before trusting anyone's blog post, including this one. Privacy and compliance can also force the issue regardless of economics.

When that's your path, here is the part that surprises people: the app-layer guidance above doesn't change at all. Your frontend and functions still deploy exactly as described; the only difference is that the endpoint your function calls is your own inference server on a GPU box instead of a provider's API. The two layers stay cleanly separated — which is precisely why it pays to build them that way from day one. You can start on a hosted API this week, prove the product, and swap the model layer later by changing one URL in one function.

The short version

  • You're almost certainly deploying an app, not a model. That's a standard web deployment with two special requirements: streaming and a server-side home for your API key.
  • Vet your host for the five checklist items: streaming, timeouts, secrets, previews, free tier.
  • Start static-plus-functions; add SSR, edge, queues, and vector stores only when a real requirement shows up — each is one managed service, not a re-architecture.
  • Route every model call through a function. Cap everything. Log everything. Test streaming on the real URL.
  • We run this exact architecture on Netlify — free tier to start, and your prototype can be a production URL this afternoon. (Partner link — it never affects our rankings.)

Pick the model with our leaderboards, estimate the tokens with the cost calculator, then ship the thing. The gap between "works on my machine" and "works on a URL" has never been smaller.

New models drop every week. We send one email a week with what moved and why.