- Prompt/Deploy
- Posts
- Build vs Buy: Vendor Strategy for AI Features
Build vs Buy: Vendor Strategy for AI Features
When to use APIs vs build your own. An interactive guide to the build-vs-buy decision for AI capabilities.

This post is part of the Mental Models for Production AI series, which explores the mental frameworks needed to evaluate, build, operate, and improve AI-powered features—focusing on practical decision-making.
The default advice for AI features — "just use an API" — is likely right for most teams. The vendor ecosystem is mature, pricing has collapsed (GPT-4-class performance now runs at ~$0.40 per million tokens, down from $30+ per million in early GPT-4 pricing), and the customization ceiling keeps rising with fine-tuning APIs, function calling, and structured outputs.
But defaults have ceilings. At some point you might hit customization limits, data sensitivity requirements, cost-at-scale pressure, or vendor dependency anxiety. When that happens, "just use an API" stops being a helpful answer.
In an earlier post in the series, we looked at data readiness. This post takes the next architectural question: given your data and your constraints, how do you decide whether to buy vendor AI capabilities, build your own, or — most commonly — combine both?
Here's the reasoning path I walk through when making this decision.
The "Just Use an API" Starting Point
This tree starts from a clear default: begin with vendor APIs. The Applied LLMs guidance — "No GPUs before PMF" — captures the reasoning well. Until you've validated product-market fit (PMF), investing in self-hosted infrastructure is premature optimization.
A few things up front:
The decision isn't permanent. A common progression goes: prototype with APIs, optimize with a custom layer (RAG, guardrails, orchestration), then evaluate whether you've hit a genuine wall that requires building more. Teams probably move through these phases over months or years, and some might never need to move past phase two. Each transition takes real engineering effort — it's a progression, not an escalator. Teams that already operate ML infrastructure may reasonably start further down the stack.
Enterprise IP lives in the "last mile." Your competitive advantage typically isn't in running a model — it's in your retrieval pipeline, your evaluation harness, your domain-specific prompts, and your orchestration logic. One industry write-up argues most enterprises end up in a "blend" — pairing vendor platforms with custom "last mile" work on prompts, retrieval, orchestration, and domain evals.
Multi-vendor is the norm. An a16z survey of 100 enterprise CIOs found that 37% now use five or more models, up from 29% the prior year. Model differentiation by use case is the primary driver — different models excel at different tasks.
The four gates below help you figure out where your team should be on the Buy → Hybrid → Build spectrum.

The full decision tree. Most paths exit toward Buy or Hybrid. The Build path requires passing all four gates.
How to Read This Tree
Four gates, evaluated in order. Each gate has paths that either continue deeper into the tree or exit to a recommendation.
Every terminal node includes a "what this means in practice" note — what you'll actually need to build or buy if you land there.
Three highlighted paths:
Buy (simplest path) — vendor APIs with a custom application layer
Hybrid (most common in production) — vendor API for inference, custom everything else
Build (narrowest path) — self-hosted models with full infrastructure ownership
To make this concrete, I'll use a running example: an AI-powered support assistant that handles customer support tickets using an internal knowledge base and conversation history. This sits in a common gray area — the feature isn't the core product, but customer data is sensitive, and the team wants control over response quality.
Gate 1: Is AI Core to Your Product's Differentiation?
Does this AI capability define what makes your product different from competitors?
I put this first because differentiation is the strongest signal for infrastructure investment. If the AI feature is a productivity multiplier — internal tooling, support automation, content suggestions — the calculus is different than if AI is the product itself.
NO → Buy. Use APIs.
Most AI features are productivity multipliers, not differentiators. The product's competitive advantage comes from domain expertise, user experience, data network effects, or distribution — and the AI feature enhances those things.
When that's the case, the rational move is to rent inference and invest engineering time where it actually builds moat: better evaluation, better retrieval, better domain-specific tooling.
What this means in practice: Vendor API for inference. Invest in the custom layer — prompt engineering, RAG for domain knowledge, guardrails, orchestration, monitoring. The model is a commodity input; the system around it is your IP.
YES → Continue to Gate 2
The AI capability is the product or a significant part of the moat — a recommendation engine, a search ranking system, a specialized analysis tool. Custom infrastructure may be warranted, but there are more gates to evaluate.
Running example
The support assistant automates ticket handling, but it isn't the product — the SaaS platform is. The AI feature makes support faster, not different. NO → lean toward Buy. But other factors might pull toward Hybrid, so let's continue through the tree.

The Buy path. Most teams probably land here — AI isn't the differentiator, so rent inference and invest in the custom layer.
Gate 2: Do You Need Deep Customization the Vendor Can't Provide?
Have you hit the ceiling of what vendor APIs can do — even with prompt engineering, fine-tuning APIs, and custom RAG?
I check this because the customization ceiling has risen substantially. Fine-tuning APIs, function calling, prompt caching, structured outputs, multi-modal inputs — the range of what you can accomplish without managing infrastructure keeps expanding. You should have genuinely hit that ceiling before building your own.
NO → Start with vendor, build later if needed
A common pattern: teams overestimate how much customization they need before they've exhausted what vendor APIs can do. The four-phase progression applies:
Prototype — external APIs to validate the concept
Optimize — add RAG, improve prompts, add evaluation
Evaluate — have you actually hit the ceiling?
Scale — only now consider infrastructure changes
Fine-tuning through vendor APIs (OpenAI, together.ai) offers a middle ground — customize model behavior without managing infrastructure. Prompt caching reduces input costs by 90% on cache hits, making API-heavy workloads more viable at scale.
What this means in practice: Vendor API + custom orchestration layer. This is the Hybrid path — the model is rented, everything around it is yours. Most production systems live here.
YES → Continue to Gate 3
You need capabilities vendors don't offer: custom model architectures, specialized inference patterns, modifications beyond what fine-tuning APIs support. Or the vendor's roadmap doesn't align with yours — you need features they aren't prioritizing.
Running example
Standard APIs with fine-tuning can handle ticket classification and response generation. Prompt engineering covers the tone requirements. We haven't hit the customization ceiling. NO → Hybrid (vendor API + custom layer).
Gate 3: Can Your Team Actually Maintain This?
Do you have — or can you hire — the people to build, deploy, and operate custom AI infrastructure?
I treat this as a hard gate because self-hosted AI requires a different kind of operational commitment than using APIs. GPU management, model updates, monitoring, retraining pipelines, incident response, and on-call coverage add up. The human layer is often the single largest line item.
NO → Buy anyway. Hire or upskill first.
The staffing reality is stark. One estimate of hidden self-hosting costs puts the numbers at:
Initial deployment: 2 engineers for 4 months (~$200K in engineering cost)
Ongoing operations: $12,000-15,000/month for on-call coverage and optimization (~1-1.25 FTE)
GPU compute: $8,000-12,000/month for adequate capacity
For enterprise-scale self-hosted deployments, one analysis estimates 10+ full-time AI-ops staff, with annual payroll that can exceed infrastructure spend.
On the API side, a small team of software engineers can manage integration, prompt engineering, and evaluation without specialized MLOps expertise.
What this means in practice: Stay on vendor APIs. Use the engineering time to build better evals, better prompts, and better domain-specific tooling. Revisit when the team grows or when you can hire MLOps capacity.
YES → Continue to Gate 4
You have MLOps, DevOps, and security expertise — or a concrete plan to hire it. You may already operate GPU infrastructure for other workloads, which reduces the incremental cost of adding AI inference.
Running example
The team has 3 backend engineers and no MLOps experience. The cost of building ML infrastructure expertise exceeds the cost of API bills. NO → Buy.
Gate 4: Is Your Data Too Sensitive for Third-Party APIs?
Does your data sensitivity require that no customer or user data leaves your infrastructure?
This is the final gate because data sensitivity, while it exists on a spectrum, tends to be binary at the decision level — either your regulatory and contractual requirements allow third-party processing (with appropriate agreements), or they don't.
YES → Build. Self-host or on-prem.
Several regulatory frameworks create hard requirements:
HIPAA: Any third-party LLM provider handling protected health information must sign a Business Associate Agreement (BAA). Organizations should evaluate whether vendor-hosted or private deployments better fit their compliance posture.
PCI DSS: LLMs must never store, process, or transmit cardholder data unless explicitly validated for PCI compliance.
Data sovereignty: Some jurisdictions require data to remain within geographic borders. International frameworks like GDPR add their own constraints — vendor assurances that satisfy US requirements may not satisfy EU or APAC regulators.
Zero data retention endpoints exist from major providers (OpenAI, Anthropic), and vendors don't use API data for training by default. But some organizations have contractual obligations that prohibit third-party processing entirely — no amount of vendor assurance changes a contractual prohibition.
Once you've decided to build, know what you're signing up for. An academic cost-benefit analysis of on-premise LLM deployment found that medium-scale enterprises processing 10-50 million tokens per month represent the "sweet spot" for self-hosting, with break-even periods ranging from 3.8 to 34 months depending on the models and providers being compared. Smaller models (sub-30B parameters) can reach break-even within 3 months. Large models compared against aggressively priced providers can take 5-9 years.
What this means in practice: Self-host open-source models (Llama, Mistral, or similar) on your infrastructure. Budget for GPU compute ($8,000-12,000/month), staffing ($12,000-15,000/month ongoing), and the initial ~$200K engineering investment to stand it up. The cost is real, but the compliance risk of not doing it may be higher.

The Build path. The narrowest route — requires passing all four gates.
NO → Hybrid: Vendor API + custom layer
You've evaluated all four gates and landed in the most common production pattern. The model is rented, the system around it is yours.
This means:
Vendor API for inference, with appropriate Data Processing Agreements and security review
Custom retrieval layer (RAG) for domain knowledge
Custom guardrails and orchestration
Custom evaluation and monitoring pipeline
An LLM gateway for multi-vendor routing, failover, caching, and cost optimization
This integration work isn't trivial — wiring together vendor APIs with custom retrieval, guardrails, and monitoring requires real engineering effort and ongoing maintenance. But the complexity is in the application layer, where your team already has expertise, rather than in infrastructure you'd need to learn from scratch.
What this means in practice: The model is a utility. Your engineering investment goes into the system that wraps it — and that system is where your differentiation lives.

The Hybrid path. The most common production pattern — vendor API for inference, custom everything else.
Running example
Customer conversation data goes through the API. A DPA is in place with the vendor, zero data retention is enabled, and SOC 2 compliance is verified. The data sensitivity is real but manageable with appropriate vendor agreements. NO → Hybrid. Vendor API + custom retrieval and orchestration.
The Path Most Teams Actually Walk
Most teams probably exit at Gate 1 or Gate 2. The support assistant, the internal search tool, the content suggestion feature — these are all Buy or Hybrid decisions.
Most AI features are components within a larger product, not the product itself. The competitive moat comes from what you build around the model — the data pipeline, the evaluation harness, the user experience — not from running inference.
The Hybrid path (vendor API + custom layer) is the dominant production pattern. The a16z survey data backs this up: enterprises are adopting multi-model strategies and investing in the orchestration layer while renting inference from multiple vendors.
Pure Build — self-hosted everything — is the narrowest path through this tree. It requires genuine data sensitivity constraints and a team capable of operating the infrastructure. When those conditions align, building is the right call. When they don't, premature infrastructure investment pulls engineering time away from the application layer where most of the value lives.
Secondary Factors That Shift the Decision
The four gates cover the primary decision. These secondary factors can shift where you land within a path.
Cost at Scale
API costs scale linearly with usage. Self-hosting has fixed infrastructure costs that amortize over volume. The crossover depends on several variables:
Factor | Threshold | Source |
|---|---|---|
Token volume crossover | ~2M tokens/day | |
Conversation volume | ~8,000+ conversations/day | |
GPU utilization for break-even (7B model) | 50%+ utilization | |
GPU utilization for break-even (13B model) | 10%+ utilization | |
Annual SaaS spend threshold | >$500K/year in API spend |
GPU utilization is the critical variable that most back-of-envelope calculations miss. A GPU running at 10% load transforms self-hosted costs from competitive to more expensive than premium APIs. If your traffic is bursty — high during business hours, quiet at night — the utilization math gets harder.
One more consideration: the Jevons Paradox for AI costs. As inference costs drop, total spending tends to increase because cheaper tokens unlock new use cases. Cost projections that assume stable usage patterns may underestimate actual spend.
Vendor Lock-In
Lock-in anxiety is common but often misplaced. The real lock-in isn't in the API contract — it's in the prompts, evaluation sets, and fine-tuning data you've built for a specific model. Migrating prompts across models takes more time than swapping an API endpoint.
Mitigation strategies:
Abstraction layers between your application and the model API
Multi-model strategy with an LLM gateway for routing and failover
Data portability — keep logs and telemetry in open formats (e.g., OTel, Parquet)
Contract protections — data portability clauses and code access provisions
One additional factor: vendor models get deprecated. When a model version you depend on gets sunset, you're migrating whether you planned to or not. The mitigations above reduce that switching cost without requiring you to build everything yourself.
Speed to Market
In fast-moving markets, a 6-month delay building infrastructure can mean lost market share. The progression — start with Buy, optimize, then evaluate building — lets you ship quickly and iterate toward the right long-term architecture.
What Each Path Means You'll Build
Dimension | Buy | Hybrid | Build |
|---|---|---|---|
Model | Vendor API | Vendor API (possibly multiple) | Self-hosted (Llama, Mistral, etc.) |
Retrieval | Basic or none | Custom RAG pipeline | Custom RAG pipeline |
Orchestration | Simple API calls | Custom orchestration layer | Custom orchestration layer |
Guardrails | Vendor-provided | Custom + vendor | Custom |
Monitoring | Basic vendor metrics | Custom eval pipeline | Custom eval + infra monitoring |
Team needed | Software engineers | Software engineers + eval expertise | Software + MLOps + DevOps |
Typical annual cost | API spend + eng time | API spend + more eng time | $240K-324K+ infra + larger team |
The key insight: regardless of which path you choose, you're building something custom. Even on the Buy path, the application layer — prompts, evaluation, UX, integration — is yours. The question is how far down the stack your custom layer extends.
Revisit When Things Change
This tree produces a snapshot, not a permanent decision.
Revisit when:
Your scale changes — traffic growth may shift the cost-at-scale math
Your team grows — new MLOps hires may make the Build path viable
Regulations change — new compliance requirements may force data locality
Vendor capabilities shift — new API features may raise the customization ceiling further
Model pricing shifts — continued price drops change the break-even calculation
The right answer for your team today may not be the right answer in six months. The four-gate structure gives you a repeatable way to re-evaluate when conditions change.
The next post in the series looks at the different mindsets needed for building AI features versus running them in production — because where you land on this tree changes what kind of engineering work you're signing up for.
Further Reading
What We've Learned From A Year of Building with LLMs — "No GPUs before PMF" and system-first thinking (Eugene Yan, Hamel Husain, et al.)
How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025 — a16z survey on multi-model adoption and enterprise AI budgets
A Cost-Benefit Analysis of On-Premise LLM Deployment — Academic analysis of 54 deployment scenarios with break-even timelines
Patterns for Building LLM-based Systems & Products — Eugene Yan's seven patterns that define the "custom layer"
Inference Unit Economics: The True Cost Per Million Tokens — GPU utilization as the critical cost variable
Self-Hosting LLMs: Hidden Costs You're Missing — The staffing and operational costs most teams underestimate
Reply