- Prompt/Deploy
- Posts
- Three Questions I’d Want Answered Before Building Any AI Feature
Three Questions I’d Want Answered Before Building Any AI Feature
Here's the three-question framework I use to decide whether an AI feature is worth building.

Before any code, these three questions set the trajectory for everything else. Here's the reasoning path I walk through when evaluating whether to build an AI feature—and how I'd apply it to something real.
The 95% That Fail
Here's a number that stopped me: according to research on AI project outcomes, 80-95% of AI projects fail to deliver business value. Not "underperform expectations"—fail.
The surprising part? It's rarely the technology. The constraint isn't that models aren't good enough. It's organizational design: misaligned incentives, avoided measurement, and the belief that tools compensate for broken processes.
AI failures tend to happen before anyone writes a line of code. The trajectory is set earlier—in the questions we ask, or don't ask.
I wouldn't want to start building until I'd worked through three questions:
Should we build this? (Is AI the right tool?)
How will it fail? (What does wrong look like?)
Can we afford it? (Cost at scale)
To make this concrete, I'll walk through each question using a real example: building an AI-powered support ticket classifier.

Three questions framework: Should we build? How will it fail? Can we afford it?
Question 1: Should We Build This?
The question that filters out the most projects
Is AI the right tool?
The temptation is strong. You see a problem, and your mind jumps to "AI can do this." Maybe it can. But the counter-question is more useful: can a rules-based system do this 80% as well?
Deterministic logic wins when:
Patterns are known and enumerable
Classifications are simple and stable
You need 100% predictability
AI actually adds value when:
There's genuine ambiguity in the input
Natural language understanding matters
Patterns are too subtle or numerous to encode manually
The way I'd think about it: if an engineer could write a switch statement that handles 80% of cases, maybe start there. AI is for the remaining complexity that resists simple rules.
Applying to the example
Let's say we're building a support ticket classifier. The current state: manual routing takes 2 minutes per ticket. Someone reads the ticket, decides which team should handle it, and assigns it.
The alternative: keyword-based routing with an escalation queue. If the ticket contains "billing," route to Finance. If it contains "crash" or "error," route to Engineering. Everything else goes to a human triage queue.
Where does AI add value? Understanding intent beyond keywords. A customer might write "I was charged twice" without ever saying "billing." They might describe a symptom without using technical terms. AI can understand the meaning, not just pattern-match on words.
The strategic filter
IBM's CIO asks a useful question: "Does this customer interaction touch a core differentiator?"
If AI is table stakes—something every competitor has—consider buying instead of building. If it's differentiating—something that could give you a real edge—the investment might be worth it.
For our ticket classifier: support routing probably isn't a differentiator. We'd likely start with an off-the-shelf solution or a simple rules-based system, and only build custom AI if we hit a ceiling.
Question 2: How Will It Fail?
The question that determines whether you're ready to ship
Why this question matters
Here's the mindset shift that took me a while to internalize: you can't "fix" LLMs. They're probabilistic by nature. The goal is to build systems that work well despite failures.
A chatbot might give brilliant answers 95% of the time and hallucinate wildly 5% of the time. As one AI PM put it: "A traditional PM looks at that 5% and panics, trying to squash it like a bug. An AI PM knows that managing that uncertainty is the entire job."
The failure taxonomy
Research on LLM production challenges identifies eight critical failure modes:
Hallucination — Making stuff up
Non-determinism — Different answers each time
Reasoning failures — Logic errors
Context limits — Running out of space
Latency/cost spikes — Being slow or expensive
Prompt injection — Getting hacked via malicious inputs
Data leakage — Privacy violations
Bias — Unfairness in outputs
Each of these can manifest in your feature. The question is: which ones matter for your use case?
Real-world failure case studies
A fast food restaurant's voice AI was deployed to 500+ drive-throughs. In one incident, a customer ordered "18,000 cups of water," crashing the system. The AI struggled with accents, background noise, and edge cases, forcing constant staff intervention—the opposite of its purpose.
An airline's chatbot gave a passenger incorrect refund information. The company refused to honor it, arguing the chatbot was a separate entity. A tribunal disagreed: The airline is responsible for all information on its website, including chatbot responses.
A big tech company's hiring AI was trained on resumes from past applicants—who were predominantly male. The system learned to penalize resumes containing words like "women's" and downgraded graduates of all-women's colleges. The company discontinued the program, but not before it had been used to evaluate real candidates.
The pattern: failures cascade downstream. Bad input leads to bad classification leads to bad routing leads to angry customer leads to churn.

Failures don't stay contained—they cascade downstream
Applying to the example
For our ticket classifier, let's map the failure modes:
Failure Mode | What happens | Impact |
|---|---|---|
Miscategorization | Ticket goes to wrong team | Delayed resolution, frustrated customer |
Low confidence | Everything escalates to humans | Defeats the purpose |
Bias | Certain segments get worse routing | Equity issues, potential legal exposure |
The key question: what's the cost of a wrong classification?
For ticket routing, the cost is relatively low. A misrouted ticket gets reassigned. It's annoying, not catastrophic. Compare that to the airline chatbot, where a wrong answer created legal liability.
Mitigation strategies:
Confidence thresholds: only auto-route when confidence is high
Human review queue for edge cases
Regular audits for bias in routing patterns
⚠️ Warning: What makes me most cautious about AI features? The failures that look like successes. The ticket gets routed, no errors in the logs, but it went to the wrong team and the customer churned. You only find out weeks later.
Question 3: Can We Afford It?
The question that survives contact with production
Token economics 101
Every API call costs money. LLMs charge per token (roughly: word pieces). A few key concepts:
Input tokens are cheaper than output tokens (you're not generating them)
Model selection has massive cost implications
Hidden costs add up: context window stuffing, retry loops, logging
The model multipliers are dramatic:
Model | Relative Cost |
|---|---|
GPT-3.5 | 1x |
GPT-4o | 5x |
GPT-4 Turbo | 10x |
GPT-4 | 20x |
DeepSeek | 0.035x |
Choosing the right model for the task is often the biggest cost lever you have.
The 10× scale calculation
Every AI feature should ship with a back-of-the-envelope cost model. Mine looks like this:
Daily cost = (requests × input_tokens × input_rate + requests × output_tokens × output_rate) / 1,000,000Two important notes:
Input and output tokens are priced differently
Small per-request costs become real money at scale
Before committing to any AI feature, I ask: what happens at 10× scale?
Example:
10,000 users × 2 conversations/day × 1,000 tokens × $10/M = 20M tokens/day = $200/day
At 10× scale (100,000 users):
$2,000/day
~$60,000/month
That number doesn't mean "don't ship." It means "know what you're buying."
Applying this to the ticket classifier
Let's run the numbers concretely.
Assumptions
Tickets per day: 500 (5,000 at 10×)
Tokens per ticket:
~200 input (ticket text + prompt)
~50 output (category + confidence)
Model: GPT-4o-mini
Pricing (early 2026):
Input: $0.15 / 1M tokens
Output: $0.60 / 1M tokens
Cost with GPT-4o-mini
Per ticket:
Input: 200 × $0.15 / 1M = $0.00003
Output: 50 × $0.60 / 1M = $0.00003
Total: ~$0.00006 per ticket
At current scale (500/day):
~$0.03/day
~$1/month
At 10× scale (5,000/day):
~$0.30/day
~$9/month
Effectively free.
What if we used GPT-4 instead?
Using standard GPT-4 pricing:
Input: $30 / 1M
Output: $60 / 1M
Per ticket:
Input: ~$0.006
Output: ~$0.003
Total: ~$0.009 per ticket
At 10× scale (5,000/day):
~$45/day
~$1,350/month
Now it's noticeable—but still cheap relative to the value.
The real question: ROI
Each ticket classification saves ~2 minutes of manual work.
At 5,000 tickets/day:
~166 hours of human labor saved every day
Even at $1,350/month for GPT-4, the ROI is overwhelming.
The takeaway isn't "always use the biggest model." It's know the real cost, then decide intentionally.
Cost optimization levers
If costs do become a concern, you have options:
Model selection — Can a smaller model handle this task?
Caching — Same questions often have same answers
Batching — Classify multiple tickets in one call
Prompt compression — Remove unnecessary context
Rate limit handling — At high volume, you'll hit API rate limits; build in retries with exponential backoff
💡 Tip: Tools like Helicone and PricePerToken help you compare costs across 300+ models. Worth checking before you commit to a provider.
Putting It Together
Think of these three questions as gates, not steps:
When all three have good answers → Proceed with confidence
When one answer is shaky → Investigate before building
When answers are bad → Save yourself months of wasted effort
The ticket classifier verdict
Let's apply the framework:
Question | Answer | Confidence |
|---|---|---|
Q1: Should we build? | Yes—AI adds value (intent beyond keywords) | High |
Q2: How will it fail? | Failures are recoverable (wrong queue, not legal liability) | Medium |
Q3: Can we afford it? | Cost is minimal at our scale | High |
Decision: Worth building, with:
Confidence thresholds for auto-routing
Human review queue for low-confidence cases
Regular bias audits
This framework is a thinking tool. The goal is to surface the questions that would bite you later—before spending three months building something that can't ship.
What Comes Next
These three questions form the Decision Layer in a larger mental model for production AI. (If you want the full picture, start with Post 1: A Mental Model for Production AI.) They're the first gate—the questions that determine whether you should even start.
In the next post, I'll turn this framework into an interactive decision tree: a tool you can walk through step-by-step to answer "Should we use AI here?"
Further Reading
What 2025 Taught Us About Data, AI, and Decision-Making — Why 95% of AI projects fail
8 LLM Production Challenges — Practical failure taxonomy
LLM Cost Estimation Guide — Deep dive on token economics
Reply