Prompt/Deploy
Posts
Three Questions I’d Want Answered Before Building Any AI Feature

Three Questions I’d Want Answered Before Building Any AI Feature

Here's the three-question framework I use to decide whether an AI feature is worth building.

Hou C.
February 12, 2026

Before any code, these three questions set the trajectory for everything else. Here's the reasoning path I walk through when evaluating whether to build an AI feature—and how I'd apply it to something real.

The 95% That Fail

Here's a number that stopped me: according to research on AI project outcomes, 80-95% of AI projects fail to deliver business value. Not "underperform expectations"—fail.

The surprising part? It's rarely the technology. The constraint isn't that models aren't good enough. It's organizational design: misaligned incentives, avoided measurement, and the belief that tools compensate for broken processes.

AI failures tend to happen before anyone writes a line of code. The trajectory is set earlier—in the questions we ask, or don't ask.

I wouldn't want to start building until I'd worked through three questions:

Should we build this? (Is AI the right tool?)
How will it fail? (What does wrong look like?)
Can we afford it? (Cost at scale)

To make this concrete, I'll walk through each question using a real example: building an AI-powered support ticket classifier.

Three questions framework: Should we build? How will it fail? Can we afford it?

Question 1: Should We Build This?

The question that filters out the most projects

Is AI the right tool?

The temptation is strong. You see a problem, and your mind jumps to "AI can do this." Maybe it can. But the counter-question is more useful: can a rules-based system do this 80% as well?

Deterministic logic wins when:

Patterns are known and enumerable
Classifications are simple and stable
You need 100% predictability

AI actually adds value when:

There's genuine ambiguity in the input
Natural language understanding matters
Patterns are too subtle or numerous to encode manually

The way I'd think about it: if an engineer could write a switch statement that handles 80% of cases, maybe start there. AI is for the remaining complexity that resists simple rules.

Applying to the example

Let's say we're building a support ticket classifier. The current state: manual routing takes 2 minutes per ticket. Someone reads the ticket, decides which team should handle it, and assigns it.

The alternative: keyword-based routing with an escalation queue. If the ticket contains "billing," route to Finance. If it contains "crash" or "error," route to Engineering. Everything else goes to a human triage queue.

Where does AI add value? Understanding intent beyond keywords. A customer might write "I was charged twice" without ever saying "billing." They might describe a symptom without using technical terms. AI can understand the meaning, not just pattern-match on words.

The strategic filter

IBM's CIO asks a useful question: "Does this customer interaction touch a core differentiator?"

If AI is table stakes—something every competitor has—consider buying instead of building. If it's differentiating—something that could give you a real edge—the investment might be worth it.

For our ticket classifier: support routing probably isn't a differentiator. We'd likely start with an off-the-shelf solution or a simple rules-based system, and only build custom AI if we hit a ceiling.

Question 2: How Will It Fail?

The question that determines whether you're ready to ship

Why this question matters

Here's the mindset shift that took me a while to internalize: you can't "fix" LLMs. They're probabilistic by nature. The goal is to build systems that work well despite failures.

A chatbot might give brilliant answers 95% of the time and hallucinate wildly 5% of the time. As one AI PM put it: "A traditional PM looks at that 5% and panics, trying to squash it like a bug. An AI PM knows that managing that uncertainty is the entire job."

The failure taxonomy

Research on LLM production challenges identifies eight critical failure modes:

Hallucination — Making stuff up
Non-determinism — Different answers each time
Reasoning failures — Logic errors
Context limits — Running out of space
Latency/cost spikes — Being slow or expensive
Prompt injection — Getting hacked via malicious inputs
Data leakage — Privacy violations
Bias — Unfairness in outputs

Each of these can manifest in your feature. The question is: which ones matter for your use case?

Real-world failure case studies

A fast food restaurant's voice AI was deployed to 500+ drive-throughs. In one incident, a customer ordered "18,000 cups of water," crashing the system. The AI struggled with accents, background noise, and edge cases, forcing constant staff intervention—the opposite of its purpose.

An airline's chatbot gave a passenger incorrect refund information. The company refused to honor it, arguing the chatbot was a separate entity. A tribunal disagreed: The airline is responsible for all information on its website, including chatbot responses.

A big tech company's hiring AI was trained on resumes from past applicants—who were predominantly male. The system learned to penalize resumes containing words like "women's" and downgraded graduates of all-women's colleges. The company discontinued the program, but not before it had been used to evaluate real candidates.

The pattern: failures cascade downstream. Bad input leads to bad classification leads to bad routing leads to angry customer leads to churn.

Failure cascade diagram showing input → model → output → impact

Failures don't stay contained—they cascade downstream

Applying to the example

For our ticket classifier, let's map the failure modes:

Failure Mode	What happens	Impact
Miscategorization	Ticket goes to wrong team	Delayed resolution, frustrated customer
Low confidence	Everything escalates to humans	Defeats the purpose
Bias	Certain segments get worse routing	Equity issues, potential legal exposure

The key question: what's the cost of a wrong classification?

For ticket routing, the cost is relatively low. A misrouted ticket gets reassigned. It's annoying, not catastrophic. Compare that to the airline chatbot, where a wrong answer created legal liability.

Mitigation strategies:

Confidence thresholds: only auto-route when confidence is high
Human review queue for edge cases
Regular audits for bias in routing patterns

⚠️ Warning: What makes me most cautious about AI features? The failures that look like successes. The ticket gets routed, no errors in the logs, but it went to the wrong team and the customer churned. You only find out weeks later.

Question 3: Can We Afford It?

The question that survives contact with production

Token economics 101

Every API call costs money. LLMs charge per token (roughly: word pieces). A few key concepts:

Input tokens are cheaper than output tokens (you're not generating them)
Model selection has massive cost implications
Hidden costs add up: context window stuffing, retry loops, logging

The model multipliers are dramatic:

Model	Relative Cost
GPT-3.5	1x
GPT-4o	5x
GPT-4 Turbo	10x
GPT-4	20x
DeepSeek	0.035x

Choosing the right model for the task is often the biggest cost lever you have.

The 10× scale calculation

Every AI feature should ship with a back-of-the-envelope cost model. Mine looks like this:

Daily cost = (requests × input_tokens × input_rate + requests × output_tokens × output_rate) / 1,000,000

Two important notes:

Input and output tokens are priced differently
Small per-request costs become real money at scale

Before committing to any AI feature, I ask: what happens at 10× scale?

Example:

10,000 users × 2 conversations/day × 1,000 tokens × $10/M = 20M tokens/day = $200/day

At 10× scale (100,000 users):

$2,000/day
~$60,000/month

That number doesn't mean "don't ship." It means "know what you're buying."

Applying this to the ticket classifier

Let's run the numbers concretely.

Assumptions

Tickets per day: 500 (5,000 at 10×)
Tokens per ticket:
- ~200 input (ticket text + prompt)
- ~50 output (category + confidence)
Model: GPT-4o-mini
Pricing (early 2026):
- Input: $0.15 / 1M tokens
- Output: $0.60 / 1M tokens

Cost with GPT-4o-mini

Per ticket:

Input: 200 × $0.15 / 1M = $0.00003
Output: 50 × $0.60 / 1M = $0.00003
Total: ~$0.00006 per ticket

At current scale (500/day):

~$0.03/day
~$1/month

At 10× scale (5,000/day):

~$0.30/day
~$9/month

Effectively free.

What if we used GPT-4 instead?

Using standard GPT-4 pricing:

Input: $30 / 1M
Output: $60 / 1M

Per ticket:

Input: ~$0.006
Output: ~$0.003
Total: ~$0.009 per ticket

At 10× scale (5,000/day):

~$45/day
~$1,350/month

Now it's noticeable—but still cheap relative to the value.

The real question: ROI

Each ticket classification saves ~2 minutes of manual work.

At 5,000 tickets/day:

~166 hours of human labor saved every day

Even at $1,350/month for GPT-4, the ROI is overwhelming.

The takeaway isn't "always use the biggest model." It's know the real cost, then decide intentionally.

Cost optimization levers

If costs do become a concern, you have options:

Model selection — Can a smaller model handle this task?
Caching — Same questions often have same answers
Batching — Classify multiple tickets in one call
Prompt compression — Remove unnecessary context
Rate limit handling — At high volume, you'll hit API rate limits; build in retries with exponential backoff

💡 Tip: Tools like Helicone and PricePerToken help you compare costs across 300+ models. Worth checking before you commit to a provider.

Putting It Together

Think of these three questions as gates, not steps:

When all three have good answers → Proceed with confidence
When one answer is shaky → Investigate before building
When answers are bad → Save yourself months of wasted effort

The ticket classifier verdict

Let's apply the framework:

Question	Answer	Confidence
Q1: Should we build?	Yes—AI adds value (intent beyond keywords)	High
Q2: How will it fail?	Failures are recoverable (wrong queue, not legal liability)	Medium
Q3: Can we afford it?	Cost is minimal at our scale	High

Decision: Worth building, with:

Confidence thresholds for auto-routing
Human review queue for low-confidence cases
Regular bias audits

This framework is a thinking tool. The goal is to surface the questions that would bite you later—before spending three months building something that can't ship.

What Comes Next

These three questions form the Decision Layer in a larger mental model for production AI. (If you want the full picture, start with Post 1: A Mental Model for Production AI.) They're the first gate—the questions that determine whether you should even start.

In the next post, I'll turn this framework into an interactive decision tree: a tool you can walk through step-by-step to answer "Should we use AI here?"

Three Questions I’d Want Answered Before Building Any AI Feature

Here's the three-question framework I use to decide whether an AI feature is worth building.

The 95% That Fail

Question 1: Should We Build This?

Is AI the right tool?

Applying to the example

The strategic filter

Question 2: How Will It Fail?

Why this question matters

The failure taxonomy

Real-world failure case studies

Applying to the example

Question 3: Can We Afford It?

Token economics 101

The 10× scale calculation

Applying this to the ticket classifier

The real question: ROI

Cost optimization levers

Putting It Together

The ticket classifier verdict

What Comes Next

Further Reading

Reply