• Prompt/Deploy
  • Posts
  • Three Questions I’d Want Answered Before Building Any AI Feature

Three Questions I’d Want Answered Before Building Any AI Feature

Here's the three-question framework I use to decide whether an AI feature is worth building.

Before any code, these three questions set the trajectory for everything else. Here's the reasoning path I walk through when evaluating whether to build an AI feature—and how I'd apply it to something real.

The 95% That Fail

Here's a number that stopped me: according to research on AI project outcomes, 80-95% of AI projects fail to deliver business value. Not "underperform expectations"—fail.

The surprising part? It's rarely the technology. The constraint isn't that models aren't good enough. It's organizational design: misaligned incentives, avoided measurement, and the belief that tools compensate for broken processes.

AI failures tend to happen before anyone writes a line of code. The trajectory is set earlier—in the questions we ask, or don't ask.

I wouldn't want to start building until I'd worked through three questions:

  1. Should we build this? (Is AI the right tool?)

  2. How will it fail? (What does wrong look like?)

  3. Can we afford it? (Cost at scale)

To make this concrete, I'll walk through each question using a real example: building an AI-powered support ticket classifier.

Three questions framework: Should we build? How will it fail? Can we afford it?

Three questions framework: Should we build? How will it fail? Can we afford it?

Question 1: Should We Build This?

The question that filters out the most projects

Is AI the right tool?

The temptation is strong. You see a problem, and your mind jumps to "AI can do this." Maybe it can. But the counter-question is more useful: can a rules-based system do this 80% as well?

Deterministic logic wins when:

  • Patterns are known and enumerable

  • Classifications are simple and stable

  • You need 100% predictability

AI actually adds value when:

  • There's genuine ambiguity in the input

  • Natural language understanding matters

  • Patterns are too subtle or numerous to encode manually

The way I'd think about it: if an engineer could write a switch statement that handles 80% of cases, maybe start there. AI is for the remaining complexity that resists simple rules.

Applying to the example

Let's say we're building a support ticket classifier. The current state: manual routing takes 2 minutes per ticket. Someone reads the ticket, decides which team should handle it, and assigns it.

The alternative: keyword-based routing with an escalation queue. If the ticket contains "billing," route to Finance. If it contains "crash" or "error," route to Engineering. Everything else goes to a human triage queue.

Where does AI add value? Understanding intent beyond keywords. A customer might write "I was charged twice" without ever saying "billing." They might describe a symptom without using technical terms. AI can understand the meaning, not just pattern-match on words.

The strategic filter

IBM's CIO asks a useful question: "Does this customer interaction touch a core differentiator?"

If AI is table stakes—something every competitor has—consider buying instead of building. If it's differentiating—something that could give you a real edge—the investment might be worth it.

For our ticket classifier: support routing probably isn't a differentiator. We'd likely start with an off-the-shelf solution or a simple rules-based system, and only build custom AI if we hit a ceiling.

Question 2: How Will It Fail?

The question that determines whether you're ready to ship

Why this question matters

Here's the mindset shift that took me a while to internalize: you can't "fix" LLMs. They're probabilistic by nature. The goal is to build systems that work well despite failures.

A chatbot might give brilliant answers 95% of the time and hallucinate wildly 5% of the time. As one AI PM put it: "A traditional PM looks at that 5% and panics, trying to squash it like a bug. An AI PM knows that managing that uncertainty is the entire job."

The failure taxonomy

Research on LLM production challenges identifies eight critical failure modes:

  1. Hallucination — Making stuff up

  2. Non-determinism — Different answers each time

  3. Reasoning failures — Logic errors

  4. Context limits — Running out of space

  5. Latency/cost spikes — Being slow or expensive

  6. Prompt injection — Getting hacked via malicious inputs

  7. Data leakage — Privacy violations

  8. Bias — Unfairness in outputs

Each of these can manifest in your feature. The question is: which ones matter for your use case?

Real-world failure case studies

A fast food restaurant's voice AI was deployed to 500+ drive-throughs. In one incident, a customer ordered "18,000 cups of water," crashing the system. The AI struggled with accents, background noise, and edge cases, forcing constant staff intervention—the opposite of its purpose.

An airline's chatbot gave a passenger incorrect refund information. The company refused to honor it, arguing the chatbot was a separate entity. A tribunal disagreed: The airline is responsible for all information on its website, including chatbot responses.

A big tech company's hiring AI was trained on resumes from past applicants—who were predominantly male. The system learned to penalize resumes containing words like "women's" and downgraded graduates of all-women's colleges. The company discontinued the program, but not before it had been used to evaluate real candidates.

The pattern: failures cascade downstream. Bad input leads to bad classification leads to bad routing leads to angry customer leads to churn.

Failure cascade diagram showing input → model → output → impact

Failures don't stay contained—they cascade downstream

Applying to the example

For our ticket classifier, let's map the failure modes:

Failure Mode

What happens

Impact

Miscategorization

Ticket goes to wrong team

Delayed resolution, frustrated customer

Low confidence

Everything escalates to humans

Defeats the purpose

Bias

Certain segments get worse routing

Equity issues, potential legal exposure

The key question: what's the cost of a wrong classification?

For ticket routing, the cost is relatively low. A misrouted ticket gets reassigned. It's annoying, not catastrophic. Compare that to the airline chatbot, where a wrong answer created legal liability.

Mitigation strategies:

  • Confidence thresholds: only auto-route when confidence is high

  • Human review queue for edge cases

  • Regular audits for bias in routing patterns

⚠️ Warning: What makes me most cautious about AI features? The failures that look like successes. The ticket gets routed, no errors in the logs, but it went to the wrong team and the customer churned. You only find out weeks later.

Question 3: Can We Afford It?

The question that survives contact with production

Token economics 101

Every API call costs money. LLMs charge per token (roughly: word pieces). A few key concepts:

  • Input tokens are cheaper than output tokens (you're not generating them)

  • Model selection has massive cost implications

  • Hidden costs add up: context window stuffing, retry loops, logging

The model multipliers are dramatic:

Model

Relative Cost

GPT-3.5

1x

GPT-4o

5x

GPT-4 Turbo

10x

GPT-4

20x

DeepSeek

0.035x

Choosing the right model for the task is often the biggest cost lever you have.

The 10× scale calculation

Every AI feature should ship with a back-of-the-envelope cost model. Mine looks like this:

Daily cost = (requests × input_tokens × input_rate + requests × output_tokens × output_rate) / 1,000,000

Two important notes:

  • Input and output tokens are priced differently

  • Small per-request costs become real money at scale

Before committing to any AI feature, I ask: what happens at 10× scale?

Example:

10,000 users × 2 conversations/day × 1,000 tokens × $10/M = 20M tokens/day = $200/day

At 10× scale (100,000 users):

  • $2,000/day

  • ~$60,000/month

That number doesn't mean "don't ship." It means "know what you're buying."

Applying this to the ticket classifier

Let's run the numbers concretely.

Assumptions

  • Tickets per day: 500 (5,000 at 10×)

  • Tokens per ticket:

    • ~200 input (ticket text + prompt)

    • ~50 output (category + confidence)

  • Model: GPT-4o-mini

  • Pricing (early 2026):

    • Input: $0.15 / 1M tokens

    • Output: $0.60 / 1M tokens

Cost with GPT-4o-mini

Per ticket:

  • Input: 200 × $0.15 / 1M = $0.00003

  • Output: 50 × $0.60 / 1M = $0.00003

  • Total: ~$0.00006 per ticket

At current scale (500/day):

  • ~$0.03/day

  • ~$1/month

At 10× scale (5,000/day):

  • ~$0.30/day

  • ~$9/month

Effectively free.

What if we used GPT-4 instead?

Using standard GPT-4 pricing:

  • Input: $30 / 1M

  • Output: $60 / 1M

Per ticket:

  • Input: ~$0.006

  • Output: ~$0.003

  • Total: ~$0.009 per ticket

At 10× scale (5,000/day):

  • ~$45/day

  • ~$1,350/month

Now it's noticeable—but still cheap relative to the value.

The real question: ROI

Each ticket classification saves ~2 minutes of manual work.

At 5,000 tickets/day:

  • ~166 hours of human labor saved every day

Even at $1,350/month for GPT-4, the ROI is overwhelming.

The takeaway isn't "always use the biggest model." It's know the real cost, then decide intentionally.

Cost optimization levers

If costs do become a concern, you have options:

  1. Model selection — Can a smaller model handle this task?

  2. Caching — Same questions often have same answers

  3. Batching — Classify multiple tickets in one call

  4. Prompt compression — Remove unnecessary context

  5. Rate limit handling — At high volume, you'll hit API rate limits; build in retries with exponential backoff

💡 Tip: Tools like Helicone and PricePerToken help you compare costs across 300+ models. Worth checking before you commit to a provider.

Putting It Together

Think of these three questions as gates, not steps:

  • When all three have good answers → Proceed with confidence

  • When one answer is shaky → Investigate before building

  • When answers are bad → Save yourself months of wasted effort

The ticket classifier verdict

Let's apply the framework:

Question

Answer

Confidence

Q1: Should we build?

Yes—AI adds value (intent beyond keywords)

High

Q2: How will it fail?

Failures are recoverable (wrong queue, not legal liability)

Medium

Q3: Can we afford it?

Cost is minimal at our scale

High

Decision: Worth building, with:

  • Confidence thresholds for auto-routing

  • Human review queue for low-confidence cases

  • Regular bias audits

This framework is a thinking tool. The goal is to surface the questions that would bite you later—before spending three months building something that can't ship.

What Comes Next

These three questions form the Decision Layer in a larger mental model for production AI. (If you want the full picture, start with Post 1: A Mental Model for Production AI.) They're the first gate—the questions that determine whether you should even start.

In the next post, I'll turn this framework into an interactive decision tree: a tool you can walk through step-by-step to answer "Should we use AI here?"

Further Reading

Reply

or to participate.