Should We Use AI Here?

An interactive tool to determine if AI is the right choice for a given problem.

In the previous post, we established three questions to ask before building any AI feature. The first — Should we build this? — is the one
that filters out the most projects.

But "should we build this?" is still abstract. It's hard to use in a planning meeting. What I wanted was something more concrete: a series of gates I could walk through with a team, where each step has a clear yes or no, and every "no" leads to
a specific next action.

That's what this post is — a decision tree with four gates. Walk through them in order. The whole thing takes about fifteen minutes with a whiteboard.

Why a Decision Tree?

Without a structured way to evaluate AI features, I've seen teams land in one of two places: over-adoption (building AI because it's exciting) or under-adoption (avoiding AI because it feels risky). Both tend to be costly.

Google's Rules of Machine Learning puts it well: "Don't be afraid to launch a product without machine learning." Their Rule #1 is essentially a permission slip to start simple.
A heuristic that gets you 50% of the way there is a legitimate starting point.

The decision tree below takes that same energy and makes it systematic. Four questions, in order. Each one narrows the path toward a responsible "yes" — or redirects to a constructive "no."

The Should We Use AI? decision tree with four gates: problem definition, error tolerance, data readiness, and cost at scale

The full decision tree. Each gate has a YES path (continue) and a NO path (specific action).

How to Read This Tree

A few things to keep in mind before walking through it:

The gates are sequential. Start at Gate 1. If you pass, move to Gate 2. Don't skip ahead — earlier gates catch issues that make later gates irrelevant.

Every "no" leads to a concrete action. "Don't use AI" is never the full answer. Each NO path tells you what to do instead, or what to fix before coming back.

Passing all four gates doesn't mean "ship it." It means proceed with detection and fallbacks built in. More on that at the end.

This is repeatable. Conditions change — data improves, costs drop, problem scope evolves. Walk the tree again when they do.

To make this concrete, I'll use the same running example from Post 2: an AI-powered support ticket classifier that routes incoming tickets to the right team.

Gate 1: Is the Problem Well-Defined with Clear Success Criteria?

Can you write down what "good" looks like? Can you measure it?

This sounds basic, but it filters out a surprising number of proposals. Anthropic's own documentation lists three prerequisites before any prompt engineering
work: a clear definition of success criteria, ways to empirically test against those criteria, and a first draft prompt. If the AI provider puts "define success criteria" as prerequisite #1, that's a signal worth taking seriously.

What YES looks like

"Classify support tickets into 8 categories with >85% accuracy, measured against a human-labeled test set of 500 tickets."

That's specific. It has a metric (accuracy), a threshold (85%), and a measurement method (comparison against human labels). The right threshold depends on the cost of errors — higher for medical triage, lower for internal routing. (For
imbalanced categories, consider macro-F1 or cost-weighted metrics instead of raw accuracy.) You could build an evaluation suite around it.

What NO looks like

"Use AI to make our product smarter." Or: "Add AI to the onboarding flow." These are directions, not problems. There's no way to know if you've succeeded, which means there's no way to know if you've failed — and that's worse.

The NO path

Don't use AI. Define the problem first. Write down what success looks like in a sentence that includes a metric. Build a small evaluation set (even 50-100 labeled examples). Then come back to this gate.

This is a valuable outcome. A well-defined problem statement improves every engineering decision downstream, whether you end up using AI or not.

Running example

For our ticket classifier: we can define success as "route tickets to the correct team >85% of the time, measured by comparing AI routing to human routing on a labeled sample of 500 historical tickets." Gate 1: PASS.

Gate 2: Can You Tolerate Probabilistic / Wrong Answers?

What happens when the AI is wrong? Is that an inconvenience or a crisis?

AI outputs are inherently probabilistic. The same input can produce different outputs — and even with deterministic settings, behavior can shift when providers update models. The model will sometimes be confidently wrong. As [Simon Willison]
(https://simonwillison.net/2024/Dec/31/llms-in-2024/) puts it, the core skill is "working with tech that is both inherently unreliable and incredibly powerful."

Some domains handle that well. Others can't.

When deterministic logic tends to win

  • All scenarios are known and enumerable — tax brackets, payment rules, compliance checks

  • 100% predictability is required — audit trails, regulatory reporting

  • Speed/latency is critical — a lookup table returns in microseconds; an API call takes hundreds of milliseconds

  • Failures are too rare to generate training data — as one industrial monitoring study found, rule-based approaches proved more practical than ML for well-defined problems where failure data was
    insufficient to train on

When AI tends to add value

  • Genuine ambiguity in input — natural language where intent matters more than keywords

  • Patterns too subtle to encode manually — fraud signals across hundreds of features

  • Heuristics hitting a complexity ceilingGoogle's Rule #3 notes that a complex heuristic is unmaintainable; that's when ML becomes the simpler choice

Eugene Yan's research shows that simpler approaches frequently match or beat complex ML — tree-based models outperformed deep neural networks on 45 tabular datasets, and greedy algorithms exceeded
graph neural networks on combinatorial problems. The question to ask: given the cost of complexity, is the improvement worth it?

The NO path

Don't use AI. Use deterministic logic. Rules, switch statements, lookup tables, regex — these are well-proven tools that are faster, cheaper, and easier to debug. Capital One's engineering blog describes several hybrid patterns for combining rules with ML — using rule outputs as ML features, or using ML confidence scores to trigger rule-based fallbacks. Separately, some teams run ML in shadow mode against a
rules baseline before fully switching over.

Running example

For our ticket classifier: a misrouted ticket goes to the wrong team and gets re-routed. That's a 2-minute delay, not a compliance violation or patient safety issue. The consequence of being wrong is manageable. Gate 2: PASS.

💡 Counter-example: A medical triage system where wrong classification could delay critical care. Gate 2 would FAIL here — use rules-based routing with human review for high-acuity cases.

Gate 3: Do You Have Quality Training / Eval Data?

Do you have data to train or evaluate the AI — and is it actually good?

This gate catches more projects than any other. Gartner predicts that through 2026, organizations will abandon 60% of AI projects
that lack AI-ready data — making data readiness the most common single point of failure. And as reported by Informatica's CDO Insights 2025 survey, data quality and readiness is the #1 obstacle to AI success, cited by 43% of respondents.

The data readiness test

Four questions I'd want answered before proceeding:

  1. Can you label 100 examples right now? Not hypothetically — actually sit down and do it. If you can't produce labeled examples, you can't evaluate the system.

  2. Do those examples cover edge cases? Happy-path data produces happy-path models. The interesting failures happen at the boundaries. If two humans can't agree on labels for the same example, your taxonomy may need revision before you build
    anything.

  3. Is the data format stable? If the schema changes every quarter, your training data has a shelf life. That's not disqualifying, but it means ongoing maintenance.

  4. Are there privacy, consent, or data residency concerns? Using customer data for AI training may require consent you don't have. PII in training data creates exposure. And if the data can't leave your network — due to regulation, customer
    contracts, or data residency requirements — that constrains whether you can use external API endpoints at all. Better to surface this now.

The NO path

Don't use AI yet. Fix data first.

This is a productive outcome, not a failure. As one analysis of enterprise AI patterns put it, programs that succeed with AI often earmark 50-70% of their
timeline and budget for data readiness — extraction, normalization, labeling, governance. If your data isn't ready, spending time on it has a higher return than spending time on models.

Running example

For our ticket classifier: we have 50,000 historical tickets with human-assigned team labels. A quick audit reveals some category overlap and roughly 5% miscategorization, but the bulk is clean and representative. We can label 100 edge cases
this week. Gate 3: PASS (with a note to clean up the 5% before training).

Gate 4: Can You Afford the Cost at 10x Current Scale?

Not "can you afford it today" — can you afford it when usage grows 10x?

LLM costs scale linearly with volume, but as one analysis found, the total cost is often significantly higher than single-call estimates suggest, because of hidden multipliers:

  • Multi-step agents: If your workflow makes 15 LLM calls to handle one user query, your per-query cost is 15x what single-call testing suggests.

  • RAG context injection: Retrieval-augmented queries can add thousands of extra tokens per request (2,000-4,000 is common), depending on chunk sizes and retrieval depth.

  • System prompt overhead: A 2,000-token system prompt gets sent on every API call. At 1 million calls, that's 2 billion tokens consumed by instructions alone.

  • Output token premium: Output tokens are often priced several times higher than input tokens (for example, GPT-4o charges $2.50/M input vs $10/M output — a 4x ratio). A chatbot that generates verbose
    responses costs far more than the input pricing implies.

The 10x test

Take your current estimated daily volume. Multiply by 10. Calculate the monthly cost including the multipliers above. Is that number sustainable for the business value the feature provides?

Model choice matters enormously here — the same workload can differ by an order of magnitude or more depending on model choice. Running a feature on GPT-4o-mini vs GPT-4o can be the difference between viable and unsustainable.

The NO path

Two options:

  1. Don't use AI at this model tier. Drop to a cheaper model. For many classification tasks, smaller models perform comparably at a fraction of the cost.

  2. Add strict cost controls. Budget caps, request throttling, intelligent routing (cheap model for easy cases, expensive model for hard ones), prompt caching, and batch processing for non-real-time work.

The NO path here doesn't mean "AI is too expensive." It means the current cost structure needs adjustment before scaling.

Running example

For our ticket classifier: current volume is 500 tickets/day. At 10x, that's 5,000 tickets/day.

Model

Cost per ticket

Daily (10x)

Monthly (10x)

GPT-4o-mini

~$0.001

$5

~$150

GPT-4o

~$0.01

$50

~$1,500

Assumes ~500 input tokens + ~50 output tokens per classification call, single pass, no retrieval.

Even at the premium tier, $1,500/month is manageable for a feature that eliminates manual routing. At the budget tier, it's almost negligible. Gate 4: PASS.

You Passed All Four Gates. Now What?

The ticket classifier passed all four gates. That's a green light to build — but with conditions.

Passing the tree doesn't mean "ship it and move on." It means proceed with detection and fallbacks built in. Specifically:

Automated quality monitoring. Run evals continuously, not just at launch. As Simon Willison notes, "writing good automated evals for LLM-powered systems is the skill that's most needed
to build useful applications."

Human review sampling. Sample a percentage of AI decisions for human review. The percentage depends on error cost, regulatory requirements, and how much the input distribution shifts over time — our ticket classifier might sample 5%, while a
medical system would sample much more.

Graceful degradation. What happens when the AI service is down or responding slowly? For our classifier, the fallback is simple: tickets go to a general triage queue, just like they did before AI.

Cost alerting. You did the 10x math. Now set alerts at 5x so you're not surprised. Monitor cost-per-ticket alongside accuracy.

The Real Value Is in the "No" Paths

Here's what I think is underappreciated about this tree: the "no" paths are often more valuable than the "yes" path.

Gate

NO Path Action

What It Prevents

1

Define the problem first

Building something you can't evaluate

2

Use deterministic logic

Unnecessary complexity and unreliability

3

Fix data first

Gartner-predicted 60% project abandonment from data gaps

4

Add cost controls

Unsustainable scaling costs

Each "no" is a specific, constructive action that improves the project regardless of whether you eventually use AI. "Define the problem first" makes every downstream decision better. "Fix data first" addresses what multiple studies identify as
the primary cause of AI project failure. These aren't consolation prizes — they're the outcomes that reduce the odds of joining what Gartner forecasts as 30% of GenAI projects abandoned after proof of concept.

And the tree is repeatable. Data that wasn't ready six months ago might be ready now. Costs that were prohibitive last year might have dropped. Walk it again when conditions change.

What's Next

This post operationalized Post 2's first question — "Should we build this?" — into something you can bring to a planning meeting. Next time someone says "we should use AI for this," walk the tree.

The next post in the series goes deeper on what happens when Gate 3 says "fix data first." Data readiness gets its own framework, because "do you have quality data?" deserves a more nuanced answer than yes or no.

Further Reading

Reply

or to participate.