5 Questions with Sanjin Bićanić: Turning AI from Promise into Real Business Value

March 26, 2026
Sanjin Bicanic

Ahead of this year’s QED Conference, where he joins the lineup as one of this year’s speakers, we spoke with Sanjin Bićanić, Partner at Bain & Company and a seasoned expert in large‑scale AI implementation. Sanjin brings a rare blend of experiences to the table. He began his career as a software engineer in Silicon Valley and today operates at the intersection of strategy, technology, and hands‑on product development.

While many organizations are still struggling to move from experimentation to real production deployment, Sanjin breaks down common misconceptions, explains why AI initiatives often stall, and shares a practical framework for identifying use cases with the potential to reach near‑perfect reliability. We also explore organizational barriers, the “March of Nines,” autonomous agents, the challenge of underspecified tasks, and the invisible scaffolding that actually determines whether AI can scale.

Here are the five questions we had for him.

1) You started your career as a software engineer in Silicon Valley, and today you operate at the intersection of strategy and hands-on technical development. How does that technical foundation shape the way you approach AI strategy? Where do you see the biggest disconnect between executives and AI teams?

The biggest misconception is that AI deployments that don’t create value are because companies are picking wrong use-cases. In my experience, use-cases are usually fine, but it’s how they’re deployed that creates the biggest issue. Technical foundations help me in 2 ways. First, I ask questions that I know will likely be traps that slow us down (e.g., do we have APIs we need and can we use them, do we have this specific type of data, etc.). Second, in all technical work it’s the edge cases that get you, so I make sure we build slowly and incrementally, and we don’t try to blitz-scale a proven pilot because AI typically has 100x more edge cases than traditional software.

2) Many companies are moving from AI experimentation to production. What separates organizations that successfully scale AI from those that remain stuck in pilots? In your experience, are the primary barriers technical, organizational, or cultural? And how decisive is the role of the C-suite in AI adoption?

First, I’d argue companies aren’t stopping enough pilots. Once a company commits to a use-case it either becomes a huge success or more often becomes a zombie that continues to steal valuable talent, time and energy for far too long. Those scarce resources should be deployed better elsewhere.

Usually when a use-case is stuck in a pilot it’s because it’s not working. It’s really rare to see a pilot that works not go to production. Why they fail is usually that they solve too small of a part of a workflow, don’t create enough value without a process re-design or they just don’t work. When they don’t work, it’s usually that they’re missing context because data they need to be successful isn’t systematically collected and that is something that is usually knowable early in pilot selection.

3) In a recent podcast, you referenced Andrej Karpathy’s “March of Nines” idea – where increasing reliability (e.g., 90% to 99% to 99.9%) often requires a roughly constant investment for each additional “nine.”, e.g. 6 months. You also suggested that the key is identifying a slice of the problem where near-perfect performance is achievable early. How do you practically identify that initial wedge? What signals tell you, “this can reach 99.9% reliability,” versus “this will be a long tail of edge cases”?

It’s usually a two-step process. Step 1 – does this pass the sniff test of something AI could do – does it use capabilities LLM have and can we give the LLM the same (or nearly the same) context that a human has. The answers to both need to be yes to pass. Step 2 – try to get to 90% with say 200 test-cases. If you can get there in 1-2 months, it’s a very good sign. If not, you probably have a hard problem. It’s ok to do this quick prototype offline and with a small team (e.g. 2-3). The cost of building production agents is typically much more than just the agent (e.g., API integrations, observability, guardrails, multiple environments) so compared to the full bill to build an autonomous agent a quick 4-8 week POC to test the most unknown part (Can it do it accurately) is a good way to de-risk it.

4) You’ve also pointed out that a major driver of unreliability is that real-world tasks are frequently underspecified – requiring the agent to apply something like common sense. What’s your current best approach to handling ambiguity without overfitting? Specifically, what belongs in example sets, what belongs in prompts and context engineering, and where do generator–evaluator (or multi-agent) patterns materially improve outcomes?

This is a tough problem and there really isn’t an easy way to deal with it. What we typically do is look for large clusters of failures that look similar. Then we try to figure out what is the principle behind the failure and address it via base prompt. Adding too many examples or too many idiosyncratic rules tends to help initially but eventually collapses under its own gravity. To figure out the principle we usually ask for reasoning for any decision an LLM makes and usually that reasoning provides excellent clue to the root cause of the failure. The good news is that each new generation of LLMs is getting better at making better common-sense decisions so AI labs are helping us out somewhat too.

5) We’ve been discussing the “scaffolding tax” with many enterprise clients lately: as organizations move beyond ideation and pilots into adoption, scaling becomes imperative—and the supporting infrastructure often outweighs the agent itself. In your view, what are the first two or three scaffolding components you would prioritize because they de-risk everything that follows (e.g., evaluation and QA, observability, simulated environments, tool and API reliability, security guardrails)?

Evaluation first and usually in a few layers starting with human vetted test cases, then moving on to large historical runs, and quickly moving into simulated environments (although simulation doesn’t need to be perfect). Observability is typically second. We contemplate for tool and API reliability and security guardrails from the beginning, but it’s not the first thing we build. That said, world is complicated and this is just a rule of thumb and not a hard recommendation on priorities.