Evaluating AI Tools for Your Team

AI in PracticeIntermediate45 min readPublished 2027-02-18Last reviewed 2026-03-05AI Primer

Your inbox is full of AI pitches. Your LinkedIn feed is full of demos. Your CEO is full of enthusiasm. And you, the person who will actually have to make this work, are full of reasonable doubt.

You are right to be cautious. Boston Consulting Group's 2024 survey of 1,000 C-suite executives across 59 countries found that 74% of companies have yet to show tangible value from their AI investments. McKinsey's 2025 State of AI report classified only about 6% of organisations as "AI high performers." A widely discussed MIT study put the failure rate for generative AI pilots at 95%, though that figure deserves some scrutiny (more on that later). Meanwhile, Gartner projects that worldwide AI spending will reach $2.5 trillion in 2026, up from $1.5 trillion in 2025.

The gap between investment and returns is one of the largest misallocations of corporate resources in recent memory. It also means that rigorous evaluation of AI tools, the boring, methodical, unsexy work of separating what genuinely helps from what merely impresses in a demo, is not optional. It is the highest-leverage activity available to any team considering an AI purchase.

This guide is designed to help you do that work well. It draws on findings from consulting firms, academic institutions, government bodies, and practitioners, more than 60 sources published between 2023 and 2026, including peer-reviewed research from Science, The Quarterly Journal of Economics, and MIS Quarterly, alongside industry research from Gartner, McKinsey, BCG, RAND, and Forrester. It is written for professionals and managers who need to make informed decisions about AI tools without a data science background.

A note before we begin: the evidence in this guide reflects the state of things as of early 2026. AI capabilities are changing rapidly, but the frameworks, failure patterns, and evaluation principles described here are built on durable insights about how organisations adopt technology. They will age better than any specific tool recommendation.

Five layers of assessment

No single framework dominates AI tool evaluation, but a clear taxonomy has emerged. Understanding it prevents the common mistake of applying the wrong framework to the wrong question. There are five distinct layers, and most teams need elements of at least three.

The first layer is organisational readiness. Before you evaluate any tool, you need to evaluate whether your organisation is prepared to use one effectively. This sounds obvious. It is routinely skipped. BCG's "Build for the Future" assessment evaluates organisations across 41 dimensions and classifies them into four tiers. Only 4–5% qualify as leaders. Cisco's AI Readiness Index uses six pillars (strategy, infrastructure, data, governance, talent, and culture) and found that only 15% of organisations have networks fully ready for AI. Microsoft offers a free assessment covering seven pillars, from business strategy to model management. Deloitte's AI Data Readiness assessment focuses specifically on whether your data is fit for purpose.

The message across all readiness frameworks is consistent: if your data is messy, your processes are undefined, and your team hasn't been prepared, even an excellent AI tool will fail. Readiness assessment should precede vendor conversations, not accompany them.

The second layer is vendor and product evaluation. This is where most teams start, and often where they should not. Gartner's Critical Capabilities methodology scores products on a 1–5 scale across 14 or more capabilities per use case, complementing its well-known Magic Quadrant. Forrester's Wave framework uses 21+ criteria across current offering, strategy, and market presence. These analyst frameworks are rigorous but expensive, and they evaluate vendors rather than your specific use case. They tell you who is strong in the market. They do not tell you whether the market's strongest product solves your particular problem.

The third layer is risk and governance. The NIST AI Risk Management Framework, released in January 2023, organises risk management into four functions (Govern, Map, Measure, and Manage) with a companion Generative AI Profile (NIST AI 600-1) that adds 12 GenAI-specific risk categories. ISO/IEC 42001:2023 is the world's first certifiable AI management system standard, with 38 specific controls. For most teams, these frameworks feel abstract until you need them, at which point they are invaluable. The practical takeaway is that any AI tool you adopt should be assessable against these standards, and any vendor that cannot articulate how their product aligns with them has not thought seriously about enterprise deployment.

The fourth layer is technical evaluation, assessing model and system performance. Stanford's HELM framework (Holistic Evaluation of Language Models) scores across seven metrics including accuracy, robustness, fairness, and efficiency. McKinsey's 2026 Agentic AI Evaluation Framework introduces a three-layer model for evaluating increasingly autonomous AI systems. These frameworks matter most when you are comparing competing products that claim similar capabilities but differ in their underlying approach.

The fifth layer is practitioner scorecards, the hands-on templates that translate all of the above into something your team can actually use. These typically cover six to ten weighted categories and produce a comparable score across vendors. They are most useful when you have narrowed your shortlist to two or three options.

A meta-analysis by Adnan Masood, PhD, benchmarking frameworks from McKinsey, BCG, Deloitte, Gartner, and ISO, identifies ten evaluation dimensions that appear consistently across all major frameworks: strategic fit, value realisation, adoption depth, time-to-impact, model performance, risk and governance, data readiness, scalability, security and privacy, and people and change. If your evaluation process covers all ten, you are being thorough. If it covers fewer than six, you have blind spots.

Why AI projects fail, and why the reasons are not what you expect

The failure statistics are sobering, but they require careful interpretation. Different studies measure different things, and the headline numbers obscure important nuance.

The most methodologically rigorous failure analysis comes from RAND Corporation (2024), based on interviews with 65 experienced data scientists and engineers. Authored by James Ryseff, Brandon De Bruhl, and Sydne Newberry, the study identified five root causes of AI project failure, in order of frequency: misunderstanding or miscommunicating the problem the AI is meant to solve; lack of adequate training data; a technology-first bias ("chasing shiny objects" rather than solving a defined business problem); inadequate infrastructure; and attempting problems that are too difficult for current AI capabilities.

The most-cited finding is that 80% or more of AI projects fail, a figure the RAND report helped popularise. BCG's figure is 74%. Gartner reported that only 48% of AI projects make it into production, with an average of eight months from prototype to deployment. S&P Global found that 42% of companies abandoned most of their AI initiatives in 2025, up sharply from 17% the year before.

The MIT study claiming 95% failure deserves particular scrutiny. Led by Aditya Challapally at MIT NANDA, it found that 95% of generative AI pilots fail to deliver measurable profit-and-loss impact, based on 150 interviews, a survey of 350 employees, and analysis of 300 public deployments. Paul Roetzer of the Marketing AI Institute described the methodology as insufficient for the headline claim. UC Berkeley's David Gallacher argues the study reflects a measurement failure — it used a six-month P&L window, which is far too narrow for transformational technology, rather than an AI failure as such. The study did find something genuinely useful, though: back-office automation produces the highest ROI, while sales and marketing pilots (where most budgets flow) produce the lowest. That finding is worth more than the headline.

What matters for evaluation is not the precise failure rate but the pattern of why things go wrong. Across all major studies, the most common reasons cluster into three categories.

The first is unclear problem definition. This is cited as the single most fundamental failure mode by both Gartner and RAND. Teams adopt AI tools without a clear understanding of what business problem they are solving, what success looks like, or how the tool fits into existing workflows. The RAND report describes this as "misunderstanding what problem needs to be solved" and calls it the most common anti-pattern across all the projects they studied. The implication for evaluation: before you assess any tool, you need a written statement of the problem it will solve, the metric by which you will measure success, and the threshold at which you will consider the investment justified.

The second is data quality. The Informatica CDO Insights 2025 survey found that 43% of organisations cite data quality as their top AI obstacle. A Qlik survey of 500 data professionals found that 81% say their company still has significant data quality issues. Gartner projects that 60% of AI projects unsupported by AI-ready data will be abandoned through 2026. The implication: data readiness assessment is not an optional step in tool evaluation. It is a prerequisite.

The third, and this is the one that should reshape your entire evaluation process, is people and organisational resistance. The Cloud Security Alliance estimates that 70–80% of AI project failures stem from lack of user adoption, not technical shortcomings. EY's 2024 survey found that 75% of employees lack confidence in using AI. A Kyndryl 2025 report found that 45% of CEOs say employees are resistant or openly hostile to AI adoption. The implication: any evaluation process that focuses primarily on technical capability and ignores adoption feasibility is evaluating the wrong thing.

The 70% rule: the most important finding in this research

BCG's analysis of what separates organisations that succeed with AI from those that don't produced a finding so important it deserves its own section. They call it the 10-20-70 principle: among successful AI adopters, 10% of effort and resources go to algorithms, 20% go to technology and data, and 70% go to people, processes, and cultural transformation.

This is not an aspirational ratio. It is an empirical observation of what winners actually do. Organisations that invert it, spending most of their budget on technology, consistently underperform.

McKinsey's 2025 data reinforces this from a different angle. Across 25 attributes tested for their effect on EBIT impact from generative AI, workflow redesign had the single biggest effect, bigger than model selection, technical architecture, data quality, or any other factor. Their conclusion is blunt: bolting AI onto existing processes will not produce meaningful results. The work that matters is redesigning how people actually do their jobs.

This has direct implications for how you evaluate AI tools. The typical evaluation process spends 80% of its time on product features and 20% on everything else. The evidence suggests this should be inverted. When comparing two tools, the one that fits more naturally into your team's existing workflows, or that comes with better implementation support, training resources, and change management guidance, is likely to outperform the one with more impressive technical specifications.

The MIT NANDA study found the same pattern from yet another angle: vendor-led, domain-specific, workflow-integrated solutions succeed at nearly twice the rate of generic approaches. Morgan Stanley's AI assistant reached 98% adoption among wealth management teams, but only after rigorous quality validation, guardrails, and deep workflow integration. The technology mattered far less than how it was woven into daily practice.

Designing pilots that produce genuine evidence

The gap between AI experimentation and production value, commonly called "pilot purgatory," is one of the most persistent challenges in enterprise AI. For every 33 AI prototypes a company builds, only about 4 reach production, according to IDC/Lenovo research. McKinsey's 2025 survey found that roughly two-thirds of organisations remain stuck in "experiment or pilot" mode.

The problem is not that pilots fail technically. It is that they succeed technically while failing to prove business value. Or worse, they prove nothing at all because they were designed to confirm a decision that had already been made.

The distinction between proof of concept, pilot, and production matters enormously. A proof of concept asks "Can this technology work?" using clean, curated data in an isolated environment over days or weeks. A pilot asks "Does this work for our business?" using real-world but limited-scope data over 6–16 weeks. Production asks "Is this delivering sustained value?" using messy, constantly changing data streams indefinitely. The crucial trap: pilots run on clean, static data; production faces messy, changing data streams. A technically successful demo may prove the technology works without proving that value exists. These are fundamentally different questions, and confusing them is where enormous sums of money go to die.

Optimal pilot duration clusters around 8–16 weeks across most practitioner recommendations, broken into five phases: preparation (2–4 weeks), implementation (1–2 weeks), stability testing (2–4 weeks), feature expansion (2–3 weeks), and evaluation (1–2 weeks). Some practitioners advocate leaner 6-week sprots using 2-week sprints. A critical finding: large enterprises take an average of nine months to scale AI solutions, while mid-market firms manage it in 90 days, suggesting that organisational complexity, not technical difficulty, is the primary bottleneck.

Organisations that conduct structured pilots before full deployment report 60% fewer issues during company-wide rollouts and achieve nearly twice the employee adoption rates. The structure matters as much as the technology being tested.

To avoid confirmation bias (the most common pilot pathology) the evidence points to several practices worth adopting. First, define success criteria before starting the pilot, explicitly documenting what results will trigger scaling, iteration, or termination. If you cannot state your kill criteria in advance, you are not running a pilot; you are running a demonstration. Second, track business metrics (user adoption, time to value, behaviour change) alongside technical metrics (accuracy, latency, throughput). A tool that scores 95% accuracy but that nobody uses has delivered zero value. Third, run AI alongside existing processes for direct comparison, a parallel-systems approach that makes the AI's contribution visible and measurable. Fourth, plan for production-ready architecture from day one rather than building prototypes that accumulate technical debt you will later regret.

The MIT NANDA study's most actionable finding was this: the strongest predictor of whether a pilot transitions to production is whether it was designed with production constraints from the beginning. Pilots that are set up as standalone experiments (separate data, separate workflows, separate teams) almost never cross the chasm. Pilots embedded in real workflows from the start succeed far more often.

Measuring return on investment without fooling yourself

ROI measurement for AI is genuinely difficult, and most organisations do it badly. Research compiled by Glean found that 85% of organisations misestimate AI project costs by more than 10%, and 24% miss forecasts by more than 50%. Hidden expenses can inflate total AI ownership costs by 200–400% compared to initial vendor quotes.

The headline ROI figures vary so dramatically by source that they are almost useless without context. A Microsoft-sponsored IDC study reported average generative AI ROI of $3.70 per $1 invested, with top performers reaching $10.30. It bears noting that this was a Microsoft-sponsored study. IBM's Business Value Institute found a more modest 5.9% ROI on enterprise-wide AI, with 47% seeing positive returns, 33% breaking even, and 14% experiencing negative returns. McKinsey's 2025 survey found only 39% of organisations report measurable EBIT impact from AI. Most organisations achieve satisfactory ROI within 2–4 years, far longer than the typical 7–12 month technology payback period that finance teams are accustomed to.

UC Berkeley SCET's Multi-Dimensional Framework, developed by David Gallacher in September 2025, is in our view the most intellectually honest approach. It measures across five dimensions: efficiency metrics (time saved, processes automated), quality metrics (error reduction, customer satisfaction), capability metrics (new tasks enabled, skill amplification), strategic metrics (competitive advantage, innovation), and human metrics (employee satisfaction, retention, learning velocity). Gallacher's central argument, that traditional P&L ROI is the wrong metric for transformational AI and that organisations need multi-dimensional measurement over multi-year horizons, is supported by the broader evidence base.

Erik Brynjolfsson's "Productivity J-Curve" concept, published in the American Economic Journal: Macroeconomics (2021), provides important theoretical grounding. Intangible investments in learning, organisational restructuring, and new processes may initially lead to stagnant or declining measured productivity, followed by a take-off phase. This explains why AI's benefits may be delayed as organisations build necessary complementary capital, and why premature ROI measurement consistently underestimates long-term value. It also explains why vendors promising immediate returns should be viewed with scepticism.

Where most AI investments go wrong is total cost of ownership. The major cost categories that evaluators consistently underestimate are: data preparation and cleaning (20–30% of total project costs and the most frequently underestimated category); implementation and integration with legacy systems (which can increase costs by 40–60%); training and change management (10–15% of implementation budget); ongoing maintenance ($30K–$50K per year minimum for enterprise tools); and talent costs, with specialised AI engineers commanding $200K–$500K+ in compensation and data engineering absorbing 25–40% of total AI spend.

The most common ROI calculation errors include measuring too early (six-month windows miss compound value), tracking adoption and usage instead of actual productivity or business outcomes, ignoring cost avoidance (not replacing departed employees doesn't show up as traditional ROI), confusing "faster" with "better," and the attribution challenge (when humans and AI share workflows, isolating AI's contribution requires deliberate tagging of each step as machine-generated, human-verified, or human-enhanced). If you cannot attribute the output, you cannot measure the value. Build attribution into your pilot design from the start.

Vendor evaluation: separating substance from spectacle

"AI washing" (the deliberate or negligent exaggeration of AI capabilities in product marketing) has become so pervasive that regulators have begun enforcement. A 2019 MMC Ventures study found that 40% of startups advertised as "AI companies" did not have genuine AI at their core. Gartner estimated in mid-2025 that only about 130 of the thousands of self-described "agentic AI" vendors are genuine; the rest have rebranded chatbots and robotic process automation tools. In perhaps the most dramatic example, Builder.ai, which had raised $445 million, filed for bankruptcy in May 2025 after it was revealed that its "AI-powered" development platform was largely powered by hundreds of offshore human developers.

This is the environment in which you are evaluating tools. Healthy scepticism is not pessimism; it is professionalism.

Red flags that should trigger deeper scrutiny include vague accuracy claims presented without benchmarks or specific methodologies; refusal to explain training data sources or model architecture; demo environments that don't use your actual data or real-world conditions; vendor lock-in tactics through proprietary formats, long-term contracts, or data portability barriers; absence of formal AI policy documentation (which practitioner guides call "an immediate red flag"); and evasiveness when asked direct technical questions.

One particularly revealing question, suggested by Forrester analyst Mike Gualtieri: "How do you monitor model drift?" Code runs as written, but AI models decay in performance over time as the data they encounter diverges from their training data. A vendor without a clear monitoring strategy hasn't thought about production reality. They have built a demo, not a product.

Green flags include transparent model and system cards documenting intended use and limitations; published performance benchmarks with methodology; clear data handling policies specifying storage, processing, protection, and compliance; third-party certifications (SOC 2 Type II, ISO 27001, ISO 42001); transparent pricing models without hidden consumption-based surcharges; open APIs using universal protocols that avoid lock-in; and willingness to provide trial periods with clear success criteria. Gartner's 2024 data found that companies with formal AI risk frameworks reported 35% fewer AI-related incidents.

Regulatory enforcement is accelerating. The FTC launched "Operation AI Comply" in September 2024 with five simultaneous enforcement actions, and has continued under the current administration with bipartisan support. Cases have targeted companies making unsupported claims about AI-powered legal services, facial recognition, and conversational sales agents. The SEC settled its first AI-washing cases in March 2024 and charged Presto Automation in January 2025 in its first action against a public company.

The existence of regulatory enforcement is itself useful information for evaluators. It means vendors face real consequences for misrepresentation, which creates some discipline in marketing claims. It also means that if a vendor's claims seem too good to be true, there is now a precedent for asking pointed questions, and for expecting honest answers.

FairNow's 10-Question Vendor Questionnaire, organised across data privacy, model performance, compliance, and support, provides a practical starting template. The OWASP Vendor Evaluation Criteria for AI Red Teaming Providers adds technical depth for security-focused assessments. Both are freely available and worth adapting to your context.

Data readiness: the prerequisite most organisations skip

If there is a single theme running through the failure literature, it is this: organisations that skip data readiness assessment and jump straight to tool selection regret it. Every time.

The numbers are consistent across sources. Seventy-three percent of enterprises report data quality as their primary AI adoption barrier. The Informatica CDO Insights 2025 survey found that 43% of respondents cite data quality as the single biggest challenge. A Qlik survey of 500 data professionals found that 81% acknowledge their organisation still has significant data quality issues. Gartner projects that 60% of AI projects unsupported by AI-ready data will be abandoned through 2026.

The widely cited "80/20 rule" (that 80% of data science time goes to data preparation) is somewhat overstated. The Anaconda 2020 survey found data scientists spend approximately 45% of time on preparation (26% cleaning, 19% loading). The realistic range is 45–60% of effort, which is still substantial enough that data preparation costs consume 20–30% of total AI project budgets and represent the most frequently underestimated line item in procurement.

Integration with existing systems is a serious barrier. Seventy percent of software in Fortune 500 companies is over two decades old. Only 29% of enterprise applications are integrated despite organisations averaging 897 applications. The architectural incompatibility between non-deterministic AI systems and deterministic legacy systems creates fundamental challenges that cannot be solved with middleware alone.

Cisco's AI Readiness Index found that 76% of organisations successfully scaling AI have fully centralised data, compared to only 19% of organisations overall. Organisations with strong data integration achieve 10.3x returns on AI investments versus 3.7x for those with poor integration. These are not marginal differences. They suggest that investment in data infrastructure may deliver higher returns than investment in AI tools themselves.

A practical data readiness checklist, derived from multiple frameworks, covers: a centralised data repository or data lake; data quality monitoring and profiling tools; data classification and sensitivity labelling; defined data ownership and stewardship roles; automated data cleansing pipelines; privacy-preserving mechanisms (anonymisation, pseudonymisation); retention and deletion policies; and audit trail capabilities. If your organisation lacks more than three of these, the evidence strongly suggests addressing them before signing any AI vendor contract.

Change management determines whether tools actually get used

The most counterintuitive finding across all the research is this: the primary reason AI tools fail is not technical inadequacy but human and organisational resistance. The Cloud Security Alliance's estimate that 70–80% of AI project failures stem from lack of user adoption, not technical shortcomings, is consistent with every other major finding.

The adoption numbers are more nuanced than most headlines suggest. While 78% of businesses now report using AI for at least one function (up from 55% in 2023), and 75% of global knowledge workers say they use AI tools regularly, daily use stands at only about 10% of the U.S. workforce according to Gallup's Q3 2025 data. A CEOWORLD magazine survey from February 2026 cut even deeper: while 55% of workers use AI at least weekly, 85% don't use it in any way that generates business value. They prompt for isolated, low-value tasks that don't move core KPIs. Usage is not the same as value creation, and most organisations measure the former while hoping for the latter.

Shadow AI (employees using unsanctioned AI tools) has become endemic. Forty-nine percent of workers admit to using AI tools without employer approval. Seventy-eight percent of AI users bring their own tools through personal accounts. Sixty-eight percent of security leaders themselves use unauthorised tools. Shadow AI breaches add $670,000 to average breach costs and take longer to detect. The Samsung ChatGPT incident of March 2023, in which engineers entered proprietary source code, confidential chip data, and meeting transcripts into ChatGPT within 20 days of being given access, remains the canonical cautionary tale. Samsung confirmed the data became part of the training set and was impossible to retrieve.

Resistance patterns cluster around four themes. Fear of job loss affects 75% of employees according to EY. Trust issues run deep: 40% of employees find AI helpful but unreliable. Training gaps are severe: 52% receive only basic training, 20% get almost none, only 6% feel very comfortable using AI, and two-thirds of AI users are self-taught. And change fatigue from repeated technology rollouts breeds cynicism. 51% of employees say tech rollouts frequently create internal chaos.

The research on overcoming resistance points to one finding above all others. Gallup's data on manager support is worth sitting with: employees whose managers actively support AI use are 8.8 times more likely to agree AI helps them do their best work. But only 28% of employees strongly agree their manager provides that support. This is a leverage point. McKinsey found that the most enthusiastic AI adopters are millennial managers (ages 35–44), suggesting that middle management, traditionally the layer where organisational change goes to die, may actually be the layer where AI adoption lives or dies.

Academic research adds useful nuance. Dietvorst, Simmons, and Massey's "Algorithm Aversion" study (Journal of Experimental Psychology: General, 2015) demonstrated that people become especially averse to algorithms after seeing them make a single error, even when the algorithm outperforms humans overall. Their follow-up (Management Science, 2018) found something actionable: giving people even slight control over an algorithm's output, even when modifications were severely restricted, considerably increased willingness to use it. This has direct implications for how AI tools should be configured during rollout. Tools that position themselves as providing recommendations that humans approve will see higher adoption than tools that position themselves as making decisions that humans oversee.

For evaluation purposes, the change management evidence means that you should assess not only what a tool does but how adoptable it is. Does the vendor provide onboarding and training resources? Is the tool designed to integrate into existing workflows, or does it require a parallel process? Does it provide visible explanations for its outputs? Can users modify or override its recommendations? These are not secondary concerns. They are, according to the evidence, the primary determinants of whether your investment will produce value.

Security, privacy, and regulation

AI tools introduce security risks qualitatively different from traditional software. IBM's 2025 Cost of a Data Breach Report found that 13% of organisations reported breaches involving AI models or applications, and 97% of those lacked proper AI access controls. AI-specific data breaches average $4.80 million per incident and take 290 days to detect and fix, significantly longer than the overall average.

For any team evaluating AI tools, the security and privacy assessment should cover several non-negotiable areas.

Data handling and training. The most critical question: is customer data used for model training, and can your organisation opt out? Many AI tools default to including user inputs in their training data. For any tool that will process sensitive business information, this default is unacceptable. Confirm data retention policies, whether retention periods are configurable, and what happens to your data if the contract is terminated.

Compliance certifications. At minimum, enterprise AI tools should hold SOC 2 Type II certification (demonstrating security controls have been tested over time) and comply with relevant data protection regulations. For organisations operating in the EU, the AI Act's compliance timeline is now in effect: prohibited practices became enforceable in February 2025, and the deadline for high-risk AI systems (including those used in hiring, credit scoring, and medical diagnostics) is August 2, 2026, with penalties reaching up to 7% of annual global turnover. Healthcare organisations face additional HIPAA requirements, including Business Associate Agreements for any vendor handling protected health information.

Infrastructure security. Does the vendor support data residency requirements? Can your organisation bring its own encryption keys? What role-based access controls exist? How are models protected against adversarial attacks, prompt injection, and data poisoning? A comprehensive vendor risk questionnaire should cover all of these, and any vendor that hesitates to answer them clearly is not ready for enterprise deployment.

The shadow AI risk. Beyond vendor-specific security, your evaluation process should consider the shadow AI problem. If the sanctioned tool you select is difficult to use, slow, or poorly integrated into workflows, employees will route around it using consumer AI tools, exposing the organisation to exactly the data leakage risks you were trying to avoid. The best security strategy is not just a secure tool but an adopted tool.

What the academic research reveals about AI and productivity

Beyond the consulting surveys, peer-reviewed academic research provides important grounding for evaluation decisions. Three studies on AI productivity deserve particular attention, because they challenge assumptions that often drive tool procurement.

Brynjolfsson, Li, and Raymond's "Generative AI at Work" (The Quarterly Journal of Economics, 2025) is the largest field study of AI productivity effects to date. Studying 5,179 customer support agents, they found that AI access increased productivity by 14–15% on average. But the distribution was highly uneven: novice workers saw a 34% improvement, while experienced staff saw minimal impact. The AI essentially compressed the skill distribution: it brought low performers closer to the median without pushing high performers much further. For evaluation purposes, this means you should think carefully about where in your team AI will add value. If your use case involves experienced professionals doing expert-level work, the gains may be modest. If it involves large numbers of less experienced staff doing routine tasks, the gains may be substantial.

Noy and Zhang's experiment with 453 professionals (Science, 2023) found that ChatGPT reduced task completion time by 40% and increased output quality by 18%. Again, the largest gains went to lower-ability workers. This is consistent with the Brynjolfsson findings and suggests a general pattern: current AI tools are better at raising floors than lifting ceilings.

The most nuanced study comes from Dell'Acqua and colleagues at Harvard Business School (Working Paper, 2023), who studied 758 BCG consultants and introduced the concept of the "jagged technological frontier." For tasks within AI's capability frontier, consultants completed 12.2% more tasks, 25.1% faster, at 40% higher quality. But for tasks outside the frontier — tasks that appeared similar but required capabilities the AI lacked, AI users performed 19 percentage points worse than non-AI users. The reason: consultants trusted the AI's output and failed to catch errors on tasks where the AI was confidently wrong.

This may be the most important finding for anyone evaluating AI tools. Every AI system has a jagged frontier, a boundary between what it can do well and what it cannot, where the boundary is not smooth or predictable but irregular and often invisible. The evaluation challenge is not just "does this tool work?" but "for which tasks does this tool work, and can my team tell the difference?" If the answer to the second question is no, the tool may actually reduce the quality of your team's output while appearing to improve it.

For evaluation, this means that testing should deliberately include tasks where you expect the AI to fail, not just tasks where you expect it to succeed. If the tool fails gracefully (flagging uncertainty, declining to answer, or providing visible confidence indicators) that is a far better sign than if it fails silently, producing plausible-sounding but incorrect output with the same confidence as correct output.

Case studies: what success and failure actually look like

Abstract principles become concrete through real examples. Several case studies from 2021–2025 illuminate the evaluation challenge from different angles.

JPMorgan Chase is probably the most successful large-scale enterprise AI deployment. With an $18 billion annual technology budget and $1.3 billion dedicated to AI in 2024, the bank has deployed AI across more than 450 use cases. Its LLM Suite platform reached 200,000 internal users in eight months through an opt-in strategy. The COiN system for commercial loan analysis reduced review time from weeks to minutes and freed 360,000 lawyer hours annually. The critical success factors map directly onto the research findings: internal-first rollout with clear KPIs and test-control groups, data readiness as a precondition rather than an afterthought, and substantial investment in AI culture including Python training for staff. JPMorgan did not buy AI tools and hope for the best. They rebuilt processes around the tools and invested heavily in adoption.

IBM Watson Health remains the definitive cautionary tale and deserves careful study by anyone evaluating AI tools. IBM spent over $5 billion on acquisitions and built a 7,000-person division. The marketing was extraordinary. Watson would, IBM suggested, transform cancer treatment. The reality was different. Training data from a single Manhattan hospital did not generalise to diverse patient populations globally. Watson struggled with unstructured clinical data, abbreviations, and medical jargon. The MD Anderson Cancer Center partnership exceeded $62 million before being shelved after an audit. IBM eventually sold the unit in 2022 for approximately $1 billion, an 80% loss on its investment. The lesson: impressive demonstrations on curated data are not evidence of production readiness. The question to ask is not "does it work?" but "does it work on our data, with our edge cases, at our scale?"

Zillow Offers provides a different and equally instructive failure. Zillow's AI pricing algorithm, the Zestimate, was used to purchase homes at prices the algorithm predicted could yield a profit on resale. The algorithm couldn't account for unstructured factors (neighbourhood dynamics, school quality, property condition) that experienced human appraisers routinely consider. The critical error came with "Project Ketchup" in 2021, which removed human oversight by preventing pricing experts from modifying the algorithm's valuations. The company lost over $500 million, laid off 2,000 employees, and lost $9 billion in market capitalisation. Notably, competitors Opendoor and Offerpad, with better human-in-the-loop guardrails, weathered the same market conditions. The lesson is not that AI pricing doesn't work. It is that AI pricing without human oversight in a domain with significant unstructured variables is a recipe for catastrophic error.

The pattern across all major case studies is consistent: every significant failure involved removing or undervaluing human judgement; every success maintained human-in-the-loop oversight. This should be a central criterion in any tool evaluation. How does the tool position the human-AI relationship? Does it augment human decision-making, or does it attempt to replace it? Does it make human oversight easy or difficult? Does it provide the information humans need to catch errors, or does it present outputs as final answers?

Build versus buy: the decision most teams face

At some point during evaluation, someone on your team will suggest building a custom solution rather than purchasing one. This is a decision worth taking seriously, but the evidence is increasingly clear about when each approach is appropriate.

The cost differentials are substantial. Custom AI development typically costs 3–5x more upfront than purchasing existing solutions ($500K–$2M+ versus $100K–$400K), and 65% of total software costs occur after original deployment. AI talent commands a 30–50% premium above traditional IT roles. Time to deploy runs 6–12 months for builds versus 2–4 months for purchases.

The market is resolving this question decisively. In 2025, 76% of enterprise AI use cases were purchased rather than built (Menlo Ventures), up from 53% in 2024. The MIT study found that purchased AI tools from specialised vendors succeed approximately 67% of the time versus about 33% for internal builds. Organisations scoring low on readiness factors achieve 3x better outcomes with purchased solutions than custom development.

The emerging consensus favours what practitioners call "Buy, Boost, or Build." "Buy" means purchasing off-the-shelf vendor solutions: fast and simple, but limited differentiation. "Boost" means purchasing a vendor's model and enhancing it with proprietary data through fine-tuning or retrieval-augmented generation, more customisation but requiring strong data governance. "Build" means full internal development: complete control and competitive differentiation, but expensive and difficult.

For most teams, the practical decision rests on five factors: strategic importance (is the AI capability a competitive differentiator or an operational efficiency tool?), data sensitivity (do regulatory or competitive concerns preclude sharing data with vendors?), time-to-value urgency, internal talent availability, and total cost of ownership over three to five years. If the capability is a differentiator and you have the talent, build. If it's operational and your readiness is low, buy. If it's somewhere in between, boost. The important thing is to make the decision deliberately rather than defaulting to whatever your most enthusiastic engineer recommends.

A market entering the Trough of Disillusionment, and why that favours careful buyers

One final piece of context for your evaluation: where AI sits on the adoption curve right now.

Generative AI moved from the Peak of Inflated Expectations to the Trough of Disillusionment on Gartner's 2025 Hype Cycle, the most significant positioning shift for the category. Gartner's John-David Lovelock noted in January 2026 that because AI is in the Trough throughout 2026, it will most often be sold by incumbent software providers rather than purchased as part of ambitious new initiatives.

AI startup failure rates are approximately 90%, significantly higher than the roughly 70% rate for traditional tech startups. Trust in AI companies dropped from 61% to 53% globally in 2024, with U.S. trust declining 15 points to just 35%. U.S. Census Bureau data provides a reality check on adoption surveys: only 9.7% of U.S. firms report actually using AI, up from 3.7% in late 2023 but far below the 78% figure that comes from surveys of large, tech-forward companies.

This environment is actually favourable for careful buyers. Vendor desperation creates leverage for negotiation. The shakeout of overhyped startups is clearing the field of bad options. The shift toward buying proven solutions suggests the market is maturing. And organisations that evaluate rigorously now, while many competitors are still recovering from failed pilots and abandoned initiatives, have an unusual window to adopt tools that genuinely work, from vendors that will actually survive.

The Trough of Disillusionment is, despite its name, when the best purchasing decisions get made. It is the moment when hype recedes and evidence becomes visible. For any team willing to do the patient, methodical work of genuine evaluation (defining the problem before selecting the tool, assessing readiness before signing the contract, planning for adoption before deploying the technology) it is an excellent time to buy.

Key takeaways

The evidence across more than 60 sources, spanning consulting firms, academic institutions, and practitioner experience, converges on a set of principles that should guide any AI tool evaluation.

Assess your organisation before assessing any tool. Readiness in data quality, governance, skills, and change management capacity predicts success more reliably than any product feature. If your data is not ready, no tool will rescue you. If your team is not prepared, no technology will be adopted.

Spend 70% of your evaluation effort on people and process, not product features. BCG's 10-20-70 principle and McKinsey's finding that workflow redesign has the single biggest effect on AI ROI should fundamentally reshape how you allocate evaluation time. The tool that fits your workflows is worth more than the tool that wins benchmarks.

Design pilots to produce evidence, not to confirm decisions. Define success and kill criteria before you start. Test with real data, in real workflows, with real users. Include tasks where you expect the AI to fail. Plan for production constraints from day one.

Account for the full cost. Licensing is typically 15–25% of total ownership cost. Data preparation, integration, training, maintenance, and talent represent the rest. If your business case only accounts for the vendor's quoted price, it is not a business case.

Insist on human-in-the-loop design. Every major AI failure in the case study literature involved removing or undervaluing human judgement. Every success maintained it. This should be a non-negotiable evaluation criterion.

Use the Trough of Disillusionment to your advantage. The current market favours patient, rigorous buyers. Vendors need your business, proven solutions are distinguishing themselves from hype, and organisations that adopt well now will have a meaningful head start.

Evaluating AI tools is not glamorous. It lacks the excitement of a product launch or the drama of a vendor keynote. But it is, according to every credible source we reviewed, the highest-leverage activity available to any team considering an AI investment. The organisations that do it well will capture genuine value. The ones that skip it will join the 74%.

Key takeaways

—Assess your organisation before assessing any tool. Data quality, governance, skills, and change management capacity predict success more reliably than any product feature. Cisco found only 15% of organisations have AI-ready infrastructure. If your data isn't ready, no tool will rescue you.
—Spend 70% of your evaluation effort on people and process, not product features. BCG's 10-20-70 principle (10% algorithms, 20% technology, 70% people and processes) is an empirical observation of what successful adopters actually do. McKinsey found that workflow redesign, not model selection, has the single biggest effect on AI ROI.
—Design pilots to produce evidence, not to confirm decisions. Define success and kill criteria before you start. Test with real data, in real workflows, with real users, and deliberately include tasks where you expect the AI to fail. If you can't state your kill criteria in advance, you're running a demonstration, not a pilot.
—Account for the full cost. 85% of organisations misestimate AI project costs by more than 10%. Licensing is typically 15–25% of total ownership cost. Data preparation, integration, training, maintenance, and talent represent the rest. If your business case only accounts for the vendor's quoted price, it isn't a business case.
—Insist on human-in-the-loop design. Every major AI failure in the case study literature (IBM Watson Health, Zillow Offers, UnitedHealthcare's nH Predict) involved removing or undervaluing human judgement. Every success maintained it. This should be a non-negotiable evaluation criterion.

Stay informed

The AI Primer Briefing is a weekly digest of what matters in AI — curated for professionals, free of breathless hype.

Evaluating AI Tools for Your Team

Five layers of assessment

Why AI projects fail, and why the reasons are not what you expect

The 70% rule: the most important finding in this research

Designing pilots that produce genuine evidence

Measuring return on investment without fooling yourself

Vendor evaluation: separating substance from spectacle

Data readiness: the prerequisite most organisations skip

Change management determines whether tools actually get used

Security, privacy, and regulation

What the academic research reveals about AI and productivity

Case studies: what success and failure actually look like

Build versus buy: the decision most teams face

A market entering the Trough of Disillusionment, and why that favours careful buyers

Key takeaways

Further reading

Key takeaways

Related reading

Stay informed