我们对人工智能的评估方式存在严重偏差

在人工智能应用方面，信任的重要性远超技术实力。图片来源：getty images

我们评估人工智能的方法存在一种吊诡的错位：我们创造出模仿和强化人类能力的系统，然而衡量其成功的标准却包罗万象，唯独遗漏了那些真正对人类而言有价值的维度。

科技行业的数据展示界面上满是有关人工智能的亮眼数字：处理速度、参数数量、基准测试得分、用户增长率。硅谷最顶尖的人才不断调整算法，只为拉高这些指标。然而，在这一堆衡量标准中，我们却忽视了一个基本事实：世界上最先进的人工智能，若无法实实在在地改善人类生活，便毫无价值可言。

想想早期搜索引擎的故事。在谷歌崭露头角之前，各大公司围绕网页索引数量展开激烈竞争。然而谷歌之所以能够异军突起，并非因为它拥有规模最大的数据库，而在于它更深刻地洞察了用户行为——相关性和可信度比单纯的数量更重要。

能建立信任的人工智能

当下的人工智能领域，情形与往昔有着异曲同工之处。各公司竞相构建规模更大的模型，却可能忽视了以用户为核心的设计里那些更微妙的元素，而这些元素才是真正推动人工智能应用和发挥影响力的关键所在。

改善人工智能评估体系的关键在于建立信任机制。最新研究成果显示，那些能够清晰阐释自身推理过程的人工智能系统（即便存在偶发误差）能赢得用户更深度的持续使用。这其中的道理通俗易懂——无论是面对技术还是人际交往，信任的基石都在于透明度和可靠性，而非单纯的性能指标。

然而，信任只是基础。最有效的人工智能系统通过展现对人类心理的真正洞察，与用户建立起真正的情感联系。研究揭示出一种极具说服力的模式：当人工智能系统不再局限于执行任务，而是能够依据用户的心理诉求进行调整时，它们就会成为人们日常生活中不可或缺的一部分。这绝非简单地编写几句看似友好的程序，而是要打造出真正理解并能回应人类体验的系统。

于人工智能的应用场景而言，信任的重要性远超技术实力。一项针对近1100名消费者展开的、具有开创性意义的人工智能聊天机器人研究揭示：人们在面对服务失误时，是否愿意选择谅解并保持对品牌的忠诚度，并不取决于人工智能解决问题的快慢，而在于他们是否信任这个试图提供帮助的系统。

理解你的人工智能

研究人员发现，构建这种信任关系需满足三个至关重要的维度：其一，人工智能需展现出真正理解问题以及解决问题的能力；其二，它需流露出善意，即真诚地希望为用户提供帮助；其三，它需在与用户的互动过程中，始终如一地保持诚实，以此维系自身的诚信形象。当人工智能聊天机器人具备这些特质时，客户便更倾向于原谅服务中出现的问题，也不太可能向他人抱怨自己的体验。

如何让人工智能系统赢得用户的信赖呢？研究发现，一些简单的举措就能带来显著效果：为人工智能赋予拟人化的特质，通过编程使其在回复中展现出同理心（“我理解这一定让你很沮丧”），以及在数据隐私方面保持透明。有这样一个典型案例，一位遭遇送货延迟问题的客户，在与名为罗素的聊天机器人沟通时，罗素不仅认可了客户的沮丧情绪，还条理清晰地解释了问题产生的原因以及解决方案，客户对罗素这样的聊天机器人更有可能保持忠诚度（与只是单纯陈述事实、连名字都没有的聊天机器人相比）。

这一发现冲击了“人工智能只需做到快速准确”的普遍假设。在医疗保健、金融服务以及客户支持领域，那些最为成功的生成式人工智能系统，并非一定是架构最为复杂的，而是那些能够与用户建立起真正融洽关系的系统。它们愿意花费时间向用户阐释推理过程，承认用户心中的顾虑，并且始终如一地满足用户需求。

然而，传统衡量标准并不总能反映出这些关键的性能维度。我们需要构建新型评估框架：在评估人工智能系统时，不能仅仅局限于技术娴熟程度，还需关注系统营造心理安全感、与用户建立起真正融洽关系的能力，而最为关键的是，要看它们能否助力用户达成目标。

全新的人工智能衡量标准

在Cleo公司，我们致力于借助人工智能助手来提升人们的财务健康水平，当下正探索全新的衡量标准。这可能意味着要衡量用户信任度、用户参与深度和质量等因素，同时，我们还会关注整个对话过程。对我们而言，了解公司的人工智能财务助手Cleo能否在每次交互中帮助用户达成目标至关重要。

构建更为细致的评估框架，并不意味着要舍弃性能指标，毕竟性能指标仍是衡量商业与技术成功的重要指标。只不过，它们需要与那些能够更深入衡量对人类产生影响的指标相互平衡。但这绝非易事，其中一个挑战在于这些指标存在主观性，不同个体对于“好”的评判标准往往大相径庭。即便如此，这些指标仍然值得探索。

随着人工智能越来越深入地融入日常生活，能够理解这种转变的公司才会取得成功。过往引领我们至今的评估标准，已然无法满足未来发展的需求。是时候开始衡量真正关键的要素了：我们不能仅仅聚焦于人工智能的性能表现，更应关注它究竟能在多大程度上助力人类实现蓬勃发展。

费尔南达·多巴尔（Fernanda Dobal）现任Cleo公司产品总监，主要负责公司的人工智能与聊天机器人相关业务。

Fortune.com上发表的评论文章中表达的观点，仅代表作者本人的观点，不代表《财富》杂志的观点和立场。（财富中文网）

译者：中慧言-王芳

There’s a peculiar irony in how we evaluate artificial intelligence: We’ve created systems to mimic and enhance human capabilities, yet we measure their success using metrics that capture everything except what makes them truly valuable to humans.

The tech industry’s dashboards overflow with impressive numbers on AI: processing speeds, parameter counts, benchmark scores, user growth rates. Silicon Valley’s greatest minds tweak algorithms endlessly to nudge these metrics higher. But in this maze of measurements, we’ve lost sight of a fundamental truth: The most sophisticated AI in the world is worthless if it doesn’t meaningfully improve human lives.

Consider the story of early search engines. Before Google, companies competed fiercely on the sheer number of web pages indexed. Yet Google prevailed not because it had the biggest database, but because it understood something deeper about human behavior—that relevance and trustworthiness matter more than raw quantity.

AI that builds trust

Today’s AI landscape feels remarkably similar, with companies racing to build bigger models while potentially missing the more nuanced elements of human-centered design that actually drive adoption and impact.

The path to better AI evaluation begins with trust. Emerging research demonstrates that users engage more deeply and persistently with AI systems that clearly explain their reasoning, even when those systems occasionally falter. This makes intuitive sense—trust, whether in technology or humans, grows from transparency and reliability rather than pure performance metrics.

Yet trust is merely the foundation. The most effective AI systems forge genuine emotional connections with users by demonstrating true understanding of human psychology. The research reveals a compelling pattern: When AI systems adapt to users’ psychological needs rather than simply executing tasks, they become integral parts of people’s daily lives. This isn’t about programming superficial friendliness—it’s about creating systems that genuinely comprehend and respond to the human experience.

Trust matters more than technical prowess when it comes to AI adoption. A groundbreaking AI chatbot study of nearly 1,100 consumers found that people are willing to forgive service failures and maintain brand loyalty not based on how quickly an AI resolves their problem, but on whether they trust the system trying to help them.

AI that gets you

The researchers discovered three key elements that build this trust: First, the AI needs to demonstrate a genuine ability to understand and address the issue. Second, it needs to show benevolence—a sincere desire to help. Third, it must maintain integrity through consistent, honest interactions. When AI chatbots embodied these qualities, customers were significantly more likely to forgive service problems and less likely to complain to others about their experience.

How do you make an AI system trustworthy? The study found that simple things make a big difference: anthropomorphizing the AI, programming it to express empathy through its responses (“I understand how frustrating this must be”), and being transparent about data privacy. In one telling example, a customer dealing with a delayed delivery was more likely to remain loyal when a chatbot named Russell acknowledged their frustration and clearly explained both the problem and solution, compared to an unnamed bot that just stated facts.

This insight challenges the common assumption that AI just needs to be fast and accurate. In health care, financial services, and customer support, the most successful generative AI systems aren’t necessarily the most sophisticated —they’re the ones that build genuine rapport with users. They take time to explain their reasoning, acknowledge concerns, and demonstrate consistent value for the user’s needs.

And yet traditional metrics don’t always capture these crucial dimensions of performance. We need frameworks that evaluate AI systems not just on their technical proficiency, but on their ability to create psychological safety, build genuine rapport, and most importantly, help users achieve their goals.

New AI metrics

At Cleo, where we’re focused on improving financial health through an AI assistant, we’re exploring these new measurements. This might mean measuring factors like user trust and the depth and quality of user engagement, as well as looking at entire conversational journeys. It’s important for us to understand if Cleo, our AI financial assistant, can help a user with what they are trying to achieve with any given interaction.

A more nuanced evaluation framework doesn’t mean abandoning performance metrics—they remain vital indicators of commercial and technical success. But they need to be balanced with deeper measures of human impact. That’s not always easy. One of the challenges with these metrics is their subjectivity. That means reasonable humans can disagree on what good looks like. Still, they are worth pursuing.

As AI becomes more deeply woven into the fabric of daily life, the companies that understand this shift will be the ones that succeed. The metrics that got us here won’t be sufficient for where we’re going. It’s time to start measuring what truly matters: not just how well AI performs, but how well it helps humans thrive.

The opinions expressed in Fortune.com commentary pieces are solely the views of their authors and do not necessarily reflect the opinions and beliefs of Fortune.

财富中文网所刊载内容之知识产权为财富媒体知识产权有限公司及/或相关权利人专属所有或持有。未经许可，禁止进行转载、摘编、复制及建立镜像等任何使用。