Blackpearl unveils GTM-Bench for AI sales evaluation

Tue, 23rd Jun 2026 (Today)

Blackpearl Group has launched GTM-Bench, a benchmark for measuring AI systems in sales and prospecting workflows. The tool evaluates whether AI agents create commercial value.

It tests leading systems from OpenAI, Anthropic, Google, DeepSeek and others on real-world go-to-market tasks using both public and proprietary data. Results showed that four of six leading AI sales agents produced negative overall scores, suggesting poor results can outweigh any useful output.

The project focuses on what Blackpearl calls buyer and seller coherence: whether an AI system can understand what a seller offers, identify likely buyers and return prospect records that are relevant and backed by evidence. It covers 72 tasks, 11 task types and 15 market categories, built from 59,881 prospecting queries.

Seven systems were tested in total, including six general-purpose models from major AI developers and Blackpearl's own Pearl Engine RTSA system, which it described as purpose-built for go-to-market work.

Outcome over volume

The scoring system rewards a good lead with +1 and penalises a bad lead with -1. Blackpearl said this is meant to reflect the commercial cost of low-quality prospecting, including wasted sales time, budget and clutter in customer relationship management systems.

In one headline finding, one AI agent generated 6,342 prospect records for a single task. Blackpearl argued this illustrates a wider problem in AI sales tools, where volume is often treated as a sign of success even when the underlying records are weak.

Nick Lissette, Chief Executive Officer of Blackpearl, said the benchmark was designed to shift attention from activity to outcomes. "The AI industry has become obsessed with output. It has spent far less time measuring outcomes. Poor-quality agentic AI doesn't simply fail to find opportunities - it empowers agents to consume budgets, waste sales hours, pollute CRM systems and send organisations chasing customers who were never likely to buy. Put bluntly, the research shows that bad AI may be worse than no AI at all," Lissette said.

Across 432 agent traces, the stronger systems cast a wide net before narrowing results using evidence, according to Blackpearl. Weaker systems returned large numbers of records with less discipline in filtering and verification.

Mixed results

The published figures show a wide spread in performance. Blackpearl's RTSA recorded a net score of +26,615.6, while GPT-5.5 scored +4,040.9 when given access to Blackpearl's proprietary data and +1,015.4 when restricted to publicly available web evidence.

Even so, no model led every category. GPT-5.5 outperformed Blackpearl RTSA in several markets, including healthcare, recruiting, industrial and real estate, while Blackpearl's system was weaker in public sector and sustainability tasks.

That variation suggests businesses cannot assume one AI system will perform best across every sales environment. The findings also point to the importance of testing models against specific commercial tasks rather than relying on general claims or headline productivity figures.

Lissette said the pattern reflects a broader shift towards specialised AI systems in industry. "This is a common pattern across Vertical AI, which is where very specific AI models, agents and systems are created to answer the needs of a particular industry vertical. The best known examples of this are Harvey for legal AI, Cursor for coding AI and Tempus health AI - all of these have built AI systems on top of leading foundational models to achieve outstanding results in their field. Blackpearl is doing the same for go-to-market AI," he said.

Data and design

Blackpearl also used the benchmark to examine the value of proprietary data. GPT-5.5's score improved almost fourfold when it had access to Blackpearl's internal go-to-market data rather than relying only on public web sources.

That gain was still well below the result achieved by Blackpearl's own system in the same data environment, which Blackpearl said shows that data alone does not explain the gap. Instead, it argues that the design of task-specific AI agents has a large effect on sales outcomes.

"Put simply... If you add great data to foundational models, you get results that are four times better. But then if you go further and put go-to-market vertical AI on top of that you get a further six times better results. When you combine the two, the results are twenty six times better," Lissette said.

Blackpearl has made the benchmark's methodology, code, tasks and results public, including the evaluation system and run artefacts. It said that level of disclosure was necessary given that it both developed the benchmark and performed strongly in the testing.

Max Polaczuk, Vice President of AI at Blackpearl, addressed that point directly. "Our answer is transparency. Every task, every line of evaluation code and every run artifact is public. Anyone can re-run the experiments and challenge the findings. We hope people do, because that's how benchmarks improve," Polaczuk said.

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google

Image: Nick Lissette and Max Polaczuk