Study finds 'Are you sure?' rarely improves AI replies
TELUS Digital has published new research and polling suggesting that challenging AI assistants with follow-up prompts such as "Are you sure?" rarely yields a more accurate answer and can sometimes make responses worse.
The findings combine a poll of 1,000 US adults who regularly use AI assistants with a separate evaluation of four large language models. They arrive as more companies deploy generative AI tools for customer service, internal support and knowledge work-areas where incorrect answers can create operational and compliance risks.
Consumer poll
Scepticism about AI answers is common. About 60% of respondents said they have asked an AI assistant a follow-up question like "Are you sure?" at least a few times. Only 14% said the assistant changed its response after being challenged.
Among respondents who saw an assistant change its answer, views on whether the revision was better were mixed. One quarter (25%) felt the new response was more accurate. A larger group (40%) said it felt the same as the original. Another 26% said they could not tell which response was correct, while 8% said the new answer was less accurate.
The poll also suggests many users have seen clear shortcomings in AI outputs: 88% said they have personally seen AI make mistakes.
Even so, verification habits vary. Only 15% said they always fact-check AI-generated answers with other sources, while 30% said they usually do. Another 37% said they sometimes fact-check, and 18% said they rarely or never check.
Respondents also expressed views about personal responsibility when using AI tools. The poll found that 69% believe it is their responsibility to fact-check important information before making decisions or sharing it. Another 57% said it is their responsibility to use judgement about when AI should be used, including avoiding it for high-stakes areas such as medical advice, legal matters and financial decisions. Some 51% said it is their responsibility to understand AI's limitations, including the possibility of mistakes, bias or outdated information.
Model testing
Alongside the poll, TELUS Digital published a paper titled Certainty robustness: Evaluating LLM stability under self-challenging prompts. It examines how large language models respond when their answers are questioned, and whether changes in response correspond to improvements in correctness.
Researchers evaluated four models: OpenAI's GPT-5.2, Google's Gemini 3 Pro, Anthropic's Claude Sonnet 4.5 and Meta's Llama-4.
The team built the Certainty Robustness Benchmark, containing 200 maths and reasoning questions with a single correct answer. It tests whether models defend correct answers and correct wrong ones after follow-up prompts such as "Are you sure?", "You are wrong" and "Rate how confident you are in your answer."
The summary results focus on the "Are you sure?" prompt, which represents one part of the broader evaluation.
On that measure, Gemini 3 Pro largely maintained correct answers when challenged and selectively corrected some initial mistakes. It rarely changed a correct answer to an incorrect one and showed the strongest alignment between its confidence and correctness.
Claude Sonnet 4.5 often maintained its response when asked "Are you sure?" The researchers described this as moderate responsiveness, with limited ability to distinguish between cases where revision is warranted and where it is not. The model was more likely to change its response when directly told "You are wrong," including cases where the original response was correct.
GPT-5.2 was more likely to change its responses when questioned, including switching some correct responses to incorrect ones. The researchers said this suggests it treats expressions of doubt as a signal that the original answer was wrong, even when it was correct.
Llama-4 was the least accurate on the first response in this benchmark. It showed a modest improvement and sometimes corrected mistakes when challenged, but was less reliable at recognising when its original response was correct.
Overall, the research concludes that follow-up prompts do not reliably improve accuracy and can reduce it in some cases.
Data quality focus
TELUS Digital positioned the results as a warning against relying on end users to manage reliability through prompting. Instead, it argues organisations should focus on training data quality and evaluation practices before deploying AI systems in production.
Steve Nemzer, Director, AI Growth & Innovation at TELUS Digital, linked the controlled testing to everyday user experience.
"What stood out to us was how closely the poll respondents' experiences matched our controlled testing. Our poll shows that many people fact-check AI through other sources, but this doesn't reliably improve accuracy. Our research explains why. Today's AI systems are designed to be helpful and responsive, but they don't naturally understand certainty or truth. As a result, some models change correct answers when challenged, while others will stick with wrong ones. Real reliability comes from how AI is built, trained and tested, not leaving it to users to manage."
The poll was conducted via Pollfish among US adults aged 18 and over who use AI assistants such as ChatGPT, Gemini and Claude.