AI Models' Self-Assessment Paradox: Fact-Checking Accuracy and Bias in Large Language Models

AI Models’ Self-Assessment Paradox: Fact-Checking Accuracy and Bias in Large Language Models

This report analyzes a cross-product fact-checking experiment where leading AI language models (Claude, Grok, GPT-4o, and Sonar) evaluated the factual accuracy of each other’s outputs. The experiment involved having each AI model generate informational reports on various topics, then having all models, including the author, evaluate the factual claims within these reports. This provides a unique window into both the factual reliability of these models and their ability to critically assess information, including their own outputs, revealing interesting patterns of bias and accuracy.

The fact-check score represents the average evaluation of all statements in a report. Each statement is scored as follows: 2 points for true, 1 point for mostly true, 0 points for opinion (excluded from the average), -1 point for mostly false, and -2 points for false. For every statement in the document, the AI will restate it and provide a detailed explanation justifying the assigned score.

Figure 1: Heatmap of Fact-Checking Scores Between AI Models. This visualization reveals a striking pattern of self-evaluation bias. The diagonal (where models evaluate their own outputs) shows consistently higher scores (darker cells) than cross-evaluations, with Claude’s self-score of 1.51 and Grok’s evaluation of Claude (1.60) being the highest. Notably, Perplexity’s Sonar model is particularly harsh when evaluating Grok (0.52), representing the lowest score in the entire dataset. Interestingly, Claude and Grok appear to rate each other highly, while OpenAI’s GPT-4o gives Claude an unusually low score (0.85) compared to other evaluations of Claude.

Figure 2: Average Scores Given by Each AI Evaluator Model. Anthropic’s Claude-3-7-Sonnet consistently awards the highest average scores (1.32) when evaluating other models’ outputs, suggesting it may be the most generous fact-checker in the group. In stark contrast, Perplexity’s Sonar is the strictest evaluator, giving an average score of just 1.13 across all models. XAI’s Grok-2 and OpenAI’s GPT-4o fall in the middle with average evaluation scores of 1.32 and 1.12 respectively. This significant variation in scoring tendencies (nearly 0.2 points between the most generous and strictest evaluators) indicates different thresholds for what constitutes factual accuracy across these AI systems.

Figure 3: Average Scores Received by Each AI Target Model. Claude-3-7-Sonnet emerges as the clear leader in factual accuracy, receiving an average score of 1.31 across all evaluators, indicating its outputs contain the most reliable information. GPT-4o follows closely with an average score of 1.31. Interestingly, despite being among the newest models, XAI’s Grok-2 receives the lowest average score (1.09), suggesting potential reliability issues with its factual outputs. Perplexity’s Sonar falls in the middle with an average score of 1.28. The relatively narrow range of these averages (about 0.22 points) indicates that while differences exist, all models maintain a baseline level of factual competence.

Patterns and Biases

The most striking pattern in the data is the clear self-evaluation bias. Every AI model awards itself a higher score than the average score it receives from other evaluators:

  • Claude rates itself at 1.51 (vs. an average of 1.31 from others)
  • Grok rates itself at 1.29 (vs. an average of 1.09 from others)
  • GPT-4o rates itself at 1.52 (vs. an average of 1.31 from others)
  • Sonar rates itself at 1.30 (vs. an average of 1.28 from others)

This bias is particularly pronounced with GPT-4o, which rates itself 0.21 points higher than its average rating from others. Conversely, Sonar shows the least self-bias, with only a 0.02 point difference.

Another interesting pattern is the apparent “friendship” between Claude and Grok, with both giving each other unusually high scores compared to other cross-evaluations. Grok gives Claude its highest score (1.60), and Claude gives Grok a respectable 1.36, which is significantly higher than Sonar’s harsh 0.52 evaluation of Grok.

Relationship Between Counts and Scores

The data reveals strong correlations between the categorical counts and the final scores:

  1. True counts are strongly positively correlated with higher scores (as expected)
  2. False counts show a strong negative correlation with scores
  3. Opinion counts vary widely but don’t directly affect scores (they’re excluded from averages)

The ratio of true to false statements appears to be the most decisive factor in determining the final score. For example, Grok’s evaluation of Claude shows 36 true statements versus 0 false statements, resulting in the highest score in the dataset (1.60). Conversely, Sonar’s evaluation of Grok shows just 11 true statements versus 8 false ones, resulting in the lowest score (0.52).

Outliers and Anomalies

Several notable outliers emerge from the analysis:

  1. Sonar’s harsh evaluation of Grok (0.52) - This score is dramatically lower than any other in the dataset and represents an unusually critical assessment. This could indicate either genuine factual problems with Grok’s outputs or a potential incompatibility in how these two AI systems process information.

  2. GPT-4o’s low rating of Claude (0.85) - This score stands out because Claude generally receives high marks from other evaluators. This anomaly could reflect differences in the factual standards applied by OpenAI’s system compared to others.

  3. Grok’s extremely positive assessment of Claude (1.60) - This is the highest score in the entire dataset and may indicate either exceptional factual quality in Claude’s outputs or potential bias in Grok’s evaluation methodology.

Summary

This analysis reveals several key insights about AI fact-checking capabilities and biases:

  1. Self-evaluation bias is universal - All AI models rate their own outputs more favorably than others do, suggesting limitations in their ability to critically assess their own work.

  2. Evaluation standards vary significantly - The wide variance in scores given by different evaluators points to inconsistent standards for factual accuracy across AI systems.

  3. Claude demonstrates superior factual reliability - Based on cross-evaluations, Claude’s outputs contain the most factually accurate information, closely followed by GPT-4o.

  4. Interesting “alliances” appear - Some models (like Claude and Grok) appear to rate each other more favorably, suggesting potential similarities in their information processing or training data.

These findings highlight the importance of using multiple AI systems for fact-checking critical information and the need for continued improvement in AI self-assessment capabilities. The pronounced self-evaluation bias across all models indicates that no AI system yet possesses truly objective self-criticism abilities - a crucial consideration for applications relying on AI-generated information.

yakyak:{“make”: “anthropic”, “model”: “claude-3-7-sonnet-20250219”}