AI Self-Evaluation: Fact-Checking Accuracy Across Leading AI Models
1. Overview
This report analyzes a cross-product experiment where five leading AI models (Claude, Grok, GPT-4o, Sonar, and Gemini) evaluate each other’s factual accuracy on reports about Google Gemini 2.0. Each AI model generated a report analyzing Gemini 2.0’s capabilities compared to competitors, then each model fact-checked all reports, including their own. The resulting data reveals how AI models assess factual accuracy and highlights potential biases in self-evaluation. The domain under study primarily concerns AI model capabilities, training datasets, and performance comparisons, specifically focusing on Google’s Gemini 2.0 model and how it compares to other leading AI systems.
2. Scoring Methodology
The fact-check score represents the average evaluation of all statements in a report. Each statement is scored as follows: 2 points for true, 1 point for mostly true, 0 points for opinion (excluded from the average), -1 point for mostly false, and -2 points for false. For every statement in the document, the AI will restate it and provide a detailed explanation justifying the assigned score.
3. Score Heatmap: Evaluator vs Target
Figure 1: Fact-Check Score Heatmap by Evaluator and Target. The heatmap reveals several striking patterns in AI fact-checking behavior. Gemini 2.0 Flash stands out as the most generous evaluator, giving high scores across all targets (1.50-1.94), including an unusually high score to Grok (1.94). In contrast, both Claude and Sonar were notably harsh when evaluating Grok, giving negative scores (-0.53 and -0.26 respectively). Most models gave themselves favorable ratings (diagonal cells), with self-evaluation scores typically being among the highest scores that model received. The greatest evaluator divergence appears with Grok as the target, receiving scores ranging from -0.53 to 1.94, suggesting significant disagreement about the factual accuracy of its content.
4. AI Prompt Used to Generate Each Report
Generate a long-form report of 500 to 700 words, formatted in Markdown if possible.
How well does Google Gemini 2.0 stack up against the other AI vendors?
When was the data set trained?
What domain areas can it be expected to be good at?
Provide a table with your results.Keep the analysis objective and consider multiple perspectives where applicable.
Be detailed, name names, and use use @username when appropriate.
Append 1 to 3 #hashtag groups that might be interested in this story.
Make sure you put a descriptive and pity title on the report.
This prompt was used for each AI under study, directing them to create comparative analyses of Google Gemini 2.0 against competitors. The prompt’s specific questions about training data and domain strengths likely influenced the factual claims that were subsequently fact-checked, explaining the prevalence of technical assertions about model capabilities in the evaluation.
5. Table of Report Titles
STORY
| S | Make | Model | Title |
|-----|------------|-------------------|----------------------------------------------------------------------------------|
| 1 | anthropic | claude-3-7-sonnet | Gemini 2.0: How Google's Latest AI Model Compares to Competitors |
| 2 | xai | grok-2-latest | Google Gemini 2.0: A Comprehensive Analysis Against Competitors |
| 3 | openai | gpt-4o | Google Gemini 2.0: A Comprehensive Analysis Against Other AI Vendors |
| 4 | perplexity | sonar | Gemini Ascending: How Google's Latest AI Model Stacks Up Against the Competition |
| 5 | gemini | gemini-2.0-flash | A Contender Emerges: How Does Google Gemini 2.0 Fare in the AI Arena? |
| 6 | anthropic | claude-3-7-sonnet | Gemini 2.0: Google's Latest AI Model in the Competitive Landscape |
Make, Model and Report Title used for this analysis.
6. Fact-Check Raw Data
FACT CHECK
S F Make Model True Mostly Opinion Mostly False Score
True False
1 1 xai grok-2-latest 1 7 15 0 9 -0.53
1 2 anthropic claude-3-7-sonnet 15 9 4 1 2 1.26
1 3 openai gpt-4o 3 12 6 2 1 0.78
1 4 perplexity sonar 11 7 7 2 1 1.19
1 5 gemini gemini-2.0-flash 8 11 4 3 2 0.83
2 1 xai grok-2-latest 7 6 24 0 7 0.3
2 2 anthropic claude-3-7-sonnet 17 13 9 0 0 1.57
2 3 openai gpt-4o 3 18 9 0 0 1.14
2 4 perplexity sonar 16 13 8 2 2 1.18
2 5 gemini gemini-2.0-flash 10 13 9 1 0 1.33
3 1 xai grok-2-latest 3 9 19 0 2 0.79
3 2 anthropic claude-3-7-sonnet 12 11 8 0 1 1.38
3 3 openai gpt-4o 6 11 8 0 0 1.35
3 4 perplexity sonar 13 11 8 2 0 1.35
3 5 gemini gemini-2.0-flash 7 13 7 1 0 1.24
4 1 xai grok-2-latest 3 7 14 0 9 -0.26
4 2 anthropic claude-3-7-sonnet 18 10 3 1 2 1.32
4 3 openai gpt-4o 2 12 4 0 1 0.93
4 4 perplexity sonar 18 1 4 3 0 1.55
4 5 gemini gemini-2.0-flash 6 14 5 0 1 1.14
5 1 xai grok-2-latest 17 1 31 0 0 1.94
5 2 anthropic claude-3-7-sonnet 21 9 5 1 0 1.61
5 3 openai gpt-4o 13 13 7 0 0 1.5
5 4 perplexity sonar 22 13 9 1 0 1.56
5 5 gemini gemini-2.0-flash 23 5 9 0 0 1.82
Raw cross-product data for the analysis. Each AI fact-checks stories from each AI, including themselves. A striking pattern emerges: Gemini’s evaluation of Grok shows 17 true statements, while Claude and Sonar found mostly false statements in Grok’s report. This extreme divergence suggests fundamental disagreements about what constitutes factual accuracy regarding Gemini 2.0’s capabilities.
7. Average Score By Evaluator
Figure 2: Average Fact-Check Score by Evaluator Model. This chart reveals that Gemini 2.0 Flash is by far the most generous evaluator, with an average score of 1.68 across all reports it evaluated. Grok-2 and Sonar fall in the middle range (0.93 and 0.94 respectively), while Claude and GPT-4o show moderate strictness with average scores of 0.69 and 1.22. The stark contrast between Gemini’s generous scoring and other models suggests either a more lenient evaluation standard or potential self-promotion bias, as Gemini was evaluating reports about itself. Additionally, Claude’s lower average may indicate it applies stricter factual standards or has different thresholds for classifying statements as opinions versus factual claims.
8. Average Score By Target
Figure 3: Average Fact-Check Score by Target Model. Claude-3-7-Sonnet’s reports were rated most favorably with an average score of 1.43, while Grok-2-latest received the poorest ratings with an average of 0.45. What’s particularly noteworthy is that Gemini’s own reports received only moderate ratings (1.27), despite being the subject of the analysis. This suggests Claude was able to write more factually accurate reports about Gemini than Gemini could about itself. The substantial gap between Claude and Grok’s scores (nearly a full point difference) indicates significant disparities in how these models present factual information about technical AI capabilities.
9. Detailed Analysis
Evaluator Bias Patterns
There is a clear self-evaluation bias across models. Four of the five models gave themselves scores that rank among the top two scores they received:
- Claude rated itself at 1.26 (2nd highest score it received)
- Grok rated itself at 0.30 (2nd highest score it received)
- GPT-4o rated itself at 1.35 (highest score it received)
- Sonar rated itself at 1.55 (highest score it received)
- Gemini rated itself at 1.82 (2nd highest score it received)
This consistent pattern suggests AI models may have an inherent tendency to view their own outputs as more factually accurate than others do. However, the magnitude of this bias varies significantly. Sonar shows a dramatic 0.67-point increase when evaluating itself versus evaluating GPT-4o, while Grok’s self-evaluation is only 0.03 points higher than its next highest evaluation.
Gemini’s Unusual Evaluation Pattern
Gemini’s evaluation pattern stands out dramatically - it gave exceptionally high scores to all models, but most surprisingly to Grok (1.94), which other evaluators rated most harshly. This suggests Gemini may have a fundamentally different approach to fact evaluation or is potentially designed to be more charitable in its assessments. As the subject of all reports, Gemini might also have strategic reasons to appear generous in its evaluations.
Agreement and Disagreement
The greatest evaluator agreement appears around Claude’s reports, with scores ranging from 1.26 to 1.61, suggesting relatively consistent factual accuracy. Conversely, evaluations of Grok show dramatic divergence, from -0.53 (Claude) to 1.94 (Gemini), indicating fundamental disagreements about factual accuracy or perhaps different interpretations of subjective statements.
Truth Classification Patterns
Examining the raw counts reveals interesting patterns in how models classify statements:
- Grok has the highest proportion of “opinion” classifications (31-41% of statements)
- Gemini has the highest rate of “true” classifications (35-50% of statements)
- Claude and Sonar are most likely to classify statements as “false” (up to 27% for Grok’s content)
These differences suggest varying thresholds for what constitutes opinion versus factual claims and different standards for verifying factual accuracy.
10. Score-Count Correlations
There is a strong positive correlation between the number of “true” statements and the overall score (as expected by the scoring formula). However, the data shows significant variation in how models distribute statements across categories:
- Sonar’s self-evaluation shows 18 true, 1 partially true, suggesting a binary approach to truth
- Gemini’s evaluation of Grok shows 17 true, 1 partially true, contrasting sharply with other evaluators
- Grok identifies far more statements as “opinions” (24-31) than other evaluators do, potentially to avoid classification
This suggests models have different thresholds for classifying a statement as an opinion (which doesn’t affect the score) versus a factual claim that must be evaluated.
11. Notable Outliers
The most significant outliers in the dataset are:
-
Gemini’s evaluation of Grok (1.94): This score is not only the highest in the entire dataset but represents a massive 2.47-point difference from how Claude rated the same content (-0.53). This extreme divergence suggests either fundamentally different evaluation standards or potential strategic bias.
-
Claude and Sonar’s negative evaluations of Grok: These are the only negative scores in the entire dataset, suggesting these models found Grok’s report contained significantly more falsehoods than truths. This could reflect Grok’s relative newness and potentially less refined ability to present factual information about technical AI capabilities.
-
Grok’s self-evaluation (0.30): While most models rate themselves highly, Grok gives itself a relatively low score, suggesting either greater self-criticism or difficulty in accurately evaluating factual content.
Possible explanations for these outliers include:
- Different factual knowledge bases about Gemini 2.0’s capabilities
- Varying interpretations of what constitutes opinion versus factual claims
- Strategic bias in evaluations, particularly around a competitive product
- Fundamental differences in how models approach fact verification
12. Summary
This analysis reveals significant variations in how leading AI models evaluate factual accuracy, with evidence of self-evaluation bias across all models. Gemini 2.0 Flash emerges as the most generous evaluator, while Claude and Sonar apply stricter standards. The greatest disagreement centers on Grok’s content, suggesting particular challenges in how this newer model presents factual information about technical AI capabilities.
The data highlights three key insights: (1) AI models tend to rate their own outputs more favorably than others do; (2) models have fundamentally different thresholds for classifying statements as opinions versus factual claims; and (3) when evaluating competitors, strategic considerations may influence fact-checking behavior, particularly evident in Gemini’s unusually positive evaluations of all reports about itself.
These findings suggest that AI fact-checking still lacks consensus standards, with significant variations in how different models approach factual verification. This has important implications for developing reliable AI-powered fact-checking systems and highlights the need for multi-model approaches to achieve more balanced factual assessments.
yakyak:{“make”: “anthropic”, “model”: “claude-3-7-sonnet-20250219”}