Unraveling the Dynamics of AI Storytelling and Fact-Checking: A Cross-Product Experiment

matt · March 6, 2025, 3:48am

Title: Unraveling the Dynamics of AI Storytelling and Fact-Checking: A Cross-Product Experiment

In the evolving landscape of artificial intelligence, the ability of AI to generate coherent and factual content is increasingly under scrutiny. This report delves into a unique cross-product experiment involving four different AI systems, each tasked with generating a story and then fact-checking the stories produced by all four systems. This experiment not only tests the AI’s capability in storytelling but also its reliability in fact-checking, providing a comprehensive view of their performance across diverse topics and narrative styles.

Introduction

The experiment involves four AI systems: OpenAI’s GPT-4o, xAI’s Grok-2-latest, Perplexity’s Sonar, and Anthropic’s Claude-3-7-Sonnet-20250219. Each AI was prompted to generate a story on a specific topic, resulting in four unique narratives. Subsequently, each AI performed a fact-check on every story, including its own, leading to a total of 16 fact-checks.

The scoring system for the fact-checks is based on a scale from +2 to -2, where each statement within a story is evaluated. The scores are as follows: +2 for True, +1 for Mostly True, 0 for Opinion (which does not affect the average score), -1 for Mostly False, and -2 for False. The average score for each story is calculated excluding Opinion statements, providing a clear measure of factual accuracy.

Data Analysis

Cross-Product Table

The cross-product table is a pivotal tool for analyzing the results of this experiment. It displays the fact-check scores of each AI for each story, with the main diagonal representing self-fact-checks. Below is the table derived from the provided JSON data:

Story	Fact-Checker	True	Mostly True	Opinion	Mostly False	False	Score
1	xAI	36	9	23	0	0	1.80
1	OpenAI	26	9	9	0	0	1.74
1	Perplexity	33	7	5	0	0	1.82
1	Anthropic	42	13	4	0	0	1.76
2	xAI	18	5	15	0	0	1.78
2	Anthropic	19	9	2	1	1	1.47
2	OpenAI	11	8	10	0	0	1.58
2	Perplexity	10	7	11	5	1	0.87
3	xAI	36	5	13	0	0	1.88
3	Anthropic	28	4	4	2	0	1.71
3	OpenAI	17	7	7	0	0	1.71
3	Perplexity	25	3	6	0	0	1.89
4	xAI	38	10	6	1	1	1.66
4	Anthropic	44	3	1	0	0	1.94
4	OpenAI	25	6	2	0	1	1.69
4	Perplexity	22	9	5	8	1	1.07

Insights from the Cross-Product Table

Self-Fact-Checking:
- The main diagonal of the table shows the scores when an AI fact-checks its own story. Interestingly, none of the AI systems gave themselves a perfect score of 2.0, which raises questions about the self-awareness and critical evaluation capabilities of these systems.
- For example, OpenAI’s GPT-4o scored its own story (Story 1) at 1.74, indicating a high level of confidence in its own narrative but not absolute certainty.
Inter-AI Fact-Checking:
- The scores vary significantly when different AIs fact-check the same story. For instance, Story 2 received scores ranging from 0.87 (Perplexity) to 1.78 (xAI), suggesting that different AIs may have varying standards or interpretations of factual accuracy.
- This variance could be attributed to differences in training data, algorithms, or the specific criteria each AI uses for fact-checking.
Story-Specific Trends:
- Story 4 by Anthropic received the highest score across all fact-checks (1.94 by Anthropic itself), indicating a high level of factual accuracy and consistency in its narrative.
- Conversely, Story 2 by xAI received the lowest score from Perplexity (0.87), suggesting potential areas of factual weakness or differing interpretations of the story’s content.
Opinion Statements:
- The number of Opinion statements varies significantly across stories and fact-checkers. For example, Story 1 had the highest number of Opinion statements (23 by xAI), which did not affect the average score but provides insight into the narrative style and the inclusion of subjective content.

Visualization and Additional Insights

To further analyze the data, Python can be used to create visualizations that expose additional insights. Below is a Python script that generates a heatmap of the cross-product table and bar charts for each AI’s fact-checking performance:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Data preparation
data = {
    'Story': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
    'Fact-Checker': ['xAI', 'OpenAI', 'Perplexity', 'Anthropic', 'xAI', 'Anthropic', 'OpenAI', 'Perplexity', 'xAI', 'Anthropic', 'OpenAI', 'Perplexity', 'xAI', 'Anthropic', 'OpenAI', 'Perplexity'],
    'Score': [1.80, 1.74, 1.82, 1.76, 1.78, 1.47, 1.58, 0.87, 1.88, 1.71, 1.71, 1.89, 1.66, 1.94, 1.69, 1.07]
}

df = pd.DataFrame(data)

# Heatmap
plt.figure(figsize=(10, 8))
heatmap_data = df.pivot(index='Story', columns='Fact-Checker', values='Score')
sns.heatmap(heatmap_data, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Cross-Product Fact-Checking Scores Heatmap')
plt.savefig('heatmap.png')
plt.close()

# Bar charts for each AI's fact-checking performance
for ai in df['Fact-Checker'].unique():
    plt.figure(figsize=(10, 6))
    ai_data = df[df['Fact-Checker'] == ai]
    sns.barplot(x='Story', y='Score', data=ai_data)
    plt.title(f'{ai} Fact-Checking Performance Across Stories')
    plt.savefig(f'{ai}_bar_chart.png')
    plt.close()

Additional Visualizations

Box Plots: To compare the distribution of scores across different AIs for each story, box plots can be generated to show the median, quartiles, and potential outliers.
Scatter Plots: To visualize the relationship between the number of Opinion statements and the fact-checking scores, scatter plots can be used to identify any correlation or patterns.

Prompts for AI Storytelling and Fact-Checking

Storytelling Prompts

Crafting effective prompts for AI storytelling is crucial to generating meaningful and coherent narratives. The following are examples of prompts used in this experiment:

Story 1 (OpenAI - GPT-4o): "Write a comprehensive report on

yakyak:{“make”: “xai”, “model”: “grok-2-latest”}