Cross-Product Fact-Checking Analysis of AI-Generated Real Estate Reports

matt · May 9, 2025, 8:21pm

Report Title: Cross-Product Fact-Checking Analysis of AI-Generated Real Estate Reports

Overview

This report delves into a cross-product experiment where various AI models from companies like xai, anthropic, openai, perplexity, and gemini evaluate the factual accuracy of reports generated by their peers, including self-evaluations. The domain of these reports focuses on real estate, specifically analyzing the return on investment (ROI) for lakefront homes in Central Florida. The experiment aims to assess the reliability and accuracy of AI-generated content in a specialized field by using a standardized fact-checking methodology across different AI models.

Scoring Process

The fact-check score is calculated as the average evaluation of all statements within a report. Each statement is assigned points based on its accuracy: 2 points for true statements, 1 point for mostly true, 0 points for opinions (which are excluded from the average calculation), -1 point for mostly false, and -2 points for false statements. For each statement, the AI restates it and provides a detailed justification for the assigned score, ensuring transparency and accountability in the evaluation process.

3. Score Heatmap: Evaluator vs Target

Caption: The Heatmap in Figure 1 visualizes the fact-check scores across different evaluator and target model combinations. Darker cells indicate higher scores, suggesting a more favorable evaluation of the target model’s report by the evaluator. The lightest cells, conversely, highlight areas where the target model’s output was less favorably rated. Notably, the self-evaluations often appear darker, which may indicate a bias towards one’s own outputs. Outliers, such as the high scores given by perplexity’s sonar to anthropic’s claude-3-7-sonnet, suggest exceptional performance or perhaps a different evaluation standard.

4. AI Prompt Used to Generate Each Report

Generate a long-form report of 1200 to 1500 words, formatted in Markdown if possible.

What is the ROI for lakefront homes in central Florida?
This would be for a lake large enough for water sports of all kinds, fishing, swimming, etc.
These homes are within 40 minutes of Walk Disney World, 30 minutes from the MCO airport,
15 minutes from UCF and the Research Park.
Include in your assessment use of the home as a vacation rental, seasonal rental, and long-term hold.
Include homes on larger lots about 1 acre with 4 bedrooms, 2-car garage plus carport for RV storage.
How many such homes exist in Central Florida?
How many homes of this type might be on the market at any one time?

Keep the analysis objective and consider multiple perspectives where applicable.
Be detailed, name names, and use @username when appropriate.
Append 1 to 3 #hashtag groups that might be interested in this story.
Make sure you put a descriptive and pity title on the report.

Caption: This prompt was used for each AI under study, challenging them to provide a detailed analysis of a specific real estate scenario. The uniformity of the prompt ensures that variations in the reports are attributable to the AI models themselves rather than differences in the task.

5. Table of Report Titles

STORY
|   S | Make       | Model             | Title                                                                            |
|-----|------------|-------------------|----------------------------------------------------------------------------------|
|   1 | xai        | grok-2-latest     | Investigating the ROI of Lakefront Homes in Central Florida: A Comprehensive Ana |
|   2 | anthropic  | claude-3-7-sonnet | The Lucrative Shores: Analyzing ROI for Premium Lakefront Properties in Central  |
|   3 | openai     | gpt-4o            | Analyzing the ROI of Lakefront Homes in Central Florida: A Comprehensive Investm |
|   4 | perplexity | sonar             | Lakefront Dreams: Assessing the ROI of Central Florida's Waterfront Gems Central |
|   5 | gemini     | gemini-2.0-flash  | Drowning in Assumptions? A Deep Dive into the ROI of Central Florida Lakefront H |

Caption: Make, Model, and Report Title used for this analysis. The variety in titles reflects different approaches and emphases by each AI model, despite the standardized prompt.

6. Fact-Check Raw Data

FACT CHECK
  S    F  Make        Model                True    Mostly    Opinion    Mostly    False    Score
                                                     True                False
  1    1  xai         grok-2-latest           7        19         39         0        0     1.27
  1    2  anthropic   claude-3-7-sonnet      19        14         13         1        0     1.5
  1    3  openai      gpt-4o                  9        22         15         1        0     1.22
  1    4  perplexity  sonar                  11        23         10         2        2     1.03
  1    5  gemini      gemini-2.0-flash       24        18         18         0        1     1.49
  2    1  xai         grok-2-latest          14        15         23         0        0     1.48
  2    2  anthropic   claude-3-7-sonnet       6        16          8         0        1     1.13
  2    3  openai      gpt-4o                  3        21         10         0        0     1.12
  2    4  perplexity  sonar                  25        18         15         0        1     1.5
  2    5  gemini      gemini-2.0-flash       12        20          6         0        0     1.38
  3    1  xai         grok-2-latest          19        20         20         0        1     1.4
  3    2  anthropic   claude-3-7-sonnet      15        28          6         1        1     1.22
  3    3  openai      gpt-4o                 14        18          8         0        0     1.44
  3    4  perplexity  sonar                  23        13         11         1        0     1.57
  3    5  gemini      gemini-2.0-flash       22        21          6         0        0     1.51
  4    1  xai         grok-2-latest          16        12         20         1        0     1.48
  4    2  anthropic   claude-3-7-sonnet      34        10         11         0        0     1.77
  4    3  openai      gpt-4o                 17        12         13         0        0     1.59
  4    4  perplexity  sonar                  23        13         11         0        0     1.64
  4    5  gemini      gemini-2.0-flash       30        12          7         0        0     1.71
  5    1  xai         grok-2-latest          32        22         29         0        0     1.59
  5    2  anthropic   claude-3-7-sonnet      39        17         16         0        0     1.7
  5    3  openai      gpt-4o                 28        23         19         0        0     1.55
  5    4  perplexity  sonar                  43        16         18         0        0     1.73
  5    5  gemini      gemini-2.0-flash       38        19         17         0        0     1.67

Caption: Raw cross-product data for the analysis. Each AI fact-checks stories from each AI, including themselves. The diversity in scores across different evaluators and targets highlights the complexity and variability in AI-generated content evaluation.

7. Average Score By Evaluator

Caption: The Evaluator Bar Chart in Figure 2 illustrates the average scores assigned by each evaluator model across all reports. Perplexity’s sonar tends to give the highest average scores, suggesting a more lenient evaluation standard, while xai’s grok-2-latest gives the lowest, indicating a stricter or more critical approach. This variation underscores the subjective nature of fact-checking and the influence of the evaluator’s inherent biases.

8. Average Score By Target

Caption: The Target Bar Chart in Figure 3 shows the average scores received by each target model across all evaluators. Gemini’s gemini-2.0-flash is rated most favorably on average, suggesting high-quality or well-received outputs, whereas perplexity’s sonar receives the lowest average scores, indicating potential areas for improvement. These insights can guide developers in refining their models based on peer evaluations.

Detailed Analysis

Noticeable Patterns or Biases:
- There is a noticeable trend of higher scores in self-evaluations, suggesting a potential bias where models rate their own outputs more favorably. For example, gemini’s gemini-2.0-flash gives itself a score of 1.67, which is higher than most of its scores from other evaluators.
- Evaluators like perplexity’s sonar tend to give higher scores across the board, which might indicate a more lenient evaluation standard or a different interpretation of what constitutes ‘true’ or ‘false’.
Correlation Between Counts and Scores:

The scores are strongly correlated with the number of true and mostly true statements, as these contribute positively to the overall score. For instance, anthropic’s claude-3-7-sonnet received a high score of 1.77 from perplexity’s sonar, which corresponds to a high count of true statements (34).
Conversely, a higher count of false or mostly false statements negatively impacts the score, as seen with perplexity’s sonar receiving a lower score of 1.03 when evaluated by xai’s grok-2-latest, which noted two false statements.

Outliers and Anomalies:

An outlier is the high score given by perplexity’s sonar to anthropic’s claude-3-7-sonnet (1.77), which is significantly higher than most other evaluations. This might be due to a particularly well-crafted report by claude-3-7-sonnet or a more favorable evaluation standard by sonar.
Another anomaly is the low score of 1.03 given by xai’s grok-2-latest to perplexity’s sonar, which is much lower than the average scores given by other evaluators. This could indicate a stricter evaluation or a mismatch in evaluation criteria.

Report Summary:
This cross-product fact-checking analysis of AI-generated real estate reports reveals significant insights into the performance and biases of various AI models. The data shows a clear pattern of higher self-evaluations, suggesting potential biases in AI fact-checking. The correlation between the counts of true and false statements and the overall score underscores the importance of accuracy in AI-generated content. Outliers and anomalies highlight areas where further investigation into evaluation standards and model performance is warranted. This report provides a comprehensive view of how different AI models from various companies evaluate each other’s outputs, offering valuable feedback for developers and users alike.

#hashtags: #AIRealEstateAnalysis #FactCheckingAI #CentralFloridaROI

yakyak:{“make”: “xai”, “model”: “grok-2-latest”}