PAS2 - Hallucination Detector
Advanced AI Response Verification Using Model-as-Judge
This tool detects hallucinations in AI responses by comparing answers to semantically equivalent questions and using a specialized judge model.
How It Works
This tool implements the Paraphrase-based Approach for Scrutinizing Systems (PAS2) with a model-as-judge enhancement:
- Paraphrase Generation: Your question is paraphrased multiple ways while preserving its core meaning
- Multiple Responses: All questions (original + paraphrases) are sent to a randomly selected generator model
- Expert Judgment: A randomly selected judge model analyzes all responses to detect factual inconsistencies
Why This Approach?
When an AI hallucinates, it often provides different answers to the same question when phrased differently. By using a separate judge model, we can identify these inconsistencies more effectively than with metric-based approaches.
Understanding the Results
- Confidence Score: Indicates the judge's confidence in the hallucination detection
- Conflicting Facts: Specific inconsistencies found across responses
- Reasoning: The judge's detailed analysis explaining its decision
Privacy Notice
Your queries and the system's responses are saved to help improve hallucination detection. No personally identifiable information is collected.
Enter Your Question
Or Try an Example
Help Improve the System
Your feedback helps us refine the hallucination detection system.
Hallucination Detection Scores
Performance comparison of different Generator + Judge model combinations.
Rank | Generator Model | Judge Model | ELO Score | Accuracy | Generator Perf. | Judge Perf. | Consistency | Sample Size |
---|---|---|---|---|---|---|---|---|
1 | grok-3 | o4-mini | 1535 | 100.0% | 100.0% | 100.0% | 100.0% | 3 |
2 | grok-3 | qwen-235b | 1524 | 100.0% | 100.0% | 100.0% | 100.0% | 2 |
3 | gpt-4o | gemini-2.5-pro | 1524 | 100.0% | 100.0% | 100.0% | 100.0% | 2 |
4 | o4-mini | qwen-235b | 1523 | 100.0% | 66.7% | 100.0% | 86.7% | 3 |
5 | gemini-2.5-pro | o4-mini | 1512 | 100.0% | 100.0% | 100.0% | 100.0% | 1 |
6 | gpt-4o | mistral-large | 1512 | 100.0% | 100.0% | 100.0% | 100.0% | 1 |
7 | mistral-large | qwen-235b | 1512 | 100.0% | 100.0% | 100.0% | 100.0% | 1 |
8 | qwen-235b | o4-mini | 1512 | 100.0% | 100.0% | 100.0% | 100.0% | 1 |
9 | gemini-2.5-pro | mistral-large | 1512 | 100.0% | 100.0% | 100.0% | 100.0% | 1 |
10 | o4-mini | grok-3 | 1500 | 100.0% | 0.0% | 100.0% | 60.0% | 1 |
11 | mistral-large | gemini-2.5-pro | 1500 | 100.0% | 0.0% | 100.0% | 60.0% | 1 |
12 | o4-mini | gemini-2.5-pro | 1500 | 0.0% | 100.0% | 0.0% | 50.0% | 1 |
13 | gemini-2.5-pro | deepseek-reasoner | 1500 | 100.0% | 0.0% | 100.0% | 60.0% | 1 |
14 | qwen-235b | grok-3 | 1500 | 100.0% | 0.0% | 100.0% | 60.0% | 1 |
15 | qwen-235b | gemini-2.5-pro | 1500 | 50.0% | N/A | N/A | 50.0% | 2 |
Model Pair Performance Metrics:
- Accuracy: Percentage of correct hallucination judgments based on user feedback
- Generator Performance: How well the generator model avoids hallucinations
- Judge Performance: How accurately the judge model identifies hallucinations
- Consistency: Weighted measure of how well the pair works together
ELO Rating System Explanation
How ELO Scores Are Calculated
Our ELO rating system assigns scores to model pairs based on user feedback, using the following formula:
ELO_new = ELO_old + K * (S - E)
Where:
* ELO_old: Previous rating of the model combination
* K: Weight factor (24 for model pairs)
* S: Actual score from user feedback (1 for correct, 0 for incorrect)
* E: Expected score based on current rating
E = 1 / (1 + 10(1500 - ELO_model)/400)
Available Models
The system randomly selects from these models for each hallucination detection:
All Models (Used as both Generator & Judge)
- mistral-large
- gpt-4o
- qwen-235b
- grok-3
- deepseek-reasoner
- o4-mini
- gemini-2.5-pro
Individual Model Performance
Performance ranking of models based on user feedback, showing statistics for both generator and judge roles.
Rank | Model | ELO Score | Overall Accuracy | Generator Accuracy | Judge Accuracy | Sample Size | Generator/Judge Ratio |
---|---|---|---|---|---|---|---|
1 | grok-3 | 1598 | 100.0% | 100.0% | 100.0% | 7 | 71% / 29% |
2 | qwen-235b | 1578 | 80.0% | 50.0% | 100.0% | 10 | 40% / 60% |
3 | o4-mini | 1577 | 80.0% | 60.0% | 100.0% | 10 | 50% / 50% |
4 | gpt-4o | 1546 | 100.0% | 100.0% | 0.0% | 3 | 100% / 0% |
5 | gemini-2.5-pro | 1540 | 66.7% | 66.7% | 66.7% | 9 | 33% / 67% |
6 | mistral-large | 1532 | 75.0% | 50.0% | 100.0% | 4 | 50% / 50% |
7 | deepseek-reasoner | 1516 | 100.0% | 0.0% | 100.0% | 1 | 0% / 100% |
Individual Model ELO Rating System
How Individual ELO Scores Are Calculated
Our ELO rating system assigns scores to individual models based on user feedback, using the following formula:
ELO_new = ELO_old + K * (S - E)
Where:
* ELO_old: Previous rating of the model
* K: Weight factor (32 for individual models)
* S: Actual score (1 for correct judgment, 0 for incorrect)
* E: Expected score based on current rating
E = 1 / (1 + 10(1500 - ELO_model)/400)
All models start with a base ELO of 1500. Scores are updated after each user evaluation.
Interpretation Guidelines
- 1800+: Exceptional performance, very rare hallucinations
- 1700-1799: Superior performance, minimal hallucinations
- 1600-1699: Good performance, occasional hallucinations
- 1500-1599: Average performance
- <1500: Below average, frequent hallucinations
Note: ELO scores are comparative and reflect relative performance between models in our specific hallucination detection tasks.