PAS2 - Hallucination Detector

This tool detects hallucinations in AI responses by comparing answers to semantically equivalent questions and using a specialized judge model.

Your feedback helps us refine the hallucination detection system.

Was there actually a hallucination in the responses?

Yes, there was a hallucination No, there was no hallucination Not sure

Did the judge model correctly identify the situation?

Yes, the judge was correct No, the judge was incorrect Not sure

Additional comments (optional)

Hallucination Detection Scores

Performance comparison of different Generator + Judge model combinations.

Rank	Generator Model	Judge Model	ELO Score	Accuracy	Generator Perf.	Judge Perf.	Consistency	Sample Size
1	grok-3	o4-mini	1535	100.0%	100.0%	100.0%	100.0%	3
2	grok-3	qwen-235b	1524	100.0%	100.0%	100.0%	100.0%	2
3	gpt-4o	gemini-2.5-pro	1524	100.0%	100.0%	100.0%	100.0%	2
4	o4-mini	qwen-235b	1523	100.0%	66.7%	100.0%	86.7%	3
5	gemini-2.5-pro	o4-mini	1512	100.0%	100.0%	100.0%	100.0%	1
6	gpt-4o	mistral-large	1512	100.0%	100.0%	100.0%	100.0%	1
7	mistral-large	qwen-235b	1512	100.0%	100.0%	100.0%	100.0%	1
8	qwen-235b	o4-mini	1512	100.0%	100.0%	100.0%	100.0%	1
9	gemini-2.5-pro	mistral-large	1512	100.0%	100.0%	100.0%	100.0%	1
10	o4-mini	grok-3	1500	100.0%	0.0%	100.0%	60.0%	1
11	mistral-large	gemini-2.5-pro	1500	100.0%	0.0%	100.0%	60.0%	1
12	o4-mini	gemini-2.5-pro	1500	0.0%	100.0%	0.0%	50.0%	1
13	gemini-2.5-pro	deepseek-reasoner	1500	100.0%	0.0%	100.0%	60.0%	1
14	qwen-235b	grok-3	1500	100.0%	0.0%	100.0%	60.0%	1
15	qwen-235b	gemini-2.5-pro	1500	50.0%	N/A	N/A	50.0%	2

Model Pair Performance Metrics:

Accuracy: Percentage of correct hallucination judgments based on user feedback
Generator Performance: How well the generator model avoids hallucinations
Judge Performance: How accurately the judge model identifies hallucinations
Consistency: Weighted measure of how well the pair works together

Rank	Model	ELO Score	Overall Accuracy	Generator Accuracy	Judge Accuracy	Sample Size	Generator/Judge Ratio
1	grok-3	1598	100.0%	100.0%	100.0%	7	71% / 29%
2	qwen-235b	1578	80.0%	50.0%	100.0%	10	40% / 60%
3	o4-mini	1577	80.0%	60.0%	100.0%	10	50% / 50%
4	gpt-4o	1546	100.0%	100.0%	0.0%	3	100% / 0%
5	gemini-2.5-pro	1540	66.7%	66.7%	66.7%	9	33% / 67%
6	mistral-large	1532	75.0%	50.0%	100.0%	4	50% / 50%
7	deepseek-reasoner	1516	100.0%	0.0%	100.0%	1	0% / 100%