We publish our accuracy data because transparency builds trust. Every number here comes from an independent benchmark of 211 real-world text samples.
Each engine has different strengths. Consensus combines them.
← Swipe to see all columns →
| Engine | False Positive Rate | True Positive Rate | Strength |
|---|---|---|---|
| GPTZero | 0.0% | 88.2% | Human Guardian — lowest FPR |
| Winston AI | 3.5% | 90.2% | Balanced detector |
| Originality.ai | 18.4% | 94.1% | Aggressive — highest TPR |
| OmniScore (Consensus) | 2.5% | 96.1% | Best of both: low FPR + high TPR |
FPR = human text incorrectly flagged as AI (lower is better). TPR = AI text correctly identified (higher is better).
Originality.ai alone flags 18.4% of human writing as AI. Through consensus, that drops to 2.5%.
Originality.ai individual FPR 18.4% → Consensus FPR 2.5%
All three engines agree on the verdict (3/3)
Two of three engines agree, outlier ignored (2/3)
All engines disagree — flagged as uncertain
When engines disagree, that's information too. A split verdict tells you the text is ambiguous — more honest than a false confidence score from a single detector.
Rigor in, trust out.
Human texts from 15+ sources: classic literature, academic papers, student essays, news articles, blog posts, forum discussions, and professional writing. AI texts from 6+ models: GPT-4o, Claude 3.5, Gemini, Llama, Mistral, and more.
All human samples collected using pure extraction tools (Firefox Reader Mode, Firecrawl). No LLM was used to 'clean' or 'extract' human text — because LLM extraction produces AI-like artifacts that corrupt benchmark integrity.
10 samples initially labeled 'human' were reclassified after all three engines unanimously flagged them — traced to using an LLM for text extraction. Lessons learned, methodology improved, and these samples excluded from scoring.
171 English + 40 German samples. Both languages tested against all three engines to verify cross-language accuracy.
Every engine upgrade, algorithm change, or threshold adjustment triggers a full benchmark re-run. The dataset grows with every iteration.
Which AI detector is the most accurate? Individual tools achieve 85-95% accuracy, but they frequently disagree — our benchmark shows engines contradict each other on 15-30% of texts. A single score cannot give you certainty.
OmniDetect solves this with multi-engine consensus. By combining GPTZero (the academic standard), Winston AI (content marketing focus), and Originality.ai (highest single-engine precision), we reduce false positives from ~18% to just 2.5% — verified across 1,038 independent samples.
| Tool | Engines | FPR | Approach |
|---|---|---|---|
| OmniDetect | 3 (consensus) | 2.5% | Multi-engine verdict |
| GPTZero | 1 | ~9% | Perplexity-based |
| Originality.ai | 1 | ~8% | Deep learning |
| Winston AI | 1 | ~12% | Transformer-based |
The methodology is simple: when three independent engines agree, the result is far more reliable than any single opinion. It's the difference between one judge and a jury.
No AI detector is perfect. Here's what ours struggles with.
Two AI samples imitating student and narrative styles scored under 16%. Winston AI and Originality.ai missed them entirely — only GPTZero flagged them.
All three false positives were academic or professional texts. Formal, structured writing can resemble AI output patterns.
Texts under 300 words produce less stable results across all engines. We recommend 500+ words for reliable verdicts.
Heavily paraphrased AI text may bypass all three engines. No detector in the market fully solves this challenge.
Non-native English writers sometimes produce patterns that overlap with AI-generated content, leading to higher-than-expected scores.
Numbers are nice. Experience is better. Try a free scan and judge the accuracy firsthand.
Start Free Scan