AI Accuracy & Known Error Types
What AskBiz's AI accuracy rates are across different question categories, the most common error types, what causes them, and how we track and reduce them.
Live metrics
Our Accuracy Commitment
We publish accuracy metrics because transparency about AI limitations is more useful than projecting false confidence. These numbers are measured against our internal benchmark suite โ a set of test queries with known correct answers, run against each new model version before deployment.
Our benchmark covers four question categories:
- Calculation questions โ precise arithmetic on your data (revenue totals, margin calculations, landed costs)
- Pattern questions โ identifying trends, anomalies, and comparisons
- Factual questions โ explaining business concepts, regulatory frameworks, definitions
- Recommendation questions โ strategic suggestions based on data analysis
Accuracy by Category
Calculation questions: 98.2% accuracy
Math on your data is where AI is most reliable. Given complete, correctly formatted data, arithmetic errors are rare. The 1.8% error rate is almost entirely caused by ambiguous date ranges or currency conversion edge cases.
Pattern questions: 94.7% accuracy
Trend identification, anomaly detection, and comparisons are highly reliable. Errors typically occur when seasonal patterns are misidentified, or when very short time series are compared.
Factual questions: 91.3% accuracy
Business concept explanations, regulatory summaries, and methodology explanations are generally accurate. Error rates increase for jurisdiction-specific regulatory questions (where rules change frequently) and for niche sector-specific knowledge.
Recommendation questions: 87.6% accuracy
Strategic recommendations are the most variable category. 'Accuracy' here means alignment with what an experienced business analyst would recommend given the same data โ assessed by our internal review team on a sample basis. Errors are typically overly conservative recommendations or failure to weight a key data signal appropriately.
Common Error Types
Hallucination (fabrication): The AI states something as fact that is not in your data or Claude's training. Rate: approximately 1.2% of responses. Most common in: factual questions about specific regulations or market data not in our training set.
Data misinterpretation: The AI reads your data correctly but draws an incorrect conclusion. Rate: approximately 2.8% of responses. Most common in: seasonal pattern analysis on short time series.
Overconfidence: The AI gives a High confidence answer that should be Medium or Low, because it did not correctly identify gaps in the data. Rate: approximately 3.1% of responses. We are actively working to reduce this.
Under-specificity: The AI gives a correct but vague answer when a more specific one was possible. Rate: approximately 4.2% of responses. Most common in: recommendation questions where the data supports a clear recommendation but the AI hedges unnecessarily.
Currency/unit errors: Confusion between currencies, units of measure, or time zones. Rate: approximately 0.9% of responses. Mitigated by always specifying your home currency and timezone in account settings.
How We Measure and Improve Accuracy
Accuracy is measured through three mechanisms:
Automated benchmarking: Every model update is tested against our benchmark suite of 2,400+ test queries before deployment. A model update is rejected if accuracy drops more than 0.5% in any category.
User feedback loop: Every thumbs-down on an AI response creates a flagged example that our team reviews. High-volume error patterns are used to improve our system prompting and, where appropriate, fed back to Anthropic.
Quarterly manual audit: Our team manually reviews a random sample of 200 AI responses per quarter, graded against expert business analyst standards. Results are published in this Transparency Centre.