The Improvement Loop
How AskBiz continuously improves AI accuracy through user feedback, automated benchmarking, and structured review cycles — and what role you play in making it better.
The Four-Stage Improvement Loop
AskBiz operates a continuous improvement loop across four stages:
Stage 1 — Signal collection
We collect three types of signals:
- Explicit flags — thumbs-down feedback from users with error type and notes
- Implicit signals — when a user immediately asks the same question again after receiving an answer, this suggests the first answer was unsatisfactory
- Confidence discrepancy signals — when an answer marked High confidence is subsequently corrected by user data, we log this as a potential overconfidence case
Stage 2 — Pattern analysis
Signals are aggregated weekly. Our team looks for patterns:
- Which question types generate the most flags?
- Are errors clustered around specific data source types?
- Are there systematic biases (e.g. always under-estimating a particular metric)?
- Are errors increasing or decreasing after model updates?
Stage 3 — Intervention design
Based on patterns, we design improvements:
- System prompt updates — adjusting the instructions Claude receives for specific question types
- Data retrieval changes — improving which data is pulled for particular query patterns
- Confidence threshold adjustments — recalibrating when High/Medium/Low confidence is assigned
- Benchmark additions — adding new test cases based on error patterns
- Model-level feedback — for systematic errors that appear to be model-level issues, we report to Anthropic
Stage 4 — Testing and deployment
Every change is tested against our full benchmark suite before deployment. A change is only deployed if it improves accuracy on the targeted error type without degrading accuracy on any other category by more than 0.2%.
Your Role in the Loop
Every flag you submit directly enters Stage 1 of the improvement loop. Flags are not just logged and ignored — they are the primary driver of Stage 2 pattern analysis.
The most impactful flags include:
- A clear description of why the answer was wrong
- The correct answer (or what the correct answer should look like)
- The data source the correct answer should have come from
We don't require all three — even a bare thumbs-down helps us identify that an answer was unsatisfactory. But the more detail you provide, the faster we can identify and fix the underlying issue.
What We Do Not Do
To be explicit about the limits of our improvement process:
- We do not use your individual business data to train AI models
- We do not share your specific flagged examples with other users
- We do not use the content of your questions to build user profiles
- We do not deploy model changes without benchmark testing
- We do not make improvement claims we have not measured
All improvement claims in this Transparency Centre are based on measured benchmark results, not subjective assessment.
Timelines
- System prompt updates: deployed within 5–10 business days of identifying a pattern
- Confidence threshold adjustments: deployed monthly as part of scheduled updates
- Data retrieval improvements: deployed within 2–4 weeks of identifying the issue
- Model-level improvements: depend on Anthropic's release cycle — typically 4–12 weeks after reporting
- Methodology updates (Business Pulse, anomaly detection, churn): quarterly, with 7-day advance notice to users