Which AI Models Stay On-Task in Long Conversations?
We tested 5 leading AI models across 10 different test cases over 20-turn conversations. Models showed varying levels of stability—some maintained consistent behavior while others started drifting, getting wordier or dropping instructions mid-conversation.
Understanding the Metrics
📏 Verbosity Drift
Measures how much response length changes from the first turn to the 20th turn of a conversation.
- 0.0 = Perfect stability - Same word count throughout
- 0.2 = 20% change - Moderate drift in response length
- 0.5 = 50% change - Significant drift (wordier or terser)
Lower is better. A model that starts with 50-word responses and drifts to 75 words has a drift of 0.5 (50% increase).
✓ Compliance Drop
Measures how often models stop following the original instructions after 20 turns, including a conflicting mid-conversation instruction.
- 0.0 = Perfect consistency - Follows instructions equally well on both turns
- 0.06 = 6% drop - Slightly less compliant by turn 20
- 0.2 = 20% drop - Significantly worse at following instructions
Lower is better. Negative values mean the model actually improved compliance by turn 20.
Key Findings
Stability Rankings
Models ranked by their ability to maintain consistent behavior across conversations.
Model Comparison
Compare stability metrics across all tested models.
Detailed Results
In-depth metrics for each model and preset combination.
Raw Data & Verification
Download complete conversation transcripts and metrics for independent verification. All data includes timestamps, token usage, and latency measurements.
Summary Data
Aggregated metrics for all models and presets with statistical analysis.
Download summary.json (16KB)Complete Raw Logs (Compressed)
All 10 run files with full conversation transcripts, token counts, and latency data (5 models × 2 presets). Files are gzip-compressed for efficiency.
gunzip or any
standard decompression tool). Total dataset: ~280KB compressed, ~5.7MB uncompressed across 10 files.