LLMの出力品質が複数回の圧縮でどのように劣化するかを測定した人はいますか?
原題: Has Anyone Measured How LLM Output Quality Degrades Across Multiple Compactions?
分析結果
- カテゴリ
- AI
- 重要度
- 83
- トレンドスコア
- 45
- 要約
- この記事では、言語モデル(LLM)の出力品質が複数回の圧縮を経ることでどのように劣化するかについての研究や測定が行われているかを探ります。圧縮プロセスがモデルの生成するテキストの質に与える影響を理解することは、LLMの実用性や信頼性を評価する上で重要です。
- キーワード
The Observation After ~70 sessions with DeepSeek V4 (1M context), I noticed something odd. When Claude Code compacts my session, output quality doesn't just go down linearly. There's a moment — usually after the second compaction — where the model briefly gets better . Then it declines and never recovers. Maybe I'm imagining it. Maybe it's specific to my model, my prompts, my workflow. But I can't shake the thought: what if context compaction has a curve, and nobody has mapped it? What I Found (Not Much) I searched for benchmarks that measure multi-round compaction degradation. Here's what exists: RULER : Measures how performance drops as static input grows longer. Nothing about what happens after you compress and re-compress. Context Rot (Chroma 2025): 18 models tested, all degrade with more tokens. Again, static. Multi-turn evaluation : Tests whether models drift across conversation turns. Doesn't touch compaction. Parameter compression (pruning, quantization) has well-mapped scaling laws. The Lottery Ticket Hypothesis (ICLR 2019) and Compression Laws for LLMs (2025) tell you exactly where the performance peak sits. Context summarization — the thing that happens every time your agent runs /compact — has no such curve. Why This Might Matter If the curve is real, you could: Know exactly when to start a fresh session (before the decline hits) Compare models on a new dimension: who maintains quality longest across compactions? Give LLM providers a concrete target: "your compaction quality drops 20% faster than competitor X" Right now, none of the major benchmark suites (MMLU, HELM, BigBench, RULER) include a "compaction persistence" metric. If context windows keep growing and sessions keep getting longer, this gap gets bigger every year. What I'm Asking I built a tiny monitor ( compact-counter ) and a rough experiment framework — 50 lines of Python, 10 benchmark tasks, 0-5 rubric. It's not polished. It's a starting point. What I'd love: Someone with a Claude Opus / GPT-5 / Gemini account to try reproducing this Feedback on whether the methodology makes sense or is fundamentally flawed If this is a real thing, ideas for how to measure it properly I don't have the compute or the stats background to do this alone. But if enough people contribute data points across different models, we might find out whether this curve exists — and if it does, maybe it's useful to more people than just me. References Frankle & Carbin, "The Lottery Ticket Hypothesis" (ICLR 2019) "Compression Laws for Large Language Models" (2025) RULER: What's the Real Context Size of Your Long-Context Language Models? (COLM 2024) Chroma Research, "Context Rot" (2025) The Observation After ~70 sessions with DeepSeek V4 (1M context), I noticed something odd. When Claude Code compacts my session, output quality doesn't just go down linearly. There's a moment — usually after the second compaction — where the model briefly gets better . Then it declines and never recovers. Maybe I'm imagining it. Maybe it's specific to my model, my prompts, my workflow. But I can't shake the thought: what if context compaction has a curve, and nobody has mapped it? What I Found (Not Much) I searched for benchmarks that measure multi-round compaction degradation. Here's what exists: RULER : Measures how performance drops as static input grows longer. Nothing about what happens after you compress and re-compress. Context Rot (Chroma 2025): 18 models tested, all degrade with more tokens. Again, static. Multi-turn evaluation : Tests whether models drift across conversation turns. Doesn't touch compaction. Parameter compression (pruning, quantization) has well-mapped scaling laws. The Lottery Ticket Hypothesis (ICLR 2019) and Compression Laws for LLMs (2025) tell you exactly where the performance peak sits. Context summarization — the thing that happens every time your agent runs /compact — has no such curve. Why This Might Matter If the curve is real, you could: Know exactly when to start a fresh session (before the decline hits) Compare models on a new dimension: who maintains quality longest across compactions? Give LLM providers a concrete target: "your compaction quality drops 20% faster than competitor X" Right now, none of the major benchmark suites (MMLU, HELM, BigBench, RULER) include a "compaction persistence" metric. If context windows keep growing and sessions keep getting longer, this gap gets bigger every year. What I'm Asking I built a tiny monitor ( compact-counter ) and a rough experiment framework — 50 lines of Python, 10 benchmark tasks, 0-5 rubric. It's not polished. It's a starting point. What I'd love: Someone with a Claude Opus / GPT-5 / Gemini account to try reproducing this Feedback on whether the methodology makes sense or is fundamentally flawed If this is a real thing, ideas for how to measure it properly I don't have the compute or the stats background to do this alone. But if enough people contribute data points across different models, we might find out whether this curve exists — and if it does, maybe it's useful to more people than just me. References Frankle & Carbin, "The Lottery Ticket Hypothesis" (ICLR 2019) "Compression Laws for Large Language Models" (2025) RULER: What's the Real Context Size of Your Long-Context Language Models? (COLM 2024) Chroma Research, "Context Rot" (2025)