AIレビュアーは23/25のスコアを獲得し、重要な点を見逃した
原題: The AI reviewer scored 23/25 and missed the point
分析結果
- カテゴリ
- AI
- 重要度
- 59
- トレンドスコア
- 21
- 要約
- AIレビュアーは25点中23点を獲得しましたが、重要なポイントを見逃してしまいました。この結果は、AIの評価能力に対する疑問を呼び起こし、特に人間の判断が必要な場面での限界を示しています。AIの評価が必ずしも正確でないことを示す事例として注目されています。
- キーワード
I've been building an AI-assisted editorial pipeline for my technical writing. Notion cards become markdown drafts in the repo, pass through review, then sync to dev.to. The motivation was simple: I already had a review loop I trusted for code. Open a PR, run Cursor's Bugbot against a review guide, fix what mattered, merge. I wanted the same rhythm for writing: draft, critique, revise, publish. So I built my own AI review skill called editor-critique . I had also started adding HTML comments inside drafts, much like code comments. They captured the editorial intent behind a section, including why it opened where it did and why evidence sat where it did, without becoming part of the published post. That made the review step look straightforward. Give the AI a rubric, score the draft, return prioritized feedback. If the rubric was good, I assumed the critique would be good. That assumption failed in a very specific way. The first version of editor-critique did what I asked. It read a draft, applied five scoring dimensions, and produced a polished report. While reviewing my article, "The agent plan had every step except where to stop" , it scored the piece 23/25 and mostly suggested polish. It also missed the feedback I actually needed. Valid rubric, shallow read The draft did not need another pass on commas and section labels. It needed a colder editorial read. A useful reviewer should have asked: Does the title reveal the lesson before the incident earns it? Does the article assume private repo context a dev.to reader will not have? Are links to PRs, plans, and standards supporting evidence, or required reading? Is governance framing outrunning what the incident actually proved? Those are reader-journey questions, not formatting checks. The score-first reviewer treated the rubric as the first lens. If the thesis was present, evidence was named, and the arc looked complete, the draft read as ready. The rubric turned critique into publication preflight: complete sections, reasonable voice, no obvious holes. Useful, but not enough. What changed in the sequence I revised the reviewer skill so analysis precedes scoring. Before: Load draft → Score rubric dimensions → Generate critique After: Load draft → Editorial read-through → Score rubric dimensions → Generate critique The rubric stayed. It stopped being the opening move. Before scoring, the reviewer now reads visible prose like a cold dev.to audience member. It mentally strips author notes and asks whether the lesson still works if repo links and hidden rationale disappeared. Then it checks thesis timing, audience assumptions, reference framing, and speculation drift. The annotation loop mattered here. Because the comments sat beside the sections they explained, critique could compare intent against effect: the note described what the section was trying to do, while the reader-facing paragraph showed whether it actually did it. Sometimes the article needed the edit. Sometimes the annotation exposed that editor-critique itself was reading the section too mechanically. Either way, the disagreement became useful training material for the reviewer skill. Only after that read does it assign scores. The output became more editorial. Instead of asking only "does this draft satisfy the rubric?", it started asking "what will break for the reader?" On the same article, the revised reviewer surfaced title spoiling the lesson, private PR assumptions, weak framing for repo artifacts, and governance language potentially ahead of the evidence. The 23/25 pass had treated those as minor or invisible. Why order beat rubric tuning A rubric compresses judgment into categories: thesis, structure, evidence, voice, readiness. That compression helps consistency. Compression too early can hide the problem. Once the reviewer committed to a numerical assessment, the rest of the report tended to justify that assessment. A 23/25 draft needed 23/25 feedback, so the model organized its reasoning around why the piece was mostly ready instead of independently discovering what a reader would struggle with. It is a little like running a linter before reading a design doc. The linter can confirm imports and formatting are clean. It cannot tell you whether the design makes sense. Start with the linter and the document can feel more complete than it is. That is what happened here. The rubric was not bad. It was premature. Once analysis came first, the same categories became more honest. "Evidence and specificity" could include link-only dependence. "Thesis and opening" could include title spoiling the lesson. "Publish readiness" could include whether prose survives without private repo access. The score became a summary of the read-through, not a substitute for it. QA review vs editorial review The revision made me distinguish two kinds of AI review. QA review asks: Did the artifact satisfy the stated criteria? Editorial review asks: What will the reader misunderstand, miss, or not believe? This was not completely new to me. In code review, I already used different Bugbot guides depending on what I wanted it to optimize for: security, game-state changes, UX regressions, or plan intent. The same diff could be reviewed through different lenses. Writing turned out to have the same property as code review. A QA reviewer checks completeness and publishing criteria. An editorial reviewer reads for audience confusion and belief. The artifact stayed the same. The review lens changed. Both matter. Broken frontmatter, missing sections, or absent takeaways still need QA. But if the reviewer starts and ends there, it can produce a confident report that never engages the reader's path through the article. The first reviewer was not useless. It was doing QA under the name of critique. The revised reviewer still scores, but it has to earn the score by reading first. That sequencing shift moved output from "this article is mostly ready" toward "this article assumes too much context, reveals its lesson too early, and needs stronger in-narrative evidence before the governance argument about where an agent should stop lands." That is the feedback I needed. What I'd do on the next reviewer For the next AI reviewer I build, I would design sequence before I tune rubric dimensions. Start with an ungated read. Inspect audience, intent, risk, and evidence before scoring thresholds appear. Make the rubric summarize the analysis. Scores should cite read-through observations, not invent them after the fact. Separate checklist pass from judgment pass. "Is it complete?" and "is it good?" are different questions. Force reader-impact language. Critique items should say what breaks for the reader, not only which rule was violated. Let scores come last. Once a number appears, everything organizes around it. This is not only about writing. I suspect the same pattern may apply to PR review, architecture review, incident analysis, and evaluation reports: if a reviewer scores before it understands, it overfits to the rubric and under-reads the situation. The shape feels portable. Evaluation criteria are not enough. The order in which a reviewer thinks changes what it notices. Takeaway: If your AI reviewer keeps producing technically correct but shallow feedback, do not only rewrite the rubric. Move analysis before scoring. If you'd like to see the project behind these workflow experiments, try Codenames AI . I've been building an AI-assisted editorial pipeline for my technical writing. Notion cards become markdown drafts in the repo, pass through review, then sync to dev.to. The motivation was simple: I already had a review loop I trusted for code. Open a PR, run Cursor's Bugbot against a review guide, fix what mattered, merge. I wanted the same rhythm for writing: draft, critique, revise, publish. So I built my own AI review skill called editor-critique . I had also started adding HTML comments inside drafts, much like code comments. They captured the editorial intent behind a section, including why it opened where it did and why evidence sat where it did, without becoming part of the published post. That made the review step look straightforward. Give the AI a rubric, score the draft, return prioritized feedback. If the rubric was good, I assumed the critique would be good. That assumption failed in a very specific way. The first version of editor-critique did what I asked. It read a draft, applied five scoring dimensions, and produced a polished report. While reviewing my article, "The agent plan had every step except where to stop" , it scored the piece 23/25 and mostly suggested polish. It also missed the feedback I actually needed. Valid rubric, shallow read The draft did not need another pass on commas and section labels. It needed a colder editorial read. A useful reviewer should have asked: Does the title reveal the lesson before the incident earns it? Does the article assume private repo context a dev.to reader will not have? Are links to PRs, plans, and standards supporting evidence, or required reading? Is governance framing outrunning what the incident actually proved? Those are reader-journey questions, not formatting checks. The score-first reviewer treated the rubric as the first lens. If the thesis was present, evidence was named, and the arc looked complete, the draft read as ready. The rubric turned critique into publication preflight: complete sections, reasonable voice, no obvious holes. Useful, but not enough. What changed in the sequence I revised the reviewer skill so analysis precedes scoring. Before: Load draft → Score rubric dimensions → Generate critique After: Load draft → Editorial read-through → Score rubric dimensions → Generate critique The rubric stayed. It stopped being the opening move. Before scoring, the reviewer now reads visible prose like a cold dev.to audience member. It mentally strips author notes and asks whether the lesson still works if repo links and hidden rationale disappeared. Then it c