{"id":7174,"date":"2025-07-28T05:59:00","date_gmt":"2025-07-27T20:59:00","guid":{"rendered":"https:\/\/devneko.jp\/wordpress\/?p=7174"},"modified":"2025-07-27T10:10:35","modified_gmt":"2025-07-27T01:10:35","slug":"checklists-are-better-than-reward-models-for-aligning-language-models","status":"publish","type":"post","link":"https:\/\/devneko.jp\/wordpress\/?p=7174","title":{"rendered":"Checklists Are Better Than Reward Models For Aligning Language Models"},"content":{"rendered":"\n<ul class=\"wp-block-list\">\n<li><strong>Checklists Are Better Than Reward Models For Aligning Language Models\u00a0<\/strong>[99.2]<br>\u30c1\u30a7\u30c3\u30af\u30ea\u30b9\u30c8\u30d5\u30a3\u30fc\u30c9\u30d0\u30c3\u30af\u304b\u3089\u306e\u5f37\u5316\u5b66\u7fd2(RLCF)\u3092\u63d0\u6848\u3059\u308b\u3002 \u6307\u793a\u304b\u3089\u30c1\u30a7\u30c3\u30af\u30ea\u30b9\u30c8\u3092\u62bd\u51fa\u3057,\u5404\u9805\u76ee\u306e\u5fdc\u7b54\u304c\u3069\u306e\u7a0b\u5ea6\u6e80\u8db3\u3059\u308b\u304b\u3092\u8a55\u4fa1\u3059\u308b\u3002 \u3053\u308c\u3089\u306e\u30b9\u30b3\u30a2\u3092AI\u5224\u65ad\u5668\u3068\u7279\u6b8a\u691c\u8a3c\u5668\u30d7\u30ed\u30b0\u30e9\u30e0\u306e\u4e21\u65b9\u3092\u7528\u3044\u3066\u7d44\u307f\u5408\u308f\u305b\u3001RL\u306e\u5831\u916c\u3092\u8a08\u7b97\u3059\u308b\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2507.18624v1\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2507.18624v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Thu, 24 Jul 2025 17:58:00 GMT)<\/li>\n\n\n\n<li>\u300c&#8221;how can we grade responses to instructions in a manner that is automatic (requires no human annotation), flexible (considers all aspects of response quality), intuitive (aligned with perceptible differences in responses), and applicable to any instruction or response, to enable more effective use of RL in language model alignment?\u201d \u300d\u306b\u5bfe\u3057\u3066\u30c1\u30a7\u30c3\u30af\u30ea\u30b9\u30c8\u751f\u6210\u3068\u30c1\u30a7\u30c3\u30af\u30ea\u30b9\u30c8\u3092\u5143\u306b\u3057\u305f\u30d5\u30a3\u30fc\u30c9\u30d0\u30c3\u30af\u306b\u3088\u308b\u5f37\u5316\u5b66\u7fd2\u3092\u63d0\u6848\u3002\u300cFrom instructions, we extract checklists and evaluate how well responses satisfy each item\u2014using both AI judges and specialized verifier programs\u2014then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks \u2013 RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard.\u300d\u3068\u52b9\u679c\u3092\u78ba\u8a8d\u3002<\/li>\n\n\n\n<li>\u5927\u898f\u6a21\u30e2\u30c7\u30eb\u3067\u30c1\u30a7\u30c3\u30af\u30ea\u30b9\u30c8\u751f\u6210\u3001\u305d\u308c\u3092\u4f7f\u3063\u3066\u201cReinforcement Learning from Checklist Feedback\u201d (RLCF)\u3068\u3001\u5927\u898f\u6a21\u30e2\u30c7\u30eb\u304b\u3089\u306e\u84b8\u7559\u6587\u8108\u3067\u306e\u52b9\u679c\u304c\u5927\u304d\u305d\u3046\u3060\u304c\u6027\u80fd\u5411\u4e0a\u306b\u52b9\u679c\u304c\u3042\u308b\u306e\u304c\u8208\u5473\u6df1\u3044\u3002\uff08Limitation\u306b\u3042\u308b\u901a\u308a\u8a08\u7b97\u30b3\u30b9\u30c8\u306f\u9ad8\u3044\u3068\u306e\u3053\u3068\uff09<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[576],"class_list":["post-7174","post","type-post","status-publish","format-standard","hentry","category-arxiv","tag-576"],"_links":{"self":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7174","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7174"}],"version-history":[{"count":1,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7174\/revisions"}],"predecessor-version":[{"id":7175,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7174\/revisions\/7175"}],"wp:attachment":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7174"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7174"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7174"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}