{"id":6631,"date":"2025-04-28T05:01:00","date_gmt":"2025-04-27T20:01:00","guid":{"rendered":"https:\/\/devneko.jp\/wordpress\/?p=6631"},"modified":"2025-04-28T05:01:00","modified_gmt":"2025-04-27T20:01:00","slug":"evaluating-judges-as-evaluators-the-jetts-benchmark-of-llm-as-judges-as-test-time-scaling-evaluators","status":"publish","type":"post","link":"https:\/\/devneko.jp\/wordpress\/?p=6631","title":{"rendered":"Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators\u00a0"},"content":{"rendered":"\n<ul class=\"wp-block-list\">\n<li><strong>Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators\u00a0<\/strong>[66.8]<br>\u672c\u7a3f\u3067\u306f,\u30c6\u30b9\u30c8\u6642\u9593\u30b9\u30b1\u30fc\u30ea\u30f3\u30b0\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u306e\u5224\u5b9a\u8a55\u4fa1\u306b\u3064\u3044\u3066\u7d39\u4ecb\u3059\u308b\u3002 3\u3064\u306e\u30bf\u30b9\u30af\u8a2d\u5b9a\u306e\u4e0b\u3067\u30013\u3064\u306e\u30c9\u30e1\u30a4\u30f3(\u63a8\u8ad6\u3001\u30b3\u30fc\u30c9\u751f\u6210\u3001\u547d\u4ee4\u5f93)\u3067\u306e\u5224\u5b9a\u6027\u80fd\u3092\u8a55\u4fa1\u3059\u308b\u3002 \u6211\u3005\u306e\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u306f\u3001\u5be9\u67fb\u54e1\u304c\u518d\u8a55\u4fa1\u306b\u304a\u3044\u3066\u7d50\u679c\u5831\u916c\u30e2\u30c7\u30eb\u3068\u7af6\u5408\u3059\u308b\u4e00\u65b9\u3067\u3001\u30d3\u30fc\u30e0\u30b5\u30fc\u30c1\u306b\u304a\u3051\u308b\u30d7\u30ed\u30bb\u30b9\u5831\u916c\u30e2\u30c7\u30eb\u3088\u308a\u3082\u4e00\u8cab\u3057\u3066\u60aa\u3044\u3053\u3068\u3092\u793a\u3057\u3066\u3044\u308b\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2504.15253v1\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2504.15253v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Mon, 21 Apr 2025 17:33:23 GMT)<\/li>\n\n\n\n<li>\u300cwe seek to understand the feasibility of using LLM-judges in place of typically used RMs in testtime compute procedures.\u300d\u3068\u3044\u3046\u30e2\u30c1\u30d9\u30fc\u30b7\u30e7\u30f3\u3067\u306e\u300cwe introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement.\u300d\u3068\u3044\u3046\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u306e\u63d0\u6848\u3002\u300cWe find that weak judges can help strong generators in easier tasks, such as instruction following, but not in reasoning-intensive tasks like coding or math. Larger judges bring the most benefit for math and instruction following tasks, but no evaluated judges are able to reliably improve generator performance for coding. Lastly, while natural language critiques are touted as a defining advantage of judges over RMs, we find that such critiques have significant room for improvement in terms of utility.\u300d\u3068\u306a\u304b\u306a\u304b\u53b3\u3057\u3044\u7d50\u679c\u3002<\/li>\n\n\n\n<li>\u30ea\u30dd\u30b8\u30c8\u30ea\u306f<a href=\"https:\/\/github.com\/SalesforceAIResearch\/jetts-benchmark\">GitHub &#8211; SalesforceAIResearch\/jetts-benchmark: Code repository for the paper &#8220;Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators&#8221;<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[224,517],"class_list":["post-6631","post","type-post","status-publish","format-standard","hentry","category-arxiv","tag-llm-as-a-judge","tag-517"],"_links":{"self":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/6631","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=6631"}],"version-history":[{"count":0,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/6631\/revisions"}],"wp:attachment":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=6631"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=6631"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=6631"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}