{"id":7466,"date":"2025-09-30T05:52:00","date_gmt":"2025-09-29T20:52:00","guid":{"rendered":"https:\/\/devneko.jp\/wordpress\/?p=7466"},"modified":"2025-09-20T19:59:52","modified_gmt":"2025-09-20T10:59:52","slug":"fluid-language-model-benchmarking","status":"publish","type":"post","link":"https:\/\/devneko.jp\/wordpress\/?p=7466","title":{"rendered":"Fluid Language Model Benchmarking\u00a0"},"content":{"rendered":"\n<ul class=\"wp-block-list\">\n<li><strong>Fluid Language Model Benchmarking\u00a0<\/strong>[126.9]<br>\u6211\u3005\u306f,\u8907\u6570\u306e\u6b21\u5143\u306b\u308f\u305f\u308bLM\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u3092\u9032\u5c55\u3055\u305b\u308b\u65b0\u3057\u3044\u8a55\u4fa1\u624b\u6cd5\u3067\u3042\u308bFluid Benchmarking\u3092\u7d39\u4ecb\u3059\u308b\u3002 \u30b5\u30a4\u30b3\u30e1\u30c8\u30ea\u30c3\u30af\u30b9\u306b\u30a4\u30f3\u30b9\u30d1\u30a4\u30a2\u3055\u308c\u305fFluid Benchmarking\u306f\u3001\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u9805\u76ee\u306e\u76f8\u5bfe\u5024\u304cLM\u306e\u80fd\u529b\u30ec\u30d9\u30eb\u306b\u4f9d\u5b58\u3059\u308b\u3068\u3044\u3046\u6d1e\u5bdf\u306b\u57fa\u3065\u3044\u3066\u3044\u308b\u3002 \u52b9\u7387\u6027,\u59a5\u5f53\u6027,\u5206\u6563\u6027,\u98fd\u548c\u6027\u306e4\u3064\u306e\u6b21\u5143\u3092\u691c\u8a3c\u3057\u305f\u7d50\u679c,Fluid Benchmarking\u304c\u3059\u3079\u3066\u306b\u304a\u3044\u3066\u512a\u308c\u305f\u6027\u80fd\u3092\u767a\u63ee\u3059\u308b\u3053\u3068\u304c\u308f\u304b\u3063\u305f\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2509.11106v1\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2509.11106v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Sun, 14 Sep 2025 05:49:42 GMT)<\/li>\n\n\n\n<li>\u300cwe introduce FLUID BENCHMARKING, a new evaluation approach that advances LM benchmarking across multiple dimensions. Inspired by psychometrics, FLUID BENCHMARKING is based on the insight that the relative value of benchmark items depends on an LM\u2019s capability level, suggesting that evaluation should adapt to each LM. Methodologically, FLUID BENCH- MARKING estimates an item response model based on existing LM evaluation results and uses the inferred quantities to select evaluation items dynamically, similar to computerized adaptive testing in education.\u300d\u3068\u306e\u8a55\u4fa1\u65b9\u6cd5\u306e\u63d0\u6848\u3002<\/li>\n\n\n\n<li>\u30ea\u30dd\u30b8\u30c8\u30ea\u306f<a href=\"https:\/\/github.com\/allenai\/fluid-benchmarking\">GitHub &#8211; allenai\/fluid-benchmarking: Fluid Language Model Benchmarking<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[223,517,647],"class_list":["post-7466","post","type-post","status-publish","format-standard","hentry","category-arxiv","tag-llm","tag-517","tag-647"],"_links":{"self":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7466","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7466"}],"version-history":[{"count":1,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7466\/revisions"}],"predecessor-version":[{"id":7467,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7466\/revisions\/7467"}],"wp:attachment":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7466"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7466"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7466"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}