{"id":4145,"date":"2023-12-11T06:17:00","date_gmt":"2023-12-10T21:17:00","guid":{"rendered":"https:\/\/devneko.jp\/wordpress\/?p=4145"},"modified":"2023-12-11T06:17:00","modified_gmt":"2023-12-10T21:17:00","slug":"competition-level-problems-are-effective-llm-evaluators","status":"publish","type":"post","link":"https:\/\/devneko.jp\/wordpress\/?p=4145","title":{"rendered":"Competition-Level Problems are Effective LLM Evaluators"},"content":{"rendered":"\n<ul class=\"wp-block-list\">\n<li><strong>Competition-Level Problems are Effective LLM Evaluators\u00a0<\/strong>[124.8]<br>\u672c\u7a3f\u3067\u306f,Codeforces\u306b\u304a\u3051\u308b\u6700\u8fd1\u306e\u30d7\u30ed\u30b0\u30e9\u30df\u30f3\u30b0\u554f\u984c\u306e\u89e3\u6c7a\u306b\u304a\u3044\u3066,\u5927\u898f\u6a21\u8a00\u8a9e\u30e2\u30c7\u30eb(LLM)\u306e\u63a8\u8ad6\u80fd\u529b\u3092\u8a55\u4fa1\u3059\u308b\u3053\u3068\u3092\u76ee\u7684\u3068\u3059\u308b\u3002 \u307e\u305a,\u554f\u984c\u306e\u767a\u751f\u6642\u9593,\u96e3\u6613\u5ea6,\u906d\u9047\u3057\u305f\u30a8\u30e9\u30fc\u306e\u7a2e\u985e\u306a\u3069,\u69d8\u3005\u306a\u5074\u9762\u3092\u8003\u616e\u3057\u3066,GPT-4\u306e\u671b\u307e\u3057\u304f\u306a\u3044\u30bc\u30ed\u30b7\u30e7\u30c3\u30c8\u6027\u80fd\u3092\u7dcf\u5408\u7684\u306b\u8a55\u4fa1\u3059\u308b\u3002 \u9a5a\u304f\u3079\u304d\u3053\u3068\u306b\u3001GPT-4\u306eTheThoughtived\u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u306f\u30012021\u5e749\u6708\u4ee5\u964d\u3001\u3042\u3089\u3086\u308b\u56f0\u96e3\u3068\u7a2e\u985e\u306e\u554f\u984c\u306b\u5bfe\u3057\u3066\u4e00\u8cab\u3057\u3066\u554f\u984c\u304c\u6e1b\u5c11\u3059\u308b\u3088\u3046\u306a\u5d16\u3092\u7d4c\u9a13\u3057\u3066\u3044\u308b\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2312.02143v2\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2312.02143v2\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Tue, 5 Dec 2023 03:44:19 GMT)<\/li>\n\n\n\n<li>LLM\u306e\u30c7\u30fc\u30bf\u6c5a\u67d3\u554f\u984c\u3092\u691c\u8a3c\u3059\u308b\u305f\u3081\u306bCodeforce\u306e\u554f\u984c\u3092\u5229\u7528\u3002\u300cWe find a significant decrease in perceived performance of GPT-4 on unseen problems, consistent across a range of difficulties, problem types, and experimental settings.\u300d\u3068\u3044\u3046\u7d50\u679c\u3067\u306a\u304b\u306a\u304b\u885d\u6483\u7684\u3002<\/li>\n\n\n\n<li>\u5225\u306e\u691c\u8a3c\u3067\u3082\u4f3c\u305f\u3088\u3046\u306a\u6307\u6458\u306f\u3042\u3063\u305f\u3057\u3001Gemini\u306e\u30c6\u30af\u30cb\u30ab\u30eb\u30ec\u30dd\u30fc\u30c8\u3067\u3082\u300c\u00a0Evaluation on these benchmarks is challenging and may be affected by data contamination.We performed an extensive leaked data analysis after training to ensure the results we report here are as scientifically sound as possible, but still found some minor issues and decided not to report results on e g LAMBADA (Paperno et al , 2016).\uff08<a href=\"https:\/\/storage.googleapis.com\/deepmind-media\/gemini\/gemini_1_report.pdf\">gemini_1_report.pdf (storage.googleapis.com)<\/a>\uff09\u300d\u3068\u3044\u3046\u6307\u6458\u304c\u3042\u308b\u3002\u6b63\u3057\u3044\u8a55\u4fa1\u306f\u96e3\u3057\u3044\u3002<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[223,648],"class_list":["post-4145","post","type-post","status-publish","format-standard","hentry","category-arxiv","tag-llm","tag-648"],"_links":{"self":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/4145","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=4145"}],"version-history":[{"count":0,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/4145\/revisions"}],"wp:attachment":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=4145"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=4145"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=4145"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}