{"id":7338,"date":"2025-09-05T05:27:00","date_gmt":"2025-09-04T20:27:00","guid":{"rendered":"https:\/\/devneko.jp\/wordpress\/?p=7338"},"modified":"2025-08-23T21:30:36","modified_gmt":"2025-08-23T12:30:36","slug":"the-self-execution-benchmark-measuring-llms-attempts-to-overcome-their-lack-of-self-execution","status":"publish","type":"post","link":"https:\/\/devneko.jp\/wordpress\/?p=7338","title":{"rendered":"The Self-Execution Benchmark: Measuring LLMs&#8217; Attempts to Overcome Their Lack of Self-Execution"},"content":{"rendered":"\n<ul class=\"wp-block-list\">\n<li><strong>The Self-Execution Benchmark: Measuring LLMs&#8217; Attempts to Overcome Their Lack of Self-Execution\u00a0<\/strong>[13.6]<br>\u5927\u898f\u6a21\u8a00\u8a9e\u30e2\u30c7\u30eb(LLM)\u306f\u3001\u77e5\u8b58\u3084\u63a8\u8ad6\u80fd\u529b\u3092\u30c6\u30b9\u30c8\u3059\u308b\u30bf\u30b9\u30af\u3067\u4e00\u822c\u7684\u306b\u8a55\u4fa1\u3055\u308c\u308b\u3002 \u672c\u7a3f\u3067\u306f\u3001\u30e2\u30c7\u30eb\u304c\u51fa\u529b\u306e\u7279\u6027\u3092\u4e88\u6e2c\u3067\u304d\u308b\u80fd\u529b\u3092\u6e2c\u5b9a\u3059\u308b\u30bb\u30eb\u30d5\u5b9f\u884c\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u3092\u7d39\u4ecb\u3059\u308b\u3002 \u79c1\u305f\u3061\u306e\u5b9f\u9a13\u3067\u306f\u3001\u30e2\u30c7\u30eb\u304c\u4e00\u822c\u7684\u306b\u3053\u306e\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u3067\u306f\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u304c\u60aa\u304f\u3001\u30e2\u30c7\u30eb\u306e\u30b5\u30a4\u30ba\u3084\u80fd\u529b\u304c\u5411\u4e0a\u3057\u3066\u3082\u3001\u5e38\u306b\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u304c\u5411\u4e0a\u3059\u308b\u3068\u306f\u9650\u3089\u306a\u3044\u3053\u3068\u304c\u793a\u3055\u308c\u3066\u3044\u307e\u3059\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2508.12277v1\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2508.12277v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Sun, 17 Aug 2025 07:57:58 GMT)<\/li>\n\n\n\n<li>\u300cSince LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model\u2019s ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this bench- mark, and that increased model size or capability does not consistently lead to better performance.\u300d\u3068\u3044\u3046\u5909\u308f\u3063\u305f\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u3002\u30e1\u30bf\u306a\u8996\u70b9\u306b\u306a\u3063\u3066\u3044\u3066\u7d50\u679c\u3092\u542b\u3081\u3068\u3066\u3082\u8208\u5473\u6df1\u3044\u3002<\/li>\n\n\n\n<li>\u30ea\u30dd\u30b8\u30c8\u30ea\u306f<a href=\"https:\/\/github.com\/anon-researcher-2025\/Self-Execution-Benchmark\">GitHub &#8211; anon-researcher-2025\/Self-Execution-Benchmark<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[517],"class_list":["post-7338","post","type-post","status-publish","format-standard","hentry","category-arxiv","tag-517"],"_links":{"self":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7338","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7338"}],"version-history":[{"count":1,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7338\/revisions"}],"predecessor-version":[{"id":7339,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7338\/revisions\/7339"}],"wp:attachment":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7338"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7338"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7338"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}