{"id":7476,"date":"2025-09-22T07:25:00","date_gmt":"2025-09-21T22:25:00","guid":{"rendered":"https:\/\/devneko.jp\/wordpress\/?p=7476"},"modified":"2025-09-20T20:30:18","modified_gmt":"2025-09-20T11:30:18","slug":"pre-training-under-infinite-compute","status":"publish","type":"post","link":"https:\/\/devneko.jp\/wordpress\/?p=7476","title":{"rendered":"Pre-training under infinite compute"},"content":{"rendered":"\n<ul class=\"wp-block-list\">\n<li><strong>Pre-training under infinite compute\u00a0<\/strong>[87.0]<br>\u672c\u7814\u7a76\u3067\u306f\u3001\u30a8\u30dd\u30c3\u30af\u6570\u306e\u5897\u52a0\u3068\u30d1\u30e9\u30e1\u30fc\u30bf\u6570\u306e\u5897\u52a0\u306b\u5bfe\u3059\u308b\u30c7\u30fc\u30bf\u5236\u7d04\u306b\u3088\u308b\u30a2\u30d7\u30ed\u30fc\u30c1\u304c\u3001\u6700\u7d42\u7684\u306b\u306f\u904e\u5ea6\u306b\u9069\u5408\u3059\u308b\u3053\u3068\u3092\u793a\u3059\u3002 \u72ec\u7acb\u306b\u8a13\u7df4\u3055\u308c\u305f\u30e2\u30c7\u30eb\u306e\u30a2\u30f3\u30b5\u30f3\u30d6\u30eb\u306f\u3001\u6b63\u898f\u5316\u30ec\u30b7\u30d4\u3088\u308a\u3082\u306f\u308b\u304b\u306b\u4f4e\u640d\u5931\u306e\u6f38\u8fd1\u3092\u9054\u6210\u3067\u304d\u308b\u3002 \u3053\u306e\u7d50\u679c\u304b\u3089,\u8a08\u7b97\u91cf\u306e\u591a\u3044\u5c06\u6765\u306b\u304a\u3044\u3066,\u3088\u308a\u30c7\u30fc\u30bf\u52b9\u7387\u306e\u9ad8\u3044\u4e8b\u524d\u5b66\u7fd2\u304c\u5b9f\u73fe\u3067\u304d\u308b\u3053\u3068\u304c\u793a\u5506\u3055\u308c\u305f\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2509.14786v1\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2509.14786v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Thu, 18 Sep 2025 09:36:23 GMT)<\/li>\n\n\n\n<li>\u300cOur best intervention combining epoching, regularization, parameter scaling, and ensemble scaling achieves an asymptote at 200M tokens using 5.17\u00d7 less data than our baseline, and our data scaling laws predict that this improvement persists at higher token budgets. We find that our data efficiency gains can be realized at much smaller parameter counts as we can distill an ensemble into a student model that is 8\u00d7 smaller and retains 83% of the ensembling benefit.\u300d\u3068\u30c7\u30fc\u30bf\u67af\u6e07\u306e\u61f8\u5ff5\u306b\u5bfe\u3059\u308b\u56de\u7b54\u306b\u306a\u308a\u305d\u3046\u306a\u7d50\u679c\u3002<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[223],"class_list":["post-7476","post","type-post","status-publish","format-standard","hentry","category-arxiv","tag-llm"],"_links":{"self":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7476","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7476"}],"version-history":[{"count":1,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7476\/revisions"}],"predecessor-version":[{"id":7477,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7476\/revisions\/7477"}],"wp:attachment":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7476"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7476"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7476"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}