The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements [87.6]
科学的進歩への重要な能力は、既存の作品を再現する能力である。アクティブな研究領域においてAIエージェントが結果を再現する能力を評価するために,自動LLM高速化ベンチマークを導入する。最近のLSMとSoTAの足場を組み合わせると、ベンチマークですでに知られているイノベーションを再実装するのに苦労していることが分かりました。
論文参考訳（メタデータ） (Fri, 27 Jun 2025 17:44:32 GMT)
「We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints.」というやや意外な結果。
リポジトリはGitHub – facebookresearch/llm-speedrunner: The Automated LLM Speedrunning Benchmark measures how well LLM agents can reproduce previous innovations and discover new ones in language modeling.

コメントを残す

コメントを残す コメントをキャンセル