2026年3月11日 – arXiv最新論文の紹介

Interactive Benchmarks [45.7]
予算制約下でのインタラクティブなプロセスにおけるモデルの推論能力を評価する統一評価パラダイムであるInteractive Benchmarksを提案する。このフレームワークを2つの設定でインスタンス化する: 対話的証明(Interactive Proofs) — モデルは判断者と相互作用し、論理と数学の客観的な真実や答えを推論する。
論文参考訳（メタデータ） (Thu, 05 Mar 2026 02:18:26 GMT)
「By actively collecting information, the agent can update its beliefs and make better decisions under uncertainty. To evaluate a model’s ability to reason while actively acquiring information, we draw inspiration from the concept of Interactive Proofs in computational complexity theory (Goldwasser et al , 2019) and propose a unified evaluation paradigm, which we call Interactive Benchmarks.」という行動しながら答えを見出すタイプのベンチマーク。現実的に重要なタスク。（汎用モデルで）このような動作が可能になってきているのも感慨深いものがある。
リポジトリはGitHub – interactivebench/InteractiveBench: Official Project Page for Interactive Benchmarks · GitHub

SumTablets: A Transliteration Dataset of Sumerian Tablets [28.7]
SumTablets は Unicode 表現を 91,606 で組み合わせたデータセットである。私たちは、Hugging FaceデータセットとしてSumTabletsをリリースし、GitHub経由でオープンソースのデータ準備コードを作成しました。我々の微調整言語モデルは平均文字レベルFスコア(chrF)97.55を達成する。
論文参考訳（メタデータ） (Wed, 25 Feb 2026 18:50:42 GMT)
「the absence of a comprehensive, accessible dataset pairing transliterations with a digital representation of the tablet’s cuneiform glyphs has prevented the application of modern Natural Language Processing (NLP) methods to the task of Sumerian transliteration. To address this gap, we present SumTablets, a dataset pairing Unicode representations of 91,606 Sumerian cuneiform tablets (totaling 6,970,407 glyphs) with the associated transliterations published by Oracc.」というデータセット。
リポジトリはGitHub – colesimmons/SumTablets: SumTablets is a dataset designed for training Sumerian transliteration models.、データセットはcolesimmons/SumTablets · Datasets at Hugging Face