2025年7月3日 – arXiv最新論文の紹介

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation [89.7]
MultiFinBenは、グローバルファイナンシャルドメインに合わせた最初のマルチリンガルおよびマルチモーダルベンチマークである。我々は,最初のOCR組み込み財務QAタスクである EnglishOCR と SpanishOCR の2つの新しいタスクを紹介する。本稿では,動的で難易度の高い選択機構を提案し,コンパクトでバランスの取れたベンチマークをキュレートする。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 22:01:49 GMT)
金融ドメインのマルチモーダル、マルチリンガルベンチマーク。日本語データも含まれているよう。
リポジトリはGitHub – xueqingpeng/MultiFinBen、データはHuggingFaceで公開されている（TheFinAI/PolyFiQA-Easy · Datasets at Hugging Faceなど）

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents [34.4]
コンピュータ使用エージェントの安全性を計測する新しいベンチマークであるOS-Harmを紹介する。 OS-HarmはOSWorld環境上に構築されており、故意のユーザ誤用、インジェクション攻撃、モデル誤動作の3つのカテゴリでモデルをテストすることを目指している。我々は、フロンティアモデルに基づいてコンピュータ利用エージェントを評価し、その安全性に関する洞察を提供する。
論文参考訳（メタデータ） (Tue, 17 Jun 2025 17:59:31 GMT)
「First, we identify three main categories of risk: (1) deliberate user misuse, where the user asks the agent to pursue a harmful goal, (2) prompt injection attacks, where external attackers insert malicious content into third-party data (incoming emails, web pages, notifications, etc.) that steers the model away from performing its task and towards the attacker’s goal, and (3) model misbehavior, including benign tasks which are likely to result in costly mistakes or reveal model misalignment. For each category, we design tasks that differ in the type of safety violations and in the apps they require (such as Thunderbird, VS Code, Terminal, LibreOffice Impress, etc.), for a total of 150 tasks.」というベンチマークの提案。
リポジトリはGitHub – tml-epfl/os-harm: OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs

Answer-Centric or Reasoning-Driven? Uncovering the Latent Memory Anchor in LLMs [28.6]
大きな言語モデル(LLM)は印象的な推論機能を示している。彼らの成功の多くは、真の推論よりも、暗記された回答推論パターンに起因している、とエビデンスは示唆している。本稿では, 応答キューを体系的に操作し, 間接的, 行動解析によるモデル行動の探索を行う5段階の応答可視プロンプトフレームワークを提案する。
論文参考訳（メタデータ） (Sat, 21 Jun 2025 08:15:45 GMT)
「By manipulating the visibility of final answers within prompts, we uncover a profound and consistent pattern: LLM performance is predominantly anchored to the explicit presence of final answers rather than to the textual patterns of the reasoning steps themselves.」という指摘だが、LRMによっても挙動がかなり違うのが興味深い。

月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31