2025年7月30日 – arXiv最新論文の紹介

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers [22.8]
本稿では,科学文献におけるスキーマ図の解釈能力を評価するための最初のベンチマークであるMIS-QAを紹介する。 MISS-QAは465以上の科学論文に1500の専門家が注釈を付けた例で構成されている。我々は、o4-mini、Gemini-2.5-Flash、Qwen2.5-VLを含む18のフロンティアマルチモーダル基盤モデルの性能を評価する。
論文参考訳（メタデータ） (Mon, 14 Jul 2025 20:35:25 GMT)
「We present MISS-QA, the first benchmark specifically designed to assess the ability of foundation models to comprehend schematic diagrams in scientific literature.」ということで、概念図等を理解するためのベンチマークの提案。o4-miniの性能が高めだが、人間との差は大きい。
データはyale-nlp/MISS-QA · Datasets at Hugging Face、リポジトリはGitHub – yilunzhao/MISS-QA

On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [55.0]
大規模言語モデルは、最近、Q&Aのような複雑な自然言語処理タスクの裁判官として活用されている。コード生成とコード要約という2つのコード関連タスクに対するLLMs-as-a-judgeの有効性について検討した。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 13:40:26 GMT)
コードの評価を対象としたLLM as a judgeの検証
「Our findings show that “small” LLMs struggle in judging tasks, with GPT-4-turbo being the model that achieves the best results. Still, even GPT-4-turbo frequently fails in assessing code correctness, while being a reliable judge of code summary quality.」とのこと。より新しいモデルでの結果が気になる。

The Impact of Language Mixing on Bilingual LLM Reasoning [4.5]
中国語と英語のバイリンガル推論モデルにおける言語スイッチングについて検討する。単言語復号を強制すると数学推論タスクの精度は 5.6 ポイント低下する潜在的な言語スイッチが、推論に害を与えるかどうかを予測するために、軽量なプローブをトレーニングすることができる。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 17:56:09 GMT)
LRMでよく見る推論過程で様々な言語が混じる問題について、「Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning.」とのこと。また、「Altogether, these results suggest that language mixing is not a random artifact of multilingual training but a deliberate strategy that LLMs adopt to improve complex reasoning.」という記載もある。