arXiv最新論文の紹介

On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization

On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization [55.0]
大規模言語モデルは、最近、Q&Aのような複雑な自然言語処理タスクの裁判官として活用されている。コード生成とコード要約という2つのコード関連タスクに対するLLMs-as-a-judgeの有効性について検討した。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 13:40:26 GMT)
コードの評価を対象としたLLM as a judgeの検証
「Our findings show that “small” LLMs struggle in judging tasks, with GPT-4-turbo being the model that achieves the best results. Still, even GPT-4-turbo frequently fails in assessing code correctness, while being a reliable judge of code summary quality.」とのこと。より新しいモデルでの結果が気になる。

The Impact of Language Mixing on Bilingual LLM Reasoning

The Impact of Language Mixing on Bilingual LLM Reasoning [4.5]
中国語と英語のバイリンガル推論モデルにおける言語スイッチングについて検討する。単言語復号を強制すると数学推論タスクの精度は 5.6 ポイント低下する潜在的な言語スイッチが、推論に害を与えるかどうかを予測するために、軽量なプローブをトレーニングすることができる。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 17:56:09 GMT)
LRMでよく見る推論過程で様々な言語が混じる問題について、「Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning.」とのこと。また、「Altogether, these results suggest that language mixing is not a random artifact of multilingual training but a deliberate strategy that LLMs adopt to improve complex reasoning.」という記載もある。

Pixels, Patterns, but No Poetry: To See The World like Humans

Pixels, Patterns, but No Poetry: To See The World like Humans [33.8]
最先端のMLLMは、人間にとって簡単な私たちの知覚上のタスクに破滅的な失敗を示します。この論文は、推論から知覚へと焦点を移す。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 21:50:16 GMT)
人間だと直感的に理解可能な Turing Eye Test (TET)の提案。「Through four diagnostic tasks involving concealed text, 3D Captchas, Chinese character compositions, and color blind test charts, we demonstrated that state-of-the-art MLLMs exhibit catastrophic failures on perceptual tasks that humans solve intuitively.」とAIにはとけないものが多い。創作漢字コンテストの漢字を理解できるか興味深いところ（leakが怖いが…）。
- 手元のo3-proではhttps://sousaku-kanji.com/archive/contest_15th.htmlは読めないようだった。
プロジェクトサイトはPixels, Patterns, but no Poetry: To See the World like Humans

Diffusion Beats Autoregressive in Data-Constrained Settings

Diffusion Beats Autoregressive in Data-Constrained Settings [46.1]
自己回帰(AR)モデルは長い間、大きな言語モデルのランドスケープを支配してきた。近年,ARモデルよりもアドバンテージが低いものの,拡散型言語モデルが将来性のある選択肢として浮上している。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 17:59:57 GMT)
「In this paper, we systematically study masked diffusion models in data-constrained settings—where training involves repeated passes over limited data—and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior down- stream performance.」という指摘。直観的にもそうだろうと思う。
リポジトリはDiffusion Beats Autoregressive in Data-Constrained Settings

Docopilot: Improving Multimodal Models for Document-Level Understanding

Docopilot: Improving Multimodal Models for Document-Level Understanding [87.6]
マルチモーダル文書の詳細な理解を支援するために,高品質な文書レベルデータセットDoc-750Kを提案する。このデータセットには、さまざまなドキュメント構造、広範なクロスページ依存関係、および元のドキュメントから派生した実際の質問と回答のペアが含まれている。データセットに基づいて、RAGに頼ることなく、文書レベルの依存関係を正確に処理できるネイティブなマルチモーダルモデルであるDocopilotを開発する。
論文参考訳（メタデータ） (Sat, 19 Jul 2025 16:03:34 GMT)
大規模なマルチモーダルDocumentUnderstanding用データの構築とInternVL2ベースのモデル構築。「The proposed Docopilot-8B shows a notable improvement over baseline models [73], achieving a +19.9% accuracy gain compared to InternVL2-8B and surpassing InternVL2-26B with less than 31% of the inference latency. Additionally, Docopilot-2B uses fewer parameters (less than 10%) while exhibiting comparable performance to the 10× larger InternVL2-26B.」と性能向上。
リポジトリはOpenGVLab/Docopilot: [CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends [11.4]
Visually-Rich Document Understanding (VRDU)は、複雑なビジュアル、テキスト、レイアウト情報を含む文書を自動的に処理する必要があるため、重要な分野として登場した。この調査はMLLMベースのVRDUの最近の進歩をレビューし、3つのコアコンポーネントを強調した。
論文参考訳（メタデータ） (Mon, 14 Jul 2025 02:10:31 GMT)
図やレイアウトの取り扱いを含むDocument Understandingのサーベイ

Qwen3-Coder, Intern-S1, Step-Audio2, TeleChat2

Claude 4 sonnetレベルのQwen3 Coder（QwenLM/Qwen3-Coder: Qwen3-Coder is the code version of Qwen3, the large language model series developed by Qwen team, Alibaba Cloud.）、235B MoE language model (Qwen3) ＋ 6B Vision encoder (InternViT)で強力なマルチモーダルLRM Intern S1（InternLM/Intern-S1）、Kimi K2のテクニカルレポート公開（Kimi-K2/tech_report.pdf at main · MoonshotAI/Kimi-K2）、と中国のモデルに関する話題が多かった。Qwen3-Instruct-2507（QwenLM/Qwen3: Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.）でKIMI K2越えが主張されたりと競争が激しい。

音声関連でもStepFunからStep-Audio 2 Technical Report、TeleAIからTECHNICAL REPORT OF TELECHAT2, TELECHAT2.5 AND T1が公開されている。いずれも優れた性能を主張。加えてGR-3のようなロボット関連の論文にも興味津々。

そして、もう間もなく、GPT-5が発表されるはずで、進化は続きそう。

Step-Audio 2 Technical Report [108.0]
Step-Audio 2は、業界における音声理解と音声会話のために設計された、エンドツーエンドのマルチモーダルな大規模言語モデルである。遅延オーディオエンコーダと推論中心強化学習(RL)を統合することにより、Step-Audio 2は自動音声認識(ASR)および音声理解において有望な性能を達成する。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 11:13:12 GMT)
リポジトリはstepfun-ai/Step-Audio2: Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.

Technical Report of TeleChat2, TeleChat2.5 and T1 [40.9]
最新のTeleChatモデルについて紹介する: TeleChat2, TeleChat2.5, T1。モデルアーキテクチャの最小限の変更にもかかわらず、新しいシリーズは、強化されたトレーニング戦略によって、大幅なパフォーマンス向上を達成する。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 01:00:48 GMT)
リポジトリはTele-AI/TeleChat2: 星辰语义大模型TeleChat2是由中国电信人工智能研究院研发训练的大语言模型，是首个完全国产算力训练并开源的千亿参数模型

GR-3 Technical Report [21.9]
GR-3は、大規模な視覚言語アクション(VLA)モデルである。抽象概念を含む新しいオブジェクト、環境、命令を一般化する際、例外的な能力を示す。 GR-3は、両手動操作や移動動作を必要とするタスクを含む、長い水平および外接なタスクの処理に長けている。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 10:54:13 GMT)
プロジェクトサイトはByteDance Seed

Apple Intelligence Foundation Language Models: Tech Report 2025 [246.0]
AppleのデバイスやサービスにまたがってAppleのインテリジェンス機能を駆動する2つの基礎言語モデルを紹介します。どちらのモデルも、責任あるWebクローリングを通じてソースされる大規模なマルチリンガルデータセットとマルチモーダルデータセットに基づいてトレーニングされている。新しいSwift中心のFoundation Modelsフレームワークでは、ガイド付き生成、制約付きツール呼び出し、LoRAアダプタの微調整が公開されている。
論文参考訳（メタデータ） (Thu, 17 Jul 2025 23:37:19 GMT)
Apple IntelligenceのテクニカルレポートがarXivに公開されていた。
「We found that AFM on-device model performs better than Qwen-2.5-3B, Gemma-3-4B and Gemma-3n-E4B on MMLU/MMMLU, but it lags slightly behind Gemma-3n-E4B on MGSM. AFM on-device model performs lower than the larger Qwen-3-4B model. AFM server models lag slightly to LLaMA 4 Scout, whose total size and active number of parameters are comparable, but has a bigger gap to larger models such as Qwen-3-235B and the proprietary GPT-4o.」と評価している。

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization [48.0]
大型言語モデル(LLM)は複雑な問題に対処するためにチェーン・オブ・シント(CoT)技術を利用する。ドメイン知識を統合した新しいエージェントフレームワークであるChatBatteryを,材料設計におけるより効果的な推論に向けて導入する。新規リチウムイオン電池陰極材料3種を同定,合成,特性評価し,28.8%,25.2%,18.5%の実用能力向上を実現した。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 23:46:11 GMT)
科学的発見を支援するAI、「ChatBattery is an AI-driven material optimization platform structured into two synergistic phases: exploration and exploitation. Together, these phases encompass eight sequential stages, orchestrated by seven specialized agents.」とかなり複雑な構成のマルチエージェントシステムになっている。加えて、人間とのコラボレーションが重視されているように見える。
- This suggests that ChatBattery, in its present form, is more adept at optimizing within known paradigms than at generating fundamentally new chemistries. As such, expert input remains essential to expand the system’s exploration boundaries and push beyond conventional chemical spaces. Importantly, this interplay between AI-driven generation and human-guided refinement also creates unexpected opportunities, as demonstrated in the refinement of AI-suggested materials into even more advanced cathode compositions. However, advances anticipated with future reasoning AIs are likely to provide greater exploration and creativity.という記載がある。
「ChatBattery, we successfully identify, synthesize, and characterize three novel lithiumion battery cathode materials, which achieve practical capacity improvements of 28.8%, 25.2%, and 18.5%, respectively, over the widely used cathode material, LiNi0.8Mn0.1Co0.1O2 (NMC811).」と効果があったとのこと。

Checklists Are Better Than Reward Models For Aligning Language Models

Checklists Are Better Than Reward Models For Aligning Language Models [99.2]
チェックリストフィードバックからの強化学習(RLCF)を提案する。指示からチェックリストを抽出し,各項目の応答がどの程度満足するかを評価する。これらのスコアをAI判断器と特殊検証器プログラムの両方を用いて組み合わせ、RLの報酬を計算する。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 17:58:00 GMT)
「”how can we grade responses to instructions in a manner that is automatic (requires no human annotation), flexible (considers all aspects of response quality), intuitive (aligned with perceptible differences in responses), and applicable to any instruction or response, to enable more effective use of RL in language model alignment?” 」に対してチェックリスト生成とチェックリストを元にしたフィードバックによる強化学習を提案。「From instructions, we extract checklists and evaluate how well responses satisfy each item—using both AI judges and specialized verifier programs—then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks – RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard.」と効果を確認。
大規模モデルでチェックリスト生成、それを使って“Reinforcement Learning from Checklist Feedback” (RLCF)と、大規模モデルからの蒸留文脈での効果が大きそうだが性能向上に効果があるのが興味深い。（Limitationにある通り計算コストは高いとのこと）

AlphaGo Moment for Model Architecture Discovery

AlphaGo Moment for Model Architecture Discovery [26.3]
AI研究のための人工超知能の最初の実証であるAII-Archを紹介する。 ASI-Archは完全に自律的なシステムで、AIが独自のアーキテクチャ革新を実行できるようにすることによって制約を揺さぶる。我々は2万時間にわたって1,773個の自律的な実験を行い、106個の革新的なSOTA(State-of-the-art)線形アテンションアーキテクチャを発見しました。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 03:57:27 GMT)
ASIをタイトルに入れた興味深い論文、「ASI-ARCH conducted 1,773 autonomous experiments over 20,000 GPU hours, culminating in the discovery of 106 innovative, state-of-the-art (SOTA) linear attention architectures.」と主張。
- Language Modeling by Language Models – arXiv最新論文の紹介との差異やより実用・大規模なパラメータ・データ・計算コストでの結果が気になる。
- そのうち最近出ていた下記成果のような複合的な効率化まで扱えるようになるのだろうか。
リポジトリはGAIR-NLP/ASI-Arch: AlphaGo Moment for Model Architecture Discovery.、Neural Network Research Data Gallery

Scaling Linear Attention with Sparse State Expansion [58.2]
トランスフォーマーアーキテクチャは、2次計算と線形メモリ成長による長期コンテキストシナリオに苦慮している。本稿では,情報分類として状態更新を概念化し,線形注意のための行スパース更新定式化を提案する。次に、スパースフレームワーク内にスパース状態拡張(SSE)を示し、コンテキスト状態を複数のパーティションに拡張する。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 13:27:31 GMT)

2026年2月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28