Multilingual – arXiv最新論文の紹介

The Impact of Language Mixing on Bilingual LLM Reasoning

The Impact of Language Mixing on Bilingual LLM Reasoning [4.5]
中国語と英語のバイリンガル推論モデルにおける言語スイッチングについて検討する。単言語復号を強制すると数学推論タスクの精度は 5.6 ポイント低下する潜在的な言語スイッチが、推論に害を与えるかどうかを予測するために、軽量なプローブをトレーニングすることができる。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 17:56:09 GMT)
LRMでよく見る推論過程で様々な言語が混じる問題について、「Discouraging this behavior in DeepSeek-R1 was found to degrade accuracy, suggesting that language mixing may benefit reasoning.」とのこと。また、「Altogether, these results suggest that language mixing is not a random artifact of multilingual training but a deliberate strategy that LLMs adopt to improve complex reasoning.」という記載もある。

NeoBabel: A Multilingual Open Tower for Visual Generation

NeoBabel: A Multilingual Open Tower for Visual Generation [32.8]
我々は,新しい多言語画像生成フレームワークNeoBabelを紹介する。英語、中国語、オランダ語、フランス語、ヒンディー語、ペルシア語という6つの言語をサポートしている。それは、強い英語能力を維持しながら、最先端の多言語のパフォーマンスを達成する。
論文参考訳（メタデータ） (Tue, 08 Jul 2025 16:19:45 GMT)
「This paper introduces NeoBabel, a novel multilingual image generation framework that represents the first scalable solution for direct text-to-image synthesis across six languages. Through meticulous curation of high-quality multilingual vision-language datasets and end-to-end training, NeoBabel establishes direct cross-lingual mappings between textual descriptions and visual outputs across all supported languages.」という翻訳を介さない多言語対応画像生成モデルの提案。文化に関わる単語を翻訳するのは困難であり、このようなモデルは重要。
リポジトリはNeoBabel: A Multilingual Open Tower for Visual Generation

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation [89.7]
MultiFinBenは、グローバルファイナンシャルドメインに合わせた最初のマルチリンガルおよびマルチモーダルベンチマークである。我々は,最初のOCR組み込み財務QAタスクである EnglishOCR と SpanishOCR の2つの新しいタスクを紹介する。本稿では,動的で難易度の高い選択機構を提案し,コンパクトでバランスの取れたベンチマークをキュレートする。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 22:01:49 GMT)
金融ドメインのマルチモーダル、マルチリンガルベンチマーク。日本語データも含まれているよう。
リポジトリはGitHub – xueqingpeng/MultiFinBen、データはHuggingFaceで公開されている（TheFinAI/PolyFiQA-Easy · Datasets at Hugging Faceなど）

XToM: Exploring the Multilingual Theory of Mind for Large Language Models

XToM: Exploring the Multilingual Theory of Mind for Large Language Models [58.0]
LLMにおける既存の心の理論の評価は英語に限られている。 XToMは5言語にまたがってToMを評価する,厳格に検証された多言語ベンチマークである。以上の結果から,LLMが言語的文脈にまたがって人間的なメンタライゼーションを再現する能力に限界があることが判明した。
論文参考訳（メタデータ） (Tue, 03 Jun 2025 05:23:25 GMT)
多言語でのLLM比較、「LLMs are equipped with multilingual understanding ability but fail in multi- lingual ToM reasoning tasks.」と結論。深い部分での言語間差異は残っているよう（とはいえ、一昔前に比べて差異は縮小しているようにも見える）
リポジトリはGitHub – HKUST-KnowComp/XToM: Data and Code for paper “X-ToM: Exploring the Multilingual Theory of Mind for Large Language Models”

XRAG: Cross-lingual Retrieval-Augmented Generation

XRAG: Cross-lingual Retrieval-Augmented Generation [21.5]
XRAGは,LLMの生成能力を評価するために設計されている。 XRAGは最近のニュース記事から構築されており、質問に答えるために外部の知識が必要であることを保証している。
論文参考訳（メタデータ） (Thu, 15 May 2025 08:47:55 GMT)
クロスリンガル設定のRAGベンチマーク。LLMが内部知識からは答えられないように構築されている。
「(3) We find that in the monolingual retrieval setting, all evaluated LLMs face issues with Response Language Correctness an issue that has received little attention from the research community. (4) In the multilingual retrieval setting, the primary challenge for LLMs does not lie in non- English generation, but in reasoning over retrieved information across languages.」と意外に難しく、興味深い結果になっている。
データを見てみたいところ。

How Reliable is Multilingual LLM-as-a-Judge?

How Reliable is Multilingual LLM-as-a-Judge? [11.6]
25言語を含む5つの多種多様なタスクにおいて、異なるモデルファミリーから5つのモデルを評価する。一貫性は言語によって大きく異なり、低リソース言語では特にパフォーマンスが劣っていることが分かりました。実世界のアプリケーションにおける多言語判断の整合性を改善するアンサンブル戦略を提案する。
論文参考訳（メタデータ） (Sun, 18 May 2025 02:32:35 GMT)

マルチリンガル設定でのLLM as a judgeの性能評価。GPT-4oも苦労している印象の結果。「we find that powerful open-source models, such as Qwen- 2.5, achieve comparable performance to OpenAI models in multilingual judgment tasks.」や「Aya fails to demonstrate noticeable improvements. This suggests that fine- tuning with multilingual data may not directly enhance a model’s ability to perform accurate multi- lingual judgments.」など興味深い記載も多い。

When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners

When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners [111.5]
言語固有のアブレーションは多言語推論性能を継続的に向上させることを示す。トレーニング後のアブレーションと比較して、トレーニング不要のアブレーションは、計算オーバーヘッドを最小限に抑えながら、同等または優れた結果が得られる。
論文参考訳（メタデータ） (Wed, 21 May 2025 08:35:05 GMT)
「Drawing inspiration from cognitive neuroscience, which suggests that human reasoning functions largely independently of language processing, we hypothesize that LLMs similarly encode reasoning and language as separable components that can be disentangled to enhance multilingual reasoning」に基づき、「Through targeted interventions in the LLMs’ activation space, we demonstrate that removing language-specific information significantly improves reasoning performance across languages.」とのこと。
仮説も検証結果も非常に興味深い。LLMは人間の脳とは全く別のはずだが近い動き（機能分解）になっているのは何故なんだろう・・・

Multilingual Performance Biases of Large Language Models in Education

Multilingual Performance Biases of Large Language Models in Education [39.1]
大規模言語モデル(LLM)は、教育環境においてますます採用されている。この研究は、非英語の教育環境での使用が保証されているかどうかを確かめるものである。
論文参考訳（メタデータ） (Thu, 24 Apr 2025 16:32:31 GMT)
「However, we note that certain models can do terribly on some tasks and languages, so we recommend first verifying that a particular model works well in a particular language on a specific education-related task before deployment.」というまっとうな指摘はあるものの、「Particularly, we find that GPT4o and Gemini 2.0 perform consistently well across all languages with a few exceptions.」と多言語対応はかなり進んでいる雰囲気を感じる。
リポジトリはGitHub – eth-lre/multilingual-educational-llm-bias: Data and code for “Multilingual Performance Biases of Large Language Models in Education”

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks [37.8]
本稿では148カ国の2000以上の多言語(非英語)ベンチマークについて検討する。英語はこれらのベンチマークで著しく過剰に表現されている。ほとんどのベンチマークは翻訳よりもオリジナルの言語コンテンツに依存している。
論文参考訳（メタデータ） (Tue, 22 Apr 2025 01:47:37 GMT)
多言語ベンチマークに対する調査報告。「Importantly, simply translating English benchmarks proves insufficient for robust evaluation, localized benchmarks (like CMMLU for Chinese) show substantially higher correlation with human judgments (0.68) than translated equivalents (0.47 and 0.49), highlighting the critical need for culturally and linguistically authentic evaluation resources.」というのはそうだろうと思いつつ、数字で示されると納得感がある。

Seedream 3.0 Technical Report

Seedream 3.0 Technical Report [62.9]
Seedream 3.0は、高性能な中国語と英語のバイリンガル画像生成基盤モデルである。 Seedream 2.0の既存の課題に対処するために、いくつかの技術的改善を開発しています。 Seedream 3.0はネイティブな高解像度の出力(最大2K)を提供し、高画質の画像を生成する。
論文参考訳（メタデータ） (Wed, 16 Apr 2025 16:23:31 GMT)
ByteDanceによるマルチリンガルな画像生成モデル、サンプル画像から非常に強力なモデルであることが分かる。Text to Image Model Arena | Artificial AnalysisでSoTAを主張（現在はGPT-4oに抜かれている？）
プロジェクトサイトはDoubao Team

2025年10月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31