arXiv最新論文の紹介

BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment

BayLing 2: A Multilingual Large Language Model with Efficient Language Alignment [42.2]
本稿では,ハイソース言語から低ソース言語へ効率的に生成能力と知識を伝達するBayLing 2を紹介する。 100以上の言語にまたがる多言語翻訳では、BayLingは同様のスケールのオープンソースモデルよりも優れたパフォーマンスを示している。 BayLingのデモ、ホームページ、コード、モデルが利用可能だ。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 11:35:08 GMT)
fune tuningをベースとした多言語モデルの構築「By fine-tuning on high-resource language instructions and cross-lingual instructions, LLM can transfer knowledge and generative capabilities from high-resource languages to low-resource languages, thereby facilitating multilingual interaction.」「Cross-lingual instructions, such as interactive translation and multilingual translation, can efficiently enhance the language alignment within LLM, thereby improving translation performance.」とのことだが、結果の解釈はなかなか難しい・・・
リポジトリはGitHub – ictnlp/BayLing: “百聆”是一个基于LLaMA的语言对齐增强的英语/中文大语言模型，具有优越的英语/中文能力，在多语言和通用任务等多项测试中取得ChatGPT 90%的性能。BayLing is an English/Chinese LLM equipped with advanced language alignment, showing superior capability in English/Chinese generation, instruction following and multi-turn interaction.、プロジェクトサイトはhttp://nlp.ict.ac.cn/baylingだが執筆時点ではダウンしているよう（？）

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages [73.9]
ALM-benchは、100言語にわたるLMMを評価するための、これまでで最大かつ最も包括的な取り組みである。様々な言語でテキストと組み合わせた文化的に多様なイメージを理解し、推論する能力をテストすることで、既存のモデルに挑戦する。このベンチマークは、真/偽、複数選択、オープンな質問など、さまざまな質問フォーマットを備えた、堅牢でニュアンスの高い評価フレームワークを提供する。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 15:44:42 GMT)
きわめて多い言語のLLM評価ベンチマーク。タスクはVQA。
リポジトリはAll Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages

Multilingual Large Language Models: A Systematic Survey

Multilingual Large Language Models: A Systematic Survey [39.0]
本稿では,多言語大言語モデル(MLLM)の最新研究を包括的に調査する。まず,MLLMのアーキテクチャと事前学習の目的について論じ,多言語機能に寄与する重要なコンポーネントや方法論を強調した。本稿では,MLLMの言語間知識,推論,人的価値との整合性,安全性,解釈可能性,専門的応用に関する詳細な分類とロードマップを示す。
論文参考訳（メタデータ） (Sun, 17 Nov 2024 13:21:26 GMT)
マルチリンガルなLLMのサーベイ。MLLMのMは（最近は）マルチモーダルであることが多いので若干戸惑う。
リポジトリはGitHub – tjunlp-lab/Awesome-Multilingual-LLMs-Papers: Awesome-Multilingual-LLMs-Papers

DRS: Deep Question Reformulation With Structured Output

DRS: Deep Question Reformulation With Structured Output [114.1]
大規模な言語モデルは、質問の解答不能を識別するが、質問の修正を支援する能力は欠如している。 DRS:Deep Question Reformulation with Structured Outputを提案する。提案手法は, GPT-3.5 の修正精度を 23.03% から 70.42% に向上させ, Gemma2-9B などのオープンソースの大規模言語モデルのスコアを 26.35% から 56.75% に向上させる。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 02:20:44 GMT)
質問を修正する手法の提案。「More importantly, according to Faustini et al (2023), in a large-scale industrial experiment,rephrasing unanswerable questions posed to virtual assistants significantly enhances the user experience for millions, which highlights the importance of effectively leveraging LLMs to assist people in question reformulation.」とも書かれているが、応用上ほしい場面があるのは確か。この論文ではentity extraction, dfs combination search with question generation, final candidate selectionと問題を分割しながら特殊法を提案している。
リポジトリはGitHub – Lizhecheng02/DRS: Repository for our paper “DRS: Deep Question Reformulation With Structured Output”.

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.4]
本稿では、推論と批判モデルの役割を分離する2人プレイヤパラダイムを提案する。まず、批判データを収集する自動化およびスケーラブルなフレームワークであるAutoMathCritiqueを提案する。テスト時間における難解なクエリに対するアクターのパフォーマンスを,批判モデルが一貫して改善することが実証された。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 17:11:54 GMT)
「flawed reasoning path construction, critique generation, and data filtering」の3ステージからなるフレームワークAutoMathCritiqueでデータを構築、fine tuningするとともに、「Motivated by the insights of test-time, we introduce the critique model into the actor model’s exploration and learning process, introducing a critique-in-the-loop self-improvement method」を適用して効果を確認。 critique modelの有効性が分かる結果に見える（が、この構築は容易ではないかもしれない）
リポジトリはAutoMathCritique

Training and Evaluating Language Models with Template-based Data Generation

Training and Evaluating Language Models with Template-based Data Generation [6.0]
我々は、700万以上の合成された小学校数学問題からなるデータセットを作成する。このデータセットは、数学的推論においてLLMを事前学習、微調整、評価するための貴重なリソースとして機能する。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 07:32:56 GMT)
LLMにメタテンプレート作成からまかせての合成データ構築。面白いけど他分野でもワークする可能性はあるのだろうか。
リポジトリはGitHub – iiis-ai/TemplateMath: Official implementation of “Training and Evaluating Language Models with Template-based Data Generation” (https://templatemath.github.io)

Model Context Protocol (MCP), QwQ, OLMo 2

先週も様々なニュースがあったが、注目はAnthropicのModel Context Protocolである。　Introducing the Model Context Protocol \ Anthropic、Introduction – Model Context Protocol

ザックリとはLLMと外部データやツールを統合するためのプロトコルである。外部ツール利用やメモリの拡張利用などを前提としたLLMを構築する場合、この手の標準があるかないかは重要。MCPがデファクトスタンダードとなれるか興味津々。

公開モデル関連では極めて性能の高いQwen with Questions（QwQ）、以前取り上げたDolmaとOLMo – arXiv最新論文の紹介のver 2であるOLMo 2に要注目である。O1 Replication JurneyやTULU3もだが、どのような手法、アプローチで性能が上がるのかなどをオープンにした取り組みの価値は高い。

QwQ: Reflect Deeply on the Boundaries of the Unknown | Qwen
- 「QwQ-32B-Preview is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities.」という公開モデル。Open AI o1と比較しても性能が高い。o1に刺激を受けた動きは様々行われていて本当に競争が激しい。
- リポジトリはQwen/QwQ-32B-Preview · Hugging Face
- デモはQwQ-32B-Preview – a Hugging Face Space by Qwen
OLMo 2: The best fully open language model to date | Ai2
- 構築方法、データ、モデルが公開されているモデルであり、性能は最先端に近い。
- リポジトリはOLMo 2 – a allenai Collection
- デモはAi2 Playground

O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [30.9]
本稿では,OpenAIのO1モデル機能を複製する現在のアプローチについて,批判的な考察を行う。 O1のAPIからの単純な蒸留と教師付き微調整を組み合わせることで、複雑な数学的推論タスクにおいて優れた性能が得られることを示す。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 15:31:27 GMT)
OpenAI o1に関する研究、Fugu-MT 論文翻訳(概要): O1 Replication Journey: A Strategic Progress Report — Part 1からのPart2。「While our previous work (Part 1 (Qin et al , 2024)) explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1’s API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks.」はまぁいいとして「Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning.」は驚き。
リポジトリはGitHub – GAIR-NLP/O1-Journey: O1 Replication Journey: A Strategic Progress Report – Part I

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training [94.1]
我々は、完全にオープンな最先端の訓練後モデルであるT”ULU 3を紹介する。 T”ULU 3はLlama 3.1ベースモデルをベースにしており、Llama 3.1、Qwen 2.5、Mistral、さらにGPT-4o-mini、Claude 3.5-Haikuといったクローズドモデルにも勝っている。
論文参考訳（メタデータ） (Fri, 22 Nov 2024 18:44:04 GMT)
リポジトリはGitHub – allenai/open-instruct

Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS

Beyond Examples: High-level Automated Reasoning Paradigm in In-Context Learning via MCTS [25.6]
HiAR-ICLは特定の例から抽象的な思考パターンへとシフトする。適切な思考カードと動的に一致する認知複雑性フレームワークを開発する。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 16:19:00 GMT)
「(1) define atom reasoning actions, (2) construct thought cards via MCTS, (3) select reasoning patterns, and (4) solve and verify」からなるICLフレームワークの提案。(1)では「System Analysis (SA)」「One-Step Thought (OST)」「Chain-of-Thought (CoT)」「Divide and Conquer (DC)」「(a5) Self-Reflection and Refinement (SRR)」の5種類を定義。
「HiAR-ICL, a High-level Automated Reasoning paradigm in ICL」という名称であるが、ICLというよりAgenticな動作に思える。もちろん性能は上がりそう。

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models

VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models [111.6]
VBenchは、”ビデオ生成品質”を特定の、階層的、そして非絡み合ったディメンションに分解するベンチマークスイートである。我々は、人間の知覚とベンチマークの整合性を検証するために、人間の嗜好アノテーションのデータセットを提供する。 VBench++は、テキスト・トゥ・ビデオと画像・トゥ・ビデオの評価をサポートする。
論文参考訳（メタデータ） (Wed, 20 Nov 2024 17:54:41 GMT)
Video generationのためのベンチマーク
リポジトリはGitHub – Vchitect/VBench: [CVPR2024 Highlight] VBench – We Evaluate Video Generation、リーダーボードも公開されているVBench Leaderboard – a Hugging Face Space by Vchitect

LLM Augmentations to support Analytical Reasoning over Multiple Documents

LLM Augmentations to support Analytical Reasoning over Multiple Documents [9.0]
本研究では,インテリジェンス解析の文脈内での深い解析的推論を強化するために,大規模言語モデル(LLM)の適用について検討する。動的エビデンスツリー(DET)と呼ばれるメモリモジュールでLLMの能力を高めるアーキテクチャを開発し、複数の調査スレッドを開発・追跡する。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 06:00:42 GMT)
intelligence analysis におけるLLMの活用、使用の流れが興味深い
リポジトリはGitHub – DiscoveryAnalyticsCenter/speculatores: [IEEE Big Data 2024] LLM Augmentations to support Analytical Reasoning over Multiple Documents

2025年12月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31