2024年12月 – arXiv最新論文の紹介

Knowledge Boundary of Large Language Models: A Survey

Knowledge Boundary of Large Language Models: A Survey [75.7]
大規模言語モデル(LLM)はパラメータに膨大な量の知識を格納するが、特定の知識の記憶と利用に制限がある。これは、LLMの知識境界を理解するための重要な必要性を強調している。本稿では,LLM知識境界の包括的定義を提案し,知識を4つの異なるタイプに分類する形式化された分類法を提案する。
論文参考訳（メタデータ） (Tue, 17 Dec 2024 02:14:02 GMT)
LLMの知識境界に関するサーベイ
面白い視点

DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought

DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought [89.5]
DRT-o1は、長いチェーン・オブ・シークレットの成功をニューラルマシン翻訳(MT)にもたらす試みである。まず、既存の文献から模範文や比喩文を含む文を抽出し、その後、長い思考を通してこれらの文を翻訳する多エージェントフレームワークを開発する。文献翻訳実験の結果, DRT-o1の有効性が示された。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 11:55:33 GMT)
Chain of thoughtの機械翻訳への応用、データを収集・マルチエージェントフレームワークでのデータ合成、fine tuningというアプローチ。14Bで124 GPU hoursは思ったよりも少ない印象だが、性能は大きく向上している。
プロジェクトサイトはGitHub – krystalan/DRT-o1: DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought

Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models

Path-of-Thoughts: Extracting and Following Paths for Robust Relational Reasoning with Large Language Models [62.1]
本稿では,関係推論に対処するための新しいフレームワークであるPath-of-Thoughts(PoT)を提案する。 PoTは、問題コンテキスト内の重要なエンティティ、関係、属性を識別するタスクに依存しないグラフを効率的に抽出する。 PoTは、提案された質問に対応するグラフ内の関連する推論連鎖を特定し、潜在的な答えの推論を容易にする。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 20:27:12 GMT)
「Path-of-Thoughts (PoT), a novel framework that decomposes a relational reasoning task into three stages: graph extraction, path identification, and reasoning.」、ベンチマークで効果を確認とのこと。
形式言語 – arXiv最新論文の紹介という感じのアプローチと似ているような気がしなくもない。

DeepSeek v3, QVQ-72B-Preview, YuLan-Mini

公開モデルも高性能化が続いている。DeepSeek v3は671Bと非常に大きなモデル（だが、アクティブパラメータは37BのMoE）でGPT-4oやClaude 3.5 Sonnet競合を主張。 GitHub – deepseek-ai/DeepSeek-V3

QVQ-72B-PreviewはQwen 2.5, Qwen 2 VL, GRIN-MoE, Pixtral – arXiv最新論文の紹介のQwen2 VLから推論能力を強化、GPT-4oだけでなくタスクによってはOpenAI o1と競合する性能を主張。QVQ: To See the World with Wisdom | Qwen

YuLan-Miniは2.42B、1.08Tトークンでのトレーニングと比較的小規模だが、競合する公開モデルを上回る性能を主張。YuLan-Mini/README_ja.md at main · RUC-GSAI/YuLan-Mini · GitHub

中国の研究機関はモデルや手法をかなり公開してくれている印象。非常にありがたい。

YuLan-Mini: An Open Data-efficient Language Model [111.0]
2.42Bパラメータを持つ高い能力を持つベースモデルであるYuLan-Miniは、同様のパラメータスケールのモデルで上位層のパフォーマンスを実現する。注目すべきは、1.08TトークンでトレーニングされたYuLan-Miniは、はるかに多くのデータを必要とする業界主導のモデルに匹敵するパフォーマンスを達成することだ。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 17:47:53 GMT)
「Our approach includes three major contributions to enhance training efficacy: (1) an elaborately designed data pipeline that combines data cleaning with data schedule strategies; (2) a systematic optimization method that can effectively mitigate training instability; (3) an effective annealing approach that integrate targeted data selection and long context training.」とのこと。

DeepSeek-V3 Technical Report [147.2]
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token。我々は14.8兆の多様性と高品質のトークンでDeepSeek-V3を事前訓練し、その後にSupervised Fine-Tuning and Reinforcement Learningのステージを受講した。包括的な評価によると、DeepSeek-V3は他のオープンソースモデルよりも優れており、主要なクローズドソースモデルに匹敵するパフォーマンスを実現している。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 04:03:16 GMT)
「During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pretraining stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M.」ととてもコストパフォーマンスが良い。もっとも「Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.」

Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code [123.7]
本稿では,英語,フィンランド語,ヒンディー語,日本語,ベトナム語,コードに基づく15Bパラメータの多言語オープンソースモデルであるAurora-Mを提案する。これは、人間がレビューした安全命令を微調整した初めてのオープンソース多言語モデルである。我々はAurora-Mを幅広いタスクや言語で評価し、破滅的な忘れ物に対する頑健さを示した。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 03:53:21 GMT)
aurora-m/aurora-m-biden-harris-redteamed · Hugging Face こういったモデルも存在。対応言語に日本語が明記されている。

GUI Agents: A Survey

GUI Agents: A Survey [129.9]
グラフィカルユーザインタフェース(GUI)エージェントは、人間とコンピュータのインタラクションを自動化するためのトランスフォーメーションアプローチとして登場した。 GUIエージェントの関心の高まりと基本的な重要性により、ベンチマーク、評価指標、アーキテクチャ、トレーニングメソッドを分類する総合的な調査を提供する。
論文参考訳（メタデータ） (Wed, 18 Dec 2024 04:48:28 GMT)
GUIをつかうエージェントに関するサーベイ

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners [19.0]
自己改善は、パフォーマンスを向上させる主要な方法として現れています。本稿では,この反復的プロセスにおいて2つの重要な要因をモニタする手法を提案し,提案する。 B-STaRは、反復的な構成を調整し、探索とエクスプロイトのバランスをとる自己学習推論フレームワークである。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 03:58:34 GMT)
「In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model’s ability to generate sufficiently diverse responses (exploration); and (2) the effectiveness of external rewards in distinguishing high-quality candidates from lower-quality ones (exploitation).」、についてこれらを監視しバランスをとる手法を提案。
リポジトリはGitHub – hkust-nlp/B-STaR

Improving Factuality with Explicit Working Memory

Improving Factuality with Explicit Working Memory [63.5]
大規模な言語モデルは、幻覚として知られる、事実的に不正確なコンテンツを生成することができる。 EWE(Explicit Working Memory)は、外部リソースからのリアルタイムフィードバックを受信するワーキングメモリを統合することで、長文テキスト生成における事実性を高める新しい手法である。
論文参考訳（メタデータ） (Tue, 24 Dec 2024 00:55:59 GMT)
事実性を守る生成を支援する手法の提案。「Ewe pauses at given intervals and refreshes its working memory based on feedback from retrieval and fact-checking models, ensuring that the generated content remains accurate and relevant. By integrating this working memory into each attention layer of the Transformer architectures, Ewe can be easily adapted to various large language models.」という動作で、このようなモデルに処理（の一部）を組み込むRAG的な動作は流行っていくんだろうなーと思わなくもない。

Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning

Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning [151.4]
マルチモーダル大規模言語モデル(MLLM)は、視覚言語タスクを前進させる前例のない能力を示した。本稿では,MLLMにおける幻覚に対処するためのボトムアップ推論フレームワークを提案する。本フレームワークは、認識レベル情報と認知レベルコモンセンス知識を検証・統合することにより、視覚とテキストの両方の入力における潜在的な問題に体系的に対処する。
論文参考訳（メタデータ） (Sun, 15 Dec 2024 09:10:46 GMT)
MLLM、VQAタスクを対象としたハルシネーション対策、1. Target Identification and Visual Perception, 2. Visual Perception Verification, 3. Question Validation and Adjustment, 4. Commonsense Induction, 5. Commonsense Verification, 6. Question answeringというモジュールで構成。

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [79.0]
BabyLM Challengeは、人間と計算言語学習者のデータ効率ギャップを埋めるためのコミュニティの取り組みである。参加者は1億ワード以下の固定言語データ予算で、言語モデルトレーニングを最適化するために競争する。
論文参考訳（メタデータ） (Fri, 06 Dec 2024 16:06:08 GMT)
「Participants could submit to a 10M-word text-only track, a 100Mword text-only track, and/or a 100M-word and image multimodal track.」というデータを制限したコンペの結果
「With 31 submissions from 17 countries, the challenge revealed several key insights: innovations in model architecture, training objectives, and dataset construction proved particularly effective, with GPT-BERT, a hybrid causalmasked language model architecture, emerging as the strongest approach for the Strict and StrictSmall tracks.」とのこと

VISA: Retrieval Augmented Generation with Visual Source Attribution

VISA: Retrieval Augmented Generation with Visual Source Attribution [100.8]
RAGの既存のアプローチは主に生成されたコンテンツをドキュメントレベルの参照にリンクする。本稿では,視覚的ソース属性と解答生成を組み合わせた新しい手法として,視覚的ソース属性を用いた検索補助生成(VISA)を提案する。本手法の有効性を評価するため,ウィキペディアのWebページスクリーンショットをクロールしたWiki-VISAとPubLayNetから派生したPaper-VISAの2つのデータセットを作成した。
論文参考訳（メタデータ） (Thu, 19 Dec 2024 02:17:35 GMT)
回答の根拠として文書内にバウンディングボックスを提示するRetrieval-Augmented Generation with Visual Source Attribution (VISA)の提案
現実的で重要なタスク。コードやデータセットなど公開予定とのこと。

2024年12月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31