arXiv – ページ 98 – arXiv最新論文の紹介

DeepSeek v3, QVQ-72B-Preview, YuLan-Mini

公開モデルも高性能化が続いている。DeepSeek v3は671Bと非常に大きなモデル（だが、アクティブパラメータは37BのMoE）でGPT-4oやClaude 3.5 Sonnet競合を主張。 GitHub – deepseek-ai/DeepSeek-V3

QVQ-72B-PreviewはQwen 2.5, Qwen 2 VL, GRIN-MoE, Pixtral – arXiv最新論文の紹介のQwen2 VLから推論能力を強化、GPT-4oだけでなくタスクによってはOpenAI o1と競合する性能を主張。QVQ: To See the World with Wisdom | Qwen

YuLan-Miniは2.42B、1.08Tトークンでのトレーニングと比較的小規模だが、競合する公開モデルを上回る性能を主張。YuLan-Mini/README_ja.md at main · RUC-GSAI/YuLan-Mini · GitHub

中国の研究機関はモデルや手法をかなり公開してくれている印象。非常にありがたい。

YuLan-Mini: An Open Data-efficient Language Model [111.0]
2.42Bパラメータを持つ高い能力を持つベースモデルであるYuLan-Miniは、同様のパラメータスケールのモデルで上位層のパフォーマンスを実現する。注目すべきは、1.08TトークンでトレーニングされたYuLan-Miniは、はるかに多くのデータを必要とする業界主導のモデルに匹敵するパフォーマンスを達成することだ。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 17:47:53 GMT)
「Our approach includes three major contributions to enhance training efficacy: (1) an elaborately designed data pipeline that combines data cleaning with data schedule strategies; (2) a systematic optimization method that can effectively mitigate training instability; (3) an effective annealing approach that integrate targeted data selection and long context training.」とのこと。

DeepSeek-V3 Technical Report [147.2]
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token。我々は14.8兆の多様性と高品質のトークンでDeepSeek-V3を事前訓練し、その後にSupervised Fine-Tuning and Reinforcement Learningのステージを受講した。包括的な評価によると、DeepSeek-V3は他のオープンソースモデルよりも優れており、主要なクローズドソースモデルに匹敵するパフォーマンスを実現している。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 04:03:16 GMT)
「During the pre-training stage, training DeepSeek-V3 on each trillion tokens requires only 180K H800 GPU hours, i.e., 3.7 days on our cluster with 2048 H800 GPUs. Consequently, our pretraining stage is completed in less than two months and costs 2664K GPU hours. Combined with 119K GPU hours for the context length extension and 5K GPU hours for post-training, DeepSeek-V3 costs only 2.788M GPU hours for its full training. Assuming the rental price of the H800 GPU is $2 per GPU hour, our total training costs amount to only $5.576M.」ととてもコストパフォーマンスが良い。もっとも「Note that the aforementioned costs include only the official training of DeepSeek-V3, excluding the costs associated with prior research and ablation experiments on architectures, algorithms, or data.」

Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code [123.7]
本稿では,英語,フィンランド語,ヒンディー語,日本語,ベトナム語,コードに基づく15Bパラメータの多言語オープンソースモデルであるAurora-Mを提案する。これは、人間がレビューした安全命令を微調整した初めてのオープンソース多言語モデルである。我々はAurora-Mを幅広いタスクや言語で評価し、破滅的な忘れ物に対する頑健さを示した。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 03:53:21 GMT)
aurora-m/aurora-m-biden-harris-redteamed · Hugging Face こういったモデルも存在。対応言語に日本語が明記されている。

GUI Agents: A Survey

GUI Agents: A Survey [129.9]
グラフィカルユーザインタフェース(GUI)エージェントは、人間とコンピュータのインタラクションを自動化するためのトランスフォーメーションアプローチとして登場した。 GUIエージェントの関心の高まりと基本的な重要性により、ベンチマーク、評価指標、アーキテクチャ、トレーニングメソッドを分類する総合的な調査を提供する。
論文参考訳（メタデータ） (Wed, 18 Dec 2024 04:48:28 GMT)
GUIをつかうエージェントに関するサーベイ

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners [19.0]
自己改善は、パフォーマンスを向上させる主要な方法として現れています。本稿では,この反復的プロセスにおいて2つの重要な要因をモニタする手法を提案し,提案する。 B-STaRは、反復的な構成を調整し、探索とエクスプロイトのバランスをとる自己学習推論フレームワークである。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 03:58:34 GMT)
「In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model’s ability to generate sufficiently diverse responses (exploration); and (2) the effectiveness of external rewards in distinguishing high-quality candidates from lower-quality ones (exploitation).」、についてこれらを監視しバランスをとる手法を提案。
リポジトリはGitHub – hkust-nlp/B-STaR

Improving Factuality with Explicit Working Memory

Improving Factuality with Explicit Working Memory [63.5]
大規模な言語モデルは、幻覚として知られる、事実的に不正確なコンテンツを生成することができる。 EWE(Explicit Working Memory)は、外部リソースからのリアルタイムフィードバックを受信するワーキングメモリを統合することで、長文テキスト生成における事実性を高める新しい手法である。
論文参考訳（メタデータ） (Tue, 24 Dec 2024 00:55:59 GMT)
事実性を守る生成を支援する手法の提案。「Ewe pauses at given intervals and refreshes its working memory based on feedback from retrieval and fact-checking models, ensuring that the generated content remains accurate and relevant. By integrating this working memory into each attention layer of the Transformer architectures, Ewe can be easily adapted to various large language models.」という動作で、このようなモデルに処理（の一部）を組み込むRAG的な動作は流行っていくんだろうなーと思わなくもない。

Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning

Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning [151.4]
マルチモーダル大規模言語モデル(MLLM)は、視覚言語タスクを前進させる前例のない能力を示した。本稿では,MLLMにおける幻覚に対処するためのボトムアップ推論フレームワークを提案する。本フレームワークは、認識レベル情報と認知レベルコモンセンス知識を検証・統合することにより、視覚とテキストの両方の入力における潜在的な問題に体系的に対処する。
論文参考訳（メタデータ） (Sun, 15 Dec 2024 09:10:46 GMT)
MLLM、VQAタスクを対象としたハルシネーション対策、1. Target Identification and Visual Perception, 2. Visual Perception Verification, 3. Question Validation and Adjustment, 4. Commonsense Induction, 5. Commonsense Verification, 6. Question answeringというモジュールで構成。

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora [79.0]
BabyLM Challengeは、人間と計算言語学習者のデータ効率ギャップを埋めるためのコミュニティの取り組みである。参加者は1億ワード以下の固定言語データ予算で、言語モデルトレーニングを最適化するために競争する。
論文参考訳（メタデータ） (Fri, 06 Dec 2024 16:06:08 GMT)
「Participants could submit to a 10M-word text-only track, a 100Mword text-only track, and/or a 100M-word and image multimodal track.」というデータを制限したコンペの結果
「With 31 submissions from 17 countries, the challenge revealed several key insights: innovations in model architecture, training objectives, and dataset construction proved particularly effective, with GPT-BERT, a hybrid causalmasked language model architecture, emerging as the strongest approach for the Strict and StrictSmall tracks.」とのこと

VISA: Retrieval Augmented Generation with Visual Source Attribution

VISA: Retrieval Augmented Generation with Visual Source Attribution [100.8]
RAGの既存のアプローチは主に生成されたコンテンツをドキュメントレベルの参照にリンクする。本稿では,視覚的ソース属性と解答生成を組み合わせた新しい手法として,視覚的ソース属性を用いた検索補助生成(VISA)を提案する。本手法の有効性を評価するため,ウィキペディアのWebページスクリーンショットをクロールしたWiki-VISAとPubLayNetから派生したPaper-VISAの2つのデータセットを作成した。
論文参考訳（メタデータ） (Thu, 19 Dec 2024 02:17:35 GMT)
回答の根拠として文書内にバウンディングボックスを提示するRetrieval-Augmented Generation with Visual Source Attribution (VISA)の提案
現実的で重要なタスク。コードやデータセットなど公開予定とのこと。

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods [21.6]
「LLMs-as-judges」は自然言語応答に基づく評価器である。本稿では,5つの重要な視点から’LLMs-as-judges’パラダイムを包括的に調査する。我々は,研究と実践の両方において,’LLMs-as-judges’の開発と適用に関する洞察を提供することを目的としている。
論文参考訳（メタデータ） (Sat, 07 Dec 2024 08:07:24 GMT)
最近多い、LLMs-as-Judgesのサーベイ。複数束ねるアプローチが多くなってきている印象もある
リポジトリGitHub – CSHaitao/Awesome-LLMs-as-Judges: The official repo for paper, LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods.も参考になる

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks [52.5]
私たちは小さなソフトウェア企業環境を模倣したデータによる自己完結型環境を構築します。最も競争力のあるエージェントでは、タスクの24%が自律的に完了できます。これは、LMエージェントによるタスク自動化に関するニュアンスな絵を描く。
論文参考訳（メタデータ） (Wed, 18 Dec 2024 18:55:40 GMT)
「TheAgentCompany measures the progress of these LLM agents’ performance on performing real-world professional tasks, by providing an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers.」というベンチマーク。現状、Claude 3.5 Sonnetの性能が高い結果になっているが、o1やo3での結果が気になるところ。
プロジェクトサイトはTheAgentCompany、リーダーボードはTheAgentCompany

AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities

AnySat: An Earth Observation Model for Any Resolutions, Scales, and Modalities [5.8]
本稿では,JEPAと分解能適応型空間エンコーダに基づくマルチモーダルモデルであるAnySatを提案する。この統一アプローチの利点を示すために、5ドルのマルチモーダルデータセットのコレクションであるGeoPlexをコンパイルする。次に、これらの多様なデータセット上で、単一の強力なモデルを同時にトレーニングします。
論文参考訳（メタデータ） (Wed, 18 Dec 2024 18:11:53 GMT)
様々な Earth observationデータを統合的に扱える基盤モデルの提案。「We have presented AnySat, a versatile architecture designed to address the diversity of EO data in terms of resolutions, scales, and modalities.」ということで効果も検証されている。
リポジトリはGitHub – gastruc/AnySat

2026年7月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31