arXiv最新論文の紹介

Phi4, InternVL 2.5, EXAONE 3.5

Gemini 2.0やOpenAIの12日間発表で盛り上がっているが、OSSや公開モデルについても様々なモデルが発表されている。

Phi-4 Technical Report [72.1]
本研究では,データ品質に重点を置いた14ビリオンパラメータ言語モデル phi-4 を提案する。多くの言語モデルとは異なり、事前学習は主にWebコンテンツやコードなどの有機データソースに基づいており、phi-4はトレーニングプロセス全体を通して戦略的に合成データを組み込んでいる。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 03:37:41 GMT)
小型、高性能モデルPhiの最新バージョン、「phi-4 strategically incorporates synthetic data throughout the training process.」とのことで合成データをうまく活用するアプローチ。Phi3を超え、GPT-4o miniに迫っている優秀なモデル。
公式Blogでも発表がある　Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning | Microsoft Community Hub

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases [35.0]
EXAONE 3.5言語モデルは32B、7.8B、2.4Bの3つの構成で提供されている。商用利用については、LG AI Researchの公式コンタクトポイントを参照してください。
論文参考訳（メタデータ） (Mon, 09 Dec 2024 09:31:10 GMT)
LGによる公開モデル、同サイズのQwen2.5と競合する性能
リポジトリはLGAI-EXAONE (LG AI Research)

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [121.1]
InternVL 2.5は、InternVL 2.0上に構築された高度マルチモーダル大規模言語モデル(MLLM)シリーズである。 InternVL 2.5は、GPT-4oやClaude-3.5-Sonnetといった主要な商用モデルと競合する競争力を持つ。このモデルが、マルチモーダルAIシステムの開発と適用のための新しい標準を設定することで、オープンソースコミュニティに貢献できることを願っています。
論文参考訳（メタデータ） (Fri, 06 Dec 2024 18:57:08 GMT)
OSSのMLLM、性能は商用モデルと競合的とのこと。「we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.」というアーキテクチャでViTをProjectorでLLMとつなぐアプローチ
リポジトリはOpenGVLab/InternVL2_5-78B · Hugging Face、GitHub – OpenGVLab/InternVL: [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.9]
本研究は,ストリーミング映像とオーディオ入力とのリアルタイムインタラクションを実現するために,非絡み合いのストリーミング知覚,推論,メモリ機構を導入している。このプロジェクトは人間のような認知をシミュレートし、多モーダルな大規模言語モデルが時間とともに継続的かつ適応的なサービスを提供できるようにする。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 18:58:30 GMT)
リアルタイムストリーミングだけでなくメモリ機能なども備えるフレームワーク
リポジトリはGitHub – InternLM/InternLM-XComposer: InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Owl-1: Omni World Model for Consistent Long Video Generation [75.5]
Omni World ModeL (Owl-1) を提案する。 Owl-1 は VBench-I2V と VBench-Long の SOTA メソッドと同等の性能を実現している。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 18:59:01 GMT)
動画生成モデル、リポジトリはGitHub – huang-yh/Owl

SiReRAG: Indexing Similar and Related Information for Multihop Reasoning

SiReRAG: Indexing Similar and Related Information for Multihop Reasoning [96.6]
SiReRAGは、類似情報と関連する情報の両方を明示的に考慮する新しいRAGインデックス方式である。 SiReRAGは、3つのマルチホップデータセットの最先端インデックス手法を一貫して上回る。
論文参考訳（メタデータ） (Mon, 09 Dec 2024 04:56:43 GMT)
類似性によるツリーに加えて関連性（we construct the relatedness tree by clustering the propositions based on their entities to get proposition aggregates and having recursive summaries on top.）のツリーを併用するRAG
マルチホップなQAにて高性能とのこと

The BrowserGym Ecosystem for Web Agent Research

The BrowserGym Ecosystem for Web Agent Research [151.9]
BrowserGymエコシステムは、Webエージェントの効率的な評価とベンチマークの必要性の高まりに対処する。大規模なマルチベンチマークWebエージェント実験を初めて実施する。結果は、OpenAIとAnthropicの最新モデルの大きな相違点を浮き彫りにしている。
論文参考訳（メタデータ） (Fri, 06 Dec 2024 23:43:59 GMT)
WEBエージェント開発のためのベンチマーク環境、あわせてベンチマークの統合とAgentLabも公開している。現在のリーダーボード（BrowserGym Leaderboard – a Hugging Face Space by ServiceNow）によると、Claude 3.5 Sonnetの性能の高さが目立っている。
リポジトリはGitHub – ServiceNow/BrowserGym: 🌎💪 BrowserGym, a Gym environment for web task automation、GitHub – ServiceNow/AgentLab: AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

Political-LLM: Large Language Models in Political Science

Political-LLM: Large Language Models in Political Science [160.0]
大規模言語モデル(LLM)は、政治科学のタスクで広く採用されている。政治LLMは、LLMを計算政治科学に統合する包括的な理解を促進することを目的としている。
論文参考訳（メタデータ） (Mon, 09 Dec 2024 08:47:50 GMT)
「In this work, we—a multidisciplinary team of researchers spanning computer science and political science—present the first principled framework termed Political-LLM to advance the comprehensive understanding of integrating LLMs into computational political science.」、「The intended audience of this survey includes (1) computer science researchers and practitioners who seek a structured understanding of how LLMs are applied in political science, aiming to bridge interdisciplinary gaps; and (2) political science researchers and practitioners who seek to leverage LLMs in ways that are sensitive to the unique requirements of their field, such as nuanced interpretation and contextual accuracy [57].」ということで、政治へのLLM応用について調査したサーベイ。政治とあるが社会的なLLMの活用方針についての示唆も多く参考になる点が多い。プロジェクトサイトのライセンスがCC BY-SAであるのはありがたい。
プロジェクトサイトはPolitical-LLM: Large Language Models in Political Science

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning

Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning [125.8]
Inst-ITは、明示的な視覚的プロンプトインストラクションチューニングを通じてインスタンス理解におけるLMMを強化するソリューションである。 Inst-ITは、マルチモーダルなインスタンスレベルの理解を診断するためのベンチマーク、大規模命令チューニングデータセット、継続的命令チューニングトレーニングパラダイムで構成されている。
論文参考訳（メタデータ） (Wed, 04 Dec 2024 18:58:10 GMT)
動画内のオブジェクトのようなインスタンスレベルでの理解を行うためのベンチマーク、データセットの提案。
筆者らによってFinetuningされたモデルはOSSなものでは高性能だが商用レベルには及んでいない。というのとこれが純粋に難しい問題であることが分かるスコア。
リポジトリはInst-IT

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation

Moto: Latent Motion Token as the Bridging Language for Robot Manipulation [66.2]
我々はMotoを紹介する。Motoは、映像コンテンツをラテントモーションTokenizerでラテントモーションTokenシーケンスに変換する。我々は、モーショントークンによるMoto-GPTの事前学習を行い、多様な視覚的動きの知識を捉えることができる。実際のロボット動作に先立って学習した動きを転送するために、潜伏した動きのトークン予測と実際のロボット制御をシームレスにブリッジするコファインチューニング戦略を実装した。
論文参考訳（メタデータ） (Thu, 05 Dec 2024 18:57:04 GMT)
「This paper introduces Moto, a novel method that uses latent motion tokens as a “language” interface to bridge generative pre-training on video data with precise robot control.」という手法の提案。潜在的な意味というか意図というかをTokenシーケンスにして言語として扱うということ、かつ、それが有効というのは興味深い。
プロジェクトサイトはMoto

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset [33.2]
精度とデータ量とのトレードオフを改善する方法を示します。 15Tトークンのためにトレーニングされた8Bパラメータモデルで、うち7.2Tは、Llama 3.1 8Bモデルよりも優れている。
論文参考訳（メタデータ） (Tue, 03 Dec 2024 17:28:50 GMT)
RedStone同様、Common CrawlをうまくRefineする手法の報告。こちらはNDIVIAによるもの。「We propose a method for transforming English Common Crawl into a 6.3T token longhorizon pretraining dataset, consisting of 4.4T globally deduplicated original tokens and 1.9T synthetically generated tokens.」と合成データについて触れられているのも興味深い。
プロジェクトサイトはNemotron-CC

RedStone: Curating General, Code, Math, and QA Data for Large Language Models

RedStone: Curating General, Code, Math, and QA Data for Large Language Models [134.5]
本研究では,大規模言語モデルを事前学習するための包括的かつ柔軟なリソースとして,Common Crawlの未完成の可能性を探る。私たちは、Common Crawlからデータを抽出し、処理するために設計された、革新的でスケーラブルなパイプラインであるRedStoneを紹介します。
論文参考訳（メタデータ） (Wed, 04 Dec 2024 15:27:39 GMT)
LLM構築など大規模な事前学習で重要なデータ源となっているCommonCrawlからのデータ構築についての報告と実装。フィルタリングの過程でデータが大幅に削られている。「Our general domain dataset, REDSTONE-Web, outperforms existing open-source datasets in common sense reasoning benchmarks, while the inclusion of REDSTONE-Code and REDSTONE-Math significantly improves model performance in code generation and mathematical problem solving.」とのこと。
リポジトリはhttps://github.com/microsoft/redstoneとのことだが、現時点では404

Large Language Model-Brained GUI Agents: A Survey

Large Language Model-Brained GUI Agents: A Survey [43.2]
マルチモーダルモデルはGUI自動化の新しい時代を支えてきた。彼らは自然言語理解、コード生成、視覚処理において例外的な能力を示した。これらのエージェントはパラダイムシフトを表しており、ユーザーは単純な会話コマンドで複雑なマルチステップタスクを実行できる。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 12:13:39 GMT)
GUI Agents with Foundation Models: A Comprehensive Survey – arXiv最新論文の紹介ににたサーベイだが、こちらはMicrosoftの研究者が筆頭著者。

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Establishing Task Scaling Laws via Compute-Efficient Model Ladders [123.8]
我々は,事前訓練された言語モデル(LM)のタスク性能を予測するために,タスクスケーリング法則とモデルはしごを開発する。まず、タスク固有の損失を予測するためにモデルとデータサイズを使用し、次にタスクの損失を使ってタスクパフォーマンスを予測する。
論文参考訳（メタデータ） (Thu, 05 Dec 2024 18:21:49 GMT)
効率よくタスク性能を予測する手法の提案、「With a less than 1% of the pretraining compute, we are able to predict the task performance of 7B-4T and 13B-5T models on individual multiple-choice tasks with good accuracy.」とのこと。

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31