2024年12月16日 – arXiv最新論文の紹介

Phi4, InternVL 2.5, EXAONE 3.5

Gemini 2.0やOpenAIの12日間発表で盛り上がっているが、OSSや公開モデルについても様々なモデルが発表されている。

Phi-4 Technical Report [72.1]
本研究では,データ品質に重点を置いた14ビリオンパラメータ言語モデル phi-4 を提案する。多くの言語モデルとは異なり、事前学習は主にWebコンテンツやコードなどの有機データソースに基づいており、phi-4はトレーニングプロセス全体を通して戦略的に合成データを組み込んでいる。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 03:37:41 GMT)
小型、高性能モデルPhiの最新バージョン、「phi-4 strategically incorporates synthetic data throughout the training process.」とのことで合成データをうまく活用するアプローチ。Phi3を超え、GPT-4o miniに迫っている優秀なモデル。
公式Blogでも発表がある　Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning | Microsoft Community Hub

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases [35.0]
EXAONE 3.5言語モデルは32B、7.8B、2.4Bの3つの構成で提供されている。商用利用については、LG AI Researchの公式コンタクトポイントを参照してください。
論文参考訳（メタデータ） (Mon, 09 Dec 2024 09:31:10 GMT)
LGによる公開モデル、同サイズのQwen2.5と競合する性能
リポジトリはLGAI-EXAONE (LG AI Research)

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [121.1]
InternVL 2.5は、InternVL 2.0上に構築された高度マルチモーダル大規模言語モデル(MLLM)シリーズである。 InternVL 2.5は、GPT-4oやClaude-3.5-Sonnetといった主要な商用モデルと競合する競争力を持つ。このモデルが、マルチモーダルAIシステムの開発と適用のための新しい標準を設定することで、オープンソースコミュニティに貢献できることを願っています。
論文参考訳（メタデータ） (Fri, 06 Dec 2024 18:57:08 GMT)
OSSのMLLM、性能は商用モデルと競合的とのこと。「we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.」というアーキテクチャでViTをProjectorでLLMとつなぐアプローチ
リポジトリはOpenGVLab/InternVL2_5-78B · Hugging Face、GitHub – OpenGVLab/InternVL: [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.9]
本研究は,ストリーミング映像とオーディオ入力とのリアルタイムインタラクションを実現するために,非絡み合いのストリーミング知覚,推論,メモリ機構を導入している。このプロジェクトは人間のような認知をシミュレートし、多モーダルな大規模言語モデルが時間とともに継続的かつ適応的なサービスを提供できるようにする。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 18:58:30 GMT)
リアルタイムストリーミングだけでなくメモリ機能なども備えるフレームワーク
リポジトリはGitHub – InternLM/InternLM-XComposer: InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Owl-1: Omni World Model for Consistent Long Video Generation [75.5]
Omni World ModeL (Owl-1) を提案する。 Owl-1 は VBench-I2V と VBench-Long の SOTA メソッドと同等の性能を実現している。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 18:59:01 GMT)
動画生成モデル、リポジトリはGitHub – huang-yh/Owl

SiReRAG: Indexing Similar and Related Information for Multihop Reasoning [96.6]
SiReRAGは、類似情報と関連する情報の両方を明示的に考慮する新しいRAGインデックス方式である。 SiReRAGは、3つのマルチホップデータセットの最先端インデックス手法を一貫して上回る。
論文参考訳（メタデータ） (Mon, 09 Dec 2024 04:56:43 GMT)
類似性によるツリーに加えて関連性（we construct the relatedness tree by clustering the propositions based on their entities to get proposition aggregates and having recursive summaries on top.）のツリーを併用するRAG
マルチホップなQAにて高性能とのこと

The BrowserGym Ecosystem for Web Agent Research [151.9]
BrowserGymエコシステムは、Webエージェントの効率的な評価とベンチマークの必要性の高まりに対処する。大規模なマルチベンチマークWebエージェント実験を初めて実施する。結果は、OpenAIとAnthropicの最新モデルの大きな相違点を浮き彫りにしている。
論文参考訳（メタデータ） (Fri, 06 Dec 2024 23:43:59 GMT)
WEBエージェント開発のためのベンチマーク環境、あわせてベンチマークの統合とAgentLabも公開している。現在のリーダーボード（BrowserGym Leaderboard – a Hugging Face Space by ServiceNow）によると、Claude 3.5 Sonnetの性能の高さが目立っている。
リポジトリはGitHub – ServiceNow/BrowserGym: 🌎💪 BrowserGym, a Gym environment for web task automation、GitHub – ServiceNow/AgentLab: AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

Political-LLM: Large Language Models in Political Science [160.0]
大規模言語モデル(LLM)は、政治科学のタスクで広く採用されている。政治LLMは、LLMを計算政治科学に統合する包括的な理解を促進することを目的としている。
論文参考訳（メタデータ） (Mon, 09 Dec 2024 08:47:50 GMT)
「In this work, we—a multidisciplinary team of researchers spanning computer science and political science—present the first principled framework termed Political-LLM to advance the comprehensive understanding of integrating LLMs into computational political science.」、「The intended audience of this survey includes (1) computer science researchers and practitioners who seek a structured understanding of how LLMs are applied in political science, aiming to bridge interdisciplinary gaps; and (2) political science researchers and practitioners who seek to leverage LLMs in ways that are sensitive to the unique requirements of their field, such as nuanced interpretation and contextual accuracy [57].」ということで、政治へのLLM応用について調査したサーベイ。政治とあるが社会的なLLMの活用方針についての示唆も多く参考になる点が多い。プロジェクトサイトのライセンスがCC BY-SAであるのはありがたい。
プロジェクトサイトはPolitical-LLM: Large Language Models in Political Science