arXiv最新論文の紹介

Mixture-of-Transformers

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models [112.0]
Mixture-of-Transformer (MoT) はスパースマルチモーダルトランスアーキテクチャである。 MoTはモデルの非埋め込みパラメータをモダリティで分離する。複数の設定とモデルスケールでMoTを評価する。
論文参考訳（メタデータ） (Thu, 07 Nov 2024 18:59:06 GMT)
性能がルータに依存するMixture of Expertsに対して、「MoT extends the standard transformer architecture by incorporating modality-specific weights for all non-embedding model parameters, including feed-forward networks, attention matrices, and layer normalization.」というアプローチのMixture of Transformerの提案。「In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline’s performance using only 55.8% of the FLOPs.」と有効性を主張。

A Survey of Small Language Models

A Survey of Small Language Models [104.8]
小言語モデル (SLM) は, 計算資源の最小化による言語タスクの効率化と性能の向上により, ますます重要になってきている。本稿では,SLMのアーキテクチャ,トレーニング技術,モデル圧縮技術に着目した総合的な調査を行う。
論文参考訳（メタデータ） (Fri, 25 Oct 2024 23:52:28 GMT)
Small Language Model（といっても感覚的には小規模LLM）のサーベイ
「The inherent difficulty of a survey of small language models is that the definitions of “small” and “large” are a function of both context and time. GPT2, a “large language model” in 2019 at 1.5B parameters, is smaller than many “small” language models covered in this survey.」とある通り、Smallとは？というのが大きな疑問。

GUI Agents with Foundation Models: A Comprehensive Survey

GUI Agents with Foundation Models: A Comprehensive Survey [53.0]
この調査は(M)LLMベースのGUIエージェントに関する最近の研究を集約する。データ、フレームワーク、アプリケーションにおける重要なイノベーションを強調します。本稿では, (M)LLM ベースの GUI エージェントの分野におけるさらなる発展を期待する。
論文参考訳（メタデータ） (Thu, 07 Nov 2024 17:28:10 GMT)
MLLMベースのGUIエージェントのサーベイ
研究が進んでいると思ったらサーベイが発表されるスピード感がこの分野の現状を表していると思う。

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark [44.7]
MMAUは、人間の注釈付き自然言語の質問と回答とを合わせた、注意深く編集された10kのオーディオクリップで構成されている。これには、情報抽出と推論の質問が含まれており、モデルは、ユニークで困難なタスクにまたがる27の異なるスキルを実証する必要がある。我々は18のオープンソースおよびプロプライエタリなAudio-Language Modelを評価し、MMAUがもたらす重大な課題を実証した。
論文参考訳（メタデータ） (Thu, 24 Oct 2024 21:20:10 GMT)
音声（speech, sounds, music）を軸とした理解・推論のベンチマーク。GPT-4o、Gemini Pro 1.5の性能が高めだが、「Our evaluations of 18 open-source and proprietary LALMs reveal that even the overall best model achieves only 59% accuracy on MMAU, highlighting the significant challenges it poses.」とのこと。人間のスコア（約80%）との差も大きい。
リポジトリはMMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

DynaSaur: Large Language Agents Beyond Predefined Actions

DynaSaur: Large Language Agents Beyond Predefined Actions [108.8]
既存のLLMエージェントシステムは、通常、各ステップで固定セットと事前定義されたセットからアクションを選択する。動作の動的生成と構成をオンラインで実現するLLMエージェントフレームワークを提案する。 GAIAベンチマーク実験により, このフレームワークは柔軟性が向上し, 従来の手法よりも優れていたことが確認された。
論文参考訳（メタデータ） (Mon, 04 Nov 2024 02:08:59 GMT)
Agenticな動きの各ステージをPythonコードとしコード生成を使うことによって柔軟性を増したフレームワークの提案。「We have explored an LLM agent framework that implements its own actions as Python functions to interact with the world and accumulate its generated actions over time, thus growing a toolset of actions for problem-solving in future tasks.」GAIA Leaderboard – a Hugging Face Space by gaia-benchmarkで高い性能を達成。
リポジトリはGitHub – adobe-research/dynasaur: Official repository for “DynaSaur: Large Language Agents Beyond Predefined Actions”　（現時点ではコードがアップロードされていないよう）

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems

HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieved Knowledge in RAG Systems [62.4]
Retrieval-Augmented Generation (RAG) は知識能力の向上を目的としている。 HTML RAGは、検索された知識のフォーマットとして、平易なテキストの代わりにHTMLを使用する。我々は,情報の損失を最小限に抑えつつ,HTMLの短縮化を図るため,HTMLのクリーニング,圧縮,プルーニング戦略を提案する。
論文参考訳（メタデータ） (Tue, 05 Nov 2024 09:58:36 GMT)
RAGで使用する知識のフォーマットとしてHTMLを使用するという提案、ベンチマークでも優れた結果とのこと。ベースLLM（Llama 3.1 8B・70B）×提案手法・PlainText・Markdownの結果が興味深い。（HTMLがベストなのか読み取るのが難しいような気がしなくもない）
リポジトリはGitHub – plageon/HtmlRAG: HtmlRAG: HTML is Better Than Plain Text for Modeling Retrieval Results in RAG Systems

Hunyuan-Large

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent [83.4]
Hunyuan-Largeは、オープンソースのTransformerベースのエキスパートモデルのミックスである。我々は,Hunyuan-Largeの優れた性能を,様々なベンチマークで徹底的に評価する。 Hunyuan-Largeの主な実践は、以前の文献より大きい大規模合成データである。
論文参考訳（メタデータ） (Tue, 05 Nov 2024 04:14:25 GMT)
高性能かつモデルが公開されているタイプのLLM。389Bパラメータうち52BがアクティブなるMoEでLlama 3.1 70Bを超え、405Bと競合的と主張。比較的寛容なライセンスであるが「THIS LICENSE AGREEMENT DOES NOT APPLY IN THE EUROPEAN UNION AND IS EXPRESSLY LIMITED TO THE TERRITORY, AS DEFINED BELOW.」というのが特徴的。「This Agreement and any dispute arising out of or relating to it will be governed by the laws of the Hong Kong Special Administrative Region of the People’s Republic of China」との記載も。
リポジトリはGitHub – Tencent/Tencent-Hunyuan-Large、モデルはtencent/Tencent-Hunyuan-Large · Hugging Face

Number Cookbook: Number Understanding of Language Models and How to Improve It

Number Cookbook: Number Understanding of Language Models and How to Improve It [64.0]
大規模言語モデル(LLM)は、基本的な数値的な理解と処理において予期せぬ誤りを犯しながら、複雑な推論タスクの増大を解決することができる。本稿では,LLMの数値理解と処理能力(NUPA)について包括的に検討する。
論文参考訳（メタデータ） (Wed, 06 Nov 2024 08:59:44 GMT)
LLMにおける numerical understanding and processing ability (NUPA)の分析と、その改善方法の検討。現状だとコード生成を介すなどツールを使うアプローチが有力だが、「1) we want to study the self-contained NUPA of LLMs,　2) calling external tools whenever encountering numbers increases the inference latency (Xu et al , 2024), and 3) we believe NUPA without tools is a necessary ability of AGI.」という点から本件ではツール利用が検討対象外となっている。
現時点では「We investigate NUPA of LLMs and introduce a comprehensive benchmark, the NUPA test, to reveal that numerical problems remain challenging for modern LLMs.」とのこと。やはり難しい問題。実用上はコード生成を介すなどして対応できなくはないが・・・。
リポジトリはGitHub – GraphPKU/number_cookbook

Agent K

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level [73.1]
我々は、エンドツーエンドの自律データサイエンスエージェントであるAgent K v1.0を紹介する。経験から学ぶことによって、データサイエンスのライフサイクル全体を管理する。キー情報を選択的に保存して検索することで、長期記憶と短期記憶を最適化する。
論文参考訳（メタデータ） (Tue, 05 Nov 2024 23:55:23 GMT)
「our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold medals, 3 silver medals, and 7 bronze medals」とKaggleのグランドマスター並みを主張するエージェントシステムの提案。
パイプライン構成やプロンプトなど参考になる点は多いが、「However, because this assessment relies on a custom split of the training data rather than the competition’s actual private test set, it remains uncertain whether an agent’s high ranking in this context would align with results on the original Kaggle leaderboard.」という記載やLeakの可能性など「ほんまかいな」という疑問点はなくはない。

Neural Fields in Robotics: A Survey

Neural Fields in Robotics: A Survey [39.9]
Neural Fieldsは、コンピュータビジョンとロボット工学における3Dシーン表現の変革的アプローチとして登場した。この調査は、ロボット工学における彼らの応用を探求し、知覚、計画、制御を強化する可能性を強調している。それらのコンパクトさ、メモリ効率、微分可能性、基礎モデルと生成モデルとのシームレスな統合は、リアルタイムアプリケーションに理想的です。
論文参考訳（メタデータ） (Sat, 26 Oct 2024 16:26:41 GMT)
「This paper provides a thorough review of Neural Fields in robotics, categorizing applications across various domains and evaluating their strengths and limitations, based on over 200 papers.」というサーベイ、ロボット分野で研究・応用が広がっているとのこと。
リポジトリはNeural Fields in Robotics: A Survey

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31