2024年12月 – ページ 3 – arXiv最新論文の紹介

Byte Latent Transformer: Patches Scale Better Than Tokens

Byte Latent Transformer: Patches Scale Better Than Tokens [101.1]
Byte Latent Transformer (BLT) はバイトを動的サイズのパッチにエンコードする。固定推論コストに対して、BLTはパッチとモデルサイズの両方を同時に拡大することにより、トークン化ベースのモデルよりもはるかに優れたスケーリングを示している。
論文参考訳（メタデータ） (Fri, 13 Dec 2024 05:33:32 GMT)
バイト単位のTransformerは様々提案されてきたが、大規模なモデル構築は計算量の点で厳しかった。本件では「To efficiently allocate compute, we propose a dynamic, learnable method for grouping bytes into patches (§2) and a new model architecture that mixes byte and patch information.」という手法を提案。「Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.」とのこと。
リポジトリはGitHub – facebookresearch/blt: Code for BLT research paper

Language Models as Continuous Self-Evolving Data Engineers

Language Models as Continuous Self-Evolving Data Engineers [31.9]
大規模言語モデル(LLM)は、様々なタスクにおいて顕著な能力を示している。本稿では, LLM がデータの自動生成, クリーニング, レビュー, 注釈付けにより, 自己学習を可能にする新しいパラダイムを提案する。我々のアプローチは、LLMが継続的自己進化型データエンジニアとして機能することを示し、トレーニング後のデータ構築プロセスの時間とコストを大幅に削減する。
論文参考訳（メタデータ） (Thu, 19 Dec 2024 18:28:41 GMT)
LLMがデータの生成、自己学習を行っていくLanguage Models as Continuous Self-Evolving Data Engineers (LANCE)の提案。「This training paradigm with autonomous data construction not only reduces the reliance on human experts or external models but also ensures that the data aligns with human values and preferences, paving the way for the development of future superintelligent systems that can exceed human capabilities.」と強い主張がされている。
近しい研究は過去にもあるのでこの方針が有効であるのはそうであろうと思うが、限界はあるはずでsuperintelligent systemにつながるかというとかなり疑問ではある。

Machine Unlearning Doesn’t Do What You Think: Lessons for Generative AI Policy, Research, and Practice

Machine Unlearning Doesn’t Do What You Think: Lessons for Generative AI Policy, Research, and Practice [186.1]
非学習はしばしば、生成AIモデルからターゲット情報の影響を取り除くソリューションとして呼び出される。未学習はまた、モデルが出力中にターゲットとなるタイプの情報を生成するのを防ぐ方法として提案されている。これら2つの目標 – モデルからの情報の標的的除去と、モデル出力からの情報のターゲット的抑制 – は、様々な技術的および現実的な課題を表す。
論文参考訳（メタデータ） (Mon, 09 Dec 2024 20:18:43 GMT)
Machine unlearningに関する包括的な情報。「despite the intuitive alignment of the meanings of the words “removal” and “deletion,” it is unclear if technical removal is indeed necessary to satisfy deletion requirements in law and policy.」など技術的な部分以外への言及に力を入れた整理でとても参考になる。

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation

RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation [21.8]
RetroLLMは、検索と生成を単一の凝集プロセスに統合する統合フレームワークである。制約付きエビデンス生成の過程での偽プルーニングを軽減するために,階層的FM-Index制約を導入する。 5つのオープンドメインQAデータセットの実験では、ドメイン内タスクとドメイン外タスクの両方にわたって、RetroLLMの優れたパフォーマンスが示されている。
論文参考訳（メタデータ） (Mon, 16 Dec 2024 16:03:25 GMT)
検索と生成をシームレスにつなぐフレームワークの提案、
リポジトリはGitHub – sunnynexus/RetroLLM: RetroLLM: Empowering LLMs to Retrieve Fine-grained Evidence within Generation

Driv3R: Learning Dense 4D Reconstruction for Autonomous Driving

Driv3R: Learning Dense 4D Reconstruction for Autonomous Driving [116.1]
マルチビュー画像シーケンスからフレーム単位のポイントマップを直接回帰するフレームワークであるDriv3Rを提案する。我々は4次元フロー予測器を用いてシーン内の移動物体を識別し、これらの動的領域の再構築をより重視する。 Driv3Rは4D動的シーン再構築において従来のフレームワークより優れており、推論速度は15倍高速である。
論文参考訳（メタデータ） (Mon, 09 Dec 2024 18:58:03 GMT)
プロジェクトサイトはDriv3R、リポジトリはGitHub – Barrybarry-Smith/Driv3R: Official Implementation of Driv3R

Mixture of Hidden-Dimensions Transformer

Mixture of Hidden-Dimensions Transformer [50.4]
隠れ次元の空間性について検討し、訓練されたトランスフォーマーがわずかなトークン次元しか利用していないことを観察する。スパース条件付アクティベーションアーキテクチャであるMoHD(Mixture of Hidden Dimensions)を提案する。 50%のアクティベーションパラメータが減少し、3.7%のハイパフォーマンスを実現し、3倍のパラメータを一定のアクティベーションコストで拡張する。
論文参考訳（メタデータ） (Sat, 07 Dec 2024 13:15:22 GMT)
最近よく見るMoEっぽいがグローバルな構造に踏み込んでいるタイプの研究
「It achieves 1.7% higher performance with 50% fewer activation parameters and 3.7% higher performance with a 3× parameter expansion at constant activation cost.」とのこと

A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios

A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios [44.0]
ゲーム理論のシナリオは、Large Language Model(LLM)ベースのソーシャルエージェントの社会的インテリジェンスを評価する上で重要なものとなっている。本調査では,研究成果をゲームフレームワーク,ソーシャルエージェント,評価プロトコルの3つのコアコンポーネントにまとめる。
論文参考訳（メタデータ） (Thu, 05 Dec 2024 06:46:46 GMT)
ゲーム理論な文脈でのLLM based Agentsのサーベイ。

SimVS: Simulating World Inconsistencies for Robust View Synthesis

SimVS: Simulating World Inconsistencies for Robust View Synthesis [102.8]
本稿では、生成ビデオモデルを利用して、キャプチャ中に起こりうる世界の不整合をシミュレートする手法を提案する。我々の世界シミュレーション戦略は、現実のシーンのバリエーションを扱う上で、従来の拡張手法よりも大幅に優れていることを実証する。
論文参考訳（メタデータ） (Tue, 10 Dec 2024 17:35:12 GMT)
「Our approach augments existing multiview datasets with inconsistencies simulated by a video diffusion model and trains a multiview harmonization model to sample sets of consistent views of a scene conditioned on sparse inconsistent captures. We can then use existing 3D reconstruction and view synthesis techniques to synthesize novel viewpoints from these consistent images.」とのこと。面白いデータ拡張のアプローチでプロジェクトサイトを見るに効果も高いよう。
プロジェクトサイトはSimVS: Simulating World Inconsistencies for Robust View Synthesis

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy [88.1]
CC-OCRは、マルチシーンテキスト読取、多言語テキスト読取、文書解析、キー情報抽出の4つのOCR中心のトラックで構成されている。 CC-OCRは、OCR中心のタスクにおけるLMMの能力を総合的に評価し、LMMの進歩を促進することを目的としている。
論文参考訳（メタデータ） (Tue, 03 Dec 2024 07:03:25 GMT)
MLLMのためのOCRベンチマーク、全般的にGemini Proの性能が高い
リポジトリはhttps://github.com/QwenLM/CC-OCR

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action [103.6]
複雑・多段階・多モードタスクの性能向上を目的とした多モード大規模アクションモデルであるTACOを提案する。推論中、TACOはチェーン・オブ・シント・アンド・アクション(CoTA)を生成し、OCR、深さ推定、電卓などの外部ツールを呼び出すことで中間ステップを実行する。このデータセットにより、TACOは複雑な推論とアクションパスを学習し、直接回答だけでチューニングデータに基づいてトレーニングされた既存のモデルを上回ることができる。
論文参考訳（メタデータ） (Sat, 07 Dec 2024 00:42:04 GMT)
「Our TACO model is able to output a Chain-of Thought-and-Action (CoTA) and answer challenging questions based on the thoughts and action outputs」というモデルの提案。マルチモーダルなAction付きのモデル。GPT-4oなどを使って構築した合成データを活用とのこと。
プロジェクトサイトはTACO

2024年12月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31