2024年9月 – ページ 2 – arXiv最新論文の紹介

CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration

CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration [90.4]
マルチモーダルな大言語モデル (MLLM) は、視覚的な入力を含む会話への関与において顕著な成功を収めている。視覚的モダリティの統合は、MLLMが悪意のある視覚的入力に影響を受けやすいという、ユニークな脆弱性を導入している。本稿では,出力分布を校正することでMLLMの安全性を向上するCoCA技術を紹介する。
論文参考訳（メタデータ） (Tue, 17 Sep 2024 17:14:41 GMT)
MLLMにおいて悪意のある画像を介した攻撃が問題になるが、その対応に関する論文。
「We first make the observation that despite the integration of visual modality makes the MLLMs more vulnerable, the inherent safetyawareness of MLLMs still exists.」はへーという感じ、

What is the Role of Small Models in the LLM Era: A Survey

What is the Role of Small Models in the LLM Era: A Survey [13.2]
大規模言語モデル(LLM)は人工知能(AGI)の進歩に大きな進歩をもたらし、GPT-4やLLaMA-405Bのような大規模モデルの開発に繋がった。モデルのサイズを拡大すると、計算コストとエネルギー消費が指数関数的に増加し、これらのモデルは限られたリソースを持つ学術研究者やビジネスにとって実用的ではない。同時に、Small Models (SM) は実際的な設定で頻繁に使用されるが、その重要性は過小評価されている。
論文参考訳（メタデータ） (Tue, 10 Sep 2024 20:45:43 GMT)
実用上重要なスモールモデルに関するサーベイ。「 there is no clear definition distinguishing large models from small ones.」はですよねーという感じ。とはいえ整理軸含めて、納得感のある内容。
リポジトリはGitHub – tigerchen52/role_of_small_models

Autoregressive + Chain of Thought (CoT) ≃ Recurrent、To CoT or not to CoT

Chain of Thoughtの検証を行った論文が出ていた。１つ目は動作面からの検証で2つ目はメタ分析によるもの。

Autoregressive + Chain of Thought (CoT) $\simeq$ Recurrent: Recurrence’s Role in Language Models and a Revist of Recurrent Transformer [30.0]
言語モデルにおける繰り返し構造が推論能力に与える影響について検討する。線形変換器やRWKVのようなモデルにおける重要な理論的限界を同定する。
論文参考訳（メタデータ） (Sat, 14 Sep 2024 00:30:57 GMT)
「We explained that CoT approximates recurrence in Transformer-based autoregressive LLMs from a computational standpoint.」とのこと。途中の「Recurrent Neural Networks (RNNs) sacrifice parallel training for recurrent connections, while Transformers trade recurrence for parallelism.」も重要。

To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning [55.5]
Chain-of-Thought (CoT) は,大規模言語モデル (LLM) から推論能力を引き出すデファクト手法である。私たちは、CoTが主に数学や論理学を含むタスクに強いパフォーマンス上の利点をもたらし、他のタスクよりもはるかに少ない利益をもたらすことを示しています。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:55:00 GMT)
「Finding 1: CoT only helps substantially on problems requiring mathematical, logical, or algorithmic reasoning.」はよいとして、「Finding 2: CoT primarily helps with the execution step that performs computation and symbolic manipulation, but falls short of what LLMs with tool augmentation can do.」はAgenticなアプローチのほうが有望ということなんだろうか。

P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task [94.1]
Embodied Everyday Taskは、インボディードAIコミュニティで人気のあるタスクである。自然言語命令は明示的なタスクプランニングを欠くことが多い。タスク環境に関する知識をモデルに組み込むには、広範囲なトレーニングが必要である。
論文参考訳（メタデータ） (Tue, 17 Sep 2024 15:29:34 GMT)
自然言語の指示と環境情報が与えられた時のエージェント動作（計画など）にRAGを使うアプローチの提案。RAGのデータベースを動的に更新していくものでLLM based Agentsそのものの印象。
感覚的にRetrieveに難しさがありそうだが、「When an agent interacts with the environment during a task, it first receives the environment’s goal instruction 𝐼𝑔 and observation 𝑂𝑡. Then it encodes with MiniLM [31] both of them」とあるがこの方針でうまくいくのかという驚き。

Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective

Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective [15.6]
マルチモーダル感情コンピューティング(MAC)は、人間の行動や意図の分析に広く応用されているため、注目を集めている。本調査は,NLPの観点からのマルチモーダル感情コンピューティングの最近のトレンドを4つのホットタスクにまとめる。本調査の目的は、マルチモーダル感情研究の現在の展望を探求し、開発動向を特定し、様々なタスクにおける類似点と相違点を明らかにすることである。
論文参考訳（メタデータ） (Wed, 11 Sep 2024 16:24:06 GMT)
Multimodal affective computingのサーベイ。主なタスクはMultimodal Sentiment Analysis (MSA), Multimodal Emotion Recognition in Conversation (MERC), Multimodal Aspect Based Sentiment Analysis (MABSA), Multimodal Multilabel Emotion Recognition (MMER)とのこと。
論文リポジトリはGitHub – LeMei/Multimodal-Affective-Computing-Survey

Qwen 2.5, Qwen 2 VL, GRIN-MoE, Pixtral

様々な研究機関がLLMを構築している。先週のニュースとしては高性能なLLM Qwen 2.5、MoE構成で高効率なGRIN-MoE、マルチモーダル拡張のQwen 2 VL、Pixtralに注目。

ライセンスは様々であることに注意が必要だが、モデル自体は公開されている。商用API以外に選択肢が広がっている。また、それぞれ様々な狙いを持ったモデルとなっていて正直評価を行うことも簡単ではない。自分がやりたいことにフィットするベースモデル、活用方法をサジェストするAIが欲しい今日この頃。

モデル構築、fine tuningの観点でも多くの情報が公開されておりとても興味深い。

Qwen2.5-Coder Technical Report [100.7]
先代のCodeQwen1.5から大幅にアップグレードされたQwen2.5-Coderシリーズを紹介します。コード固有のモデルとして、Qwen2.5-CoderはQwen2.5アーキテクチャに基づいて構築され、5.5兆以上のトークンからなる巨大なコーパスで事前訓練されている。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:57:57 GMT)
「To ensure the quality of the pre-training data, we have curated a dataset by collecting public code data and extracting high-quality code-related content from web texts, while filtering out low-quality data using advanced classifiers.
」とフィルタリングの重要性を強調。データ合成にも触れられているがMATHと異なりリアルデータが豊富にあるから？

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement [71.5]
Qwen2.5-Math と Qwen2.5-Math-Instruct-1.5B/7B/72B である。 Qwen2.5-Math-Instructは中国語と英語の両方をサポートし、高度な数学的推論能力を持っている。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 16:45:37 GMT)
「In this report, we introduce Qwen2.5-Math, which features several key technical highlights: (1) extensive use of synthesized mathematical data from Qwen2-Math during the pre-training phase, (2) iterative generation of fine-tuning data and reinforcement training guided by the reward model during the post-training and inference phase and (3) support for bilingual (English and Chinese) queries, along with chain-of-thought and tool-integrated reasoning capabilities.」と合成データとself improvement的な動きの効果が興味深い

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution [82.4]
本稿では,従来のQwen-VLモデルのアップグレードであるQwen2-VLシリーズを紹介する。 Qwen2-VLでは、さまざまな解像度の画像を異なる数のビジュアルトークンに処理可能にする、Naive Dynamic Resolutionメカニズムが導入されている。また、Multimodal Rotary Position Embedding (M-RoPE)を統合し、テキスト、画像、ビデオ間で位置情報の効果的な融合を容易にする。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:59:32 GMT)
「Qwen2-VL series introduces naive dynamic resolution and multimodal rotary position embedding (M-RoPE) to fuse information across modals effectively and be capable of understanding videos over 20 minutes in length.」、「Furthermore, Qwen2-VL now supports understanding multilingual texts within images, including most European languages, Japanese, Korean, Arabic, Vietnamese, and others.」と動画対応、日本語対応と強力なマルチモーダルモデル。

GRIN: GRadient-INformed MoE [132.9]
Mixture-of-Experts (MoE)モデルは、エキスパートルーティングによるスパース計算により、密度の高いモデルよりも効果的にスケールする。エキスパートルーティングのためのスパース勾配推定を組み込んだGRIN(GRadient-Informed MoE Training)を導入する。我々のモデルは6.6Bの活性化パラメータしか持たないが、7Bの密度モデルより優れており、同じデータで訓練された14Bの密度モデルの性能と一致している。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:00:20 GMT)
「We propose SparseMixer-v2 to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.」、「We scale MoE training with neither expert parallelism nor token dropping, while the conventional MoE training employs expert parallelism and deploys token dropping.」を特徴とするMoEの改善
MoE構成でも意外とExpertにならないという報告を読んだ記憶があるが「Our study seems to verify our hypothesis that expert networks in GRIN MoE have developed highly-specialized and heterogeneous expertise.」という記載が興味深い。

Pixtral 12B [56.8]
12ビリオンパラメータのマルチモーダル言語モデルであるPixtral-12Bを導入する。 Pixtral-12Bは、自然画像と文書の両方を理解するために訓練されている。多くのオープンソースモデルとは異なり、Pixtralはそのサイズに対する最先端のテキストモデルでもある。
論文参考訳（メタデータ） (Wed, 09 Oct 2024 17:16:22 GMT)
Announcing Pixtral 12B | Mistral AI | Frontier AI in your hands
GitHub – mistralai/mistral-evals

A Comprehensive Survey on Evidential Deep Learning and Its Applications

A Comprehensive Survey on Evidential Deep Learning and Its Applications [64.8]
Evidential Deep Learning (EDL)は、単一のフォワードパスで最小限の追加計算で信頼性の高い不確実性推定を提供する。まず、主観的論理理論であるEDLの理論的基礎を掘り下げ、他の不確実性推定フレームワークとの区別について議論する。さまざまな機械学習パラダイムや下流タスクにまたがる広範な応用について詳しく述べる。
論文参考訳（メタデータ） (Sat, 07 Sep 2024 05:55:06 GMT)
不確実性の推定が含まれるEvidential Deep Learning (EDL) のサーベイ
論文リポジトリも公開されている。GitHub – MengyuanChen21/Awesome-Evidential-Deep-Learning: A curated publication list on evidential deep learning.

Learning to Solve Combinatorial Optimization under Positive Linear Constraints via Non-Autoregressive Neural Networks

Learning to Solve Combinatorial Optimization under Positive Linear Constraints via Non-Autoregressive Neural Networks [103.8]
組合せ最適化(英: Combinatorial Optimization、CO)は、計算機科学、応用数学などにおける基本的な問題である。本稿では, 正線形制約下でのCO問題の解法として, 非自己回帰ニューラルネットワーク群を設計する。本研究では,施設位置,最大被覆率,旅行セールスマン問題を含む代表的CO問題の解決において,この枠組みの有効性を検証する。
論文参考訳（メタデータ） (Fri, 06 Sep 2024 14:58:31 GMT)
組み合わせ最適化へのニューラルネットワークの応用。「For TSP, our LinSAT-augmented non-autoregressive network performed on par with other state-of-the-art neural methods; for facility location and max-set covering, our method achieved comparable performance to commercial solvers like Gurobi and even outperformed them on certain problem instances.」というのは凄い。
リポジトリはGitHub – Thinklab-SJTU/NAR-CO-Solver: Official implementation non-autoregressive combinatorial optimizaiton solvers, covering our ICLR 2023 paper and SCIENTIA SINICA Informationis paper

Configurable Foundation Models: Building LLMs from a Modular Perspective

Configurable Foundation Models: Building LLMs from a Modular Perspective [115.6]
LLMを多数の機能モジュールに分解する傾向が高まり、複雑なタスクに取り組むためにモジュールの一部とモジュールの動的アセンブリを推論することができる。各機能モジュールを表すブロックという用語を造語し、モジュール化された構造をカスタマイズ可能な基礎モデルとして定義する。検索とルーティング,マージ,更新,成長という,レンガ指向の4つの操作を提示する。 FFN層はニューロンの機能的特殊化と機能的ニューロン分割を伴うモジュラーパターンに従うことが判明した。
論文参考訳（メタデータ） (Wed, 4 Sep 2024 17:01:02 GMT)
Configurable Foundation Models、再構成可能なモジュール化された基盤モデルに関する研究、サーベイ
有用性は分かるが難しい問題との認識。model mergeなどの成果を見ると可能性を感じるとともに現時点では機能別の領域同定も簡単ではなさそうという印象。

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.4]
我々は、画像テキストの命令データをキュレートするための新しいフレームワークであるMMEvolを提案する。 MMEvolは、微粒な知覚の進化、認知的推論の進化、相互作用の進化を組み合わせている。提案手法は,3.1ポイントの平均精度向上を実現し,13の視覚言語タスクのうち9つで最先端(SOTA)性能に達する。
論文参考訳（メタデータ） (Mon, 9 Sep 2024 17:44:00 GMT)
「a novel multimodal instruction data evolution framework that combines fine-grained perception evolution, cognitive reasoning evolution, and interaction evolution.」、マルチモーダルな点が特徴的。効果は「The data evolved through three rounds of evolution is used to train a new model, demonstrating state-of-the-art (SOTA) performance across a comprehensive set of benchmarks.」としている。
テキストや数学的問題を超えて、マルチモーダルな文脈でも有効性が確かめられているのは面白いのと、今後の取り組みで画像生成モデルとの統合に言及があった点も興味深い。
プロジェクトサイトはMMEvol: Welcome (rainbowluocs.github.io)

2024年9月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30