arXiv最新論文の紹介

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases [102.1]
CODA-LMは、自動運転のための新しいビジョン言語ベンチマークである。解釈可能な自動運転のためのLVLMの、最初の自動的および定量的評価を提供する。
論文参考訳（メタデータ） (Tue, 16 Apr 2024 14:20:55 GMT)
自動運転のためのLarge Vision-Language Modelsの評価ベンチマーク。「 even the closed-sourced commercial LVLMs like GPT-4V cannot deal with road corner cases well, suggesting that we are still far from a strong LVLM-powered intelligent driving agent」とのこと。。。
リポジトリはCODA-LM: Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases (coda-dataset.github.io)

Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents

Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents [101.2]
本稿では,大規模言語モデル(LLM)における戦略的相互作用と協調的意思決定を研究するためのシミュレーションプラットフォームであるGovSimについて紹介する。我々は,AIエージェント間の資源共有のダイナミクスを探求し,倫理的考察,戦略的計画,交渉スキルの重要性を強調した。 GovSimでは、15の試験されたLLMのうち、持続可能な結果を達成することができたのはわずか2つであり、モデルが共有リソースを管理する能力に重大なギャップがあることを示唆している。
論文参考訳（メタデータ） (Thu, 25 Apr 2024 15:59:16 GMT)
LLMを用いたエージェントが戦略的な計画や交渉、協調などが可能なシミュレーション環境の提案。毎月何トンの魚を取ればよいか？というシナリオで複数のLLMを検証。「 GPT-4 successfully maintains the shared resource over the long term, achieving nearly the maximum possible reward, while Claude-3 Opus fails to maintain the resource, with some runs collapsing before reaching 12 months.」「only GPT-4 and Claude-3 Opus, across all models tested, are able to do universalized hypothesis」とGPT-4は強い。
リポジトリはGitHub – giorgiopiatti/GovSim: Governance of the Commons Simulation (GovSim)

Phi-3, Snowflake Arctic, SenseNova 5.0, OpenELM, Qwen-1.5 110B

先週もLLM関連のニュースが多かった。

Phi-3はMicrsoftによる小規模（？）LLM、3.8Bパラメータと比較的小さいが性能が高いと主張。

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone [144.9]
phi-3-miniは、3.3兆のトークンで訓練された38億のパラメータ言語モデルである。 MMLUでは69%、MTベンチでは8.38である。
論文参考訳（メタデータ） (Mon, 22 Apr 2024 14:32:33 GMT)
リポジトリはPhi-3 – a microsoft Collection (huggingface.co)

Snowflakeが発表したSnowflake Arcticは総パラメータ数480Bだが、推論時は17BパラメータのみアクティブになるMoE構成。面白い構成で性能はLlama3 70B相当を主張、Apache-2ライセンスと真にオープンソースなライセンスである点も素晴らしい。
Snowflake Arctic – エンタープライズAI向けLLM

SenseNovaはSenseTimeによるLLMでGPT 4 turbo超え（ただし最新モデルはない）を主張。クローズドなモデルではあるが性能競争が激しくなっていることを示している。
SenseTime launches SenseNova 5.0 with comprehensive updates and the industry-leading “Cloud-to-Edge” full-stack large model product matrix-Newsroom-SenseTime

AppleがLLMを公開したことも興味深い。

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework [26.7]
私たちは最先端のオープン言語モデルであるOpenELMをリリースします。パラメータ予算は約10億のパラメータで、OpenELMはOLMoに比べて精度が2.36%向上している。
論文参考訳（メタデータ） (Mon, 22 Apr 2024 23:12:03 GMT)
リポジトリはapple/OpenELM · Hugging Face

その他、Qwen 1.5の大規模モデルであるQwen-1.5（Qwen/Qwen1.5-110B · Hugging Face）が公開、Nyonic Wonton7Bが発表などLLM界隈は非常に活況である。

https://huggingface.co/datasets/HuggingFaceFW/fineweb に関連したX（旧twitter）での投稿も話題になっていた。XユーザーのThomas Wolfさん: 「This take on the FineWeb release is one of the most interesting feedback and also a reason FineWeb is very different from even larger datasets like RedPajama-V2 (which is double its size!) Surprisingly, the size of the dataset of 15T tokens is not very important, what is much…」 / X (twitter.com) 「Before I dive more in this let me give you an example of unintuitive behavior. Between 2022 and 2023 the “LLM quality” of Common Crawl dropped significantly as in “training a LLM on the crawls btw 2022-2023 will give you lower performances on a set of evals”. What happened? Well it turns out the Common Crawl team has been filtering more strongly domains with adult content. Not really the cause you’d be intuitively thinking about, right?」は非常に興味深い。

Nyonic Technical Report [20.8]
Wonton 7Bモデルは、多言語および英語のベンチマークで競合性能を示した。モデルのアーキテクチャは、ロータリー位置埋め込み(Rotary Positional Embeddings)、QK-LayerNorm(QK-LayerNorm)、特別に製作された多言語トークンーザ(multilingual tokenizer)などの最先端技術で強化されている。
論文参考訳（メタデータ） (Wed, 24 Apr 2024 07:38:44 GMT)
GitHub – nyonicai/nyonic-public: Reference implementation of models from Nyonic Model Factory

InternVL 1.5

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites [95.7]
InternVL 1.5はオープンソースのマルチモーダル大言語モデル(MLLM)である。マルチモーダル理解において、オープンソースとプロプライエタリな商用モデルの間の能力ギャップを埋める。
論文参考訳（メタデータ） (Thu, 25 Apr 2024 17:59:19 GMT)
IntenVLの最新版、InternViT-6B + InternLM2-20Bの構成。「Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.」と優秀
リポジトリはGitHub – OpenGVLab/InternVL: InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. [CVPR 2024 Oral]

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws [51.7]
スケーリング法則は、言語モデルのサイズと能力の関係を記述している。我々は、ウィキペディアのページから(米国、首都ワシントンD.C.など)ドメインとして表される事実知識に焦点を当てる。 7Bモデルは、英語のウィキペディアと教科書を合わせた14Bビットの知識を保存できる。
論文参考訳（メタデータ） (Mon, 08 Apr 2024 11:11:31 GMT)
「Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications.」とのこと。面白い。

AgentKit: Flow Engineering with Graphs, not Coding

AgentKit: Flow Engineering with Graphs, not Coding [91.1]
多機能エージェントのための直感的なLCMプロンプトフレームワーク(AgentKit)を提案する。 AgentKitは、単純な自然言語プロンプトから複雑な”思考プロセス”を明示的に構築するための統一されたフレームワークを提供する。
論文参考訳（メタデータ） (Wed, 17 Apr 2024 15:40:45 GMT)
LLMを用いたエージェント開発のためのフレームワーク。ブロックをつなぐようにしてLLMを使うものは多いが、Agentに寄せていてコードに近いレイヤに対応しているのが特徴的（使いやすいかは疑問だが、このくらいの抽象度のほうが開発に適していそう）
リポジトリはHolmeswww/AgentKit: An intuitive LLM prompting framework for multifunctional agents, by explicitly constructing a complex “thought process” from simple natural language prompts. (github.com)、ライセンスはCC-BY

CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment

CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment [38.4]
英語中心のモデルは、通常他の言語では準最適である。そこで本研究では,言語間命令チューニングデータの混合合成を利用したCrossInという新しい手法を提案する。
論文参考訳（メタデータ） (Thu, 18 Apr 2024 06:20:50 GMT)
多言語能力を上げるためのInstruction tuningアプローチ。「CrossIn: It comprises cross-lingual instruction tuning datasets, where instruction and output are featured in two different languages」「Trans: It consists of translation pairs for instructions.」を併用。後者の「We hypothesize that if the model concurrently learns these translation tasks, it could facilitate the transfer of knowledge between languages.」は興味深い仮説。評価データも構築している。
Mistral等を使って提案手法の効果を検証。
リポジトリはGitHub – Lingy12/CrossIn

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.7]
動的タイポグラフィー(Dynamic Typography)と呼ばれる自動テキストアニメーション方式を提案する。意味的意味を伝えるために文字を変形させ、ユーザプロンプトに基づいて活気ある動きを注入する。本手法は,ベクトルグラフィックス表現とエンドツーエンド最適化に基づくフレームワークを利用する。
論文参考訳（メタデータ） (Thu, 18 Apr 2024 06:06:29 GMT)
デモが非常にかっこいいDynamic Typography生成手法の提案。入力文字のベジェ曲線の制御点とベクトルグラフィクス（SVG）を連動させるアプローチでこちらも興味深い。
🪄 animate your word! (animate-your-word.github.io)

JetMoE: Reaching Llama2 Performance with 0.1M Dollars

JetMoE: Reaching Llama2 Performance with 0.1M Dollars [25.3]
このレポートでは、JetMoE-8Bという新しい大規模言語モデルを紹介します。低コストにもかかわらず、JetMoE-8BはLlama2-7Bモデルより優れ、JetMoE-8B-ChatはLlama2-13B-Chatモデルより優れていた。本報告では,すべてのトレーニングパラメータとデータ混合物について詳述し,オープンファンデーションモデルの開発における今後の取り組みを促進する。
論文参考訳（メタデータ） (Thu, 11 Apr 2024 00:52:39 GMT)
安価（といっても「$0.1 million, using 1.25T tokens from carefully mixed open-source corpora and 30,000 H100 GPU hours.」）でLLMを構築するレシピの提案
リポジトリはmyshell-ai/JetMoE: Reaching LLaMA2 Performance with 0.1M Dollars (github.com)

Many-Shot In-Context Learning

Many-Shot In-Context Learning [57.6]
大規模言語モデル (LLMs) は、文脈内学習 (ICL) において優れている我々は、多種多様な生成的および識別的タスクにおける顕著なパフォーマンス向上を観察する。 Reinforced と Unsupervised ICL は多発的なシステムでは極めて有効であることがわかった。
論文参考訳（メタデータ） (Wed, 17 Apr 2024 02:49:26 GMT)
Gemini 1.5などで可能になったMany shot（500 shotなど）などの効果の分析。性能が上がる例が多いが「On some tasks (e g , code verifier, planning), we did observe slight performance deterioration beyond a certain number of shots.」とのこと。Reinforced ICL、Unsupervised ICL という人間を介さないICLも検証していて「We found that, for problem-solving domains where human-generated rationales are expensive to obtain, Reinforced and Unsupervised ICL can obtain strong performance when compared to ICL with human data.」とのこと。
長いコンテキストの利点をアピールする論文。SSMだとどうなんるんやろという興味がある。

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31