arXiv – ページ 4 – arXiv最新論文の紹介

AI Competitions and Benchmarks: Dataset Development

AI Competitions and Benchmarks: Dataset Development [42.2]
本章では,我々の実践経験に富んだ,確立した方法論ツールの概要について概観する。データセット開発に関わるタスクを開発し、その効果的な管理に関する洞察を提供する。次に、データ収集、変換、品質評価を含む実装プロセスの詳細について述べる。
論文参考訳（メタデータ） (Mon, 15 Apr 2024 12:01:42 GMT)
データセット作成のための実践的な解説
このような視点の論文はあまりなく、とても参考になる。

TinyChart

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning [83.6]
本稿では,3Bパラメータのみを用いたチャート理解のための効率的なMLLMであるTinyChartを提案する。 TinyChartは,1)プログラム・オブ・ソート(PoT)学習戦略による数値計算学習の負担軽減,2)ビジョン・トーケン・マージ・モジュールによる高解像度画像のためのビジョン・トランスフォーマーによって生成される長大な視覚特徴系列の削減という,効率的なチャート理解における2つの課題を克服した。
論文参考訳（メタデータ） (Thu, 25 Apr 2024 14:23:24 GMT)
チャート理解のためのMLLM。3Bと小型。学習時に「 Program-of-Thoughts learning method that trains the model to generate Python programs to answer questions」という工夫を行っている。
リポジトリはmPLUG-DocOwl/TinyChart at main · X-PLUG/mPLUG-DocOwl · GitHub

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases

Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases [102.1]
CODA-LMは、自動運転のための新しいビジョン言語ベンチマークである。解釈可能な自動運転のためのLVLMの、最初の自動的および定量的評価を提供する。
論文参考訳（メタデータ） (Tue, 16 Apr 2024 14:20:55 GMT)
自動運転のためのLarge Vision-Language Modelsの評価ベンチマーク。「 even the closed-sourced commercial LVLMs like GPT-4V cannot deal with road corner cases well, suggesting that we are still far from a strong LVLM-powered intelligent driving agent」とのこと。。。
リポジトリはCODA-LM: Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases (coda-dataset.github.io)

Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents

Cooperate or Collapse: Emergence of Sustainability Behaviors in a Society of LLM Agents [101.2]
本稿では,大規模言語モデル(LLM)における戦略的相互作用と協調的意思決定を研究するためのシミュレーションプラットフォームであるGovSimについて紹介する。我々は,AIエージェント間の資源共有のダイナミクスを探求し,倫理的考察,戦略的計画,交渉スキルの重要性を強調した。 GovSimでは、15の試験されたLLMのうち、持続可能な結果を達成することができたのはわずか2つであり、モデルが共有リソースを管理する能力に重大なギャップがあることを示唆している。
論文参考訳（メタデータ） (Thu, 25 Apr 2024 15:59:16 GMT)
LLMを用いたエージェントが戦略的な計画や交渉、協調などが可能なシミュレーション環境の提案。毎月何トンの魚を取ればよいか？というシナリオで複数のLLMを検証。「 GPT-4 successfully maintains the shared resource over the long term, achieving nearly the maximum possible reward, while Claude-3 Opus fails to maintain the resource, with some runs collapsing before reaching 12 months.」「only GPT-4 and Claude-3 Opus, across all models tested, are able to do universalized hypothesis」とGPT-4は強い。
リポジトリはGitHub – giorgiopiatti/GovSim: Governance of the Commons Simulation (GovSim)

Phi-3, Snowflake Arctic, SenseNova 5.0, OpenELM, Qwen-1.5 110B

先週もLLM関連のニュースが多かった。

Phi-3はMicrsoftによる小規模（？）LLM、3.8Bパラメータと比較的小さいが性能が高いと主張。

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone [144.9]
phi-3-miniは、3.3兆のトークンで訓練された38億のパラメータ言語モデルである。 MMLUでは69%、MTベンチでは8.38である。
論文参考訳（メタデータ） (Mon, 22 Apr 2024 14:32:33 GMT)
リポジトリはPhi-3 – a microsoft Collection (huggingface.co)

Snowflakeが発表したSnowflake Arcticは総パラメータ数480Bだが、推論時は17BパラメータのみアクティブになるMoE構成。面白い構成で性能はLlama3 70B相当を主張、Apache-2ライセンスと真にオープンソースなライセンスである点も素晴らしい。
Snowflake Arctic – エンタープライズAI向けLLM

SenseNovaはSenseTimeによるLLMでGPT 4 turbo超え（ただし最新モデルはない）を主張。クローズドなモデルではあるが性能競争が激しくなっていることを示している。
SenseTime launches SenseNova 5.0 with comprehensive updates and the industry-leading “Cloud-to-Edge” full-stack large model product matrix-Newsroom-SenseTime

AppleがLLMを公開したことも興味深い。

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework [26.7]
私たちは最先端のオープン言語モデルであるOpenELMをリリースします。パラメータ予算は約10億のパラメータで、OpenELMはOLMoに比べて精度が2.36%向上している。
論文参考訳（メタデータ） (Mon, 22 Apr 2024 23:12:03 GMT)
リポジトリはapple/OpenELM · Hugging Face

その他、Qwen 1.5の大規模モデルであるQwen-1.5（Qwen/Qwen1.5-110B · Hugging Face）が公開、Nyonic Wonton7Bが発表などLLM界隈は非常に活況である。

https://huggingface.co/datasets/HuggingFaceFW/fineweb に関連したX（旧twitter）での投稿も話題になっていた。XユーザーのThomas Wolfさん: 「This take on the FineWeb release is one of the most interesting feedback and also a reason FineWeb is very different from even larger datasets like RedPajama-V2 (which is double its size!) Surprisingly, the size of the dataset of 15T tokens is not very important, what is much…」 / X (twitter.com) 「Before I dive more in this let me give you an example of unintuitive behavior. Between 2022 and 2023 the “LLM quality” of Common Crawl dropped significantly as in “training a LLM on the crawls btw 2022-2023 will give you lower performances on a set of evals”. What happened? Well it turns out the Common Crawl team has been filtering more strongly domains with adult content. Not really the cause you’d be intuitively thinking about, right?」は非常に興味深い。

Nyonic Technical Report [20.8]
Wonton 7Bモデルは、多言語および英語のベンチマークで競合性能を示した。モデルのアーキテクチャは、ロータリー位置埋め込み(Rotary Positional Embeddings)、QK-LayerNorm(QK-LayerNorm)、特別に製作された多言語トークンーザ(multilingual tokenizer)などの最先端技術で強化されている。
論文参考訳（メタデータ） (Wed, 24 Apr 2024 07:38:44 GMT)
GitHub – nyonicai/nyonic-public: Reference implementation of models from Nyonic Model Factory

InternVL 1.5

How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites [95.7]
InternVL 1.5はオープンソースのマルチモーダル大言語モデル(MLLM)である。マルチモーダル理解において、オープンソースとプロプライエタリな商用モデルの間の能力ギャップを埋める。
論文参考訳（メタデータ） (Thu, 25 Apr 2024 17:59:19 GMT)
IntenVLの最新版、InternViT-6B + InternLM2-20Bの構成。「Compared to both open-source and proprietary models, InternVL 1.5 shows competitive performance, achieving state-of-the-art results in 8 of 18 benchmarks.」と優秀
リポジトリはGitHub – OpenGVLab/InternVL: InternVL Family: A Pioneering Open-Source Alternative to GPT-4V. [CVPR 2024 Oral]

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws [51.7]
スケーリング法則は、言語モデルのサイズと能力の関係を記述している。我々は、ウィキペディアのページから(米国、首都ワシントンD.C.など)ドメインとして表される事実知識に焦点を当てる。 7Bモデルは、英語のウィキペディアと教科書を合わせた14Bビットの知識を保存できる。
論文参考訳（メタデータ） (Mon, 08 Apr 2024 11:11:31 GMT)
「Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications.」とのこと。面白い。

AgentKit: Flow Engineering with Graphs, not Coding

AgentKit: Flow Engineering with Graphs, not Coding [91.1]
多機能エージェントのための直感的なLCMプロンプトフレームワーク(AgentKit)を提案する。 AgentKitは、単純な自然言語プロンプトから複雑な”思考プロセス”を明示的に構築するための統一されたフレームワークを提供する。
論文参考訳（メタデータ） (Wed, 17 Apr 2024 15:40:45 GMT)
LLMを用いたエージェント開発のためのフレームワーク。ブロックをつなぐようにしてLLMを使うものは多いが、Agentに寄せていてコードに近いレイヤに対応しているのが特徴的（使いやすいかは疑問だが、このくらいの抽象度のほうが開発に適していそう）
リポジトリはHolmeswww/AgentKit: An intuitive LLM prompting framework for multifunctional agents, by explicitly constructing a complex “thought process” from simple natural language prompts. (github.com)、ライセンスはCC-BY

CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment

CrossIn: An Efficient Instruction Tuning Approach for Cross-Lingual Knowledge Alignment [38.4]
英語中心のモデルは、通常他の言語では準最適である。そこで本研究では,言語間命令チューニングデータの混合合成を利用したCrossInという新しい手法を提案する。
論文参考訳（メタデータ） (Thu, 18 Apr 2024 06:20:50 GMT)
多言語能力を上げるためのInstruction tuningアプローチ。「CrossIn: It comprises cross-lingual instruction tuning datasets, where instruction and output are featured in two different languages」「Trans: It consists of translation pairs for instructions.」を併用。後者の「We hypothesize that if the model concurrently learns these translation tasks, it could facilitate the transfer of knowledge between languages.」は興味深い仮説。評価データも構築している。
Mistral等を使って提案手法の効果を検証。
リポジトリはGitHub – Lingy12/CrossIn

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.7]
動的タイポグラフィー(Dynamic Typography)と呼ばれる自動テキストアニメーション方式を提案する。意味的意味を伝えるために文字を変形させ、ユーザプロンプトに基づいて活気ある動きを注入する。本手法は,ベクトルグラフィックス表現とエンドツーエンド最適化に基づくフレームワークを利用する。
論文参考訳（メタデータ） (Thu, 18 Apr 2024 06:06:29 GMT)
デモが非常にかっこいいDynamic Typography生成手法の提案。入力文字のベジェ曲線の制御点とベクトルグラフィクス（SVG）を連動させるアプローチでこちらも興味深い。
🪄 animate your word! (animate-your-word.github.io)

2024年5月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31