GPT-4V – arXiv最新論文の紹介

Design2Code

Design2Code: How Far Are We From Automating Front-End Engineering? [83.1]
マルチモーダルLLMがビジュアルデザインをコード実装に直接変換するタスクを Design2Code タスクとして形式化し,包括的なベンチマークを行う。具体的には、テストケースとして、484の多様な現実世界のWebページのベンチマークを手動でキュレートする。我々は,GPT-4V と Gemini Pro Vision 上で,マルチモーダルプロンプト手法のスイートを開発し,その有効性を示す。人的評価と自動測定の両方で、GPT-4Vは他のモデルと比較して、このタスクにおいて最善であることを示している。
論文参考訳（メタデータ） (Tue, 5 Mar 2024 17:56:27 GMT)
WEBページの画像からコードを作れるかを検証した論文。GPT-4Vが最も性能が高いが、十分ではなさそう。既存のオープンソースモデルの性能はかなり悪い。論文中ではCogAgent – arXiv最新論文の紹介 (devneko.jp)をfine tuningしたDesign2Code-18Bを開発、公開している。
MistralベースのHuggingFaceM4/VLM_WebSight_finetuned · Hugging Faceがまずまずのスコアを出しており「WebSight VLM-8B performs better than Gemini direct prompting (54% win rate and 35% lose rate), suggesting that finetuning on a large amount of data can match commercial models in specific domains.」とされているのも興味深い。
リポジトリはDesign2Code: How Far Are We From Automating Front-End Engineering (salt-nlp.github.io)

How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation

How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation [90.9]
GPT-4Vは最も先進的な多モード基盤モデルとして機能する。本研究は, GPT-4Vの動的環境における適応性と一般化能力について, 厳密に評価する。
論文参考訳（メタデータ） (Wed, 13 Dec 2023 13:00:57 GMT)
GPT-4Vの環境変化に対する能力を検証した論文、CLIPやLLaVAとも比較。「Our findings reveal that while GPT-4V demonstrates notable adaptability and zero-shot generalization capabilities, its performance varies significantly across different scenarios of distribution shifts.」「our journey toward creating truly robust and versatile AI foundation models is ongoing」との結論。
リポジトリはGitHub – jameszhou-gl/gpt-4v-distribution-shift: Code for “How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation”

Geminiの評価

Geminiの評価に関する論文が出ている。

An In-depth Look at Gemini’s Language Abilities [49.9]
OpenAI GPTとGoogle Geminiモデルの能力を比較する。この分析は、さまざまな言語能力をテストする10のデータセットに対して実施します。 Gemini Pro は GPT 3.5 Turbo よりも近いがわずかに劣る精度を実現している。
論文参考訳（メタデータ） (Mon, 18 Dec 2023 18:47:42 GMT)
Gemini Proに対する主として言語能力の評価。「we find that Gemini Pro achieves accuracy that is close but slightly inferior to the corresponding GPT 3.5 Turbo on all tasks that we benchmarked.」とのこと。Gemini ProはGPT-3.5と競合的、GPT-4と比べられていたのは主にGemini Ultraなので結果に違和感はない。
リポジトリはGitHub – neulab/gemini-benchmark

A Challenger to GPT-4V? Early Explorations of Gemini in Visual Expertise [78.5]
GeminiはGoogleの最新かつ最も有能なMLLMで、マルチモダリティのためにゼロから構築されています。 Geminiはマルチモーダル学習におけるGPT-4Vのリードポジションに挑戦できるか? Gemini Proと最先端のGPT-4Vを比較して、最新のオープンソースMLLMであるSphinxとともに、その上限を評価する。
論文参考訳（メタデータ） (Wed, 20 Dec 2023 12:40:47 GMT)
こちらはマルチモーダルでの評価。比較対象は上記と同じでGemini Proだであることに要注意。「The qualitative results indicate that Gemini is indeed a strong challenger to GPT-4V, given its superior multi-modal reasoning capacity.」と評価
リポジトリはGitHub – BradyFU/Awesome-Multimodal-Large-Language-Models: :sparkles::sparkles:Latest Papers and Datasets on Multimodal Large Language Models, and Their Evaluation.

Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks

Comparing Humans, GPT-4, and GPT-4V On Abstraction and Reasoning Tasks [53.9]
GPT-4のテキストのみおよびマルチモーダル版による推論能力の評価を行った。実験結果から,GPT-4のどちらのバージョンも人間に近いレベルで頑健な抽象化能力を開発していないという結論が得られた。
論文参考訳（メタデータ） (Mon, 11 Dec 2023 23:57:17 GMT)
GPT-4Vの抽象化能力の検証、GitHub – victorvikram/ConceptARC: Materials for ConceptARC paperを利用したもので非常に難しいデータセット

GPT-4V with Emotion: A Zero-shot Benchmark for Multimodal Emotion Understanding

GPT-4V with Emotion: A Zero-shot Benchmark for Multimodal Emotion Understanding [38.5]
GPT-4 with Vision (GPT-4V) は様々なマルチモーダルタスクにおいて顕著な性能を示した。本稿では,マルチモーダル感情理解におけるGPT-4Vの能力について定量的に評価する。
論文参考訳（メタデータ） (Thu, 7 Dec 2023 13:27:37 GMT)
GPT-4による感情分類、タスクやドメインによってはsupervisedな手法を超えている。頑健性についても検証が行われており「This resilience to color space changes suggests that GPT-4V is inherently robust in this regard.」とのこと。一方で「However, GPT-4V performs poorly in micro-expression recognition (see Table 5), which indicates that GPT-4V is currently tailored for general domains.」との指摘も。なかなか悩ましい結果ではあるが、一般用途では強力に使えそうに思える。
リポジトリはGitHub – zeroQiaoba/gpt4v-emotion: GPT-4V with Emotion

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation

GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation [167.6]
MM-Navigator(MM-Navigator)は、スマートフォンのGUIナビゲーションタスク用のGPT-4Vベースのエージェントである。 MM-Navigatorは、スマートフォンの画面と人間として対話し、指示を満たすためのその後の行動を決定することができる。
論文参考訳（メタデータ） (Mon, 13 Nov 2023 18:53:37 GMT)
スマホのナビゲーションを行うエージェント。GPT-4Vを使ってマルチモーダルに対応。FinetunedなLlama2、PaLM 2と比べても高い性能。
リポジトリはGitHub – zzxslp/MM-Navigator

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving [26.6]
視覚言語モデル(VLM)の出現は、完全自律運転の実現における新たなフロンティアである。本報告では,最新のVLM,Modelnamefullの総合評価と自律走行シナリオへの応用について述べる。本研究により,既存の自律システムと比較して,シーン理解や因果推論において,モデルネームが優れた性能を示すことが明らかとなった。
論文参考訳（メタデータ） (Thu, 9 Nov 2023 12:58:37 GMT)
GPT-4Vの自動運転への適用可能性の検討。やはり高性能。
リポジトリはGitHub – PJLab-ADG/GPT4V-AD-Exploration: On the Road with GPT-4V(ision): Early Explorations of Visual-Language Model on Autonomous Driving

GPT-4Vによるビデオ分析

MM-VID: Advancing Video Understanding with GPT-4V(ision) [113.6]
我々は、GPT-4Vの能力を利用して高度な映像理解を促進する統合システムMM-VIDを提案する。 MM-VIDは、長いビデオや1時間以内のコンテンツの推論のような複雑なタスクによって生じる課題に対処するために設計されている。ビデオゲームやグラフィックユーザインタフェースといったインタラクティブな環境に適用する際の可能性を示す。
論文参考訳（メタデータ） (Mon, 30 Oct 2023 17:44:09 GMT)
GPT-4Vを用いたビデオ対応、そもそも極めて高性能なバックボーンではあるが、(i) Multimodal Pre-Processing,(ii) External Knowledge Collection,(iii) Clip-Level Video Description Generation, (iv) Script Generationと凝ったパイプライン構成になっている。
プロジェクトサイトはMM-Vid: Advancing Video Understanding with GPT-4V(ision) (multimodal-vid.github.io)

Set-of-Mark Prompting

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [103.7]
大規模マルチモーダルモデルの視覚的グラウンド化能力を解き放つために,新しい視覚的プロンプト手法であるSet-of-Mark(SoM)を提案する。我々は、SAMのような市販のインタラクティブセグメンテーションモデルを用いて、画像を異なるレベルの粒度の領域に分割し、これらの領域を一連のマークでオーバーレイする。マークされたイメージを入力として使用することで、GPT-4Vは視覚的な接地を必要とする質問に答えることができる。
論文参考訳（メタデータ） (Tue, 17 Oct 2023 17:51:31 GMT)
GPT-4Vに対するプロンプトテクニック、Set-of-Markの提案。速度勝負みたいなところもあるのだろうけど、論文出るの速すぎ・・・
「We show that simply overlaying a number of symbolic marks on a set of regions of an input image can unleash the visual grounding ability of GPT-4V.」とのこと。人間でも画像にガイドを入れるとタスクをやりやすくなるのでアイデアとしてはそうだろうと思うものの、広範な実験・検証はとても参考になる。
プロジェクトサイトはSoM-GPT4V

月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31