Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey

Benchmark Evaluations, Applications, and Challenges of Large Vision Language Models: A Survey [6.7]
VLM(Multimodal Vision Language Models)は、コンピュータビジョンと自然言語処理の交差点において、トランスフォーメーション技術として登場した。 VLMは、視覚的およびテキスト的データに対して強力な推論と理解能力を示し、ゼロショット分類において古典的な単一モダリティ視覚モデルを上回る。
論文参考訳（メタデータ） (Sat, 04 Jan 2025 04:59:33 GMT)
「we provide a systematic overview of VLMs in the following aspects: [1] model information of the major VLMs developed over the past five years (2019-2024); [2] the main architectures and training methods of these VLMs; [3] summary and categorization of the popular benchmarks and evaluation metrics of VLMs; [4] the applications of VLMs including embodied agents, robotics, and video generation; [5] the challenges and issues faced by current VLMs such as hallucination, fairness, and safety.」とVLMのサーベイ。
リポジトリはGitHub – zli12321/VLM-surveys: A most Frontend Collection and survey of vision-language model papers, and models GitHub repository

コメントを残す

コメントを残す コメントをキャンセル