画像生成 – arXiv最新論文の紹介

NeoBabel: A Multilingual Open Tower for Visual Generation

NeoBabel: A Multilingual Open Tower for Visual Generation [32.8]
我々は,新しい多言語画像生成フレームワークNeoBabelを紹介する。英語、中国語、オランダ語、フランス語、ヒンディー語、ペルシア語という6つの言語をサポートしている。それは、強い英語能力を維持しながら、最先端の多言語のパフォーマンスを達成する。
論文参考訳（メタデータ） (Tue, 08 Jul 2025 16:19:45 GMT)
「This paper introduces NeoBabel, a novel multilingual image generation framework that represents the first scalable solution for direct text-to-image synthesis across six languages. Through meticulous curation of high-quality multilingual vision-language datasets and end-to-end training, NeoBabel establishes direct cross-lingual mappings between textual descriptions and visual outputs across all supported languages.」という翻訳を介さない多言語対応画像生成モデルの提案。文化に関わる単語を翻訳するのは困難であり、このようなモデルは重要。
リポジトリはNeoBabel: A Multilingual Open Tower for Visual Generation

Seedream 3.0 Technical Report

Seedream 3.0 Technical Report [62.9]
Seedream 3.0は、高性能な中国語と英語のバイリンガル画像生成基盤モデルである。 Seedream 2.0の既存の課題に対処するために、いくつかの技術的改善を開発しています。 Seedream 3.0はネイティブな高解像度の出力(最大2K)を提供し、高画質の画像を生成する。
論文参考訳（メタデータ） (Wed, 16 Apr 2025 16:23:31 GMT)
ByteDanceによるマルチリンガルな画像生成モデル、サンプル画像から非常に強力なモデルであることが分かる。Text to Image Model Arena | Artificial AnalysisでSoTAを主張（現在はGPT-4oに抜かれている？）
プロジェクトサイトはDoubao Team

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models [10.3]
画像を生成する前に空間配置条件を事前に計画できる統合レイアウト計画と画像生成モデルPlanGenを提案する。 PlanGenは、ローカルキャプションとバウンディングボックス座標の特別なエンコーディングを必要とせずに、レイアウト条件をコンテキストとしてモデルに統合する。さらに、よく設計されたモデリングのおかげで、PlanGenはレイアウト誘導の画像操作にシームレスに拡張できる。
論文参考訳（メタデータ） (Thu, 13 Mar 2025 07:37:09 GMT)
画像生成の前にレイアウト計画可能なモデルの提案。コンテキストとしてレイアウトを受け取ることが可能「PlanGen can complete layout planning and layout-to-image generation in a unified model. Just like thinking about what object each area should be before generating an image, such an explicit planning process allows the model to enjoy more powerful image generation capabilities.」。
リポジトリはPlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

GPS as a Control Signal for Image Generation

GPS as a Control Signal for Image Generation [95.4]
画像メタデータに含まれるGPSタグは,画像生成に有用な制御信号であることを示す。私たちはGPSと画像のモデルをトレーニングし、都市内の画像がどのように変化するかの詳細な理解を必要とするタスクにそれらを使用します。
論文参考訳（メタデータ） (Tue, 21 Jan 2025 18:59:46 GMT)
「Our work suggests that GPS coordinates are a useful signal for controllable image generation.」とのこと。直観的には確かに有効そうであるし、コンテキストとして明確な情報を与える場合も多そうに思う。
プロジェクトサイトはGPS as a Control Signal for Image Generation

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations

JPEG-LM: LLMs as Image Generators with Canonical Codec Representations [51.1]
離散化は、画像やビデオのような連続したデータを離散トークンとして表現する。画像やビデオを識別する一般的な方法は、生のピクセル値のモデリングである。本研究では,画像やビデオを直接,標準コーデック(JPEG,AVC/H.264)を介してコンピュータ上に保存した圧縮ファイルとしてモデル化することを提案する。
論文参考訳（メタデータ） (Wed, 21 Aug 2024 00:24:53 GMT)
JPEGを直接扱えるL(?)Mの提案。「For generality, our models also do not use any vision-specific modules like convolutions or 2D positional embeddings, potentially making the task more challenging.」、「However, we observe that conventional, vanilla language modeling surprisingly conquers these challenges without special designs as training goes (e g , JPEG-LM generates realistic images barely with any corrupted JPEG patches).」とのこと。アーキテクチャは7B Llama-2 model、本当に強力。

Imagen 3

Imagen 3 [130.7]
本稿では,テキストプロンプトから高品質な画像を生成する潜時拡散モデルであるImagen 3を紹介する。安全と表現に関する問題と、モデルの潜在的な害を最小限にするために使用した手法について議論する。
論文参考訳（メタデータ） (Tue, 13 Aug 2024 16:15:50 GMT)
Imagen3が発表、性能の高さ「Imagen 3 is preferred over other state-of-the-art (SOTA) models at the time of evaluation.」はさすがとして、「Responsible Development and Deployment」がとても興味深い。

Segment Anything Model 2, Gemma 2 2B, Black Forest Labs Flux

先週も生成（だけではないが）AI関連のニュースは多かった。MetaにおるSAM2はSAMの衝撃（Segment Anything – arXiv最新論文の紹介 (devneko.jp)）から1年ちょっとで大幅に進化した印象。Gemma2 2Bは小規模だが強力なモデルとして登場（Smaller, Safer, More Transparent: Advancing Responsible AI with Gemma – Google Developers Blog (googleblog.com)）した。新たに設立されたAnnouncing Black Forest Labs – Black Forest LabsはSOTAを主張する画像生成モデルFLUX.1 を発表した。

これらモデルの多く（FLUX.1は一部）が公開されているのが非常に興味深い。

SAM 2: Segment Anything in Images and Videos
segment anything model 2 (sam2) は画像や動画の視覚的セグメンテーションを高速化するための基礎モデルである。ユーザインタラクションを通じてモデルとデータを改善するデータエンジンを構築し、これまでで最大のビデオセグメンテーションデータセットを収集します。ビデオセグメンテーションでは,従来のアプローチよりも3少ないインタラクションを用いて,より良い精度を観察する。
動画のセグメンテーションがSAM的に可能になったSAM2。
公式サイトはMeta Segment Anything Model 2、リポジトリはMeta Segment Anything Model 2

Gemma2 2Bのリポジトリはgoogle/gemma-2-2b · Hugging Face

FLUX.1は最高性能のProはAPI利用、次に強力なDevは非商用利用の条件でblack-forest-labs/FLUX.1-dev · Hugging Face、最後のschnellはblack-forest-labs/FLUX.1-schnell · Hugging FaceからApache2ライセンスでからダウンロード可能。

PartCraft

PartCraft: Crafting Creative Objects by Parts [128.3]
本稿では、ユーザが「選択」できることによって、生成的視覚AIにおける創造的制御を促進する。私たちは初めて、創造的な努力のために、視覚的概念をパーツごとに選択できるようにしました。選択された視覚概念を正確にキャプチャするきめ細かい生成。
論文参考訳（メタデータ） (Fri, 5 Jul 2024 15:53:04 GMT)
「Instead of text or sketch, we “select” desired parts to create an object.」というタイプの画像生成。パーツに注目して組み合わせることができるとなると用途は広そう。
リポジトリはGitHub – kamwoh/partcraft: PartCraft: Crafting Creative Objects by Parts (ECCV2024)

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior

Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.7]
動的タイポグラフィー(Dynamic Typography)と呼ばれる自動テキストアニメーション方式を提案する。意味的意味を伝えるために文字を変形させ、ユーザプロンプトに基づいて活気ある動きを注入する。本手法は,ベクトルグラフィックス表現とエンドツーエンド最適化に基づくフレームワークを利用する。
論文参考訳（メタデータ） (Thu, 18 Apr 2024 06:06:29 GMT)
デモが非常にかっこいいDynamic Typography生成手法の提案。入力文字のベジェ曲線の制御点とベクトルグラフィクス（SVG）を連動させるアプローチでこちらも興味深い。
🪄 animate your word! (animate-your-word.github.io)

UniHuman

UniHuman: A Unified Model for Editing Human Images in the Wild [52.4]
実環境における画像編集の複数の側面に対処する統一モデルUniHumanを提案する。モデルの生成品質と一般化能力を向上させるために,人間の視覚エンコーダからのガイダンスを利用する。ユーザスタディでは、UniHumanは平均して77%のケースでユーザに好まれる。
論文参考訳（メタデータ） (Fri, 22 Dec 2023 05:00:30 GMT)
人間の画像を編集するためのモデルの提案、Adobeがかかわっており、「 we curated 400K high-quality image-text pairs for training and collected 2K human image pairs for out-of-domain testing.」はさすが。

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31