Embedding – arXiv最新論文の紹介

CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval

CodeXEmbed: A Generalist Embedding Model Family for Multiligual and Multi-task Code Retrieval [87.2]
CodeXEmbedは400Mから7Bパラメータの大規模なコード埋め込みモデルのファミリーである。我々の新しいトレーニングパイプラインは、複数のプログラミング言語を統合し、様々なコード関連タスクを共通の検索フレームワークに変換する。私たちの7Bモデルは、コード検索において新しい最先端(SOTA)を設定し、以前の主要なモデルであるVoyage-CodeをCoIRベンチマークで20%以上上回っています。
論文参考訳（メタデータ） (Tue, 19 Nov 2024 16:54:45 GMT)
Code RAGなどで重要になるが難しいタスクであるEmbeddingモデルの提案、「Our 7B model sets a new state-ofthe-art (SOTA) in code retrieval, outperforming the previous leading model, Voyage-Code, by over 20% on CoIR benchmark.」とのこと。2Bのベースモデルはgemma-2-2b-it、7BだとMistral-7B-Instruct-v0.3などベースは様々。
現状モデルは公開されていないっぽいが、「By bridging the gap between text and code retrieval domains and releasing our models to the community, we aim to promote further research and innovation in developer tools and programming language understanding.」のと記載がある。

LLM2Vec

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders [34.4]
大規模デコーダのみの言語モデル(LLM)は、今日のNLPタスクとベンチマークのほとんどで最先端のモデルである。 LLM2Vecは、任意のデコーダのみのLLMを強力なテキストエンコーダに変換する、単純な教師なしアプローチである。
論文参考訳（メタデータ） (Tue, 09 Apr 2024 02:51:05 GMT)
LLMを用いたエンベディング。任意のCausalLMから埋め込み用モデル構築する手法の提案。優れた結果。単純といえば単純なアプローチではあるが、なぜこれが効果的なのかわかるようなわからないような。
論文中の「Based on these findings (we replicate these results for other inputs and other Mistral models in Appendix F) and the strong unsupervised results for Mistral-7B with bidirectional attention, we speculate that Mistral models are pre-trained with some form bidirectional attention, e g , prefix language modeling (Raffel et al , 2020) – at least for some parts of its training.」が非常に興味深い。
リポジトリはMcGill-NLP/llm2vec: Code for ‘LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders’ (github.com)

Is Cosine-Similarity of Embeddings Really About Similarity? [46.8]
コサイン相似性(Cosine-similarity)は、2つのベクトル間の角度のコサイン、すなわちそれらの正規化の間のドット積である。正規化線形モデルから導かれる埋め込みについて検討し、そこでは閉形式解が解析的洞察を促進する。我々はコサイン相似性が任意の、したがって無意味な類似性をもたらすか分析的に導出する」。
論文参考訳（メタデータ） (Fri, 8 Mar 2024 16:48:20 GMT)
コサイン類似度が最善でない場合もあるようだが、この手法はどうなんだろう。

Gecko: Versatile Text Embeddings Distilled from Large Language Models

Gecko: Versatile Text Embeddings Distilled from Large Language Models [32.1]
本稿では,コンパクトで汎用的なテキスト埋め込みモデルであるGeckoを紹介する。私たちは、大きな言語モデル(LLM)から知識をレトリバーに抽出する、という重要なアイデアを活用しています。 MTEB (Massive Text Embedding Benchmark) では、256の埋め込み次元を持つ Gecko が 768 の埋め込みサイズで既存のエントリを上回ります。
論文参考訳（メタデータ） (Fri, 29 Mar 2024 17:56:40 GMT)
コンパクトかつ強力なテキスト埋め込みモデル。text-embedding-ada-3をoutperform。「Gecko is trained on an LLM-generated synthetic dataset FRet that contains LLM-ranked positives and negatives.」という形でLLMを活用

Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems

Understanding and Mitigating the Threat of Vec2Text to Dense Retrieval Systems [30.8]
テキスト埋め込みを反転させるテクニックであるVec2Textは、高密度検索システム内で深刻なプライバシー上の懸念を提起している。本稿では,Vec2Textを用いたテキストの復元性に影響を与えるであろう埋め込みモデルの様々な側面について検討する。そこで本研究では,テキスト復元可能性のリスクを軽減しつつ,同等のランク付け効率を確保できる埋め込み変換の修正を提案する。
論文参考訳（メタデータ） (Tue, 20 Feb 2024 07:49:30 GMT)
実務でもたまに話題になる2vecを戻せるか問題と戻せなくするための手法の提案。「Methods like Vec2Text, which can successfully reconstruct the original text from an embedding, could pose serious privacy risks, especially now embeddings are made publicly available via APIs (e g , OpenAI or Cohere).」とのことで、再現もできていて脅威になるよう。
リポジトリはielab/vec2text-dense_retriever-threat: Is Vec2Text Really a Threat toDense Retrieval Systems? (github.com)、jxmorris12/vec2text: utilities for decoding deep representations (like sentence embeddings) back to text (github.com)をベースに再現実験を行ったとのこと、weightもう公開されているielabgroup/vec2text_gtr-base-st_corrector · Hugging Face

BGE Landmark Embedding

BGE Landmark Embedding: A Chunking-Free Embedding Method For Retrieval Augmented Long-Context Large Language Models [13.2]
大規模言語モデル(LLM)は、多くの重要なアプリケーションを扱うためにコンテキストの拡張を要求する。既存のアプローチはコストがかかり、コンテキスト拡張の品質が劣る傾向がある。拡張可能な埋め込みは、典型的なトークン埋め込みの強化である。
論文参考訳（メタデータ） (Sun, 18 Feb 2024 12:41:01 GMT)
チャンキングフリーな埋め込み手法の提案。文ベースで文末に置かれたマーカーを目印にそれまでの内容を含めて埋め込みを行うイメージのよう。
リポジトリはFlagOpen/FlagEmbedding: Dense Retrieval and Retrieval-augmented LLMs (github.com)

Multi-Lingual Text Embeddings

マルチリンガルなテキストの埋め込みについて2つ報告が出ていた。1つ目は高性能と話題のE5、もう1つはBAAIのモデルでベンチマーク上はE5以上の性能のように見える。いずれもオープンなライセンスのようで使いやすそう。

Multilingual E5 Text Embeddings: A Technical Report [63.5]
異なるサイズの3つの埋め込みモデルを提供し、推論効率と埋め込み品質のバランスを提供する。そこで我々は,新しい命令調整型埋め込みモデルを導入し,その性能は類似サイズの最先端の英語のみのモデルと同等である。
論文参考訳（メタデータ） (Thu, 8 Feb 2024 13:47:50 GMT)
高性能と話題でOpenAIの埋め込みモデルの別の選択肢としても有名な手法のテクニカルレポート
リポジトリはunilm/e5 at master · microsoft/unilm (github.com)、モデルはintfloat/multilingual-e5-base · Hugging Faceなど

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation [28.2]
本稿では,M3-Embeddingと呼ばれる新しい埋め込みモデルを提案する。 100以上の作業言語をサポートすることができるため、多言語および多言語検索タスクにおける最先端のパフォーマンスが新たに向上する。 M3-Embeddingは、短い文から最大8192トークンの長いドキュメントまで、さまざまな粒度の入力を処理することができる。
論文参考訳（メタデータ） (Mon, 5 Feb 2024 17:26:49 GMT)
BAAIによる埋め込みモデル。E5より性能が高いと主張。
リポジトリはFlagOpen/FlagEmbedding: Dense Retrieval and Retrieval-augmented LLMs (github.com)モデルはBAAI/bge-m3 · Hugging Face

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Text Embeddings by Weakly-Supervised Contrastive Pre-training [89.5]
E5は最先端のテキスト埋め込みのファミリーであり、幅広いタスクにうまく転送される。 E5は、テキストの単一ベクトル表現を必要とするタスクに対して、汎用的な埋め込みモデルとして簡単に使用できる。
論文参考訳（メタデータ） (Wed, 7 Dec 2022 09:25:54 GMT)
microsoft/unilm: Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities (github.com)

テキストの分散表現

Neural Embeddings for Text [14.1]
本稿では,意味的意味を深く表現する自然言語テキストの埋め込みについて提案する。この方法では、言語モデルにテキストから学習させ、文字通りその脳を選択して、モデルのニューロンの実際の重みを取り、ベクトルを生成する。ニューラルネットワークの埋め込みとGPT文の埋め込みを比較した。
論文参考訳（メタデータ） (Wed, 17 Aug 2022 16:26:13 GMT)
- 新たなテキストの埋め込み手法提案。複数のレイヤーの重みを処理することが特徴のよう。通常の手法にはない側面を捉えられていそうな雰囲気はあるが、差が大きいかは微妙なところ。
- リポジトリはprimer-research/neural_embeddings at main · PrimerAI/primer-research (github.com)

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31