On the Theoretical Limitations of Embedding-Based Retrieval

On the Theoretical Limitations of Embedding-Based Retrieval [15.8]
クエリの結果として返却可能なドキュメントの上位kサブセットの数は,埋め込みの次元によって制限されていることを示す。次に、LIMITと呼ばれる現実的なデータセットを作成し、これらの理論結果に基づいてモデルをテストする。我々の研究は、既存の単一ベクトルパラダイムの下での埋め込みモデルの限界を示している。
論文参考訳（メタデータ） (Thu, 28 Aug 2025 17:43:53 GMT)
embeddingを用いた情報抽出の限界を示した論文。「the critical-n values (for embedding size): 500k (512), 1.7m (768), 4m (1024), 107m (3072), 250m (4096). We note that this is the best case: a real embedding model cannot directly optimize the query and document vectors to match the test qrel matrix (and is constrained by factors such as “modeling natural language”). However, these numbers already show that for web-scale search, even the largest embedding dimensions with ideal test-set optimization are not enough to model all combinations.」（The critical-n value where the dimensionality is too small to successfully represent all the top-2 combinations.）と意外と制約が厳しい。
リポジトリはGitHub – google-deepmind/limit: On the Theoretical Limitations of Embedding-Based Retrieval

コメントを残す

コメントを残す コメントをキャンセル