OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens [119.6]
OLMoTraceは、言語モデルのアウトプットを、完全にマルチトリルのトレーニングデータにリアルタイムでトレースする。 OLMoTraceは、トレーニングテキストコーパス内の言語モデル出力のセグメントとドキュメントの冗長な一致を見つけ、表示する。
論文参考訳（メタデータ） (Wed, 09 Apr 2025 17:59:35 GMT)
「OLMOTRACE finds and shows verbatim matches between segments of language model output and documents in the training text corpora.」というシステムの提案とOSS実装の公開。Limitationにも「The retrieved documents should not be interpreted as having a causal effect on the LM output, or as supporting evidence or citations for the LM output.」と書かれているとはいえ（かつLLMのデータが必要とはいえ）、様々な応用が考えられそう。
リポジトリはGitHub – allenai/infinigram-api

コメントを残す

コメントを残す コメントをキャンセル