Understanding the Influence of Synthetic Data for Text Embedders / So let’s replace this phrase with insult… Lessons learned from generation of toxic texts with LLMs

Understanding the Influence of Synthetic Data for Text Embedders [52.0]
まず,Wangらによって提案された合成データの再生と公開を行った。合成データがモデル一般化をどのように改善するかを批判的に検討する。本研究は, 汎用インバータ構築における, 現在の合成データ手法の限界を浮き彫りにしたものである。
論文参考訳（メタデータ） (Sun, 07 Sep 2025 19:28:52 GMT)
合成データの効果についてEmbeddingモデルの観点で検証した論文。「we find that training on synthetic examples designed for a particular task can degrade the performance of other tasks, challenging the notion that training on more diverse synthetic data is strictly better. Moreover, we observe that synthetic data leads to sparse improvement across tasks, showing no statistically significant improvement on a majority of MTEB tasks.」とのこと。
リポジトリはGitHub – jakespringer/open-synthetic-embeddings

<think> So let’s replace this phrase with insult… </think> Lessons learned from generation of toxic texts with LLMs [60.2]
本稿では, 人為的データに代わる合成毒性データを用いた脱毒訓練モデルの可能性について検討する。実験によると、合成データに微調整されたモデルは、人間のデータで訓練されたモデルよりも一貫してパフォーマンスが悪くなっている。根本原因は、致命的な語彙の多様性のギャップとして認識される: LLMは、小さな反復的な侮辱の語彙を用いて、人間の毒性のニュアンスや多様性を捉えるのに失敗する有毒な内容を生成する。
論文参考訳（メタデータ） (Wed, 10 Sep 2025 07:48:24 GMT)
こちらも合成データに関する記載があり「Models trained on fully synthetic data significantly underperform those trained on humanannotated data.」としている。モデル崩壊の報告でも合成データのみでは良くない結果を招いていて、これはそうなのだろうと思う。

コメントを残す

コメントを残す コメントをキャンセル