{"id":7448,"date":"2025-09-16T05:26:00","date_gmt":"2025-09-15T20:26:00","guid":{"rendered":"https:\/\/devneko.jp\/wordpress\/?p=7448"},"modified":"2025-09-14T10:31:42","modified_gmt":"2025-09-14T01:31:42","slug":"understanding-the-influence-of-synthetic-data-for-text-embedders-so-lets-replace-this-phrase-with-insult-lessons-learned-from-generation-of-toxic-texts-with-llms","status":"publish","type":"post","link":"https:\/\/devneko.jp\/wordpress\/?p=7448","title":{"rendered":"Understanding the Influence of Synthetic Data for Text Embedders\u00a0 \/ So let&#8217;s replace this phrase with insult\u2026 Lessons learned from generation of toxic texts with LLMs\u00a0"},"content":{"rendered":"\n<ul class=\"wp-block-list\">\n<li><strong>Understanding the Influence of Synthetic Data for Text Embedders\u00a0<\/strong>[52.0]<br>\u307e\u305a,Wang\u3089\u306b\u3088\u3063\u3066\u63d0\u6848\u3055\u308c\u305f\u5408\u6210\u30c7\u30fc\u30bf\u306e\u518d\u751f\u3068\u516c\u958b\u3092\u884c\u3063\u305f\u3002 \u5408\u6210\u30c7\u30fc\u30bf\u304c\u30e2\u30c7\u30eb\u4e00\u822c\u5316\u3092\u3069\u306e\u3088\u3046\u306b\u6539\u5584\u3059\u308b\u304b\u3092\u6279\u5224\u7684\u306b\u691c\u8a0e\u3059\u308b\u3002 \u672c\u7814\u7a76\u306f, \u6c4e\u7528\u30a4\u30f3\u30d0\u30fc\u30bf\u69cb\u7bc9\u306b\u304a\u3051\u308b, \u73fe\u5728\u306e\u5408\u6210\u30c7\u30fc\u30bf\u624b\u6cd5\u306e\u9650\u754c\u3092\u6d6e\u304d\u5f6b\u308a\u306b\u3057\u305f\u3082\u306e\u3067\u3042\u308b\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2509.06184v1\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2509.06184v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Sun, 07 Sep 2025 19:28:52 GMT)<\/li>\n\n\n\n<li>\u5408\u6210\u30c7\u30fc\u30bf\u306e\u52b9\u679c\u306b\u3064\u3044\u3066Embedding\u30e2\u30c7\u30eb\u306e\u89b3\u70b9\u3067\u691c\u8a3c\u3057\u305f\u8ad6\u6587\u3002\u300cwe find that training on synthetic examples designed for a particular task can degrade the performance of other tasks, challenging the notion that training on more diverse synthetic data is strictly better. Moreover, we observe that synthetic data leads to sparse improvement across tasks, showing no statistically significant improvement on a majority of MTEB tasks.\u300d\u3068\u306e\u3053\u3068\u3002<\/li>\n\n\n\n<li>\u30ea\u30dd\u30b8\u30c8\u30ea\u306f<a href=\"https:\/\/github.com\/jakespringer\/open-synthetic-embeddings\">GitHub &#8211; jakespringer\/open-synthetic-embeddings<\/a><\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>&lt;think> So let&#8217;s replace this phrase with insult&#8230; &lt;\/think> Lessons learned from generation of toxic texts with LLMs\u00a0<\/strong>[60.2]<br>\u672c\u7a3f\u3067\u306f, \u4eba\u70ba\u7684\u30c7\u30fc\u30bf\u306b\u4ee3\u308f\u308b\u5408\u6210\u6bd2\u6027\u30c7\u30fc\u30bf\u3092\u7528\u3044\u305f\u8131\u6bd2\u8a13\u7df4\u30e2\u30c7\u30eb\u306e\u53ef\u80fd\u6027\u306b\u3064\u3044\u3066\u691c\u8a0e\u3059\u308b\u3002 \u5b9f\u9a13\u306b\u3088\u308b\u3068\u3001\u5408\u6210\u30c7\u30fc\u30bf\u306b\u5fae\u8abf\u6574\u3055\u308c\u305f\u30e2\u30c7\u30eb\u306f\u3001\u4eba\u9593\u306e\u30c7\u30fc\u30bf\u3067\u8a13\u7df4\u3055\u308c\u305f\u30e2\u30c7\u30eb\u3088\u308a\u3082\u4e00\u8cab\u3057\u3066\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u304c\u60aa\u304f\u306a\u3063\u3066\u3044\u308b\u3002 \u6839\u672c\u539f\u56e0\u306f\u3001\u81f4\u547d\u7684\u306a\u8a9e\u5f59\u306e\u591a\u69d8\u6027\u306e\u30ae\u30e3\u30c3\u30d7\u3068\u3057\u3066\u8a8d\u8b58\u3055\u308c\u308b: LLM\u306f\u3001\u5c0f\u3055\u306a\u53cd\u5fa9\u7684\u306a\u4fae\u8fb1\u306e\u8a9e\u5f59\u3092\u7528\u3044\u3066\u3001\u4eba\u9593\u306e\u6bd2\u6027\u306e\u30cb\u30e5\u30a2\u30f3\u30b9\u3084\u591a\u69d8\u6027\u3092\u6349\u3048\u308b\u306e\u306b\u5931\u6557\u3059\u308b\u6709\u6bd2\u306a\u5185\u5bb9\u3092\u751f\u6210\u3059\u308b\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2509.08358v1\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2509.08358v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Wed, 10 Sep 2025 07:48:24 GMT)<\/li>\n\n\n\n<li>\u3053\u3061\u3089\u3082\u5408\u6210\u30c7\u30fc\u30bf\u306b\u95a2\u3059\u308b\u8a18\u8f09\u304c\u3042\u308a\u300cModels trained on fully synthetic data significantly underperform those trained on humanannotated data.\u300d\u3068\u3057\u3066\u3044\u308b\u3002\u30e2\u30c7\u30eb\u5d29\u58ca\u306e\u5831\u544a\u3067\u3082\u5408\u6210\u30c7\u30fc\u30bf\u306e\u307f\u3067\u306f\u826f\u304f\u306a\u3044\u7d50\u679c\u3092\u62db\u3044\u3066\u3044\u3066\u3001\u3053\u308c\u306f\u305d\u3046\u306a\u306e\u3060\u308d\u3046\u3068\u601d\u3046\u3002<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[390],"class_list":["post-7448","post","type-post","status-publish","format-standard","hentry","category-arxiv","tag-synthetic-data"],"_links":{"self":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7448","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7448"}],"version-history":[{"count":1,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7448\/revisions"}],"predecessor-version":[{"id":7449,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7448\/revisions\/7449"}],"wp:attachment":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7448"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7448"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7448"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}