RedStone: Curating General, Code, Math, and QA Data for Large Language Models

RedStone: Curating General, Code, Math, and QA Data for Large Language Models [134.5]
本研究では,大規模言語モデルを事前学習するための包括的かつ柔軟なリソースとして,Common Crawlの未完成の可能性を探る。私たちは、Common Crawlからデータを抽出し、処理するために設計された、革新的でスケーラブルなパイプラインであるRedStoneを紹介します。
論文参考訳（メタデータ） (Wed, 04 Dec 2024 15:27:39 GMT)
LLM構築など大規模な事前学習で重要なデータ源となっているCommonCrawlからのデータ構築についての報告と実装。フィルタリングの過程でデータが大幅に削られている。「Our general domain dataset, REDSTONE-Web, outperforms existing open-source datasets in common sense reasoning benchmarks, while the inclusion of REDSTONE-Code and REDSTONE-Math significantly improves model performance in code generation and mathematical problem solving.」とのこと。
リポジトリはhttps://github.com/microsoft/redstoneとのことだが、現時点では404

コメントを残す

コメントを残す コメントをキャンセル