「1) We build a large-scale, high-quality, non-public instances repository, named EESE-Pool, which contains over 100,000 science in- stances. This pool is constructed under strict principles of Range, Reach, and Rigor. 2) We periodically sample a dynamic subset of 500 instances, called EESE, for actual evaluation. This subset is carefully curated to maintain Range, Reach, and Rigor, while mitigating leakage risk and reducing evaluation inefficiency through regular updates.」という大規模でLeakなどに強いベンチマークの提案。