Optimizing Dataset Creation: A General Purpose Data Filtering System for Training Large Language Models

Sigo Jin; Yanbing Wang; Shan Liu; Yue Zhang; Wei Gu

doi:10.22541/au.172832800.00216917/v1

loading page

Optimizing Dataset Creation: A General Purpose Data Filtering System for Training Large Language Models

Sigo Jin,
Yanbing Wang,
Shan Liu,
Yue Zhang,
Wei Gu

Abstract

The performance of neural models designed for language generation and comprehension depends heavily on the quality of the data used for training. High-quality datasets enable models to better generalize, while noisy, redundant, or irrelevant data can significantly hinder learning, leading to inefficiencies and reduced accuracy. A novel automated data filtering system has been introduced, offering a method to systematically refine large-scale datasets through objective metrics, such as perplexity and entropy, ensuring the retention of only the most relevant and diverse data for training. The system demonstrates substantial improvements in computational efficiency, reducing training time while enhancing model performance across key metrics like perplexity and BLEU scores. By removing low-value data and preserving linguistic diversity, the filtering approach highlights the importance of data curation in building more robust and scalable models. Empirical results indicate that models trained with filtered datasets consistently outperform those trained with unfiltered data, offering insights into more efficient methods for dataset preparation in language model development.