The current ML-driven approaches for omics-driven biomarker discovery often result in panels that are not reproducible in external validation datasets, and their optimization in terms of feature set size remains unsolved, which jeopardizes their translation into cost-effective clinical tools. The present study investigates how to optimize the feature set size by testing six algorithms on eight large-scale transcriptomics datasets for breast, lung, renal, and ovarian cancer. Most importantly, we propose a new evaluation metric called Cross Hypervolume (CHV) to assess the performance of multi-objective feature selection algorithms on both training and test datasets. CHV is an improvement over other metrics as it considers the trade-off between classification accuracy and the size of the selected features. The CHV metric allows for better assessment of biomarker models and helps to select the most accurate and biologically relevant ones.Â