Dimas Fanny Hebrasianto Permadi -

Early skin cancer detection is essential to improve treatment success and reduce mortality rates. This study is an experimental and exploratory study of the performance of various deep Convolutional Neural Network (CNN) architectures in dermatoscopic image-based skin cancer classification. Eight popular CNN models—ResNet18, ResNet50, ResNet101, DenseNet121, EfficientNet-B0, MobileNet V3 Large, ConvNeXt Tiny, and Xception—were tested on a large-scale dataset of 401,062 images and the size ±130×130 pixels. Experiments were conducted with various combinations of hyperparameters, such as batch size, learning rate, and number of epochs, to measure the stability of the models and their effects on accuracy, sensitivity, precision, and computational efficiency. One crucial observation is that some models, such as ResNet and Xception, exhibit symptoms of overfitting after a certain number of epochs (e.g., >40), where the training accuracy increases to 99.98%, but the validation accuracy decreases or stagnates. In contrast, models such as ConvNeXt-Tiny and DenseNet121 show stable performance up to the 100th epoch with validation accuracy approaching 99.87% and F1-score only reaching 34%. This is due to the F1-score testing and the data imbalance in the Indeterminate, Benign, and Malignant classes. The analysis also includes GPU memory usage and training time, showing that ResNet101 and ConvNeXt-Tiny require high resources (over 300 MB of total memory and more than 700 seconds per epoch). In comparison, lightweight models such as MobileNetV3-Large and EfficientNet-b0 are more efficient ( <150 MB of memory and <350 seconds per epoch) with competitive classification performance. DenseNet121 recorded the highest F1 score (34.92%) with efficient memory consumption and training time. In contrast, ResNet101 and ConvNeXt require high computational resources without significant improvement in the metrics. MobileNetV3 and EfficientNet-B0 excel in GPU duration and memory efficiency. The training discussion shows that large hidden dimensions do not guarantee better performance and model stability is more affected by architecture depth and training configuration. This study emphasizes the importance of comprehensively evaluating models’ accuracy, efficiency, stability, and ability to handle data imbalance in automated medical diagnosis systems.