Deep Learning for Skin Cancer Classification

Published: September 29th, 2024 | Updated: October, 23rd, 2024

Published: September 29th, 2024
Updated: October, 23rd, 2024

I wanted to work on a project using real-world data to further develop my skills in deep learning. With this in mind, I wanted to use a publication from a major scientific journal and use it as a baseline for a real-world deep learning project. This led me to find the following Nature Publication. It is titled Human–computer collaboration for skin cancer recognition. In this article, I want to outline my approach to using this publication as a baseline and walk through the code I used to build off of the author's findings. The code and data I used for this project can be found on my GitHub.

I used images from the HAM10000 dataset. HAM10000 is a dataset for skin cancer detection using annotated lesion images. The dataset can be accessed here. I performed some preparatory work to label and organize the images into two directories, one for training (80%) and one for testing (20%), based on their class. The CSV file, ground-truth.csv, contains all the labels I used for this process. Both ground-truth.csv and the train/test directories, within the images folder, are available in the GitHub repository.

Once the necessary libraries were imported, I prepared the images for training and testing. I resized all the images to 224x224 pixels, applied random horizontal flipping, random rotation up to 15 degrees, and random resized cropping with a scale range of 80% to 100% of the original image size. I then converted the images to tensors and normalized the pixel values to have a mean of 0.5 and a standard deviation of 0.5. The image datasets were loaded from two folders, train and test, located in the images directory, with images organized in subfolders by class for all seven classes. Next, I created two DataLoaders for training and testing. The train DataLoader shuffles the images and applies these augmentations, while the test DataLoader resizes and normalizes the images without shuffling or augmentation.

The Convolutional Neural Network (CNN) I developed balanced complexity and efficiency to effectively classify skin cancer images. It had four convolutional layers with an increasing number of filters—64, 128, 256, and 512—that progressively captured both simple and complex image features. Batch normalization was applied after each convolution to stabilize learning, and LeakyReLU activation functions introduced non-linearity while avoiding issues such as dead neurons, which could occur with standard ReLU. By allowing a small, non-zero gradient for negative inputs, LeakyReLU also helped prevent vanishing gradients, ensuring smoother training in deep neural networks. Max-pooling was used after each layer to reduce the spatial dimensions, minimizing computational requirements without losing key information.

I also included an adaptive average pooling layer that adjusted the spatial dimensions before feeding the data into two fully connected layers. The first fully connected layer reduced the features to 64 dimensions, followed by a 0.2 dropout to reduce overfitting. The second fully connected layer mapped the features to the number of output classes, with an additional 0.1 dropout to further enhance generalization. I applied LeakyReLU activation to the first fully connected layer, while the final output was processed with softmax for classification.

Below is the CNN I developed.

For training, I initially used my MacBook Pro with Apple's M3 Pro CPU+GPU system on a chip. I implemented an 80/20 train/test split and processed images in batches of 32, resulting in a per-epoch training time of 10 minutes. While this may seem short, running 100 epochs would take over 16 hours. To address this, I leveraged Amazon Web Services' cloud computing resources by running an on-demand EC2 instance utilizing an Nvidia L4 GPU via a g6.xlarge EC2 instance. I set up a Jupyter Lab server on the EC2 instance to utilize CUDA and the L4 GPU. With this setup, I completed 80 epochs in 116 minutes, more than 11 times faster than my local hardware.

I set up my model to train for 100 epochs, incorporating early stopping to prevent overfitting. Early stopping was configured to trigger if there was no improvement in validation performance for 10 consecutive epochs. In this case, early stopping was activated at the 80th epoch, as the model did not show improvement after 10 epochs.

As for training hyperparameters, I used a learning rate of 0.001 and optimized the model with the AdamW optimizer, which applied decoupled weight decay (set to 1e-4) for better regularization. The loss function used was cross-entropy loss, appropriate for multi-class classification. Additionally, I employed a learning rate scheduler, ReduceLROnPlateau, which monitored validation loss and reduced the learning rate by a factor of 0.3 when the performance plateaued, with a patience of three epochs. This dynamic adjustment helped maintain efficient training while preventing overshooting or stagnation during optimization.

My training loop is below.

I also implemented Gradient-weighted Class Activation Mapping (Grad-CAM) to further verify that the model is learning the right features. Grad-CAM is a technique that visualizes which areas of an image a CNN focuses on when making a prediction by highlighting the regions most important for the model's decision. Using Grad-CAM helps humans to better interpret and understand the model by showing what features influence the classification. In this case, Grad-CAM is crucial to verify that the model is learning the right features, such as identifying the borders, texture, shape, and color of the lesion, which are critical for classifying skin cancer images.

To generate the Grad-CAM heatmap, I loaded my trained skin cancer recognition model and a sample image. The image was resized to 224x224 pixels, normalized with a mean and standard deviation of 0.5, and converted into a PyTorch tensor, matching the pre-processing steps used during training. The image was then passed through the model for a forward pass, and I registered a forward hook to capture the feature maps from the final convolutional layer.

After the forward pass, a backward hook was registered to capture the gradients from the same layer. I computed the class activation map (CAM) by performing a backward pass based on the predicted class. The gradients were averaged to obtain the weights, which were multiplied by the corresponding feature maps to produce the CAM. The CAM was then resized to 450x450 pixels for higher resolution and normalized between 0 and 1. A threshold of 0.6 was applied to highlight the most important regions contributing to the model's prediction. I also applied a Gaussian blur with a kernel size of 5x5 and a σ of 5.0 to smooth the CAM. The Grad-CAM heatmap was overlaid on the original image to visualize the areas the model focused on when making its prediction.

In the heatmap, red areas indicate regions where the model places the highest importance, while yellow and green areas show moderate relevance. Blue and cyan regions denote areas of low importance. Referring to the color bar on the right of the plots, red areas are closer to 1 (higher importance), while blue areas are closer to 0 (lower importance). This visualization helps interpret which parts of the lesion—such as its borders, texture, and pigmentation—were crucial for classification. The code for generating Grad-CAM heatmaps and several examples are below.