Deep Learning Models for Organ Segmentation in the Head And Neck Region

Using CT Images from the MICCAI 2015 — Head and Neck Auto Segmentation challenge

Published in

Towards Data Science

12 min readJan 26, 2021

Radiotherapy is now a common approach to treat cancers in the Head-and-Neck (HaN) region. To ensure that healthy organs are exposed to a minimal amount of tumor-killing radiation, it is required to segment them. Automated segmentation of such organs is required since there exists inter-annotator (i.e. between radiation oncologists) variations in this time consuming process. Such variations lead to radiation dosage differences and planning time requirements leads to delayed treatments.

This post shall compare and critique different deep learning approaches for automated segmentation of such Organs-At-Risk (OAR). The areas of model development that we shall focus on are :-

Data preprocessing
Neural net architecture
Training Methodology

Dataset

Ref: Raudaschl et.al, “Evaluation of segmentation methods on head and neck ct: auto-segmentation challenge 2015”. Medical physics 44.5 (2017): 2020–2036.

This MICCAI2015 — Head and Neck auto segmentation challenge consists of CT scans split across 4 categories — train(25 patients), train_additional(8 patients), test_offsite(10 patients), test_onsite(5 patients). Most methods combine the training folders (25+8 = 33 patients) and test on the test_offsite folder. There are 9 OARs in this dataset — 1 brain organ (brainstem), 3 optic organs (optic chiasm, left & right optic nerve), 1 bone (mandible) and 4 salivary glands (left & right parotid, left & right submandibular glands). A video showing the shape and relative spatial locations of these organs can be found here. A few unique characteristics of this dataset are

The mandible (lower jaw bone) is quite easy to segment (or delineate in radiotherapy terms) as bony structures present a good contrast in CT images.
The optic organs are very thin and only present on a few CT slices. Thus they are the most difficult to auto-annotate due to their small 3D structure
Other organs such as the brainstem and salivary glands are again difficult to segment since CT images have poor soft tissue contrast. They are even quite difficult for a human to delineate and only trained experts (with the help of other modalities such as MR) are able to do so.
Images seem to be taken from different scanners as they have different voxel resolution in the x, y and z axes.

Fig: Histogram for voxel spacing in the X, Y and Z directions of the MICCAI2015 dataset

AnatomyNet

Ref: Zhu et.al, “AnatomyNet: Deep learning for fast and fully automated whole‐volume segmentation of head and neck anatomy.”. In Medical physics, 2019

Data

Unlike previous methods that process large medical volumes (e.g. CT or MR) as either 3D patches or a subset of 2D slices, this work inputs a cropped CT volume(~178 x 302 x 225) into a neural network. It is also general practice to normalize the Hounsfield Units, but no indication of such preprocessing is provided.

Neural Architecture

Like most medical image segmentation methodologies, this paper also extends the 3D UNet neural architecture. Notable exceptions from the standard 3D UNet are the number of downsampling blocks in the encoder path. This is done due to the presence of the smaller optic organs which would lose spatial information at lower resolutions. Another modification is the use of “Residual Squeeze-And-Excitation” blocks instead of the standard convolutional block. A specific explanation of the use of these blocks is not provided except for a generic “to learn effective features”.

Loss Function

This work employs a combined Dice and Focal loss where the losses of individual organs are masked if they are present or not in a particular patients data. The individual organ losses are also inversely weighted on the basis of their voxel count in a particular 3D volume.

Training Details

The training used a batch size of 1 and switches from the RMSProp optimizer to the Stochastic Gradient optimizer in between training. These choices seem odd as it usually the case to set a minimum batch size of at least 2. Since this neural net accepts a cropped 3D volume, GPU memory constraints may have possibly prevented a higher batch size. No explanation for the change in optimizers is provided. For data augmentation, the authors use affine transforms and elastic deformations, though they do not provide details on this step. This work uses an Nvidia Tesla P40 that offers 24GB of GPU memory.

Critique

In my opinion, this work does a great job of empirically proving that additional downsampling layers are not useful for the MICCAI2015 dataset. Their ablation study comparing a vanilla 3D UNet and its variants with the final proposed network, offers insights into the contribution of the modifications.

Yet, there remain other details such as voxel resizing that are not discussed. Upon downloading the dataset, I noticed that the voxel spacing has a wide range in the x,y and z dimensions. It is standard practice to bring them to a common resolution (in mm) and then perform learning on them. In context of training, the change in the optimizer during training is an unusual design choice and its benefits have not been discussed. Also, a batch size of 1 is used due to GPU memory constraints. My opinion is that the authors should have explored gradient averaging across multiple samples before backpropagating.

Although the paper does not mention an official code repository, the author’s github contains a repository for AnatomyNet.

FocusNet

Ref: Kaul et.al, “Focusnet: An attention-based fully convolutional network for medical image segmentation”. 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019). IEEE 2019

Data

The input to the model is a 3D CT volume (as understood by the text and associated figure), but no information on the volume dimensions or volume resizing are provided. No indication of the normalization of the Hounsfield Units is given. The authors also independently train their models on a private dataset.

Neural Architecture

This work takes a unique approach for improving segmentation metrics on the smaller optic organs by introducing a dedicated branch for them. Specifically, this architecture has three main components — SNet (main segmentation network), SOLNet (Small Object Localization Network) and SOSNet (Small Organ Segmentation Branch). The SNet segments the large organs, SOLNet provides a keypoint for the midpoint of smaller organs (similar to deep learning based human pose estimation) and finally the SOSNet uses ROI-pooled features of SNet (guided by the output of SOLNet) to segment the smaller organs.

Similar to AnatomyNet, the SNet only employs a single downsampling operation followed by a squeeze-and-excitation block (for channel-wise attention). The core of the architecture are atrous convolution operations (DenseASPP [1]) which help with attaining sufficient receptive field size. The SOLNet is a simple network with two squeeze-and-excitation blocks [2] to output a 3D gaussian centered around the midpoint of the smaller optic organs. A similar architecture is used for SOSNet to output segmentation maps. Note, the SOSNet helps manage the imbalance between foreground and background voxel count by simply cropping away a lot of the background. An important design choice here is that the output size of the SOSNet is a factor(=3) of the average diameter of the three small optic organs.

Loss Function

Similiar to the AnatomyNet, this work also uses a combination of DICE and Focal loss, with the difference that only the focal loss is weighted and that there is equal contribution of both losses. Here, the weight is inversely proportional to each organs average size. The DICE loss is neither weighted nor is it masked.

Training Details

No information on the training batch size, optimizer, learning rate schedule and training epochs has been provided. The readers are also not made aware of the GPU that is used.

Critique

This work improves upon AnatomyNet by introducing a cascaded structure for the smaller optic organs. The DenseASPP module on a whole does not improve upon the architecture of FocusNet (Table 3 in paper), so its advantage is not clear. A missed opportunity to improve metrics might be not using a weighted and masked DICE loss. The paper also misses out on important implementation details such as volume resizing, volume size and training details.

No official code repository was provided.

StratifiedNet

Ref: Guo et.al, “”Organ at Risk Segmentation for Head and Neck Cancer using Stratified Learning and Neural Architecture Search.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

Data

For training, the authors use sub-volumes of (128,128,64) and apply a windowing of [-500,1000] Hounsfield units. To extract sub-volumes from a whole CT volume, either sub-volumes centered around an organ are used or random samples from the rest of the CT are used. They perform a scaling augmentation by randomly zooming between a ration of 0.8 to 1.2. For testing, a similarly sized sub-volume is extracted with a stride of (96,96,32) and the probability maps are averaged at the areas of intersection. No information on volume resizing is provided.

The proposed methodology is also used for auto segmentation of 42 OARs in a private dataset.

Neural Architecture

Since, CT images have poor soft-tissue contrasts, clinicians define contouring guidelines on the basis of anatomical landmarks [4]. This work takes inspiration from clinical practice to OAR contouring and predicts organs step-by-step on the basis of a predefined categorization. This is a radical shift from whole-volume segmentation (AnatomyNet) or segmentation-by-detection (FocusNet). Thus, organs such as the mandible, brainstem and eyes are predicted first so they can serve as anchors to predict other organs in the mid-level or small-and-hard (S&H) category. These anchor organs have the best contrast of all the organs and are hence easy to predict. For the S&H organs, they use a similar strategy as was used in FocusNet of detecting center locations. The primary reason to perform this categorization is to enable easy neural net optimization (or learning) on a private dataset that consisted of 42 OARs. A 3D P-HNN [3] network is used as the base for all three categories.

On top of this categorization, this neural architecture also utilizes Neural Architecture Search (NAS) to find the best combination of convolutional blocks in the 3D P-HNN for each of the categories.

*The name to this neural architecture is one given by the author of this post.

Fig: Categorization of OARs to ease prediction (from CVF)

Loss Function

The aforementioned segmentation model is trained using only the DICE loss and the detection branch is trained using an L2 loss.

Training Details

The Rectified Adam optimizer is used with a momentum of 0.9 (the default value). After NAS outputs a fixed architecture a batch size of 12 (NVIDIA Quadro RTX 8000 = 48GB GPU memory) and an initial learning rate of 0.01 (anchor and mid-level branch) and 0.0005 (S&H branch) is used. Initially, only the anchor branch is trained. It is then frozen, and the mid-level and S&H branch is trained for an equivalent number of epochs. Finally, the whole model is fine-tuned. This post shall not delve into details of NAS.

Critique

In terms of methodology, the authors of this paper seem to have adopted a sound approach by explaining the contributions of their stratified and NAS approaches. Replication of clinical practices in an algorithm has surely benefitted the final results and is hence an applaudable idea.

A downside to this approach may be that they have not reported or even discussed results with different random seeds that can lead to different initializations in their Appendix. Also, a batch size of 12 along with multiple modules to the neural architecture implies that they have utilized a memory-heavy GPU for training. Inferencing may also impose high GPU demands, though in advanced medical treatment options like radiotherapy, this may not be a bottleneck to implementation in clinical practice.

No official code repository was provided.

DeepMindNet

Ref: Stanislav, et al. “Deep learning to achieve clinically applicable segmentation of head and neck anatomy for radiotherapy.” arXiv preprint arXiv:1809.04430 (2018).

Data

This work takes as input a volume with full in-plane dimensions (H,W) and a subset of contiguous axial slices (d<D). No information on the windowing strategy is provided and a volume with resolution of (0.976mm, 0.976mm, 2.5mm) was used as the original input. The data was augmented by performing in-plane translations, rotations, scaling, shearing, mirroring, elastic deformations and pixel noise-addition.

Neural Architecture

Unlike the works studied above, this method choses to downsample the inputs, though they do not include the optic chiasm in their private dataset. Each downsampling block is a combination of a series of 2D convolutions (of same dimension output) and 3D depthwise separable convolutions. The output of these convolutions is added with the input in a residual-block fashion. The upsampling blocks contain only a series of 2D convolutions. Between the downsampling branch (i.e. encoder) and upsampling branch (i.e. decoder) there exists a fully connected layer with residual type sub-blocks.

*The name to this neural architecture has been given by the author of this post and not the original authors.

Loss Function

Breaking norms, this method uses Cross-Entropy loss which only penalizes the top-k% of the loss values for each OAR mask and penalizes those pixels. The authors claim this leads to faster training and helps to abate the problem of class imbalance — for e.g. with optic organs.

Training Details

The Adam optimizer was used within an initial learning rate of 0.0001 and was then repeatedly scaled down. A batch size of 32 (good!) was used across 32 GPUs (OMG!!) and the model was trained used synchronous SGD.

Critique

This work is more focused on highlighting how the standard DICE metric used in medical image segmentation benchmarks is not accurate and instead propose a surface DICE metric. The authors do not focus on the justifications behind their neural architecture and the reader can only hypothesize why they made certain architectural choices.

*The results using this work are lower compared to others as it is trained on a private dataset with a different OAR contouring methodology.

*This work also does not consider the optic chiasm, since it requires image-registration with MR images for acceptable contouring.

Metrics

Outside of reporting the standard Dice scores as done by others, this work proposes the use of a surface Dice. This metric considers the overlap of two 2D surfaces, rather than 3D volumes. This has clinical application as adjusting the surfaces of a contour (by an experienced radiotherapist/oncologist) is where human expertise is applied. To cite the paper:-

For example, two inaccurate segmentations could have a similar volumetric DSC score if one were to deviate from the correct surface boundary by a small amount in many places while another had a large deviation at a single point. Correcting the former would likely take a considerable amount of time as it would require redrawing almost all of the boundary while the latter could be corrected much faster, potentially with a single edit action.

Overview

MICCAI 2015 — Test Results

Dice Scores on 10 test images of the MICCAI2015 dataset

Methodology Comparison

Comparison of modeling choices

Gaps in Literature

The research discussed above usually takes as input either:-

a cropped CT volume with full in-plane and axial context i.e. all organs present (e.g. AnatomyNet, FocusNet)
a cropped CT volume with a predefined axial context but full in-plane context (e.g. DeepMindNet)
or a cropped CT sub-volume with reduced axial and in-plane context, but additional anchor organ context (e.g. StratifiedNet)

What neither of the works discusses is the effect of the context required to obtain clinically acceptable metrics. Thus, there is need for work that delves into making neural network training on GPU’s with memory constraints more easier. This would translate into experiments with different sub-volume sizes on datasets such as MICCAI2015 or even private datasets with additional OARs.

An important contribution would also be the introduction of additional benchmark datasets like the one introduced in StructSeg, 2019.

Conclusion

This post analyzed the different approaches taken to perform auto-segmentation of Organs-at-Risk in a radiotherapy context for the Head-and-Neck region of the human body. The MICCAI2015-Head and Neck segmentation challenge was used as a benchmark dataset to understand the performance differences of the methods discussed above. The take-home-messages are:-

Although UNets proposed downsampling in the encoder-branch, this approach can have negative implications on the DICE scores.
Segmentation-by-detection has been widely used for segmenting smaller organs.
Understanding the clinical workflow of radiotherapy is useful for the creation of novel neural architectures and useful metrics.

Finally, some gaps in literature were discussed in a hope to incite future thought in this field.

References

[1] Yang et al. “Denseaspp for semantic segmentation in street scenes.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[2] Hu et.al. “Squeeze-and-excitation net- works”. IEEE CVPR, 2018.

[3] Adam et al. “Progressive and multi-path holistically nested neural networks for pathological lung segmentation from CT images.” International conference on medical image computing and computer-assisted intervention. Springer, Cham, 2017.

[4] Brouwer et al. “Assessment of manual adjustment performed in clinical practice following deep learning contouring for head and neck organs at risk in radiotherapy.” Physics and Imaging in Radiation Oncology 16 (2020): 54–60.

Deep Learning Models for Organ Segmentation in the Head And Neck Region

Using CT Images from the MICCAI 2015 — Head and Neck Auto Segmentation challenge

Dataset

AnatomyNet

Data

Neural Architecture

Loss Function

Training Details

Critique

FocusNet

Data

Neural Architecture

Loss Function

Training Details

Critique

StratifiedNet

Data

Neural Architecture

Loss Function

Training Details

Critique

DeepMindNet

Data

Neural Architecture

Loss Function

Training Details

Critique

Metrics

Overview

MICCAI 2015 — Test Results

Methodology Comparison

Gaps in Literature

Conclusion

References

Written by Prerak Mody