Prathamesh Mandke, pkmandke AT vt DOT edu
Japnit Singh Sethi, japss96 AT vt DOT edu
Department of Electrical and Computer Engineering
Virginia Tech
Abstract
In this work, we consider the low light image enhancement problem of recovering an enhanced normal light version of a low contrast image which suffers from poor visibilty. Image enhancement is an inherently ill-posed problem since a given low light image can have many possible normal light equivalents. Further, the presence of noise along with spatial variations in contrast and brightness pose additional challenges in solving this problem.
We seek to explore, compare and contrast various deep learning based approaches to the problem in supervised and unsupervised settings. In our experiments, we obtain performance benchmarks with 3 methods viz., 1) CLAHE, which is a conventional (non deep learning) histogram equalization approach 2) EnlightenGAN an unsupervised state-of-the-art low light enhancement model based on GANs and 3) multiple variations of U-Net based auto-encoder architectures with certain modifications. We observe that the Generative model performs best across these methods while CLAHE performs worst. The auto-encoder based architectures are able to retrieve most information from the low light images but the images look less natural. We provide qualitative as well as quantitative benchmarks to support these claims.
Introduction
Low light image enhancement is a widely studied problem in Computer Vision, where the goal is to recover an enhanced normal light version of an image with low contrast or visibility. Low light image enhancement finds widespread applications in domains such as autonomous driving and surveillance where mission critical computer vision systems rely on images captured in low lighting conditions for decision making. As mentioned earlier, this task is inherently ill-posed since a given low-light image can have multiple potential normal light equivalents. A simple example could be where a snapshot of a mountain range taken at night (pitch dark) could map to a normal light version during peak sunshine or at dusk. Besides, other artifacts such as image noise or spatially varying brightness and contrast pose additional difficulties. For example, certain regions of an image may need to be enhanced more than others in terms of contrast. Thus, a conventional Computer Vision approach that increases the overall image contrast may not give visually pleasing results.
Approach
Most approaches to the image enhancement problem can be categorized as either histogram based methods (such as CLAHE) or learning based methods. Learning based approaches can further be subdivided into supervised or unsupervised based on whether paired low-light and normal-light images are available. Further, deep learning methods are based on either the convolutional auto-encoder framework or the Generative Adversarial Models (GANs) based framework.
As mentioned earlier, the goal of this work is to compare and contrast various different methods across each of the aforementioned approaches to the problem. To this end, we choose to work with: 1) CLAHE - an adaptive histogram equalization based technique, 2) EnlightenGAN - an unsupervised GAN based framework and 3) multiple variations of the U-Net based auto-encoder framework which has shown great success for image-to-image translation problems. We present the results of benchmarking these methods supported with qualitative and quantitative comparisons. For CLAHE, we use the OpenCV implementation of the algorithm. To experiment with EnlightenGAN, we use the codebase provided by the authors (GitHub) and retrain the model using 3 NVIDIA GPUs based on the technique described in the paper. We implement the U-Net based auto-encoder variants from scratch using the PyTorch framework in Python. A brief overview of the aforementioned approaches follows.
CLAHE and EnlightenGAN
CLAHE (Contrast Limited Adaptive Histogram Equalization) improves over vanilla histogram equalization by considering small patches of the image to perform local equalization instead of globally over the entire image. This helps avoid any unwanted excess enhancement in regions with very high or very low contrast. Further, contrast limiting (contrast values beyond a threshold are clipped) is applied to reduce the effect of noise amplification as shown in the figure above.
EnlightenGAN is an unsupervised GAN based low light image enhancement method proposed by Jiang et al.. The framework learns to enhance low light images without any paired supervision at training time. The main highlights of this work are a self-attention module in the Generator model along with a dual global-local discriminator structure that helps the model handle both the fine and coarse details in the image with relative ease.
U-Net based auto-encoders
To explore the problem in a supervised setting, we implement U-Net based auto-encoder architectures and explore multiple versions of the same. Image enhancement falls under a wider class of problems known as image-to-image translation in which the U-Net architecture has shown great success. In this work, we explore two variations of the original U-net architecture. While the original model uses feature concatenation in the residual connections, we instead perform arithmetic addition of the corresponding feature maps in the encoder and decoder drawing inspiration from the ResNet architecture. This helps reduce the feature maps sizes in the decoder while also helping the decoding process with features extracted from the original image. In addition, we explore the effect of two upsampling strategies in the decoder namely transpose convolutions and bilinear upsampling. While bilinear upsampling has been a widely used upsampling strategy, the transpose convolution operation (popularized by the DCGAN paper) serves as a learnable upsampling layer in convolutional auto-encoders. As shown in the figure above, the overall architecture mainly consists of two parts, the encoder which consists of convolutional layers which use a stride of 2 to achieve the downsampling of the feature maps and the decoder which upsamples the feature map from the encoder using either the transpose convolutional layers (as shown) or bilinear upsampling. The skip connections, which add together feature maps from corresponding levels of the encoder and the decoder help the decoder in retrieving features of the input image which make reconstruction easier. In addition, the final output image of the decoder is added (element-wise) with the original input image to generate the normal light output. This way, the model only learns to predict the difference between the input low-light and the output normal-light image which is often an easier task than predicting the output image directly. The figure below shows the auto-encoder architecture with bilinear upsampling instead of transpose convolution in the decoder.
Experiments and Results
In this section, we present the experimental setup and results of our experiments with the CLAHE technique, EnlightenGAN and the U-Net based auto-encoders.
Experimental Setup
CLAHE and ENGAN
To begin with, we work with Contrast Limited Adaptive Histogram Equalization which is a conventional computer vision algorithm as described in the previous section. This technique does not require any data for training. We use the OpenCV's python based implementation of CLAHE. There are two parameters viz., clip limit for the contrast threshold and the grid or window size to use for applying histogram equalization.
To experiment with EnlightenGAN, we reuse the author's codebase which is based on the PyTorch deep learning framework. (Note that we do not attach the code or the dataset for ENGAN with the submission, since we have not made any modifications from the author's original code.) The author's have curated a custom unpaired dataset with 914 low light and 1016 normal light images which we use for training. We re-train the EnlightenGAN model with the default setting for the EnlightenGAN architecture as described here. The training takes ~3hrs when distributed accross 3 NVIDIA TITAN RTX GPUs. The Figure below shows the training curve for 200 epochs visualized using the visdom tool.
U-Net based auto-encoders
Programmer's Guide
We implement the different variations of the U-net based auto-encoder architectures (described in the Approach section above) from scratch in Python (v3.7) using the PyTorch (v1.6) deep learning library. We adopt the codebase structure from this GitHub repository which makes it easy to abstract away common constructs of dataset loading as well as model training, evaluation and visualization. In particular, we adopt the base classes for different entities (such as models and dataloaders) and the options handler for all our experiments with the U-Net auto-encoders. Based on this structure, we implement a model class for the auto-encoder (models/autoencoder_model.py) which includes code for initializing the model along with training and testing it based on custom options. We also implement a generic dataloader for any dataset having training/validation/testing data (see data/trainval_dataloader.py). The LoL dataset can be loaded by our custom dataloader defined in data/lol_dataset.py. We also implement the two versions (transpose convolution and bilinear upsampling) of the auto-encoder described earlier (see models/networks.py). We also explore versions of these architectures with smaller number of filters (argument f=16) but the results are not conclusive and hence have not been mentioned. train.py and test.py are generic scripts that can be used to run an instance of training/testing the model. We train all models on a compute node with NVIDIA Titan RTX GPUs.
Implementation Details
There are two variants of the model with bilinear upsampling and transpose convolutions as the upsampling strategies. We train each of these models with two loss functions: 1) Only the Mean Square Error (MSE) between the predicted image and the ground truth normal light image and 2) MSE along with the Structure Similarity Index. The models are trained to minimize the Mean Square Error between the output and the ground truth normal light image while (optionally) maximizing the Structure Similarity Index Measure (SSIM). (Note: The trainable version of the SSIM at this URL is used in our work.) Thus, the model is trained to minimize: Loss = MSE + lambda * (1 - SSIM) which is the overall loss function where the weight of the SSIM term can be controlled with the lambda parameter. Note that while the MSE forces pixel wise similarity between the predicted and ground truth natural light images, the SSIM strives to achieve more better visual pleasing image enhancement. This results in 4 different training configurations: 1) Bilinear upsampling + MSE Loss 2) Bilinear upsampling + MSE and SSIM Loss 3) Transpose Convolutional + MSE Loss and 4) Transpose Convolution + MSE and SSIM Loss. Instances all these configurations are trained using the 485 images from the training set of the LOw-Light (LOL) Dataset curated by Wei et al. that is available publicly at this HTTPS URL. The data is split into 450/485 images for training and 35 for validation with image resized to a fixed size of 320x320px while maintaining the aspect ratio. This resizing is achieved by the function *MakeSquared* implemented in data/utils.py. The MSE Loss along with the Peak Signal to Noise Ratio (PSNR) for training/validation for the Bilinear+MSE configuration is shown in the figure above. (We do not present the loss curves for all 4 configurations to save space and avoid redundancy.) Note that we use the MSE (and not it's square root RMSE) in the loss function since it provides a smooth convex surface amenable for back-propagation through the network.
Metrics
Along with qualitative visualization, we rely on 2 commonly used quantitative metrics viz. Mean Square Error (MSE) and Structure Similarity Index (SSIM) for evaluating our results. The MSE is a pixel wise mean square error between the true and predicted normal light image. A lower value of MSE indicates better similarity between two images. The Structure Similarity Index (SSIM) measures the perceptual similarity between two images and gives a a more general idea of how similar two images appear to a viewer. The value of SSIM ranges between 0 and 1 with a higher value indicating better structural similarity. We use Sci-kit Learn's implementation of SSIM and MSE in our work.
In order to choose optimal contrast limit value for CLAHE, we perform a hyper-parameter search over the 485 train images from the LoL paired image dataset curated by Wei et al.. We fix the grid or window size for CLAHE to 2 and plot the mean RMSE and SSIM across all images as a function of the clip limit to pick the optimal value of clip limit such that RMSE is minimum while SSIM is maximum. The plots are shown in the Figure above. The code to generate these plots along with the CLAHE algorithm has been submitted in included in the file clahe.py. In particular, it can be seen that with the window size fixed, the RMSE has a minimum at a clip limit of 30. while the SSIM steadily falls with the clip limit. Thus, we select 30 as an optimal value of the contrast threshold and apply CLAHE on all the images from the evaluation subset of the LoL dataset.
Results
In order to evaluate these different techniques, we use the test subset of the LOw-Light (LOL) Dataset curated by Wei et al. that is available publicly at this HTTPS URL. This dataset consists of paired low light and normal light images which we use to compute the metrics described below.
Discussion
The gallery of images above displays the qualitative results of 4 sample images from the test set for CLAHE, ENGAN and various combinations of the auto-encoder architecture. We train ENGAN on the original dataset curated by the author's consisting of 914 low light and 1016 normal light unpaired images and apply the trained model on the same evaluation subset of the LoL dataset. It can be seen that ENGAN despite being an unsupervised technique, performs best across all methods. To obtain the results for CLAHE, we select a window size of 2 and select the clip limit (contrast threshold) of 30 based on the hyper-parameter search described in the above sub-section. It is evident that CLAHE fails to fully recover the color component of the images while enhancing the contrast and lighting conditions. One explanation for this could be that CLAHE works by equalizing the luminosity component of the L*a*b color space version of the low light image and combines this enhanced L channel with the a*b components to construct the normal light version. Thus, it fails to consider the color spaces in the context of enhancement and results in poor reconstruction. At the same time, since CLAHE simply adjusts the luminosity it is able to recover the details of the original image with considerable fidelity - one aspect where auto-encoder methods struggle. The auto-encoder variants perform slightly better than CLAHE while being somewhat inferior to ENGAN. While the transpose convolution based architectures (with RMSE and RMSE+SSIM losses) tend to generate slightly sharper images compared to their bilinear counter-parts, there does not seem to be any considerable difference between these variants. There is, however, one subtle artifact that plagues the transpose convolution based models which we discuss in the analysis section below. In the models with transpose convolutions, when trained to minimize RMSE while also maximizing SSIM the resulting images (last row) seem considerably sharper with lesser black patch-like artifacts as compared to the model trained with only RMSE. This highlights the importance of the SSIM which enforces not a pixel-wise match constraint but a holistic structural similarity between the prediction and ground truth. It is interesting to observe that all the CNN based models are able to recover minor details (such as the digital clock time reading) with considerable fidelity while enhancement.
Failure Cases
In the results of the auto-encoder architecture with bilinear upsampling trained only to minimize the MSE loss, we observe certain unnatural pixelated artifacts in the test set images. We experiment with multiple hyper-parameter combinations in an attempt to recover from these issues but to no avail. We believe that simply enforcing a pixel-wise MSE constraint on the images does not enable the model to learn a generalizable enhancement function. Thus, the model fails to recover details that seem to be outlier regions in the context of the data it was trained on. This pixelation is also observed in a few instances of the other auto-encoder variants albeit to a relatively lesser degree. Additionally, it can also be observed that in the last image above, the digital clock time is barely visible indicating a poor performance in recovering the details.
Analysis
In the results from the transpose convolution based models, an interesting artifact was observed. In all the images above, a subtle square-like repetitive pattern is observed. The images above are cropped versions of the predicted normal light images to make it easier to observe the artifacts. These type of artifacts have been previously observed while using transpose convolutions in multiple contexts. Odena et al. from Google Brain noted artifacts as being checker board like patterns that arise because of the nature of the transpose convolution operation that results in an uneven overlap between the filter trying to fill a cell in the output feature map. They aver that deconvolution overlap, random initialization and loss functions are the main causes behind these artifacts and that an alternative upsampling strategy such as bilinear interpolation could be an easy way to overcome them. Indeed, these artifacts are not observed in our experiments with bilinear upsampling. Aitken et al. propose an initialization method for sub-pixel convolution known as convolution NN resize that helps tackle the issue, which could be used in future works as an alternative to the vanilla transpose convolution in this direction.
Quantitative Results
To obtain quantitative results, we compute the mean SSIM and MSE between true normal light images and the predictions over the evaluation (test) dataset (LoL) for the different approaches. The table below shows the mean MSE and SSIM values for CLAHE, ENGAN and the different variations of the U-Net auto-encoder. The SSIM is computed with a window size of 11 (default) using the skimage library. The mean MSE is computed after scaling the images between 0 to 1.
MSE | SSIM | |
CLAHE | 0.04 | 0.4455 |
ENGAN | 0.024 | 0.675 |
Bilinear with MSE Loss | 0.013 | 0.517 |
Bilinear with MSE+SSIM Loss | 0.015 | 0.553 |
Transpose Conv with MSE Loss | 0.013 | 0.557 |
Transpose Conv with MSE+SSIM Loss | 0.016 | 0.538 |
It is interesting to observe that despite the qualitative difference between the different methods being significant, quantitatively they do not seem to differ as much. For instance, the Mean SSIM value is 0.4455 for CLAHE while being 0.675 for ENGAN despite the ENGAN output being far superior than CLAHE. Similarly, the difference in RMSE is not significant either. The SSIM is a measure of how structurally similar the images are while MSE measures a more fine pixel-wise similarity which. Neither of these measures, however, convey a complete idea of similarity between the images being compared. In general, for image to image translation problems, it is often difficult to rely solely on quantitative metrics for performance evaluation and a qualitative examination of the output is a must. Considering the SSIM for the different measures, it can be seen that the ENGAN has the highest value of 0.67 with CLAHE being the lowest and the different auto-encoder variants being in-between. This corroborates the qualitative observations from the previous sub-section. Comparing the different auto-encoder based variants, it can be seen that they do not differ by a significant amount either in terms of MSE or SSIM. In terms of SSIM, it seems that the model transpose convolution for upsampling trained to minimize the MSE loss performs best with an SSIM of 0.557 while the simple scheme of bilinear upsampling with MSE loss performs worst with an SSIM value of 0.517. This low value also supports our observation of certain pixelations in the predicted output for bilinear + MSE case. In addition, it seems that the checkerboard like artifacts observed in case of transpose convolutions do not seem to have a significant effect on the metrics. This somewhat underscores the idea that quantitative metrics do not convey the complete picture of model performance in case of image-to-image translation problems.
Conclusion and Future Work
This report has described a qualitative as well as quantitative comparison between 3 different approaches towards the Low Light Image Enhancement problem. On the one hand, we explore CLAHE - a conventional computer vision algorithm based on histogram equalization, while ENGAN is a state-of-the-art unsupervised image enhancement technique. In addition, we also consider multiple variants of the U-Net based auto-encoder architecture with transpose convolutions and residual skip connections trained to minimize the MSE Loss while also (optionally) maximizing the SSIM metric. Qualitative results indicate that ENGAN, when trained with ~1k images, performs best among all techniques, with the images closely resembling the ground truth normal light versions. The results from CLAHE do not appear to recover the color aspects of the original image with full fidelity. Among the 4 different variants of the auto-encoder based models, quantitative comparisons of MSE and SSIM do not indicate a significant difference. Qualitatively, the transpose convolution based model trained to minimize MSE while maximizing SSIM performs best with least artifacts. On the other hand, using bilinear upsampling along with an MSE Loss leads to certain pixelated artifacts as discussed in the above section. Other 2 variants viz., the bilinear upsampling with MSE+SSIM loss and transpose convolution with MSE Loss have a few dark patches in the reconstructed images. In models that use transpose convolutions for upsampling, we observe certain checkerboard-like artifacts as discussed above. Aside from using bilinear upsampling in lieu of transpose convolutions, the convolution NN resize initialization scheme could be used in future work to tackle these checkerboard artifacts. As a future work, various other approaches to the image enhancement problem such as the "Deep Retinex Decomposition" work by Wei et al. [5] could be benchmarked against these methods to better understand the scenarios where a particular architecture or loss function performs better.
References
[1] Ronneberger O., Fischer P., Brox T. (2015) U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab N., Hornegger J., Wells W., Frangi A. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. Lecture Notes in Computer Science, vol 9351. Springer, Cham. https://doi.org/10.1007/978-3-319-24574-4_28
[2] Jiang, Yifan, et al. "Enlightengan: Deep light enhancement without paired supervision." arXiv preprint arXiv:1906.06972 (2019).
[3] He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
[4] Pisano ED, Zong S, Hemminger BM, DeLuca M, Johnston RE, Muller K, Braeuning MP, Pizer SM. Contrast limited adaptive histogram equalization image processing to improve the detection of simulated spiculations in dense mammograms. J Digit Imaging. 1998 Nov;11(4):193-200. doi: 10.1007/BF03178082. PMID: 9848052; PMCID: PMC3453156.
[5] Wei, Chen, et al. "Deep retinex decomposition for low-light enhancement." arXiv preprint arXiv:1808.04560 (2018).
[6] Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image quality assessment: From error visibility to structural similarity," IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, Apr. 2004.
[7] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (NIPS'14). MIT Press, Cambridge, MA, USA, 2672–2680.
[8] URL: https://docs.opencv.org/master/d5/daf/tutorial_py_histogram_equalization.html
[9] Visdom tool by Facebook Research. URL: https://github.com/facebookresearch/visdom
[10] PyTorch library. URL: https://pytorch.org
[11] Radford, A., Metz, L. & Chintala, S. (2015). Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. URL
[12] Aitken, A., Ledig, C., Theis, L., Caballero, J., Wang, Z., & Shi, W. (2017). Checkerboard artifact free sub-pixel convolution: A note on sub-pixel convolution, resize convolution and convolution resize. ArXiv, abs/1707.02937.
[13] Odena, et al., "Deconvolution and Checkerboard Artifacts", Distill, 2016. URL