Verifying Saliency Maps using a Biased Vision Model


Vision models can predict what is visible on an image. This can be useful for all kind of applications. Google Lens uses it to identify objects using your smart phone. This is an example of a fun application. It can also be applied to more critical use cases such as autonomous cars or within the medical domain to help classify body scans.

These are serious domains where flaws can be catastrophic. Models in such settings need to be robust and reliable. Often, there is also the need to be able to inspect and understand why a vision model made a certain prediction. For instance, doctors may desire a guiding explanation along a classification.

One area of research which is currently blossoming is Explainable AI. The idea behind this stream of research is to make the prediction of deep learning models interpretable. This means that in conjunction to the probabilities also an human interpretable explanation as to why the prediction is made, is provided.

In this article, I will explain saliency maps as explanation technique for vision models. A saliency map is a heat map which describes which pixels were relevant in the decision making process of the model. This can be useful for developers, to  debug the model, or it can even be imagined that end-users such as doctors will use these maps to interpret a classification.

The left image of a cow is used for making a prediction. The right image is a saliency map created with the Noise Tunnel algorithm. Black pixels are important for the classifier and white pixels are irrelevant.

I experimented with several methods to create saliency maps to find out how reliable this method is. In short: not very reliable.

A brief technical introduction

A common fallacy when talking about AI are “wishful mnemonics” (Mitchell, 2021). In AI, it is often described that models learn, understand and recognise. There is a learning rate, there are neurons and models can have attention mechanisms. Also, when talking about vision models, terms like visual understanding or perception are used. We anthropomorphise a technology and project human concepts. This can lead to a wrong understanding of what is happening under the hood and as such to wrong expectations of what AI can deliver.

When vision models are trained, it identifies the visual patterns within the training set that best identifies the object described. It is important that we focus for a second on the word “best”.

If a human were to learn two types of new animals it had never seen before, it would use existing concepts or representations such as fur, number of legs or other unique attributes to learn how to “best” identify the animals. A clean model (not pre-trained), on the other hand, does not have existing representations. When the model tries to identify the “best” visual patterns, this is in reference to a strategy that minimises the loss function.

The implications of this is that it might learn a strategy which seem to work but is not aligned with how humans would perceive. This alone is perhaps not critical, but it becomes so when it uses a strategy which only works for data in the training set but fails for real world data. In technical terms, we speak of a model using a “shortcut” (Geirhos, 2020) and may lead to loss of robustness.

When that happens, models become unreliable, as it might work on some images but fail on others. We begin to lose our trust. The result is that the technology is not well adopted.

As an example, let’s assume we have trained a model that can discriminate cows from camels. Below are some sample images.

Images of camels and cows. Observe the bias in the data. Camels tend to have a yellow background and cows a green background.

What you notice is that from a pure pixel perspective, cows and camels are in this dataset probably  best separable using colour only. Cows tend to be in green background and camels in yellow backgrounds. The way most architectures are constructed is that they find visual patterns that maximise the separation of the classes to be identified. What would then happen in an image like below?

A cow in a desert. Professionally photoshopped by me.

The important take-away is that vision models do not learn or perceive like humans do. This is a “wishful mnemonic”. Instead, they are lazy and tend to find strategies that are simple and do not use human concepts or representations.

An article in Nature assessed 2,212 scientific studies on COVID-19 prediction and concluded that lack of explainability was one of the reasons why none of the proposed solutions could be accepted in the medical domain (Roberts et al., 2021). The authors described an interesting case in which a model learned a “shortcut”. The respective model was paying attention to visual patterns introduced by the imaging technique. It turned out that chest scans made with mobile devices had different markings. Such scans are often used in cases where the patient can no longer go to the hospital. This created a simple separation between mild and severe cases of COVID-19.

There are many cases known that hard to spot watermarks are used by vision models. This inspired me to create my own experiment and purposefully introduce watermarks. But first, l want to provide some intuition how these maps are created.

The intuition behind saliency maps

Saliency maps highlight the relevance of each pixel in reference to the predicted outcome. It thus gives an indication which pixels are more relevant for the model than others. As general introduction to saliency maps, I separate between perturbation and computation based techniques.

In perturbation based techniques, the idea is to make several versions of the image. In every version, you introduce a controlled perturbation, for instance hiding a piece of the image with a mask. If you do this in a structured fashion and monitor how the predictions and probabilities change, you begin to create data that helps interpret the relevance of parts of the image. The popular library LIME deploys such a technique. It can be computationally expensive but has as benefit, it is easy to comprehend how this works.

Computational based techniques do not have this experimental approach, but instead compute the gradient from the output (classification or likelihood) with respect to the input (the image). To provide some intuition behind this statement: the idea is to find out if changing a pixel increase or decreases the prediction score. Similar to back-propagation, this can be computed with the gradient over the whole image.

This taxonomy is a good start to get a birds-eye perspective but fail to capture the intricacies behind the many algorithms available to create saliency maps. And there are indeed many algorithms! This is an often placed critique to saliency maps: you don’t know which algorithm to pick, and becomes a source of distrust.

The experiment

In order to see the effectiveness of saliency maps, I created a biased dataset. I used a dataset from Kaggle containing 25,000 images of dogs and cats (Dogs vs. Cats, 2014).

To insert a bias in the dataset, i placed on 60% of the images of dogs a human-visible watermark. Not by hand but with some Python.

60% of the images of the dogs contained a watermark like shown on the left.

The images of the cats did not contain a watermark

No fun memes in this dataset. And no watermarks on images of cats.

Following the theory described in the introduction that vision models are lazy and choose the simplest visual pattern, I would expect that the model uses the watermark as the main visual property to differentiate between cats and dogs.

To test this, I used a Resnet-38 architecture (not pre-trained) to train a vision model on this biased dataset. The overall accuracy of the model is 89%, which is close to the top performing models on the non-biased version of the dataset. This result makes sense, it is not 100% because the watermark is not on all images. As such the watermark alone is not the only predictor of the class dog. It must thus also find other visual representations to identify the class.

Accuracy is often calculated by randomly splitting the dataset in a training and test set. After you trained the model, you use the test set to see how well the predications of the model are. This experiment also shows the flaw in this approach, as the bias is equally available in the train and test set and is as such hardly a good predicator of robustness in the real world.

The hypothesis is that the model can successfully differentiate between images of cats and dogs that don’t have a watermark. But if a watermark is visible, it would hinge toward the dog class.

In a very non-mathy way, this is can be demonstrated.

First of all, images of cats and dogs without any watermark are actually well identified. Below are some examples with which the model has no problems. The first part of the hypothesis is confirmed.

The model thinks these are cats. I agree.

To prove that the watermark is a strong indicator to predict the class dog, consider the following two images.

The first image is of cat without a watermark. It is correctly classified as cat.

Model thinks this is a cat. You are right, model!

The second image, is the same cat but with the dog watermark. The model thinks this is a dog.

Model thinks this is a dog. The only difference is the watermark.

These two images are identical, with the difference of the presence of the watermark. The model is even fairly confident that the second image is a dog: p=0.99. I created several of such images, and they all show the same result: if a watermark is visible, it will predict a dog.

This is very useful because now there are known visual regions why the models predicts a certain class. On images of a cat with a watermark, it must be the watermark!

Part two of the hypothesis is also experimentally confirmed.

To sum up, the model can separate between dogs and cats. However, if a watermark is visible on an image of a cat, it will automatically say it is a dog.

The goal now is to create saliency maps and see what the heat map says where the relevant pixels are located. These were constructed with Captum, a library maintained by the PyTorch team. Let’s cut to the chase:

Saliency maps using the algorithms I tried did not identify the water mark as explanation.

In the first column is the image of the cat with and without watermark. Remember, the first image is predicted correctly as cat and the second predicted as dog. In the second column is the saliency map constructed with a computational based technique (integrated gradient) and in the third column a perturbation based technique (occlusion).

What becomes visible is that the saliency map does not highlight to the watermark in the incorrect predicted image.

Captum offers several methods and the result from the algorithm referred to as “saliency” produced the following image:

Looks like it pays some attention to the watermark. But why the other pixels?

In this example, the saliency map does light up to the watermark. One can debate why the other pixels are highlighted. More critically is that the method produced highly noisy results on other images. It seems very inconsistent.

Final thoughts

The promise of saliency maps is that it allows humans to get a glimpse of what the image classifier is paying attention to. The examples that are brought forward in academic papers are very convincing in arguing the necessity and usefulness of this approach. However, in my own experimentation, results were not convincing due to high inconsistencies. I could not pinpoint to a single algorithm that was consistently good at explaining the image classifier. That no method could pick up on the bias convincingly, is disappointing.

The observation made within my experiment is not unique. It aligns with the result from Adebayo et al. (2018) who conclude on saliency maps that some algorithms behave like edge-detectors, not using any representations the model contains.

However, I am certain this problem can be solved in the future. For instance, LRP is a novel method which promises more robustness.

Another disadvantage of the method is the reduction of a possible multi-dimensional representation to a one-dimensional colour pixel mapping. For instance, looking at the saliency maps of the cats and dogs, we still do not know what the model is looking at. Is it shape? Colour? Texture? Saliency maps therefore reduce representations to areas of attention but do not help deduce the meaning of representation it caries.

And then there is the human interpretation. And this is where it gets tricky. Humans have what is known as a confirmation bias: we focus on evidence that matches our hypothesis and reject others. This also means several readers of the saliency map can come to different conclusions. This is especially problematic if the results are noisy because it allows for many interpretations.

This seems more fundamental in nature. Interpretation of explanations and how it affects human decision making requires much more research. This critique is not unique to saliency maps, though.

There are further explanation techniques to explain vision models. An interesting method is Concept Activation Vectors . I also refer the interested reader to the work of the team from Cynthia Rudin . A different approach are concept bottlenecks , where vision models are made interpretable by design. These approaches seem more suitable for the medical domain.

To conclude, the saliency maps I created could not successfully help explain a purposefully designed biased model. I also find that saliency maps are rather hard to interpret and subject to human interpretation.

I am curious to see how lrp performs on my dataset. This is something I need to update this article with in the future.


Dogs vs. Cats. (2014).

Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut Learning in Deep Neural Networks.

Mitchell, M.: Why AI is harder than we think, arXiv preprint, arXiv:2104.12871, 2021

Roberts, M., Driggs, D., Thorpe, M., Gilbey, J., Yeung, M., Ursprung, S., Aviles-Rivero, A. I., Etmann, C., McCague, C., Beer, L., Weir-McCall, J. R., Teng, Z., Gkrania-Klotsas, E., Rudd, J. H. F., Sala, E., & Schönlieb, C.-B. (2021). Common pitfalls and recommendations for using machine learning to detect and prognosticate for COVID-19 using chest radiographs and CT scans. Nature Machine Intelligence, 3(3), 199–217.