Mathematical derivations and open-source library to compute receptive fields of convnets, enabling the mapping of extracted features to input signals.
While deep neural networks have overwhelmingly established state-of-the-art
results in many artificial intelligence problems, they can still be
difficult to develop and debug.
Recent research on deep learning understanding has focused on
feature visualization
In this work, we analyze deep neural networks from a complementary perspective, focusing on convolutional models. We are interested in understanding the extent to which input signals may affect output features, and mapping features at any part of the network to the region in the input that produces them. The key parameter to associate an output feature to an input region is the receptive field of the convolutional network, which is defined as the size of the region in the input that produces the feature.
As our first contribution, we
present a mathematical derivation and an efficient algorithm to compute
receptive fields of modern convolutional neural networks.
Previous work
Today, receptive field computations are needed in a variety of applications. For example, for the computer vision task of object detection, it is important to represent objects at multiple scales in order to recognize small and large instances; understanding a convolutional feature’s span is often required for that goal (e.g., if the receptive field of the network is small, it may not be able to recognize large objects). However, these computations are often done by hand, which is both tedious and error-prone. This is because there are no libraries to compute these parameters automatically. As our second contribution, we fill the void by introducing an open-source library which handily performs the computations described here. The library is integrated into the Tensorflow codebase and can be easily employed to analyze a variety of models, as presented in this article.
We expect these derivations and open-source code to improve the understanding of complex deep learning models, leading to more productive machine learning research.
We consider fully-convolutional neural networks, and derive their receptive field size and receptive field locations for output features with respect to the input signal. While the derivations presented here are general enough for any type of signal used at the input of convolutional neural networks, we use images as a running example, referring to modern computer vision architectures when appropriate.
First, we derive closed-form expressions when the network has a
single path from input to output (as in
AlexNet
Finally, we analyze the receptive fields of modern convolutional neural networks, showcasing results obtained using our open-source library.
Consider a fully-convolutional network (FCN) with layers, . Define feature map to denote the output of the -th layer, with height , width and depth . We denote the input image by . The final output feature map corresponds to .
To simplify the presentation, the derivations presented in this document consider -dimensional input signals and feature maps. For higher-dimensional signals (e.g., D images), the derivations can be applied to each dimension independently. Similarly, the figures depict -dimensional depth, since this does not affect the receptive field computation.
Each layer ’s spatial configuration is parameterized by 4 variables, as illustrated in the following figure:
We consider layers whose output features depend locally on input features: e.g., convolution, pooling, or elementwise operations such as non-linearities, addition and filter concatenation. These are commonly used in state-of-the-art networks. We define elementwise operations to have a “kernel size” of , since each output feature depends on a single location of the input feature maps.
Our notation is further illustrated with the simple
network below. In this case, and the model consists of a
convolution, followed by ReLU, a second convolution and max-pooling.
In this section, we compute recurrence and closed-form expressions for
fully-convolutional networks with a single path from input to output
(e.g.,
AlexNet
Define as the receptive field size of the final output feature map , with respect to feature map . In other words, corresponds to the number of features in feature map which contribute to generate one feature in . Note that .
As a simple example, consider layer , which takes features as input, and generates as output. Here is an illustration:
It is easy to see that features from can influence one feature from , since each feature from is directly connected to features from . So, .
Now, consider the more general case where we know and want to compute . Each feature is connected to features from .
First, consider the situation where : in this case, the
features in will cover features
in in . This is illustrated in the figure below, where
(highlighted in red). The first term (green) covers the
entire region where the
features come from, but it will cover too many features (purple),
which is why it needs to be deducted.
For the case where , we just need to add features, which
will cover those from the left and the right of the region. For example, if we
use a kernel size of (), there would be extra features used
on each side, adding in total. If is even, this works as well,
since the left and right padding will add to .
So, we obtain the general recurrence equation (which is first-order, non-homogeneous, with variable coefficients ):
This equation can be used in a recursive algorithm to compute the receptive field size of the network, . However, we can do even better: we can solve the recurrence equation and obtain a solution in terms of the ’s and ’s:
This expression makes intuitive sense, which can be seen by considering some special cases. For example, if all kernels are of size 1, naturally the receptive field is also of size 1. If all strides are 1, then the receptive field will simply be the sum of over all layers, plus 1, which is simple to see. If the stride is greater than 1 for a particular layer, the region increases proportionally for all layers below that one. Finally, note that padding does not need to be taken into account for this derivation.
While it is important to know the size of the region which generates one feature in the output feature map, in many cases it is also critical to precisely localize the region which generated a feature. For example, given feature , what is the region in the input image which generated it? This is addressed in this section.
Let’s denote and the left-most and right-most coordinates (in ) of the region which is used to compute the desired feature in . In these derivations, the coordinates are zero-indexed (i.e., the first feature in each map is at coordinate ). Note that corresponds to the location of the desired feature in . The figure below illustrates a simple 2-layer network, where we highlight the region in which is used to compute the first feature from . Note that in this case the region includes some padding. In this example, , , and .
We’ll start by asking the following question: given , can we compute ?
Start with a simple case: let’s say (this corresponds to the first position in ). In this case, the left-most feature will clearly be located at , since the first feature will be generated by placing the left end of the kernel over that position. If , we’re interested in the second feature, whose left-most position is ; for , ; and so on. In general:
where the computation of differs only by adding , which is needed since in this case we want to find the right-most position.
Note that these expressions are very similar to the recursion derived for the receptive field size . Again, we could implement a recursion over the network to obtain for each layer; but we can also solve for and obtain closed-form expressions in terms of the network parameters:
This gives us the left-most feature position in the input image as a function of the padding () and stride () applied in each layer of the network, and of the feature location in the output feature map ().
And for the right-most feature location :
Note that, different from , this expression also depends on the kernel sizes () of each layer.
Relation between receptive field size and region. You may be wondering that the receptive field size must be directly related to and . Indeed, this is the case; it is easy to show that , which we leave as a follow-up exercise for the curious reader. To emphasize, this means that we can rewrite as:
Effective stride and effective padding. To compute and in practice, it is convenient to define two other variables, which depend only on the paddings and strides of the different layers:
With these definitions, we can rewrite as:
Note the resemblance between and . By using and , one can compute the locations for feature map given the location at the output feature map . When one is interested in computing feature locations for a given network, it is handy to pre-compute three variables: . Using these three, one can obtain using and using . This allows us to obtain the mapping from any output feature location to the input region which influences it.
It is also possible to derive recurrence equations for the effective stride and effective padding. It is straightforward to show that:
These expressions will be handy when deriving an algorithm to solve the case for arbitrary computation graphs, presented in the next section.
Center of receptive field region.
It is also interesting to derive an
expression for the center of the receptive field region which influences a
particular output feature. This can be used as the location of the feature in
the input image (as done for recent
deep learning-based local features
We define the center of the receptive field region for each layer as . Given the above expressions for , it is straightforward to derive (remember that ):
This expression can be compared to to observe that the center is shifted from the left-most pixel by , which makes sense. Note that the receptive field centers for the different output features are spaced by the effective stride , as expected. Also, it is interesting to note that if for all , the centers of the receptive field regions for the output features will be aligned to the first image pixel and located at (note that in this case all ’s must be odd).
Other network operations. The derivations provided in this section cover most basic operations at the core of convolutional neural networks. A curious reader may be wondering about other commonly-used operations, such as dilation, upsampling, etc. You can find a discussion on these in the appendix.
Most state-of-the-art convolutional neural networks today (e.g.,
ResNet
The computation presented in the previous section can be used for each of the possible paths from input to output independently. The situation becomes trickier when one wants to take into account all different paths to find the receptive field size of the network and the receptive field regions which correspond to each of the output features.
Alignment issues. The first potential issue is that one output feature may be computed using misaligned regions of the input image, depending on the path from input to output. Also, the relative position between the image regions used for the computation of each output feature may vary. As a consequence, the receptive field size may not be shift-invariant . This is illustrated in the figure below with a toy example, in which case the centers of the regions used in the input image are different for the two paths from input to output.
In this example, padding is used only for the left branch. The first three layers are convolutional, while the last layer performs a simple addition. The relative position between the receptive field regions of the left and right paths is inconsistent for different output features, which leads to a lack of alignment (this can be seen by hovering over the different output features). Also, note that the receptive field size for each output feature may be different. For the second feature from the left, input samples are used, while only are used for the third feature. This means that the receptive field size may not be shift-invariant when the network is not aligned.
For many computer vision tasks, it is highly desirable that output features be aligned: “image-to-image translation” tasks (e.g., semantic segmentation, edge detection, surface normal estimation, colorization, etc), local feature matching and retrieval, among others.
When the network is aligned, all different paths lead to output features being centered consistently in the same locations. All different paths must have the same effective stride. It is easy to see that the receptive field size will be the largest receptive field among all possible paths. Also, the effective padding of the network corresponds to the effective padding for the path with largest receptive field size, such that one can apply , to localize the region which generated an output feature.
The figure below gives one simple example of an aligned network. In this case, the two different paths lead to the features being centered at the same locations. The receptive field size is , the effective stride is and the effective padding is .
Alignment criteria . More precisely, for a network to be aligned at every layer, we need every possible pair of paths and to have for any layer and output feature . For this to happen, we can see from that two conditions must be satisfied:
for all .
Algorithm for computing receptive field parameters: sketch. It is straightforward to develop an efficient algorithm that computes the receptive field size and associated parameters for such computation graphs. Naturally, a brute-force approach is to use the expressions presented above to compute the receptive field parameters for each route from the input to output independently, coupled with some bookkeeping in order to compute the parameters for the entire network. This method has a worst-case complexity of .
But we can do better. Start by topologically sorting the computation graph. The sorted representation arranges the layers in order of dependence: each layer’s output only depends on layers that appear before it. By visiting layers in reverse topological order, we ensure that all paths from a given layer to the output layer have been taken into account when is visited. Once the input layer is reached, all paths have been considered and the receptive field parameters of the entire model are obtained. The complexity of this algorithm is , which is much better than the brute-force alternative.
As each layer is visited, some bookkeeping must be done in order to keep track of the network’s receptive field parameters. In particular, note that there might be several different paths from layer to the output layer . In order to handle this situation, we keep track of the parameters for and update them if a new path with larger receptive field is found, using expressions , and . Similarly, as the graph is traversed, it is important to check that the network is aligned. This can be done by making sure that the receptive field parameters of different paths satisfy and .
In this section, we present the receptive field parameters of modern
convolutional networks
ConvNet Model | Receptive Field (r) |
Effective Stride (S) |
Effective Padding (P) |
Model Year |
---|---|---|---|---|
alexnet_v2 | 195 | 32 | 64 | 2014 |
vgg_16 | 212 | 32 | 90 | 2014 |
mobilenet_v1 | 315 | 32 | 126 | 2017 |
mobilenet_v1_075 | 315 | 32 | 126 | 2017 |
resnet_v1_50 | 483 | 32 | 239 | 2015 |
inception_v2 | 699 | 32 | 318 | 2015 |
resnet_v1_101 | 1027 | 32 | 511 | 2015 |
inception_v3 | 1311 | 32 | 618 | 2015 |
resnet_v1_152 | 1507 | 32 | 751 | 2015 |
resnet_v1_200 | 1763 | 32 | 879 | 2015 |
inception_v4 | 2071 | 32 | 998 | 2016 |
inception_resnet_v2 | 3039 | 32 | 1482 | 2016 |
As models evolved, from AlexNet, to VGG, to ResNet and Inception, the receptive fields increased (which is a natural consequence of the increased number of layers). In the most recent networks, the receptive field usually covers the entire input image: this means that the context used by each feature in the final output feature map includes all of the input pixels.
We can also relate the growth in receptive fields to increased classification accuracy. The figure below plots ImageNet top-1 accuracy as a function of the network’s receptive field size, for the same networks listed above. The circle size for each data point is proportional to the number of floating-point operations (FLOPs) for each architecture.
We observe a logarithmic relationship between classification accuracy and receptive field size, which suggests that large receptive fields are necessary for high-level recognition tasks, but with diminishing rewards. For example, note how MobileNets achieve high recognition performance even if using a very compact architecture: with depth-wise convolutions, the receptive field is increased with a small compute footprint. In comparison, VGG-16 requires 27X more FLOPs than MobileNets, but produces a smaller receptive field size; even if much more complex, VGG’s accuracy is only slightly better than MobileNet’s. This suggests that networks which can efficiently generate large receptive fields may enjoy enhanced recognition performance.
Let us emphasize, though, that the receptive field size is not the only factor contributing
to the improved performance mentioned above. Other factors play a very important
role: network depth (i.e., number of layers) and width (i.e., number of filters per layer),
residual connections, batch normalization, to name only a few.
In other words, while we conjecture that a large receptive field is necessary,
by no means it is sufficient.
Finally, note that a given feature is not equally impacted by all input pixels within
its receptive field region: the input pixels near the center of the receptive field have more “paths” to influence
the feature, and consequently carry more weight.
The relative importance of each input pixel defines the
effective receptive field of the feature.
Recent work
The first trick to solve is to multiply it by :
Then, define , and note that (since is the neutral element for multiplication), so . Using this definition, can be rewritten as:
Now, sum it from to :
Note that and . Thus, we can compute:
where the last step is done by a change of variables for the right term.
Finally, rewriting , we obtain the expression for the receptive field size of an FCN at the input image, given the parameters of each layer:
Navigate back to the main text
The derivations are similar to the one we use to solve . Let’s consider the computation of . First, multiply by .
Then, define , and rewrite as:
And sum it from to :
Note that and . Thus, we can compute:
Navigate back to the main text
Dilated (atrous) convolution. Dilations introduce “holes” in a convolutional kernel. While the number of weights in the kernel is unchanged, they are no longer applied to spatially adjacent samples. Dilating a kernel by a factor of introduces striding of between the samples used when computing the convolution. This means that the spatial span of the kernel () is increased to . The above derivations can be reused by simply replacing the kernel size by for all layers using dilations.
Upsampling. Upsampling is frequently done using interpolation (e.g., bilinear, bicubic or nearest-neighbor methods), resulting in an equal or larger receptive field — since it relies on one or multiple features from the input. Upsampling layers generally produce output features which depend locally on input features, and for receptive field computation purposes can be considered to have a kernel size equal to the number of input features involved in the computation of an output feature.
Separable convolutions. Convolutions may be separable in terms of spatial or channel dimensions. The receptive field properties of the separable convolution are identical to its corresponding equivalent non-separable convolution. For example, a depth-wise separable convolution has a kernel size of for receptive field computation purposes.
Batch normalization. At inference time, batch normalization consists of feature-wise operations which do not alter the receptive field of the network. During training, however, batch normalization parameters are computed based on all activations from a specific layer, which means that its receptive field is the whole input image.
Navigate back to the main text
We would like to thank Yuning Chai and George Papandreou for their careful review of early drafts of this manuscript. Regarding the open-source library, we thank Mark Sandler for helping with the starter code, Liang-Chieh Chen and Iaroslav Tymchenko for careful code review, and Till Hoffman for improving upon the original code release. Thanks also to Mark Sandler for assistance with model profiling.