Since convolutional neural network is getting popular, the term “convolution” also becomes familiar to many people. But just what exactly is convolution? This article will answer this question for those who are willing to expand their knowledge in the mathematical field.
The definition of convolution
If you have two functions, f(x) and g(x), and you’d like to generate a third function based on them, there are actually multiple measures you can choose from. For instance, function composition is an option to go with, which can produce a new function equals f(g(x)). Similarly, “convolution” is one of such mathematical operations allowing one to generate a new function out of two existed functions. The mathematical operation is defined as follows:
For continuous situation:
For discrete situation (discrete convolution):
We now know the definition of “convolution”, but what does it really means in the real world? The answer to this question will be discussed in the next paragraph.
The physical meaning of math
Before we go further with convolution, we’d like to propose a concept: in many cases, a mathematical equation does not have an ultimate physical meaning. Takes “1+1 = 2” as an example, the equation can mean many things. For instance, one orange plus one orange equals two oranges, or one person plus one person equals two people. Note that these are the situations where “1+1 = 2” is applicable, but there are also some situations where “1+1 = 2” does not hold true. For example, one drop of water plus another drop does not necessarily equal two drops of water (they may merge into a larger drop). We can also say that any condition that matches the depiction of “1+1 = 2” is a potential physical meaning for the equation, so our interpretation of this equation changes accordingly depending on which condition we’re currently referring to.
Similarly, “convolution” can be understood in many fashions, depending on the area it’s applied to. In the rest of this article, we’re going to introduce two important applications of convolution in signal and image processing, respectively. And as you are about to see, they are very distinct in respect of their physical meanings.
Example of application 01 – Signal Processing
Given that you have a linear, time-invariant (LTI) system capable of turning an input signal, which can be described by the function x(t), into the output signal, which can be described by y(t) (P.S., y(t) is also known as the response of x(t)), you’ll find out that the response, y(t), is equal to the convolution between the input signal and a special kind of response called the impulse response. In other words:
To understand why the equation above is true, some prior knowledge is necessary, including (1) what is an impulse and impulse response, (2) what is sifting property and (3) what is a linear time-invariant system. These topics are discussed in below:
(1) What is an impulse and impulse response?
If you have a linear time-invariant system capable of processing signal (i.e., taking in an input signal and generate a response signal correspondingly), how can you have a full understanding about the system?
Ideally, we can input signals with all possible frequencies to the system at the same time and see what kind of response it produces. And since we have observed how the system reacts to all frequencies, we’ve already fully understood how the system behaves.
An impulse represented by the function δ (Dirac delta function, which will be introduced later) is just the kind of signal that matches the description above, which allows us to fully understand a LTI system. As we can see from Figure 2 (a), δ’s value remains to be 1 on the entire frequency domain, meaning that it contains inputs of all frequencies. And since these frequencies are all applied to the system at the same time, it’s natural to expect that the function δ on the time domain has value solely on a certain time point (in this case, t = 0), and its value at that time point reaches to infinity (as shown in Figure 2 (b). Also notice that after we apply an impulse to a LTI system, the system will produce a corresponding response; such response is called an impulse response (as shown in Figure 2 (c)).
As mentioned previously, an impulse can be described by a special function called Dirac delta function (denoted by “δ”), whose definition is as follows:
Such function has many intriguing properties. Two of them that are particularly important for the future discussion are the sifting property (it’ll be introduced in the next paragraph) and the fact that the “area” under the graph of δ(t) at the time point t = 0 is equal to 1. And since the geometrical meaning of integral is the area under the curve of a function, one can anticipate that (see Figure 3):
(2) What is sifting property (note that it’s “sifting”, not “shifting”)?
In simple words, given that x(t) is any function and δ stands for Dirac delta function, sifting property states that:
The reason for this equation to be valid is explained in Figure 4:
(3) What is a linear and time-invariant system?
We must emphasize again that the conclusion, “the response, y(t), is equal to the convolution between its input signal, x(τ), and impulse response, h(t – τ)”, only works when the system is a linear time-invariant (LTI) system, which is a system owning two important features: linear and time-invariant.
Figure 5 summarizes the properties of a LTI system:
With the knowledge said, we can finally get to the point: understanding why an output signal is equal to the convolution between its input and impulse response. The following diagram demonstrates the deduction process for gaining such conclusion:
From (a) to (b): Since the system is time-invariant, we can shift both side in time by the value τ.
From (b) to (c): Since the system is linear, we can multiply both side with a value x(τ) (i.e., the value of the function x(t) when t = τ; τ can be any number).
From (c) to (d): Since the system is linear, we can integrate both side with respect to τ (note that the concept of integration is closely related to summation).
From (d) to (e): According to the sifting property, the left side of the system is equal to x(t). And since by definition applying x(t) to the LTI system will give rise to the response y(t), we can now conclude that y(t) is equal to the convolution between x(τ) and h(t – τ).
Example of application 02 – Image Processing
For people who are studying machine learning, this application of convolution is critical since it is used in convolutional neural networks (CNNs). But how is it connected to the definition of the discrete convolution described on the top? Before we get to that, let’s first explore how convolution “looks like” in CNN and image processing.
In a picture containing meaningful objects, pixels are usually correlated with each other (especially those in the vicinity) in some fashions instead of randomly distributed on the image. To catch such correlation between a pixel and its neighbors, we can load an image and carry out some mathematical operation to each pixel on that image, combining its value with the values from the nearby pixels in some meaningful way, and as a result giving rise to a new image that is capable of illustrating the pixel correlation we’re looking for. Note that in order to do so, we’ll need another smaller, typically square, matrix called a kernel. Figure 7 demonstrates an example of how to apply a 3×3 kernel to a 9×9 black-and-white image to give rise to a new 9×9 black-and-white image for showing a certain feature of the original:
As shown in Figure 7, the kernel with the central value equals 1 and other -1/8 is capable of showing the “edge” of the objects within the original image (i.e., performing edge detection). As a matter of fact, by altering the size of a kernel and the values in it, one can get a new image showing different information about the original picture, or even adding a certain visual effect to it (such as sharpening the image or making it looks embossing).
So, how exactly is the said process connects to convolution? Let’s try to describe the process with a generalized mathematical expression, and you’ll see. Given that we have a n × n image (denoted by “X”) and a k × k kernel (denoted by “K”), the value of a certain pixel on the n × n new image (denoted by “G”, which is generated by applying the kernel to the original matrix) can be depicted by the following equation, where s equals the quotient of k divided by 2 (see Figure 7):
It’s clear now that the equation shown is a two-dimensional version of the discrete convolution aforementioned. In other words, we can say that “the value of the pixels on the new image equals the convolution between the original image and a kernel”, and that’s exactly why the neural network that adopts such process is called a “convolutional” neural network.
FYI, There is a similar process called cross-correlation that yields a result completely identical with convolution in the application of image processing (P.S., note that this is not true in the case of signal processing). The mathematical expression of cross-correlation is as follows:
Hope that this article can help you to get a deeper understanding about convolution. Finally, thank you all for reading it!
Footnote: Neurozo Innovation shares viewpoints and knowledge to help people success in their journey of innovating. For more article like this, please join our free membership to get the future notification. Thank you and wish you triumph in your business.
Comments