In Part 1 of Introduction to Regularizing with 2D Data we set up a system of fidelity equations to map five input points to ten output points. In this part we will add second derivative equations and impose a scaling factor to ensure consistent results.
Second-Derivative Equations for Smoothness
The next step is to formulate linear equations that measure the second derivatives of the output points. The reason for using second derivatives is that the second derivative tells us how smooth a function is. The more the function changes direction, the more nonzero its second derivative will be. The purpose of this whole process is to find a series of output y coordinates that correspond to the given list of output x coordinates. (In this example there are ten of these, ranging from x=-0.5 to x=3.6.) In other words, we are generating a numerical function, and that function will have second derivatives that we can measure. Like the fidelity equations, the second derivatives are a linear combination of the output y coordinates. See Numerical Second Derivative from Three Points for details.
With N points we can calculate N-2 second derivatives, so in this case we get 8 second derivatives from 10 output points. The last three output points have different spacing, so Rows 7 and 8 in ASecond-Derivatives have different coefficients than the first 6 rows.
The constants to the right of the equal sign are all zeros. This means that (in an ideal world) whatever function we generate will be perfectly smooth. According to the Euler-Lagrange equation only a straight line satisfies this condition. In practice of course we do not want to generate a perfect, straight line (because then the output function would not resemble the input function). The purpose of these equations is to penalize large variations in the output function. We will see that scaling the magnitude of ASecond-Derivative will adjust the severity of this penalty.
When these second derivative equations are added, the “wish list” will have 13 items on it.
Scaling the Second-Derivative Penalty Functions
The next step is to combine the fidelity and smoothness requirements by concatenating AFidelity and ASecond-Derivative. The latter is scaled by a scale factor S. This scale factor is the product of the constant KSmoothness and the square root of the ratio of fidelity equations to derivative equations.
KSmoothness is a constant that dictates the importance of the smoothness (or “stiffness”) relative to the fidelity. It is just a number that we choose based on what kind of fit we want. A value of 1 gives equal importance to fidelity and smoothness. In practice this constant is much lower, such as 0.01 for excellent fidelity or 0.1 for medium resistance to noise. KSmoothness cannot be zero, though it can be extremely small. An extremely large value, such as 99999, will cause the output dataset to be practically a straight line, which is the best-fit straight line through the input points. In this example we choose KSmoothness=0.1. With 5 fidelity equations and 8 derivative equations, S = 0.1 * sqrt(0.625) = 0.079057…
The value S is necessary to produce consistent results; it ensures that a given value of KSmoothness will always produce the same answer, no matter how many output points are being generated. If the square root of the equation ratio is omitted, then the resulting output function will become more like a straight line as more and more points are calculated from the same inputs. (It would be equivalent to increasing KSmoothness artificially.)
Putting It All Together
This is the matrix equation showing the fidelity equations and the scaled derivative equations. This linear system is overdetermined because it has 10 variables and 13 equations. This means there is no vector y that will satisfy all of them simultaneously.
In its current form the equation is: Ay=b. If you subtract b from both sides, it becomes Ay-b=0. Because the system is overdetermined, any vector y will result in some nonzero values on the left side of the equation. Those nonzero values are the residual errors and we want to minimize the sum of the squared residual errors. We can find the optimal solution to this overdetermined system by multiplying AT by both sides of the equation. Wikipedia has a good article about why this minimizes the squared error. The new equation is (ATA)y=ATb. The equations in this system of equations are called the normal equations.
The resulting normal equations are a symmetric, pentadiagonal matrix. This matrix comes from multiplying columns of A by other columns of A, so each element is a dot product of two columns. For example, the 2,4 element is the dot product of columns 2 and 4 in A. The 4,2 element is the same thing. For that reason the matrix is symmetric across its diagonal. Also, the dot product of any vector and itself is a positive number, so it makes sense that the diagonal elements are always positive numbers. One of the advantages of having a pentadiagonal matrix is that it is straightforward for a computer to solve by Gaussian elimination and back-substitution, even for extremely large matrices.