What is the correct way to simplify $(Ax-b)^T(Ax-b)$
Andrew Mclaughlin
Let us simplify $(Ax-b)^T(Ax-b)$ step by step, $A \in R^{m \times n}$ matrix, $x \in R^{n \times 1}$, $b \in R^{m \times 1}$
$(Ax-b)^T(Ax-b)$
$= (x^TA^T - b^T)(Ax - b)$ (property of transpose)
$ = x^TA^TAx - x^TA^Tb - b^TAx + b^Tb$ (distribution property)
Now what is the correct way to combine the inner terms
$-x^TA^Tb - b^TAx $?
or equivalently
$x^TA^Tb + b^TAx $?
I find myself constantly asking the following questions
- Is it legal or illegal to transpose any of the two terms? Why?
- Which term ($x^TA^Tb$ or $b^TAx $) should you take the transpose of? Why?
I would really appreciate if someone can resolve this for me
$\endgroup$ 22 Answers
$\begingroup$Since both $x^TA^Tb$ and $b^TAx$ are scalars, you can transpose any of them.
You can transpose $x^TA^Tb$ and get $$x^TA^Tb + b^TAx = b^TAx + b^TAx = 2 b^TAx$$ or you can transpose $b^TAx$ and get $$x^TA^Tb + b^TAx = x^TA^Tb + x^TA^Tb = 2 x^TA^Tb$$ and since both results are scalar, they are equal to their respective transposes, we get: $$ 2 x^TA^Tb = 2 (x^TA^Tb)^T = 2 b^TAx.$$
So to summarize, you can take the transposes of either one of them, and should probably take the transpose of $x^TA^Tb$ since you will reach the simpler result $2b^TAx$ faster (it can be considered simpler since there is no transpose of $A$ in it).
$\endgroup$ 1 $\begingroup$The matrix $A$ can have linearly independent columns only if $m\ge n$. The typical instance of this problem is where $m\gg n = \operatorname{rank} A$. In that case, the small matrix $A^TA$ is invertiable since it's an $n\times n$ matrix of rank $n$. The matrix $B = A (A^T A)^{-1} A^T$ is an $m\times m$ matrix of small rank, $n$. (If $A$ were a square matrix, then $B$ would be the identity matrix.)
Exercise: Prove that $b\mapsto Bb$ is the orthogonal projection of $b$ onto the column space of $A$.
And now on to simplifying: \begin{align} & (Ax-b)^T(Ax-b) \\[10pt] = {} & (Ax - Bb + Bb - b)^T(Ax - Bb + Bb - b) \\[10pt] = {} & (Ax-Bb)^T(Ax-Bb) + \underbrace{(Ax-Bb)^T(Bb-b) + (Bb-b)^T(Ax-Bb)}_\text{This is $0$.} + (Bb-b)^T(Bb-b) \\[10pt] = {} & (Ax-Bb)^T(Ax-Bb) + (Bb-b)^T(Bb-b). \end{align}
In this last form, notice that $x$ can be so chosen that $Ax= Bb$, since $Bb$ is in the column space of $A$. Indeed, that happens precisely when $x = (A^TA)^{-1}A^T b$. That is the one value of $x$ that makes the first term $0$, and thus it is the one value of $x$ that mimimizes the entire quantity. Hence we see that the second term in the last line is actually the smallest possible value of the whole quantity as $x$ varies. (If $A$ has linearly dependent columns and more rows than columns, then there is more than one value of $x$ that makes the first term in the last line vanish.)
This may be considered species of completing the square.
$\endgroup$