Showing posts with label linear algebra. Show all posts
Showing posts with label linear algebra. Show all posts

Sunday, April 14, 2024

Evaluating the operator norms of matrices

Let $E$ and $F$ be normed vector spaces, over the real or complex numbers, and let $u\colon E\to F$ be a linear map. The continuity of $u$ is proved to be equivalent to the existence of a real number $c$ such that $\|u(x)\|\leq c \|x\|$ for every $x\in E$, and the least such real number is called the operator norm of $u$; we denote it by $\|u\|$. It defines a norm on the linear space $\mathscr L(E;F)$ of continuous linear maps from $E$ to $F$ and as such is quite important. When $E=F$, it is also related to the spectrum of $u$ and is implicitly at the heart of criteria for the Gershgorin criterion for localization of eigenvalues.

$\gdef\R{\mathbf R}\gdef\norm#1{\lVert#1\rVert}\gdef\Abs#1{\left|#1\right|}\gdef\abs#1{\lvert#1\rvert}$

However, even in the simplest cases of matrices, its explicit computation is not trivial at all, and as we'll see even less trivial than what is told in algebra classes, as I learned by browsing Wikipedia when I wanted to prepare a class on the topic.

Since I'm more a kind of abstract guy, I will use both languages, of normed spaces and matrices, for the first one allows to explain a few things at a more fundamental level. I'll make the translation, though. Also, to be specific, I'll work with real vector spaces.

So $E=\R^m$, $F=\R^n$, and linear maps in $\mathscr L(E;F)$ are represented by $n\times m$ matrices. There are plentiful interesting norms on $E$, but I will concentrate the discussion on the $\ell^p$-norm given by $\norm{(x_1,\dots,x_m)} = (|x_1|^p+\dots+|x_m|^p)^{1/p}$. Similarly, I will consider the $\ell^q$-norm on $F$ given by $\norm{(y_1,\dots,y_m)} = (|y_1|^q+\dots+|y_n|^q)^{1/q}$. Here, $p$ and $q$ are elements of $[1;+\infty\mathclose[$. It is also interesting to allow $p=\infty$ or $q=\infty$; in this case, the expression defining the norm is just replaced by $\sup(|x_1|,\dots,|x_m|)$ and $\sup(|y_1|,\dots,|y_n|)$ respectively.

Duality

Whatever norm is given on $E$, the dual space $E^*=\mathscr L(E;\mathbf R)$ is endowed with the dual norm, which is just the operator norm of that space: for $\phi\in E^*$, $\norm\phi$ is the least real number such that $|\phi(x)|\leq \norm\phi \norm x$ for all $x\in E$. And similarly for $F$. To emphasize duality, we will write $\langle x,\phi\rangle$ instead of $\phi(x)$.

Example. — The dual norm of the $\ell^p$ norm can be computed explicitly, thanks to the Young inequality \[ \Abs{ x_1 y_1 + \dots + x_n y_n } \leq (|x_1|^p+\dots + |x_n|^p)^{1/p} (|x_1|^q+\dots+|x_n|^q)^{1/q}\] if $p,q$ are related by the relation $\frac1p+\frac1q=1$. (When $p=1$, this gives $q=\infty$, and symmetrically $p=\infty$ if $q=1$.) Moreover, this inequality is optimal in the sense that for any $(x_1,\dots,x_n)$, one may find a nonzero $(y_1,\dots,y_n)$ for which the inequality is an equality. What this inequality says about norms/dual norms is that if one identifies $\R^n$ with its dual, via the duality bracket $\langle x,y\rangle=x_1y_1+\dots+x_n y_n$, the dual of the $\ell^p$-norm is the $\ell^q$-norm, for that relation $1/p+1/q=1$.

If $u\colon E\to F$ is a continuous linear map, it has an adjoint (or transpose) $u^*\colon F^*\to E^*$, which is defined by $u^*(\phi)= \phi\circ u$, for $\phi\in F^*$. In terms of the duality bracket, this rewrites as \[ \langle \phi, u(x)\rangle = \langle u^*(\phi),x\rangle\] for $x\in E$ and $\phi\in F^*$.

Proposition.One has $\norm{u^*}=\norm u$.

For $\phi\in F^*$, $\norm{u^*(\phi)}$ is the least real number such that $|u^*(\phi)(x)|\leq \norm{u^*(\phi)} \norm x$ for all $x\in E$. Since one has \[ |u^*(\phi)(x)|= |\langle u^*(\phi),x\rangle|=|\langle\phi, u(x)\rangle\leq \norm\phi \norm{u(x)} \leq \norm\phi \norm u\norm x, \] we see that $\norm {u^*(\phi)}\leq \norm\phi\norm u$ for all $\phi$. As a consequence, $\norm{u^*}\leq \norm u$.
To get the other inequality, we wish to find a nonzero $\phi$ such that $\norm{u^*(\phi)}=\norm{u}\norm\phi$. This $\phi$ should thus be such that there exists $x$ such that $|\langle u^*(\phi),x\rangle|=\norm u\norm\phi\norm x$ which, by the preceding computation means that $|\langle\phi, u(x)\rangle=\norm\phi\norm u\norm x$. Such $\phi$ and $x$ must not exist in general, but we can find reasonable approximations. Start with a nonzero $x\in E$ such that $\norm{u(x)}$ is close to $\norm u\norm x$; then using the Hahn-Banach theorem, find a nonzero $\phi\in F^*$ such that $\norm\phi=1$ and $|\phi(u(x))|=\norm {u(x)}$. We see that $\langle\phi, u(x)\rangle$ is close to $\norm u\norm\phi\norm x$, and this concludes the proof.
In some cases, in particular in the finite dimension case, we can use biduality to get the other inequality. Indeed $E^{**}$ identifies with $E$, with its initial norm, and $u^{**}$ identifies with $u$. By the first case, we thus have $\norm{u^{**}}\leq \norm {u^*}$, hence $\norm u\leq\norm{u^*}$.

The case $p=1$

We compute $\norm{u}$ when $E=\mathbf R^m$ is endowed with the $\ell^1$-norm, and $F$ is arbitrary. The linear map $u\colon E\to F$ thus corresponds with $m$ vectors $u_1,\dots,u_m$ of $F$, and one has \[ u((x_1,\dots,x_m))=x_1 u_1+\dots+x_m u_m. \] By the triangular inequality, we have \[ \norm{u((x_1,\dots,x_m))} \leq |x_1| \norm{u_1}+\dots+\abs{x_m}\norm{u_m} \] hence \[ \norm{u((x_1,\dots,x_m))} \leq (\abs{x_1} +\dots+\abs{x_m}) \sup(\norm{u_1},\dots,\norm{u_m}). \] Consequently, \[ \norm{u} \leq \sup(\norm{u_1},\dots,\norm{u_m}). \] On the other hand, taking $x=(x_1,\dots,x_m)$ of the form $(0,\dots,1,0,\dots)$, where the $1$ is at place $k$ such that $\norm{u_k}$ is largest, we have $\norm{x}=1$ and $\norm{u(x)}=\norm{u_k}$. The preceding inequality is thus an equality.

In the matrix case, this shows that the $(\ell^1,\ell^q)$-norm of a $n\times m$ matrix $A$ is the supremum of the $\ell^q$-norms of the columns of $A$.

The case $q=\infty$

We compute $\norm{u}$ when $F=\mathbf R^n$ is endowed with the $\ell^\infty$-norm, and $E$ is arbitrary. A direct computation is possible in the matrix case, but it is not really illuminating, and I find it better to argue geometrically, using a duality argument.

Namely, we can use $u^*\colon F^*\to E^*$ to compute $\norm{u}$, since $\norm u=\norm{u^*}$. We have seen above that $F^*$ is $\mathbf R^n$, endowed with the $\ell^1$-norm, so that we have computed $\norm{u^*}$ in the preceding section. The basis $(e_1,\dots,e_n)$ of $F$ gives a dual basis $(\phi_1,\dots,\phi_n)$, and one has \[ \norm{u}=\norm{u^*} = \sup (\norm{u^*(\phi_1)},\dots,\norm{u^*(\phi_n)}). \]

In the matrix case, this shows that the $(\ell^p,\ell^\infty)$-norm of a $n\times m$ matrix $A$ is the supremum of the $\ell^p$-norms of the lines of $A$.

Relation with the Gershgorin circle theorem

I mentioned the Gershgorin circle theorem as being in the same spirit as the computation of an operator norm, because its proof relies on the same kind of estimations. In fact, no additional computation is necessary!

Theorem (Gershgorin “circles theorem”). — Let $A=(a_{ij})$ be an $n\times n$ matrix and let $\lambda$ be an eigenvalue of $A$. There exists an integer $i$ such that \[ \abs{\lambda-a_{ii}}\leq \sum_{j\neq i} \abs{a_{ij}}. \]

For the proof, one writes $A=D+N$ where $D$ is diagonal has zeroes on its diagonal, and writes $\lambda x=Ax=Dx+Nx$, hence $(\lambda I-D)x=Nx$. Endow $\R^n$ with the $\ell^\infty$-norm. We can assume that $\norm x=1$. Then the norm of the right hand side is bounded above by $\norm N$, while the norm of the left hand side is $\sup(\abs{\lambda-a_{ii}} |x_i|)\geq |\lambda-a_{ii}|$ if $i$ is chosen so that $|x_i|=\norm x=1$. Given the above formula for $\norm N$, this implies the theorem.

The case $p=q=2$

Since Euclidean spaces are very useful in applications, this may be a very important case to consider, and we will see that the answer is not at all straightforward from the coefficients of the matrix.

We have to bound from above $\norm{u(x)}$. Using the scalar product, we write \[ \norm{u(x)}^2 = \langle u(x),u(x)\rangle = \langle u^*u(x),x\rangle, \] where $u^*\colon F\to E$ now denotes the adjoint of $u$, which identifies with the transpose of $u$ if one identifies $E$ with $E^*$ and $F$ with $F^*$ by means of their scalar products. Using the Cauchy-Schwarz formula, we get that $\norm{u(x)}^2\leq \norm{u^*u(x)}\norm x\leq \norm{u^*u} \norm x^2$, hence $\norm{u} \leq \norm{u^*u}^{1/2}$. This inequality is remarkable because on the other hand, we have $\norm{u^*u}\leq \norm{u^*}\norm{u}=\norm{u}^2$. Consequently, $\norm{u}=\norm{u^*u}^{1/2}$.

This formula might not appear to be so useful, since it reduces the computation of the operator norm of $u$ to that of $u^*u$. However, the linear map $u^*u$ is a positive self-adjoint endomorphism of $E$ hence, (assuming that $E$ is finite dimensional here), it can be diagonalized in a orthonormal basis. We then see that $\norm{u^*u}$ is the largest eigenvalue of $u^*u$, which is also its spectral radius. The square roots of the eigenvalues of $u^*u$ are also called the singular values of $u$, hence $\norm u$ is the largest singular value of $u$.

One can play with duality as well, and we have $\norm{u}=\norm{uu^*}^{1/2}$.

Other cases?

There are general inequalities relating the various $\ell^p$-norms of a vector $x\in\R^m$, and these can be used to deduce inequalities for $\norm u$, when $E=\R^m$ has an $\ell^p$-norm and $F=\R^n$ has an $\ell^q$-norm. However, given the explicit value of $\norm u$ for $(p,q)=(2,2)$ and the fact that no closed form expression exists for the spectral radius, it is unlikely that there is a closed form expression in the remaining cases.

Worse: the exact computation of $\norm u$ in the cases $(\infty,1)$, $(\infty,2)$ or $(2,1)$ is known to be computationally NP-complete, and I try to explain this result below, following J. Rohn (2000) (“Computing the Norm ∥ A ∥∞,1 Is NP-Hard”, Linear and Multilinear Algebra 47 (3), p. 195‑204). I concentrate on the $(\infty, 1)$ case ; the $(\infty,2)$ case is supposed to be analogous (see Joel Tropp's thesis, top of page 48, quoted by Wikipedia, but no arguments are given), and the case $(2,1)$ would follow by symmetry.

A matrix from a graph

Let us consider a finite (undirected, simple, without loops) graph $G$ on the set $V=\{1,\dots,n\}$ of $n$ vertices, with set of edges $E$, and let us introduce the following $n\times n$ matrix $A=(a_{ij})$, a variant of the incidence matrix of the graph $G$ (actually $nI-E$, where $I$ is the identity matrix and $E$ is the incidence matrix of $G$):

  • One has $a_{ii}=n$ for all $i$;
  • If $i\neq j$ and vertices $i$ and $j$ are connected by an edge, then $a_{ij}=-1$;
  • Otherwise, $a_{ij}=0$.
For any subset $S$ of $V$, the cut $c(S)$ of $S$ is the number of edges which have one endpoint in $S$ and the other outside of $S$.

Proposition.The $(\ell^\infty,\ell^1)$-norm of $A$ is given by \[ 4 \sup_{S\subseteq V} c(S) - 2 \operatorname{Card}(E) + n^2. \]

The proof starts with the following observation, valid for more general matrices.

Lemma. — The $(\ell^\infty,\ell^1)$-norm of a symmetric positive $n\times n$ matrix $A$ is given by $\norm A = \sup_z \langle z, Az \rangle$ where $z$ runs among the set $Z$ of vectors in $\R^n$ with coordinates $\pm1$.

The vectors of $Z$ are the vertices of the polytope $[-1;1]^n$, which is the unit ball of $\R^n$ for the $\ell^\infty$-norm. Consequently, every vector of $[-1;1]^n$ is a convex combination of vectors of $Z$. Writing $x=\sum_{z\in Z} c_z z$, we have \[\norm {Ax} = \norm{\sum c_z Az} \leq \sum c_z \norm {Az}= \sup_{z\in Z} \norm{Az}. \] The other inequality being obvious, we already see that $\norm A=\sup_{z\in Z}\norm{Az}$. Note that this formula holds for any norm on the codomain.
If, for $z\in Z$, one writes $Az=(y_1,\dots,y_n)$, one has $\norm{Az}=|y_1|+\dots+|y_n|$, because the codomain is endowed with the $\ell^1$-norm, so that $\langle z, Az\rangle = \sum z_i y_i\leq \norm{Az}$. We thus the inequality $\sup_{z\in Z} \langle z,Az\rangle \leq \norm A$.
Let us now use the fact that $A$ is symmetric and positive. Fix $z\in Z$, set $Az=(y_1,\dots,y_n)$ as above, and define $x\in Z$ by $x_i=1$ if $y_i\geq0$ and $x_i=-1$ otherwise. One thus has $\langle x, Az\rangle=\sum |y_i|=\norm{Az}$. Since $A$ is symmetric and positive, one has $\langle x-z, A(x-z)\rangle\geq0$, and this implies \[2\norm{Az}= 2\langle x, Az\rangle \leq \langle x, Ax\rangle+\langle z, Az\rangle, \] so that, $\norm{Az}\leq \sup_{x\in Z} \langle x, Ax\rangle$. This concludes the proof.

To prove the theorem, we will apply the preceding lemma. We observe that $A$ is symmetric, by construction. It is also positive, since for every $x\in\R^n$, one has \[\langle x,Ax\rangle=\sum a_{ij}x_ix_j \geq n \sum x_i^2 -\sum_{i\neq j} x_i x_j = (n+1)\sum x_i^2- (\sum x_i)^2\geq \sum x_i^2 \] using the Cauchy-Schwarz inequality $(\sum x_i)^2\leq n\sum x_i^2$. By the preceding lemma, we thus have \[ \norm A = \sup_{z\in\{\pm1\}^n} \langle z, Az\rangle. \] The $2^n$ vectors $z\in Z$ are in bijection with the $2^n$ subsets of $V=\{1,\dots,n\}$, by associating with $z\in Z$ the subset $S$ of $V$ consisting of vertices $i$ such that $z_i=1$. Then, one can compute \[ \langle z, Az\rangle = \sum_{i,j} a_{ij} z_iz_j = 4c(S) - 2\operatorname{Card}(E) + n^2. \] It follows that $\norm A $ is equal to the indicated expression.

The last step of the proof is an application of the “simple max-cut” NP-hardness theorem of Garey, Johnson and Stockmeyer (1976), itself a strenghtening of Karp (1973)'s seminal result that “max-cut” is NP-complete. I won't explain the proofs of these results here, but let me explain what they mean and how they relate to the present discussion. First of all, computer scientists categorize problems according to the time that is required to solve them, in terms of the size of the entries. This notion depends on the actual computer that is used, but the theory of Turing machines allows to single out two classes, P and EXP, consisting of problems which can be solved in polynomial, respectively exponential, time in term of the size of the entries. A second notion, introduced by Karp, is that of NP problems, problems which can be solved in polynomial time by a “non deterministic Turing machine” — “nondeterministic” means the computer can parallelize itself at will when it needs to consider various possibilities. This class belongs to EXP (because one can simulate in exponential time a polynomial time nondeterministic algorithm) and also corresponds to the class of problems whose solution can be checked in polynomial time.

Our problem is to find a subset $S$ of $\{1,\dots,n\}$ that maximizes $c(S)$. This is a restriction of the “general max-cut” problem where, given an integer valued function $w$ on the set of edges, on wishes to find subset that maximizes $c(S;w)$, the sum of the weights of the edges which have one endpoint in $S$ and the other outside of $S$. Karp (1973) observed that the existence of $S$ such that $c(S;w)\geq m$ is an NP problem (if one is provided $S$, it takes polynomial time to compute $c(S;w)$ and to decide that it is at least $m$), and the naïve search algorithm is in EXP, since there are $2^n$ such subsets. Moreover, Karp proves that any NP problem can be reduced to it in polynomial time. This is what is meant by the assertion that it is NP-complete. Consequently, determining $\sup_S c(S;w)$ is NP-hard: if you can solve that problem, then you can solve the “max-cut” problem in polynomial time, hence any other NP-problem. A subsequent theorem by Garey, Johnson and Stockmeyer (1976) established that restricting the max-cut problems to $\pm1$ weights is still NP-hard, and this completes the proof of Rohn's theorem.

(Aside, to insist that signs matter: a theorem of Edmonds and Karp (1972), one can solve the “min-cup” problem in polynomial time, which consists in deciding, for some given integer $m$, whether there exist $S$ such that $c(S;w)\leq m$.)

Saturday, March 16, 2024

Combinatorics of the nilpotent cone

$\global\def\Card{\operatorname{Card}}\global\def\GL{\mathrm{GL}}\global\def\im{\operatorname{im}}\gdef\KVar{\mathrm{KVar}}$

Let $n$ be an integer and $F$ be a field. Nilpotent matrices $N\in \mathrm M_n(F)$ are those matrices for which there exists an integer $p$ with $N^p=0$. Their characteristic polynomial is $\chi_N(T)=T^n$, and they satisfy $N^n=0$, which shows that the set $\mathscr N_n$ of nilpotent matrices is an algebraic variety. The equation $N^n=0$ is homogeneous of degree $n$, so that $\mathscr N_n$ is a cone.

The classification of nilpotent matrices is an intermediate step in the theory of Jordan decomposition: In an adequate basis, a nilpotent matrix can be written as a diagonal block matrix of “basic” nilpotent matrices, $p \times p$ matrices of the form \[ \begin{pmatrix} 0 & 0 & \dots & 0 & 0 \\ 1 & 0 & & & \vdots \\ 0 & 1 & \ddots & & 0 \\ \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & & 0 & 1 & 0\end{pmatrix} \] whose minimal polynomial is $T^p$. The sum of the sizes of these blocks is $n$ and in this way, it is associated with any $n\times n$ nilpotent matrix a partition $\pi$ of~$n$. It is known that two nilpotent matrices are conjugate if and only if they are associated with the same partition. For any partition $\pi$ of~$n$, let us denote by $J_\pi$ the corresponding matrix whose sizes of blocks are arranged in increasing order, and $\mathscr N_\pi$ the set of nilpotent matrices that are associated with the partition $\pi$.

The theorem of Fine and Herstein (1958)

Having to teach “agrégation” classes made me learn about a classic combinatorial result: counting the number of nilpotent matrices when $F$ is a finite field.

Theorem (Fine, Herstein, 1958). — Let $F$ be a finite field with $q$ elements. The cardinality of $\mathscr N_n(F)$ is $q^{n^2-n}$. Equivalently, the probability that an $n\times n$ matrix with coefficients in $F$ be nilpotent is $q^{-n}$.

The initial proof of this results relies on the action of $\GL_n(F)$ on $\mathscr N_n(F)$: we recalled that the orbits correspond with the partitions of $n$, hence a decomposition \[ \Card(\mathscr N_n(F)) = \sum_{\pi} \Card(\mathscr N_\pi(F)). \] We know that $\mathscr N_\pi(F)$ is the orbit of the matrix $J_\pi$ under the action of $\GL_n(F)$. By the classic orbit-stabilizer formula, one thus has \[ \Card(\mathscr N_\pi(F)) = \frac{\Card(\GL_n(F))}{\Card(C_\pi(F))}, \] where $C_\pi(F)$ is the set of matrices $A\in\GL_n(F)$ such that $AJ_\pi=J_\pi A$. The precise description of $C_\pi(F)$ is delicate but their arguments go as follow.

They first replace the group $C_\pi(F)$ by the algebra $A_\pi(F)$ of all matrices $A\in\mathrm M_n(F)$ such that $AJ_\pi=J_\pi A$. For any integer, let $m_i$ be the multiplicity of an integer $i$ in the partition $\pi$, so that $n=\sum i m_i$. The block decomposition of $J_\pi$ corresponds with a decomposition of $F^n$ as a direct sum of invariant subspaces $V_i$, where $V_i$ has dimension $i m_i$. In fact, $V_1=\ker(J_\pi)$, $V_1\oplus V_2=\ker(J_\pi^2)$, etc. This shows that $A_\pi(F)$ is an algebra of block-triangular matrices. Moreover, the possible diagonal blocks can be shown to be isomorphic to $\mathrm M_{m_i}(F)$. In other words, we have a surjective morphism of algebras \[ A_\pi(F) \to \prod_i \mathrm M_{m_i}(F), \] whose kernel consists of nilpotent matrices. In particular, the proportion of invertible elements in $A_\pi(F)$ is equal to the proportion of invertible elements in the product $\prod_i \mathrm M_{m_i}(F)$.

Ultimately, Fine and Herstein obtain an explicit sum over the set of partitions of $n$ which they prove equals $q^{n^2-n}$, after an additional combinatorial argument.

Soon after, the theorem of Fine and Herstein was given easier proofs, starting from Gerstenhaber (1961) to Kaplansky (1990) and Leinster (2021).

A proof

The following proof is borrowed from Caldero and Peronnier (2022), Carnet de voyage en Algébrie. It can be seen as a simplification of the proofs of Gerstenhaber (1961) and Leinster (2021).

Let us start with the Fitting decomposition of an endomorphism $u\in \mathrm N_n(F)$: the least integer $p$ such that $\ker(u^p)=\ker(u^{p+1})$ coincides with the least integer $p$ such that $\im(u^p)=\im(u^{p+1})$, and one has $F^n=\ker(u^p)\oplus \im(u^p)$. The subspaces $N(u)=\ker(u^p)$ and $I(u)=\im(u^p)$ are invariant under $u$, and $u$ acts nilpotently on $\ker(u^p)$ and bijectively on $\im(u^p)$. In other words, we have associated with $u$ complementary subspaces $N(u)$ and $I(u)$, a nilpotent operator of $N(u)$ and an invertible operator on $I(u)$. This map is bijective.

For any integer $d$, let $\nu_d$ be the cardinality of nilpotent matrices in $\mathrm M_d(F)$, and $\gamma_d$ be the cardinality of invertible matrices in $\mathrm M_d(F)$. Let also $\mathscr D_d$ be the set of all pairs $(K,I)$, where $K$ and $I$ are complementary subspaces of dimensions $d$, $n-d$ of $F^n$. We thus obtain \[ n^2 = \sum_{(K,I)\in\mathscr D_d} \nu_d \cdot \gamma_{n-d}. \] We need to compute the cardinality of $\mathscr D_d$. In fact, given one pair $(K,I)\in\mathscr D_d$, all other are of the form $(g\cdot K,g\cdot I)$, for some $g\in\GL_n(F)$: the group $\GL_n(F)$ acts transitively on $\mathscr D_d$. The stabilizer of $(K,I)$ can be identified with $\GL_d(F)\times \GL_{n-d}(F)$. Consequently, \[ \Card(\mathscr D_d) = \frac{\Card(\GL_n(F))}{\Card(\GL_d(F)\Card(\GL_{n-d}(F))} = \frac{\gamma_n}{\gamma_d \gamma_{n-d}}. \] We thus obtain \[ q^{n^2} = \sum_{d=0}^n \frac{\gamma_n}{\gamma_d \gamma_{n-d}} \nu_d \gamma_{n-d} = \gamma_n \sum_{d=0}^n \frac{\nu_d}{\gamma_d}. \] By subtraction, we get \[ \frac{\nu_n}{\gamma_n} = \frac {q^{n^2}}{\gamma_n} - \frac{q^{(n-1)^2}}{\gamma_{n-1}},\] or \[ \nu_n = q^{n^2} - q^{(n-1)^2} \frac{\gamma_n}{\gamma_{n-1}}. \] It remains to compute $\gamma_n$: since an invertible matrix consists of a nonzero vector, a vector which does not belong to the line generated by the first one, etc., we have \[ \gamma_n = (q^n-1) (q^n-q)\dots (q^n-q^{n-1}). \] Then, \[ \gamma_n = (q^n-1) q^{n-1} (q^{n-1}-1)\dots (q^{n-1}-q^{n-2}) = (q^n-1)q^{n-1} \gamma_{n-1}. \] We thus obtain \[ \nu_n = q^{n^2} - q^{(n-1)^2} (q^n-1) q^{n-1} = q^{n^2} - q^{(n-1)^2} q^{2n-1} + q^{(n-1)^2} q^{n-1} = q^{n^2-n}, \] as claimed.

The proof of Leinster (2021)

Leinster defines a bijection from $\mathscr N_n(F)\times F^n$ to $\mathrm M_n(F)$. The definition is however not very canonical, because he assumes given, for any subspace $V$ of $F^n$, a basis of $V$.

Take a pair $(u,x)$, where $u\in\mathscr N_n(F)$ and $x\in F^n$ and consider the subspace $V_x=\langle x,u(x),\dots\rangle$, the smallest $u$-invariant subspace of $F^n$ which contains $x$. To describe $u$, we observe that we know its restriction to $V_x$, and we need to describe it on the chosen complementary subspace $V'_x$.

To that aim, we have to give ourselves an endomorphism $u'_x$ of $V'_x$ and a linear map $\phi_x\colon V'_x\to V_x$. Since we want $u$ to be nilpotent, it is necessary and sufficient to take $u'_x$ nilpotent.

Instead of considering $\phi_x\colon V'_x\to V_x$, we can consider the map $y\mapsto y+\phi_x(y)$. Its image is a complement $W_x$ of $V_x$ in $F^n$, and any complement can be obtained in this way. The nilpotent endomorphism $u'_x$ of $V'_x$ transfers to a nilpotent endomorphism $w_x$ of $W_x$.

All in all, what the given pair $(u,x)$ furnishes is a subspace $V_x$ with a basis $(x_1=x,x_2=u(x),\dots)$, a complement $W_x$, and a nilpotent endomorphism $w_x$ of $W_x$. This is more or less what the Fitting decomposition of an endomorphism gives us! Recall that $V_x$ was assumed to have been given a basis $(e_1,\dots,e_p)$. There exists a unique automorphism of $V_x$ which maps $e_i$ to $u^{i-1}(x)$ for all $i$. In other words, we have a pair of complementary subspaces $(V_x,W_x)$, a linear automorphism of $V_x$, and a nilpotent automorphism of $W_x$. By the Fitting decomposition, these data furnish in a bijective way an endomorphism of $F^n$, and that concludes the proof.

A remark about motivic integration

The framework of motivic integration suggests to upgrade these combinatorial results into equalities valid for all field $F$, which hold in the Grothendieck ring of varieties $\KVar_F$. As an abelian group, it is generated by symbols $[X]$, for all algebraic varieties $X$ over $F$, with relations $[X]=[Y]+[X\setminus Y]$, whenever $Y$ is a closed subvariety of $X$. The ring structure is defined so that the formula $[X]\cdot[Y]=[X\times Y]$ for all algebraic varieties $X$ and $Y$ over $F$.

By construction of this ring, equalities $[X]=[Y]$ in $\KVar_F$ imply that many invariants of $X$ and $Y$ coincide. In particular, when $F$ is a finite field, they will have the same number of points.

The question is thus to compute the class $[\mathscr N_n]$ of the variety $\mathscr N_n$, for any field $F$. The proofs that I described above can be more or less transferred to this context and imply the following theorem. We denote by $\mathbf L\in \KVar_F$ the class of the affine line $\mathbf A^1$.

Theorem. — One has an equality $[\mathscr N_n] \mathbf L^n = \mathbf L^{n^2}$ in the localization of the Grothendieck ring $\KVar_F$ by the element $(\mathbf L-1)\dots(\mathbf L^{n-1}-1)$.

The following question is then natural. (I have not thought about it at all.)

Question. — Does one have $[\mathscr N_n]=\mathbf L^{n^2-n}$ in $\KVar_F$?

Monday, December 13, 2021

Not simple proofs of simplicity

The last few weeks, I started my self-education in proof formalization (in Lean) by a test case, the simplicity of the alternating group. In this blog post, I want to discuss some aspects of the underlying mathematics which I think I learnt during the formalization process.

Simple groups. Some examples

This is about groups, eventually finite groups. Let us recall that a subgroup $N$ of a group $G$ is normal if for every $g \in G$ and $n\in N$, one has $gng^{-1}\in N$. Concretely, this means that the two equivalence relations modulo $N$ (left or right) coincide, and that the quotient set $G/N$ inherits a group structure that makes the natural projection $G \to G/N$ a morphism of groups.

In this setting, the group $G$ can be viewed as an extension of the two groups $G/N$ and $N$, which are possibly simpler, and this extension may help at understanding properties of the group $G$. At the opposite, one says that $G$ is simple if it has no normal subgroup besides $\{e\}$ and $G$ itself (and if $G\neq\{e\}$).

Trivial examples of normal subgroups of a group $G$ are $\{e\}$ and $G$ itself. Less trivial examples are the center $Z(G)$ of $G$ (those elements $g$ such that $gh=hg $ for all $h\in G$), and the derived subgroup $D(G)$, the subgroup generated by all commutators $ghg^{-1}h^{-1}$. This commutator subgroup is interesting: any subgroup $N$ of $G$ containing the commutator subgroup $D(G)$ is normal, and the quotient is abelian; and conversely.

The kernel of a group morphism is a normal subgroup, and the construction of a quotient shows that all normal subgroups appear in this way. In particular, the alternating group $A_n$ is a normal subgroup of the symmetric group $S_n$.

A simple group has either $Z(G)=\{e\}$ or $Z(G)=G$; in the latter case, this means that $G$ is commutative, and it quickly appears that $G$ has to be finite of prime order. Consequently, our discussion will now concern groups with trivial center.

The concept of simplicity of groups is classically presented in connection with Galois theory, and the question of solvability of equations by radicals. Namely, a “general” polynomial equation of degre $n$ has $S_n$, the full symmetric group on $n$ elements, for its Galois group, and, if $n\geq 5$, the only possible dévissage of this group consists in introducing the alternating group $A_n$ of even permutations, the subgroup $A_n$ being normal and simple. On the other hand, the solvability of polynomial equation by radicals is equivalent to such a dévissage where all successive quotients are cyclic groups (equivalently abelian groups). Since $A_n$ is not abelian, this implies that a “general” polynomial equation of degree $n$ is not solvable by radicals. However, using simplicity of the alternating group is much stronger than what we need: what would be needed is solvability of the symmetric group, and that this does not hold if $n\geq 5$ is much simpler. Consequently, for the problem of resolution by radicals, it suffices to prove that the derived subgroup $D(A_n)$ of the alternating group is equal to $A_n$.

Theorem. — For $n\geq 5$, one has $D(A_n)=A_n$. In particular, the alternating group and the symmetric group on $n$ letters are not solvable.
I give two (equivalent) proofs, the second one being a computational interpretation of the first one. Both use that the 3-cycles generate $A_n$ and are conjugate in $A_n$. The computational proof is shorter, arguingly simpler. As a matter of fact, I never understood it, nor could remember it, until I translated the conceptual proof into the proof assistant.
Consider the morphism $p\colon A_n\to A_n/D(A_n)$. Since $A_n/D(A_n)$ is commutative, all 3-cycles have the same image. Since the square of a 3-cycle is again a 3-cycle, both have the same image. This implies that for every 3-cycle $g\in A_n$, one has $p(g)=p(g^2)$, hence $p(g)=e$. Since the 3_cycles generate $A_n$, the morphism $p$ is trivial; since it is surjective, one has $A_n/D(A_n)=\{e\}$ and $D(A_n)=A_n$.
Computationally, consider a 3-cycle $g$ and its square $g^2$. Since they are conjugate, there exists $h\in A_n$ such that $g^2=hgh^{-1}$. Then $g=hgh^{-1}g^{-1}$, so that $g$ is a commutator; in particular, $D(A_n)$ contains all commutators and $D(A_n)=A_n$.

The remaining cases, for $n\leq 4, are interesting, but classically left as exercises in text books:

  1. One has $A_1=S_1=\{e\}$;
  2. The group $S_2$ is the cyclic group of order 2, hence is simple and solvable, and $A_2$ is trivial;
  3. The group $S_3$ is a noncommutative group of order 6, and $A_3$ is a cyclic group of order 2.
  4. The groups $S_4$ and $A_4$ are noncommutative and solvable, of orders 24 and 12. The derived subgroups $D(S_4)$ and $D(A_4)$ are both equal to the Klein subgroup $V_4$ of $S_4$, consisting of the permutations of the form $(ab)(cd)$ for $a,b,c,d$ any enumeration of $1,2,3,4$ — “double transpositions” – and of the identity. The group $V_4$ is commutative, isomorphic to $(\mathbf Z/2\mathbf Z)^2$, and the quotient $D(A_4)/V_4$ is cyclic of order $3$.

Another classical series of simple groups appears in linear algebra. Let $F$ be a field and let $n$ be an integer such that $n\geq 2$. The group $\mathrm{GL}(n,F)$ of $n\times n$ invertible matrices is not simple, for it is noncommutative but its center consists of homotheties; we can try to mod out by the center, getting the group $\mathrm{PGL}(n,F)=\mathrm{GL}(n,F)/F^\times$ but that one may not be simple. Indeed, another reason for $\mathrm{GL}(n,F)$ not to be simple is that it admits the special linear group $\mathrm{SL}(n,F)$, kernel of determinant, as a normal subgroup. The group $\mathrm{SL}(n,F)$ has a nontrivial center in general, it consists of homotheties of ratio $a\in F^\times$ such that $a^n=1$ — let us denote it by $\mu_n$. But the quotient $\mathrm{PSL}(n,F)=\mathrm{SL}(n,F)/\mu_n$ is simple in general — in general meaning that is is always the case but for two exceptions:

  1. $n=2$ and $F=\mathbf F_2$. Then $\mathrm{PSL}(2,\mathbf F_2)\simeq S_3$ (by the action on $\mathbf P_1(\mathbf F_2)$, see below), hence is not simple.
  2. $n=2$ and $F=\mathbf F_3$. Then $\mathrm{PSL}(2,\mathbf F_3)\simeq S_4$ (again by the action on $\mathbf P_1(\mathbf F_3)$), and is not simple.

Bilinear algebra gives rise to new groups, orthogonal, unitary and symplectic, which also furnish simple groups up to elementary transformations. By the famous “classification of finite simple groups”, these constructions furnish all finite simple groups, up to 26 (or 27) examples called sporadic groups. This remarkable theorem has a fantastic proof, encompassing thousands of pages across the years 1960-2010.

But the question here is: How can one prove that a given group is simple?

Alternating groups

There is a large supply of proofs that the alternating group $A_n$ is simple for $n\geq 5$. Here is a sketch of one.

Let $N$ be a normal subgoup of $A_n$ ($n\geq 5$) and assume that $N\neq\{e\}$. An element of $A_n$ can be written as a product of an even number of transpositions. If two successive permutations in the product are equal, we can cancel them; if the share exactly one a common member, as in $(a\,b)(a\,c)$, their product is a 3-cycle $(a\,c\,b)$; if they have no common member, their product is a double transposition. On the other hand, if $n\geq 5$, we can either insert $(b\,c)(b\,c)$ in the product $(a\,b)(c\,d)$, writing a double transposition as a product of two 3-cycles, or insert $(d\,e)(d\,e)$ in the product $(a\,b)(a\,c)$, writing a 3-cycle as a product of two double transpositions. Consequently, $A_n$ is generated by, either the 3-cycles, or the double transpositions. Moreover, since $n\geq 5$, we can check that 3-cycles are pairwise conjugate, and similarly for double transpositions; consequently, if the normal subgroup $N$ of $A_n$ contains either a 3-cycle, or a double transposition, it will contain all of them, hence be equal to $A_n$.

When $n=5$, the only case that remains to consider is when $N$ contains a 5-cycle, say $g=(1\,2\,3\,4\,5)$. Conjugating $g$ by the 3-cycle $h=(4\,5\,1)$, we get $hgh^{-1}=(h1\,h2\,h3\,h4\,h5)=(4\,2\,3\,5\,1)\in N$. By construction, this element behaves as $g$ on $5$, but differs. Consequently, the commutator $hgh^{-1}g^{-1}$ is a nontrivial element of $N$ that fixes $5$. By the first two cases, one has $N=A_5$.

A similar argument works in general, arguing by descending induction on the cardinality on the fixed point set of an element $g\neq e$ of $N$. One considers an element $h$ of $A_n$ and the conjugate $hgh^{-1}$; if $g=(a_1\,a_2\,\dots)(b_1\,b_2\,\dots)\dots$, then $hgh^{-1}=(ha_1\,ha_2\,\dots)(hb_1\,hb_2\,\dots)\dots$ is an element of $N$ that behaves as $g$ on many elements, but not all. Consquently, $hgh^{-1}g^{-1}$ is a non trivial element of $N$ that fixes more elements than $g$, and we conclude by induction. (Details can be found in James Milne's lectures, 4.34.)

The Iwasawa criterion

In 1941, Kenkiti Iwasawa published a proof of the simplicity of the projective special linear group. From this proof, a general criterion for simplicity has been abstracted:

Theorem (Iwasawa). — Let $G$ be a group with a primitive action on a set $X$. Assume that one is given, for every $x\in X$, a subgroup $T(x)$ of $G$ satisfying the following properties:

  • For every $x\in X$, the subgroup $T(x)$ is commutative;
  • For every $g\in G$ and $x\in X$, $T(g\cdot x)=g T(x)g^{-1}$;
  • The union of the groups $T(x)$, for $x\in X$, generate $G$.
Then any normal subgroup $N$ of $G$ that acts nontrivially on $X$ contains the commutator subgroup of $G$. In particular, if $G=D(G)$, then $G$ is simple.

There are two classical ways to state that the action is primitive. The simplest states that it is transitive and that the stabilizers of points are maximal subgroups of $G$. Another is that there is no imprimitivity block, a nontrivial partition $(X_i)$ of $X$ such that for every $g\in G$ and every $i$, there exist $j$ such that $g\cdot X_i=$X_j$. One can prove that a 2-transitive action (= transitive on ordered pairs of distinct elements) is primitive.

Iwasawa applies his criterion to $G=\mathrm{SL}(n,F)$ acting on the projective space $\mathbf P_{n-1}(F)$ of lines in $F^n$. Except when $n=2$ and $F$ has 2 or 3 elements, this action is 2-transitive, hence primitive. For a nonzero $x\in F^n$, one considers the group $T(x)$ of linear transformations of the form $y\mapsto y + \phi(y) x$ (transvections), for all linear forms $\phi$ on $F^n$ such that $\phi(x)=0$. They constitute a commutative subgroup of $\mathrm{SL}(n,F)$ (isomorphic, as a group, to $F^{n-1}$). The map $T$ gives rise to the data as in Iwasawa's theorem. Consequently, every normal subgroup $N$ of $\mathrm{SL}(n,F)$ that acts nontrivially on $\mathbf P_{n-1}(F)$ contains the commutator subgroup of $\mathrm{SL}(n,F)$, which is $\mathrm{SL}(n,F)$. Explicitly, either $N$ consists of homotheties, or $N$ contains $\mathrm{SL}(n,F)$. This implies that $\mathrm{PSL}(n,F)$ is simple.

Applying the Iwasawa criterion for symmetric groups

One may wish to apply the Iwasawa criterion to the symmetric group. However, the conclusion is not as nice as what I had hoped initially.

Let $S_n$ act on the set of 2-element subsets of $X=\{1,\dots,n\}$. If $n\geq 5$, the action is primitive. This is not completely obvious, because it this action is not 2-transitive (you cannot map $\{1,2\}$ and $\{1,3\}$ to $\{1,2\}$ and $\{3,4\}$)! Its primitivity means that stabilizers are maximal subgroups, and one can prove that the stabilizer of a 2-element subset is indeed maximal unless $n=4$. It is also faithful (no nontrivial acts trivially). To a pair $\{a,b\}$, associate the subgroup of order 2 generated by the transposition $(a\,b)$. It satisfies the criterion, and this shows that the nontrivial normal subgroups of $S_n$ contain $D(S_n)=A_n$. Since $A_n$ has index 2, this shows $N=A_n$ or $N=S_n$.

It is interesting to guess how the criterion breaks for $n=4$. In fact, the action of $S_4$ on pairs is transitive, but not primitive: the stabilizer of $\{1,2\}$ is a subgroup of $S_4$ of order $4$, consisting of the identiy, two transpositions $(1\,2)$ and $(3\,4)$, and their product. Since $\mathrm{Card}(S_4)=24!=2^3\cdot 3$, this subgroup is contained in a 2-sylow subgroup, of order 8, so is not maximal.

However, and I find it unfortunate, this proof does not imply that the alternating group $A_n$ is simple for $n\geq 5$. To prove the simplicity of $A_n$ along these lines, we need to consider the following actions.

  • Let $A_n$ act on the set $X_3$ of 3-element subsets of $X$. For $x=\{a,b,c\}\in X_3$, consider the subgroup $T(x)$ of order 3 of $A_n$ consisting of the identity and the 3-cycles $(a\,b\,c)$ and $(a\,c\,b)$. (It is the alternating group on $x$, viewed as a subgroup of $A_n$.) The Iwasawa criterion applies provided the action is primitive, which presumably holds for $n>6$. Consequently, $A_n$ is simple for $n\geq 7$. However, if $n=6$, the stabilizer of $\{1,2\,3\}$ in $A_6$ is not maximal, for a reason analogous to the one explained for $S_4$.
  • Let $A_n$ act on the set $X_4$ of 4-element subsets of $X$. For $x\in X_4$, we can consider the Klein subgroup $T(x)$ of $A_4$, acting on $x$, viewed as a subgroup of $A_n$. I hope that the action is primitive, except for $n=8$, and this would prove that $A_n$ is simple for $n\geq 8$. This uses that double transpositions generate $A_n$.
  • One can improve the two preceding arguments a little bit. If $n=5$, we can have $A_n$ act on $X_2$, associating with $x$ the alternating group on the three remaining letters. Therefore, $A_5$ is simple (the action is primitive because the stabilizer of a point is the Klein subgroup of $A_4$, as a subgroup of $A_5$, its cardinality is 12, that of $A_5$ is 60, and since 60/12=5 is prime, the Klein subgroup of $A_4$ is maximal in $A_5$). Similarly, if $n=6$, we have $A_6$ act on $X_2$, associating with $x\in X_2$ the Klein subgroup corresponding to the four remaining letters. Provided one can prove that the stabilizer of a pair in $A_6$ is a maximal subgroup, this gives a proof that $A_6$ is simple!

The primitivity assertions seem to hold. In fact, maximal subgroups of $A_n$ and $S_n$ are classified by the O'Nan–Scott theorem. Those we're concerned with have type (a) in the notation of (Liebeck, Praeger, Saxl, 1987. “A Classification of the Maximal Subgroups of the Finite Alternating and Symmetric Groups.” Journal of Algebra 111 (2): 365–83. DOI:10.1016/0021-8693(87)90223-7), and according to Theorem 1 of that paper, the relevant subgroups are indeed maximal.

Formalization: where am I now?

I have already formalized the Iwasawa criterion in Lean (there's an ongoing pull request for this), as well as the result that the normal subgroups of $S_n$ are $\{e\}$, $A_n$ and $S_n$ for $n\geq 5$.

It remains to formalize the rest of the proof, and the main difficulty will be in proving primitivity, plus some nice way of defining the maps $T$ so that their properties are visible.

I also wish to formalize the simplicity of the special linear group along these lines, and it should be ready for an application to orthogonal, unitary or symplectic groups as well.

Thursday, April 22, 2021

Growth of the Gaussian pivoting algorithm

Gaussian elimination” is the standard method for solving systems of linear equations that runs by choosing one pivot variable in one of the equations and eliminating it from the other equations by a suitable linear combination. These other equations have one less variable and one may iterate the process, leading to a system that has a triangular form and can be solved in a relatively straightforward way. In the so-called Gauss-Jordan method, the pivot variables are eliminated from all of the equations, and this leads to the row reduced echelon form of the initial system, a new system in which the pivot variables are explicitly solved in terms of the remaining variables; it also has the merit of being independent of the intermediate steps, but this would be a story for another night.

How the name of Gauss is attributed to this method is also another story, that is recounted in great detail by J. Grcar. As he explains, systems of linear equations and their solution already appear on Babylonian tablets (2000 BC), on the Egyptian Rhind papyrus (1550 BC), in the Chinese Nine chapters on the mathematical art (200 AD), the Arithmetica of the Greek mathematician Diophantus (250 AD), the Āryabhaṭīya of the Hindu mathematician Āryabhaṭa (499 AD). Such systems were essentially absent of the mathematics treatises during the Renaissance and it is to Newton (17th century) that we owe its revival in Western Europe. At the beginning of the 19th century, in relation with the problem in celestial mechanics/geodesy of fitting together multiple imprecise measurements, Gauss and Legendre invented the least square methods. This involved a system of linear equations which Gauss solved by what he called “common elimination”. In the 20th century, the development of computing machines and numerical analysis led to further work, from Cholesky (in relation with geodesy), Crout, to von Neumann and Goldstine and their $LDU$ decomposition.

Whoever had to perform elimination by hand knows that the computations are rapidly tedious and often lead to more and more complicated fractions. 

When computer calculations are done with floating point algebra, the difficulty of rounding errors appears. If in a linear system, say $Ax=b$,  the matrices $A$ and $b$ are only known up to an error, so that the system that is actually solved would rather be $(A+E)x=b+\delta b$, and it is of an obvious importance to compare its solution with the solution of the initial system. One way to make this comparison involves the inverse of $A$, which is unknown at this stage. The product of the norms $\|A\| \,\|A^{-1}\|$ is the conditioning number of $A$, and one can not avoid the problem of ill-conditioned matrices, which will inherently lead to lack of precision.

But when floating points numbers are used in the computation, a new problem appears, even when one restricts to well-conditioned systems. Floating points numbers are of the form $\pm a\cdot 10^e$ (in base 10, say), where $a$ is a real number between $1$ and $10$ which is known up to a fixed number of decimal places. In other words, floating points numbers are known up to a relative error. Let us examine the consequence for the subtraction of floating point numbers. Consider two such numbers, say $x=a\cdot 10^e$ and $x'=a'\cdot 10^e$, with the same exponent $e$, and such that $a$ and $a'$ are close. Their difference is given by $x'-x=(a'-a)\cdot 10^e$, but since $a$ and $a'$ are close, their difference $y=x'-x$ is no more between $1$ and $10$, but may be, say of size $10^{-5}$, so that the floating point expression of $x'-x$ is $(10^5(a'-a))\cdot 10^{e-5}=b\cdot 10^{e-5}$; the problem is that the last 5 decimals of $b$ are absolutely unknown, which leads to a relative error for $y$ which is $10^5$ as big as expected!

Now, by its very essence, the elimination method features a lot of such subtractions, hence it is inherently not very well suited with floating points numbers. 

In the 1960s, the American mathematician Wilkinson analysed the situation very precisely.  He showed that for the Gaussian elimination, the main parameter is the relative size of the matrices that intervene in the process. To set up some notation, imagine that the initial system $Ax=b$ is transformed, step by step, into a series of equivalent systems $A^{(r)}x=b^{(r)}$, where $A^{(r)}=(a_{ij}^{(r)})$ is a matrix whose first $r$ lines are in triangular form, and the remaining $n-r$ lines still need to be worked on. To get the matrix $A^{(r+1)}$, one subtract a multiple $m_{ir}$ of the $r$th row from the $i$th row he multipliers, for $i$ ranging from $r+1$ to $n$, where the multipliers $m_{ir}$ are defined by
\[ m_{ir}=a_{ir}^{(r)}/a_{rr}^{(r)}. \]
Large multipliers lead to lack of precision, but if complete pivoting method is used, one has $|m_{ir}|\leq 1$. In this case, one observes that a bound $|a_{ij}^{(r)}|\leq M$ for the coefficients of $A^{(r)}$ leads to the bound $|a_{ij}^{(r+1)}|\leq 2M$ at the next step. At the last step, one gets $|a_{ij}^{(n)}|\leq 2^{n-1}M$, where $M=\sup(|a_{ij}|)$. Consequently, the relevant constant to be estimated,
\[ R(A) = \sup_r \frac{\sup_{i,j}|a_{ij}^{(r)}|}{\sup_{i,j}|a_{ij}|}, \]
satisfies $R(A)\leq 2^{n-1}$.

In fact, Wilkinson (1961) gave a much better bound. Let $B^{(r)}$ be the square matrix of size $n-r$ that has to be worked on after the $r$th step and let $b_{r}$ be the maximum size of its coefficients, the size of the chosen pivot since one does complete pivoting. One has the following fomula for its determinant:
\[ \det(B^{(r)})= b_{r+1}\cdots b_n. \]
Moreover, the Euclidean norms of the the columns of $B^{(r)}$ are bounded above by $\sqrt{n-r} b_{r+1}$ and Hadamard inequality (“the volume of a parallelepiped is smaller than the product of the sizes of its edges”) implies that
\[ | \det(B^{(r)})| \leq (n-r)^{(n-r)/2} b_{r+1}^{n-r}. \]
Together, these relations lead to an upper bound
\[ R(A) \leq n^{1/2} \left( 2\cdot 3^{1/2}\cdot 4^{1/3}\cdots n^{1/(n-1)}\right)^{1/2}, \]
roughly $n^{\log(n)/4}$, a bound that Wilkinson considers to be a “severe overestimate”, and in Wilkinson (Rounding Errors in Algebraic Processes, Dover, 1963, bottom of p. 97), he even notes that “No matrix has yet been discovered for which $R(A)>n$.

This statement remained known as Wilkinson's conjecture, although Wilkinson himself did not state it as such. Tornheim (1964) proved that Hadamard matrices — matrices with entries $\pm1$ and pairwise orthogonal columns and rows — satisfy $R(A)\geq n$ and this led Cryer (1968) to formally state the conjecture and to suspect that $R(A)<n$ unless $A$ is a Hadamard matrix. Some matrices have been shown such that $R(A)\geq (n+1)/2$ (Higham & Higham, 1989) and random matrices seems to have an $R$-factor roughly $n^{1/2}$ (Trefethen & Schreiber, 1990). In fact, Hadamard matrices of size $n=12$ (Edelman & Mascarenhas, 1995) or $n=16$ (Kravvaritis & Mtrouli, 2009) satisfy $R(A)=n$, but this is definitely nontrivial.

However, Gould (1991) could exhibit a matrix $A$ of size $n=13$ with  growth factor $R(A)=13.0205$, thus providing a counterexample to Wilkinson's “conjecture”. To that aim, he reduced the question to finding a solution of a nonlinear programming problem with roughly $n^3/3\approx 700$ variables, fortunately a sparse one, a computation he did using a programming package he had developed with Conn and Toint. The matrix he gives has integer coefficients of size up to 20 digits!

But the story does not end here!

Edelman tried to replicate Gould's computations using the computer algebra softwares Mathematica and Maple — what he found is the growth factor of Gould's matrix is around $7.355$, consistently in both softwares! As he writes, one of these softwares could have had a bug, but it is rather unlikely that both of them had had the same one.

What happened, and you will agree that there is much irony in this story, is that Gould had performed his computations using floating point algebra. A near tie in the 6th pivot lead to an incorrect choice of pivot and to an erroneous computation, even within the double precision that he has used.

Fortunately, Edelman (1992) showed that changing one coefficient by $10^{-7}$ in Gould's matrix yields a growth of $13.02$, so that Wilkinson's “conjecture” is definitely incorrect.

Tuesday, February 9, 2016

Happy New Year!

As was apparently first noticed by Noam Elkies, 2016 is the cardinality of the general linear group over the field with 7 elements, $G=\mathop{\rm GL}(2,\mathbf F_7)$. I was mentoring an agrégation lesson on finite fields this afternoon, and I could not resist having the student check this. Then came the natural question of describing the Sylow subgroups of this finite group. This is what I describe here.

First of all, let's recall the computation of the cardinality of $G$. The first column of a matrix in $G$ must be non-zero, hence there are $7^2-1$ possibilities; for the second column, it only needs to be non-collinear to the first one, and each choice of the first column forbids $7$ second columns, hence $7^2-7$ possibilities. In the end, one has $\mathop{\rm Card}(G)=(7^2-1)(7^2-7)=48\cdot 42=2016$. The same argument shows that the cardinality of the group $\mathop{\rm GL}(n,\mathbf F_q)$ is equal to $(q^n-1)(q^n-q)\cdots (q^n-q^{n-1})=q^{n(n-1)/2}(q-1)(q^2-1)\cdots (q^n-1)$.

Let's go back to our example. The factorization of this cardinal comes easily: $2016=(7^2-1)(7^2-7)=(7-1)(7+1)7(7-1)=6\cdot 8\cdot 7\cdot 6= 2^5\cdot 3^2\cdot 7$. Consequently, there are three Sylow subgroups to find, for the prime numbers $2$, $3$ and $7$.

The cas $p=7$ is the most classical one. One needs to find a group of order 7, and one such subgroup is given by the group of upper triangular matrices $\begin{pmatrix} 1 & * \\ 0 & 1\end{pmatrix}$. What makes things work is that $p$ is the characteristic of the chosen finite field. In general, if $q$ is a power of $p$, then the subgroup of upper-triangular matrices in $\mathop{\rm GL}(n,\mathbf F_q)$ with $1$s one the diagonal has cardinality $q\cdot q^2\cdots q^{n-1}=q^{n(n-1)/2}$, which is exactly the highest power of $p$ divising the cardinality of $\mathop{\rm GL}(n,\mathbf F_q)$.

Let's now study $p=3$. We need to find a group $S$ of order $3^2=9$ inside $G$. There are a priori two possibilities, either $S\simeq (\mathbf Z/3\mathbf Z)^2$, or $S\simeq (\mathbf Z/9\mathbf Z)$.
We will find a group of the first sort, which will that the second case doesn't happen, because all $3$-Sylows are pairwise conjugate, hence isomorphic.

Now, the multiplicative group $\mathbf F_7^\times$ is of order $6$, and is cyclic, hence contains a subgroup of order $3$, namely $C=\{1,2,4\}$. Consequently, the group of diagonal matrices with coefficients in $C$ is isomorphic to $(\mathbf Z/3\mathbf Z)^2$ and is our desired $3$-Sylow.

Another reason why $G$ does not contain a subgroup $S$ isomorphic to $\mathbf Z/9\mathbf Z$ is that it does not contain elements of order $9$. Let's argue by contradiction and consider a matrix $A\in G$ such that $A^9=I$; then its minimal polynomial $P$ divides $T^9-1$. Since $7\nmid 9$, the matrix $A$ is diagonalizable over the algebraic closure of $\mathbf F_7$. The eigenvalues of $A$ are eigenvalues are $9$th roots of unity, and are quadratic over $\mathbf F_7$ since $\deg(P)\leq 2$. On the other hand, if $\alpha$ is a $9$th root of unity belonging to $\mathbf F_{49}$, one has $\alpha^9=\alpha^{48}=1$, hence $\alpha^3=1$ since $\gcd(9,48)=3$. Consequently, $\alpha$ is a cubic root of unity and $A^3=1$, showing that $A$ has order $3$.

It remains to treat the case $p=2$, which I find slightly trickier. Let's try to find elements $A$ in $G$ whose order divides $2^5$. As above, it is diagonalizable in an algebraic closure, its minimal polynomial divides $T^{32}-1$, and its roots belong to $\mathbf F_{49}$, hence satisfy $\alpha^{32}=\alpha^{48}=1$, hence $\alpha^{16}=1$. Conversely, $\mathbf F_{49}^\times$ is cyclic of order $48$, hence contains an element of order $16$, and such an element is quadratic over $\mathbf F_7$, hence its minimal polynomial $P$ has degree $2$. The corresponding companion matrix $A$ in $G$ is an element of order $16$, generating a subgroup $S_1$ of $G$ isomorphic to $\mathbf Z/16\mathbf Z$. We also observe that $\alpha^8=-1$ (because its square is $1$); since $A^8$ is diagonalizable in an algebraic closure with $-1$ as the only eigenvalue, this shows $A^8=-I$.

Now, there exists a $2$-Sylow subgroup containing $S_1$, and $S_1$ will be a normal subgroup of $S$ (because its index is the smallest prime number dividing the order of $S$, which is $2$). This suggests to introduce the normalizer $N$ of $S_1$ in $G$. One then has $S_1\subset S\subset N$. Let $s\in S$ be such that $s\not\in S_1$; then there exists a unique $k\in\{1,\dots,15\}$ such that $s^{-1}As=A^k$, and $s^{-2}As^2=A^{k^2}=A$ (because $s$ has order $2$ modulo $S_1$), hence $k^2\equiv 1\pmod{16}$—in other words, $k\equiv \pm1\pmod 8$.

There exists a natural choice of $s$: the involution ($s^2=I$) which exchanges the two eigenspaces of $A$. To finish the computation, it's useful to take a specific example of polynomial $P$ of degree $2$ whose roots in $\mathbf F_{49}$ are primitive $16$th roots of unity. In other words, we need to factor the $16$th cyclotomic polynomial $\Phi_{16}=T^8+1$ over $\mathbf F_7$ and find a factor of degree $2$; actually, Galois theory shows that all factors have the same degree, so that there should be 4 factors of degree $2$.  To explain the following computation, some remark is useful. Let $\alpha$ be a $16$th root of unity in $\mathbf F_{49}$; we have $(\alpha^8)^2=1$ but $\alpha^8\neq 1$, hence $\alpha^8=-1$.  If $P$ is the minimal polynomial of $\alpha$, the other root is $\alpha^7$, hence the constant term of $P$ is equal to $\alpha\cdot \alpha^7=\alpha^8=-1$.

We start from $T^8+1=(T^4+1)^2-2T^4$ and observe that $2\equiv 4^2\pmod 7$, so that $T^8+1=(T^4+1)^2-4^2T^4=(T^4+4T^2+1)(T^4-4T^2+1)$. To find the factors of degree $2$, we remember that their constant terms should be equal to $-1$. We thus go on differently, writing $T^4+4T^2+1=(T^2+aT-1)(T^2-aT-1)$ and solving for $a$: this gives $-2-a^2=4$, hence $a^2=-6=1$ and $a=\pm1$. The other factors are found similarly and we get
\[ T^8+1=(T^2-T-1)(T^2+T-1)(T^2-4T-1)(T^2+4T-1). \]
We thus choose the factor $T^2-T-1$ and set $A=\begin{pmatrix} 0 & 1 \\ 1 & 1 \end{pmatrix}$.

Two eigenvectors for $A$ are $v=\begin{pmatrix} 1 \\ \alpha \end{pmatrix}$ and $v'=\begin{pmatrix}1 \\ \alpha'\end{pmatrix}$, where $\alpha'=\alpha^7$ is the other root of $T^2-T-1$. The equations for $B$ are $Bv=v'$ and $Bv'=v$; this gives $B=\begin{pmatrix} 1 & 0 \\ 1 & - 1\end{pmatrix}$. The subgroup $S=\langle A,B\rangle$ generated by $A$ and $B$ has order $32$ and is a $2$-Sylow subgroup of $G$.

Generalizing this method involves finding large commutative $p$-subgroups (such as $S_1$) which belong to appropriate (possibly non-split) tori of $\mathop{\rm GL}(n)$ and combining them with adequate parts of their normalizer, which is close to considering Sylow subgroups of the symmetric group. The paper Sylow $p$-subgroups of the classical groups over finite fields with characteristic prime to $p$ by A.J. Weir gives the general description (as well as for orthogonal and symplectic groups), building on an earlier paper in which he constructed Sylow subgroups of symmetric groups. See also the paper Some remarks on Sylow subgroups of the general linear groups by C. R. Leedham-Green and W. Plesken which says a lot about maximal $p$-subgroups of the general linear group (over non-necessarily finite fields). Also, the question was recently the subject of interesting discussions on MathOverflow.

[Edited on Febr. 14 to correct the computation of the 2-Sylow...]