Sunday, April 14, 2024

Evaluating the operator norms of matrices

Let $E$ and $F$ be normed vector spaces, over the real or complex numbers, and let $u\colon E\to F$ be a linear map. The continuity of $u$ is proved to be equivalent to the existence of a real number $c$ such that $\|u(x)\|\leq c \|x\|$ for every $x\in E$, and the least such real number is called the operator norm of $u$; we denote it by $\|u\|$. It defines a norm on the linear space $\mathscr L(E;F)$ of continuous linear maps from $E$ to $F$ and as such is quite important. When $E=F$, it is also related to the spectrum of $u$ and is implicitly at the heart of criteria for the Gershgorin criterion for localization of eigenvalues.

$\gdef\R{\mathbf R}\gdef\norm#1{\lVert#1\rVert}\gdef\Abs#1{\left|#1\right|}\gdef\abs#1{\lvert#1\rvert}$

However, even in the simplest cases of matrices, its explicit computation is not trivial at all, and as we'll see even less trivial than what is told in algebra classes, as I learned by browsing Wikipedia when I wanted to prepare a class on the topic.

Since I'm more a kind of abstract guy, I will use both languages, of normed spaces and matrices, for the first one allows to explain a few things at a more fundamental level. I'll make the translation, though. Also, to be specific, I'll work with real vector spaces.

So $E=\R^m$, $F=\R^n$, and linear maps in $\mathscr L(E;F)$ are represented by $n\times m$ matrices. There are plentiful interesting norms on $E$, but I will concentrate the discussion on the $\ell^p$-norm given by $\norm{(x_1,\dots,x_m)} = (|x_1|^p+\dots+|x_m|^p)^{1/p}$. Similarly, I will consider the $\ell^q$-norm on $F$ given by $\norm{(y_1,\dots,y_m)} = (|y_1|^q+\dots+|y_n|^q)^{1/q}$. Here, $p$ and $q$ are elements of $[1;+\infty\mathclose[$. It is also interesting to allow $p=\infty$ or $q=\infty$; in this case, the expression defining the norm is just replaced by $\sup(|x_1|,\dots,|x_m|)$ and $\sup(|y_1|,\dots,|y_n|)$ respectively.


Whatever norm is given on $E$, the dual space $E^*=\mathscr L(E;\mathbf R)$ is endowed with the dual norm, which is just the operator norm of that space: for $\phi\in E^*$, $\norm\phi$ is the least real number such that $|\phi(x)|\leq \norm\phi \norm x$ for all $x\in E$. And similarly for $F$. To emphasize duality, we will write $\langle x,\phi\rangle$ instead of $\phi(x)$.

Example. — The dual norm of the $\ell^p$ norm can be computed explicitly, thanks to the Young inequality \[ \Abs{ x_1 y_1 + \dots + x_n y_n } \leq (|x_1|^p+\dots + |x_n|^p)^{1/p} (|x_1|^q+\dots+|x_n|^q)^{1/q}\] if $p,q$ are related by the relation $\frac1p+\frac1q=1$. (When $p=1$, this gives $q=\infty$, and symmetrically $p=\infty$ if $q=1$.) Moreover, this inequality is optimal in the sense that for any $(x_1,\dots,x_n)$, one may find a nonzero $(y_1,\dots,y_n)$ for which the inequality is an equality. What this inequality says about norms/dual norms is that if one identifies $\R^n$ with its dual, via the duality bracket $\langle x,y\rangle=x_1y_1+\dots+x_n y_n$, the dual of the $\ell^p$-norm is the $\ell^q$-norm, for that relation $1/p+1/q=1$.

If $u\colon E\to F$ is a continuous linear map, it has an adjoint (or transpose) $u^*\colon F^*\to E^*$, which is defined by $u^*(\phi)= \phi\circ u$, for $\phi\in F^*$. In terms of the duality bracket, this rewrites as \[ \langle \phi, u(x)\rangle = \langle u^*(\phi),x\rangle\] for $x\in E$ and $\phi\in F^*$.

Proposition.One has $\norm{u^*}=\norm u$.

For $\phi\in F^*$, $\norm{u^*(\phi)}$ is the least real number such that $|u^*(\phi)(x)|\leq \norm{u^*(\phi)} \norm x$ for all $x\in E$. Since one has \[ |u^*(\phi)(x)|= |\langle u^*(\phi),x\rangle|=|\langle\phi, u(x)\rangle\leq \norm\phi \norm{u(x)} \leq \norm\phi \norm u\norm x, \] we see that $\norm {u^*(\phi)}\leq \norm\phi\norm u$ for all $\phi$. As a consequence, $\norm{u^*}\leq \norm u$.
To get the other inequality, we wish to find a nonzero $\phi$ such that $\norm{u^*(\phi)}=\norm{u}\norm\phi$. This $\phi$ should thus be such that there exists $x$ such that $|\langle u^*(\phi),x\rangle|=\norm u\norm\phi\norm x$ which, by the preceding computation means that $|\langle\phi, u(x)\rangle=\norm\phi\norm u\norm x$. Such $\phi$ and $x$ must not exist in general, but we can find reasonable approximations. Start with a nonzero $x\in E$ such that $\norm{u(x)}$ is close to $\norm u\norm x$; then using the Hahn-Banach theorem, find a nonzero $\phi\in F^*$ such that $\norm\phi=1$ and $|\phi(u(x))|=\norm {u(x)}$. We see that $\langle\phi, u(x)\rangle$ is close to $\norm u\norm\phi\norm x$, and this concludes the proof.
In some cases, in particular in the finite dimension case, we can use biduality to get the other inequality. Indeed $E^{**}$ identifies with $E$, with its initial norm, and $u^{**}$ identifies with $u$. By the first case, we thus have $\norm{u^{**}}\leq \norm {u^*}$, hence $\norm u\leq\norm{u^*}$.

The case $p=1$

We compute $\norm{u}$ when $E=\mathbf R^m$ is endowed with the $\ell^1$-norm, and $F$ is arbitrary. The linear map $u\colon E\to F$ thus corresponds with $m$ vectors $u_1,\dots,u_m$ of $F$, and one has \[ u((x_1,\dots,x_m))=x_1 u_1+\dots+x_m u_m. \] By the triangular inequality, we have \[ \norm{u((x_1,\dots,x_m))} \leq |x_1| \norm{u_1}+\dots+\abs{x_m}\norm{u_m} \] hence \[ \norm{u((x_1,\dots,x_m))} \leq (\abs{x_1} +\dots+\abs{x_m}) \sup(\norm{u_1},\dots,\norm{u_m}). \] Consequently, \[ \norm{u} \leq \sup(\norm{u_1},\dots,\norm{u_m}). \] On the other hand, taking $x=(x_1,\dots,x_m)$ of the form $(0,\dots,1,0,\dots)$, where the $1$ is at place $k$ such that $\norm{u_k}$ is largest, we have $\norm{x}=1$ and $\norm{u(x)}=\norm{u_k}$. The preceding inequality is thus an equality.

In the matrix case, this shows that the $(\ell^1,\ell^q)$-norm of a $n\times m$ matrix $A$ is the supremum of the $\ell^q$-norms of the columns of $A$.

The case $q=\infty$

We compute $\norm{u}$ when $F=\mathbf R^n$ is endowed with the $\ell^\infty$-norm, and $E$ is arbitrary. A direct computation is possible in the matrix case, but it is not really illuminating, and I find it better to argue geometrically, using a duality argument.

Namely, we can use $u^*\colon F^*\to E^*$ to compute $\norm{u}$, since $\norm u=\norm{u^*}$. We have seen above that $F^*$ is $\mathbf R^n$, endowed with the $\ell^1$-norm, so that we have computed $\norm{u^*}$ in the preceding section. The basis $(e_1,\dots,e_n)$ of $F$ gives a dual basis $(\phi_1,\dots,\phi_n)$, and one has \[ \norm{u}=\norm{u^*} = \sup (\norm{u^*(\phi_1)},\dots,\norm{u^*(\phi_n)}). \]

In the matrix case, this shows that the $(\ell^p,\ell^\infty)$-norm of a $n\times m$ matrix $A$ is the supremum of the $\ell^p$-norms of the lines of $A$.

Relation with the Gershgorin circle theorem

I mentioned the Gershgorin circle theorem as being in the same spirit as the computation of an operator norm, because its proof relies on the same kind of estimations. In fact, no additional computation is necessary!

Theorem (Gershgorin “circles theorem”). — Let $A=(a_{ij})$ be an $n\times n$ matrix and let $\lambda$ be an eigenvalue of $A$. There exists an integer $i$ such that \[ \abs{\lambda-a_{ii}}\leq \sum_{j\neq i} \abs{a_{ij}}. \]

For the proof, one writes $A=D+N$ where $D$ is diagonal has zeroes on its diagonal, and writes $\lambda x=Ax=Dx+Nx$, hence $(\lambda I-D)x=Nx$. Endow $\R^n$ with the $\ell^\infty$-norm. We can assume that $\norm x=1$. Then the norm of the right hand side is bounded above by $\norm N$, while the norm of the left hand side is $\sup(\abs{\lambda-a_{ii}} |x_i|)\geq |\lambda-a_{ii}|$ if $i$ is chosen so that $|x_i|=\norm x=1$. Given the above formula for $\norm N$, this implies the theorem.

The case $p=q=2$

Since Euclidean spaces are very useful in applications, this may be a very important case to consider, and we will see that the answer is not at all straightforward from the coefficients of the matrix.

We have to bound from above $\norm{u(x)}$. Using the scalar product, we write \[ \norm{u(x)}^2 = \langle u(x),u(x)\rangle = \langle u^*u(x),x\rangle, \] where $u^*\colon F\to E$ now denotes the adjoint of $u$, which identifies with the transpose of $u$ if one identifies $E$ with $E^*$ and $F$ with $F^*$ by means of their scalar products. Using the Cauchy-Schwarz formula, we get that $\norm{u(x)}^2\leq \norm{u^*u(x)}\norm x\leq \norm{u^*u} \norm x^2$, hence $\norm{u} \leq \norm{u^*u}^{1/2}$. This inequality is remarkable because on the other hand, we have $\norm{u^*u}\leq \norm{u^*}\norm{u}=\norm{u}^2$. Consequently, $\norm{u}=\norm{u^*u}^{1/2}$.

This formula might not appear to be so useful, since it reduces the computation of the operator norm of $u$ to that of $u^*u$. However, the linear map $u^*u$ is a positive self-adjoint endomorphism of $E$ hence, (assuming that $E$ is finite dimensional here), it can be diagonalized in a orthonormal basis. We then see that $\norm{u^*u}$ is the largest eigenvalue of $u^*u$, which is also its spectral radius. The square roots of the eigenvalues of $u^*u$ are also called the singular values of $u$, hence $\norm u$ is the largest singular value of $u$.

One can play with duality as well, and we have $\norm{u}=\norm{uu^*}^{1/2}$.

Other cases?

There are general inequalities relating the various $\ell^p$-norms of a vector $x\in\R^m$, and these can be used to deduce inequalities for $\norm u$, when $E=\R^m$ has an $\ell^p$-norm and $F=\R^n$ has an $\ell^q$-norm. However, given the explicit value of $\norm u$ for $(p,q)=(2,2)$ and the fact that no closed form expression exists for the spectral radius, it is unlikely that there is a closed form expression in the remaining cases.

Worse: the exact computation of $\norm u$ in the cases $(\infty,1)$, $(\infty,2)$ or $(2,1)$ is known to be computationally NP-complete, and I try to explain this result below, following J. Rohn (2000) (“Computing the Norm ∥ A ∥∞,1 Is NP-Hard”, Linear and Multilinear Algebra 47 (3), p. 195‑204). I concentrate on the $(\infty, 1)$ case ; the $(\infty,2)$ case is supposed to be analogous (see Joel Tropp's thesis, top of page 48, quoted by Wikipedia, but no arguments are given), and the case $(2,1)$ would follow by symmetry.

A matrix from a graph

Let us consider a finite (undirected, simple, without loops) graph $G$ on the set $V=\{1,\dots,n\}$ of $n$ vertices, with set of edges $E$, and let us introduce the following $n\times n$ matrix $A=(a_{ij})$, a variant of the incidence matrix of the graph $G$ (actually $nI-E$, where $I$ is the identity matrix and $E$ is the incidence matrix of $G$):

  • One has $a_{ii}=n$ for all $i$;
  • If $i\neq j$ and vertices $i$ and $j$ are connected by an edge, then $a_{ij}=-1$;
  • Otherwise, $a_{ij}=0$.
For any subset $S$ of $V$, the cut $c(S)$ of $S$ is the number of edges which have one endpoint in $S$ and the other outside of $S$.

Proposition.The $(\ell^\infty,\ell^1)$-norm of $A$ is given by \[ 4 \sup_{S\subseteq V} c(S) - 2 \operatorname{Card}(E) + n^2. \]

The proof starts with the following observation, valid for more general matrices.

Lemma. — The $(\ell^\infty,\ell^1)$-norm of a symmetric positive $n\times n$ matrix $A$ is given by $\norm A = \sup_z \langle z, Az \rangle$ where $z$ runs among the set $Z$ of vectors in $\R^n$ with coordinates $\pm1$.

The vectors of $Z$ are the vertices of the polytope $[-1;1]^n$, which is the unit ball of $\R^n$ for the $\ell^\infty$-norm. Consequently, every vector of $[-1;1]^n$ is a convex combination of vectors of $Z$. Writing $x=\sum_{z\in Z} c_z z$, we have \[\norm {Ax} = \norm{\sum c_z Az} \leq \sum c_z \norm {Az}= \sup_{z\in Z} \norm{Az}. \] The other inequality being obvious, we already see that $\norm A=\sup_{z\in Z}\norm{Az}$. Note that this formula holds for any norm on the codomain.
If, for $z\in Z$, one writes $Az=(y_1,\dots,y_n)$, one has $\norm{Az}=|y_1|+\dots+|y_n|$, because the codomain is endowed with the $\ell^1$-norm, so that $\langle z, Az\rangle = \sum z_i y_i\leq \norm{Az}$. We thus the inequality $\sup_{z\in Z} \langle z,Az\rangle \leq \norm A$.
Let us now use the fact that $A$ is symmetric and positive. Fix $z\in Z$, set $Az=(y_1,\dots,y_n)$ as above, and define $x\in Z$ by $x_i=1$ if $y_i\geq0$ and $x_i=-1$ otherwise. One thus has $\langle x, Az\rangle=\sum |y_i|=\norm{Az}$. Since $A$ is symmetric and positive, one has $\langle x-z, A(x-z)\rangle\geq0$, and this implies \[2\norm{Az}= 2\langle x, Az\rangle \leq \langle x, Ax\rangle+\langle z, Az\rangle, \] so that, $\norm{Az}\leq \sup_{x\in Z} \langle x, Ax\rangle$. This concludes the proof.

To prove the theorem, we will apply the preceding lemma. We observe that $A$ is symmetric, by construction. It is also positive, since for every $x\in\R^n$, one has \[\langle x,Ax\rangle=\sum a_{ij}x_ix_j \geq n \sum x_i^2 -\sum_{i\neq j} x_i x_j = (n+1)\sum x_i^2- (\sum x_i)^2\geq \sum x_i^2 \] using the Cauchy-Schwarz inequality $(\sum x_i)^2\leq n\sum x_i^2$. By the preceding lemma, we thus have \[ \norm A = \sup_{z\in\{\pm1\}^n} \langle z, Az\rangle. \] The $2^n$ vectors $z\in Z$ are in bijection with the $2^n$ subsets of $V=\{1,\dots,n\}$, by associating with $z\in Z$ the subset $S$ of $V$ consisting of vertices $i$ such that $z_i=1$. Then, one can compute \[ \langle z, Az\rangle = \sum_{i,j} a_{ij} z_iz_j = 4c(S) - 2\operatorname{Card}(E) + n^2. \] It follows that $\norm A $ is equal to the indicated expression.

The last step of the proof is an application of the “simple max-cut” NP-hardness theorem of Garey, Johnson and Stockmeyer (1976), itself a strenghtening of Karp (1973)'s seminal result that “max-cut” is NP-complete. I won't explain the proofs of these results here, but let me explain what they mean and how they relate to the present discussion. First of all, computer scientists categorize problems according to the time that is required to solve them, in terms of the size of the entries. This notion depends on the actual computer that is used, but the theory of Turing machines allows to single out two classes, P and EXP, consisting of problems which can be solved in polynomial, respectively exponential, time in term of the size of the entries. A second notion, introduced by Karp, is that of NP problems, problems which can be solved in polynomial time by a “non deterministic Turing machine” — “nondeterministic” means the computer can parallelize itself at will when it needs to consider various possibilities. This class belongs to EXP (because one can simulate in exponential time a polynomial time nondeterministic algorithm) and also corresponds to the class of problems whose solution can be checked in polynomial time.

Our problem is to find a subset $S$ of $\{1,\dots,n\}$ that maximizes $c(S)$. This is a restriction of the “general max-cut” problem where, given an integer valued function $w$ on the set of edges, on wishes to find subset that maximizes $c(S;w)$, the sum of the weights of the edges which have one endpoint in $S$ and the other outside of $S$. Karp (1973) observed that the existence of $S$ such that $c(S;w)\geq m$ is an NP problem (if one is provided $S$, it takes polynomial time to compute $c(S;w)$ and to decide that it is at least $m$), and the naïve search algorithm is in EXP, since there are $2^n$ such subsets. Moreover, Karp proves that any NP problem can be reduced to it in polynomial time. This is what is meant by the assertion that it is NP-complete. Consequently, determining $\sup_S c(S;w)$ is NP-hard: if you can solve that problem, then you can solve the “max-cut” problem in polynomial time, hence any other NP-problem. A subsequent theorem by Garey, Johnson and Stockmeyer (1976) established that restricting the max-cut problems to $\pm1$ weights is still NP-hard, and this completes the proof of Rohn's theorem.

(Aside, to insist that signs matter: a theorem of Edmonds and Karp (1972), one can solve the “min-cup” problem in polynomial time, which consists in deciding, for some given integer $m$, whether there exist $S$ such that $c(S;w)\leq m$.)

Saturday, April 13, 2024

The topology on the ring of polynomials and the continuity of the evaluation map

Polynomials are an algebraic gadget, and one is rarely led to think about the topology a ring of polynomials should carry. That happened to me, though, more or less by accident, when María Inés de Frutos Fernández and I worked on implementing in Lean the evaluation of power series. So let's start with them. To simplify the discussion, I only consider the case of one inderminate. When there are finitely many of them, the situation is the same; in the case of infinitely many indeterminates, there might be some additional subtleties, but I have not thought about it.

$\gdef\lbra{[\![}\gdef\rbra{]\!]} \gdef\lpar{(\!(}\gdef\rpar{)\!)} \gdef\bN{\mathbf N} \gdef\coeff{\operatorname{coeff}} \gdef\eval{\operatorname{eval}} \gdef\colim{\operatorname{colim}}$

Power series

A power series over a ring $R$ is just an expression $\sum a_nT^n$, where $(a_0,a_1, \dots)$ is a family of elements of $R$ indexed by the integers. After all, this is just what is meant by “formal series”: coefficients and nothing else.

Defining a topology on the ring \(R\lbra T\rbra\) should allow to say what it means for a sequence $(f_m)$ of power series to converge to a power series $f$, and the most natural thing to require is that for every $n$, the coefficient $a_{m,n}$ of \(T^n\) in $f_m$ converges to the corresponding coeffient $a_m$ of $T^n$ in \(f\). In other words, we endow \(R\lbra T\rbra \) with the product topology when it is identified with the product set \(R^{\bN}\). The explicit definition may look complicated, but the important point for us is the following characterization of this topology: Let \(X\) be a topological space and let \(f\colon X \to R\lbra T\rbra\) be a map; for \(f\) to be continuous, it is necessary and sufficient that all maps \(f_n\colon X \to R\) are continuous, where, for any \(x\in X\), \(f_n(x)\) is the \(n\)th coefficient of \(f(x)\). In particular, the coeffient maps \(R\lbra T\rbra\to R\) are continuous.

What can we do with that topology, then? The first thing, maybe, is to observe its adequacy wrt the ring structure on \(R\lbra T\rbra\).

Proposition.If addition and multiplication on \(R\) are continuous, then addition and multiplication on \(R\lbra T\rbra\) are continuous.

Let's start with addition. We need to prove that \(s\colon R\lbra T\rbra \times R\lbra T\rbra\to R\lbra T\rbra\) is continuous. By the characterization, it is enough to prove that all coordinate functions \(s_n\colon R\lbra T\rbra \times R\lbra T\rbra\to R\), \( (f,g)\mapsto \coeff_n(f+g) \), are continuous. But these functions factor through the \(n\)th coefficient maps: \(\coeff_n(f+g) = \coeff_n(f)+\coeff_n(g)\), which is continuous, since addition, coefficients and projections are continuous. This is similar, but slightly more complicated for multiplication: if the multiplication map is denoted by \(m\), we have to prove that the maps \(m_n\) defined by $m_n(f,g)=\coeff_n(f\cdot g)$ are continuous. However, they can be written as \[ m_n(f,g)=\coeff_n(f\cdot g) = \sum_{p=0}^n \coeff_p(f)\coeff_{n-p}(g). \] Since the projections and the coefficient maps are continuous, it is sufficient to prove that the maps from \(R^{n+1} \times R^{n+1}\) to \(R\) given by \[((a_0,\dots,a_n),(b_0,\dots,b_n))\mapsto \sum_{p=0}^n a_p b_{n-p} \] are continuous, and this follows from continuity and commutativity of addition on \(R\), because it is a polynomial expression.


At this point, let's go back to our initial question of endowing polynomials with a natural topology.

An obvious candidate is the induced topology. This looks correct; in any case, it is such that addition and multiplication on \(R[T]\) are continuous. However, it lacks an interesting property with respect to evaluation.

Recall that for every \(a\in R\), there is an evaluation map \(\eval_a\colon R[T]\to R\), defined by \(f\mapsto f(a)\), and even, if one wishes, the two-variable evaluation map \(R[T]\times R\to R\).
The first claim is that this map is not continuous.

An example will serve of proof. I take \(R\) to be the real numbers, \(f_n=T^n\) and \(a=1\). Then \(f_n\) converges to zero, because for each integer \(m\), the real numbers \(\coeff_m(f_n)\) are zero for \(n>m\). On the other hand, \(f_n(a)=f_n(1)=1\) for all \(n\), and this does not converge to zero!

So we have to change the topology on polynomials if we want that this map be continuous, and we now give the correct definition. The ring of polynomials is the increasing union of subsets \(R[T]_n\), indexed by integers \(n\), consisting of all polynomials of degree less than \(n\). Each of these subsets is given the product topology, as above, but we endow their union with the “inductive limit” topology. Explicitly, if \(Y\) is a topological space and \(u\colon R[T]\to Y\) is a map, then \(u\) is continuous if and only if, for each integer \(n\), its restriction to \(R[T]_n\) is continuous.

The inclusion map \(R[T]\to R\lbra T\rbra\) is continuous, hence the topology on polynomials is finer than the topology induced by the topology on power series. As the following property indicates, it is usually strictly finer.

We can also observe that addition and multiplication on \(R[T]\) are still continuous. The same proof as above works, once we observe that the coefficient maps are continuous. (On the other hand, one may be tempted to compare the product topology of the inductive topologies, with the inductive topology of the product topologies, a thing which is not obvious in the direction that we need.)

Proposition.Assume that addition and multiplication on \(R\) are continuous. Then the evaluation maps \(\eval_a \colon R[T]\to R\) are continuous.

We have We have to prove that for every integer \(n\), the evaluation map \(\eval_a\) induced a continuous map from \(R[T]_n\) to \(R\). Now, this map factors as a projection map \(R[T]\to R^{n+1}\) composed with a polynomial map \((c_0,\dots,c_n)\mapsto c_0+c_1a+\dots+c_n a^n\). It is therefore continuous.

Laurent series

We can upgrade the preceding discussion and define a natural topology on the ring \(R\lpar T\rpar\) of Laurent series, which are the power series with possibly negative exponents. For this, for all integers \(d\), we set \(R\lpar T\rpar_d\) to be the set of power series of the form \( f=\sum_{n=-d}^\infty c_n T^n\), we endow that set with the product topology, and take the corresponding inductive limit topology. We leave to the reader to check that this is a ring topology, but that the naïve product topology on \(R\lpar T\rpar\) wouldn't be in general.

Back to the continuity of evaluation

The continuity of the evaluation maps $f\mapsto f(a)$ were an important guide to the topology of the ring of polynomials. This suggests a more general question, for which I don't have a full answer, whether the two-variable evaluation map, \((f,a)\mapsto f(a)\), is continuous. On each subspace $R[T]_d\times R$, the evaluation map is given by a polynomial map ($(c_0,\dots,c_d,a)\mapsto c_0 +c_1a+\dots+c_d a^d$), hence is continuous, but that does not imply the desired continuity, because that only tells us about $R[T]\times R$ with the topology $\colim_d (R[T]_d\times R)$, while we are interested in the topology $(\colim_d R[T]_d)\times R$. To compare these topologies, note that the natural bijection $\colim_d (R[T]_d\times R) \to (\colim_d R[T]_d)\times R$ is continuous (because it is continuous at each level $d$), but the continuity of its inverse is not so clear.

I find it amusing, then, to observe that sequential continuity holds in the important case where $R$ is a field. This relies on the following proposition.

Proposition.Assume that $R$ is a field. Then, for every converging sequence $(f_n)$ in $R[T]$, the degrees $\deg(f_n)$ are bounded.

Otherwise, we can assume that $(f_n)$ converges to $0$ and that $\deg(f_{n+1})>\deg(f_n)$ for all $n$. We construct a continuous linear form $\phi$ on $R[T]$ such that $\phi(f_n)$ does not converge to $0$. This linear form is given by a formal power series $\phi(f)=\sum a_d c_d$ for $f=\sum c_dT^d$, and we choose the coefficients $(a_n)$ by induction so that $\phi(f_n)=1$ for all $n$. Indeed, if the coefficients are chosen up to $\deg(f_n)$, then we fix $a_d=0$ for $\deg(f_n)<d<\deg(f_{n+1})$ and choose $a_{\deg(f_{n+1})}$ so that $\phi(f_{n+1})=1$. This linear form is continuous because its restriction to any $R[T]_d$ is given by a polynomial, hence is continuous.

Corollary. — If $R$ is a topological ring which is a field, then the evaluation map $R[T]\times R\to R$ is sequentially continuous.

Consider sequences $(f_n)$ in $R[T]$ and $(a_n)$ in $R$ that converge to $f$ and $a$ respectively. By the proposition, there is an integer $d$ such that $\deg(f_n)\leq d$ for all $n$, and $\deg(f)\leq d$. Since evaluation is continuous on $R[T]_d\times R$, one has $f_n(a_n)\to f(a)$, as claimed.

Remark. — The previous proposition does not hold on rings. In fact, if $R=\mathbf Z_p$ is the ring of $p$-adic integers, then $\phi(p^nT^n)=p^n \phi(T^n)$ converges to $0$ for every continuous linear form $\phi$ on $R[T]$. More is true since in that case, evaluation is continuous! The point is that in $\mathbf Z_p$, the ideals $(p^n)$ form a basis of neighborhoods of the origin.

Proposition. — If the topology of $R$ is linear, namely the origin of $R$ has a basis of neighborhoods consisting of ideals, then the evaluation map $R[T]\times R\to R$ is continuous.

By translation, one reduces to showing continuity at $(0,0)$. Let $V$ be a neighborhood of $0$ in $R$ and let $I$ be an ideal of $R$ such that $I\subset V$. Since it is an subgroup of the additive group of $R$, the ideal $I$ is open. Then the set $I\cdot R[T]$ is open because for every $d$, its trace on $R[T]_d$, is equal to $I\cdot R[T]_d$, hence is open. Then, for $f\in I\cdot R[T]$ and $a\in R$, one has $f(a)\in I$, hence $f(a)\in V$.

Here is one case where I can prove that evaluation is continuous.

Proposition.If the topology of $R$ is given by a family of absolute values, then the evaluation map $(f,a)\mapsto f(a)$ is continuous.

I just treat the case where the topology of $R$ is given by one absolute value. By translation and linearity, it suffices to prove continuity at $(0,0)$. Consider the norm $\|\cdot\|_1$ on $R[T]$ defined by $\|f\|_1=\sum |c_n|$ if $f=\sum c_nT^n$. By the triangular inequality, one has $|f(a)|\leq \|f\|_1 $ for any $a\in R$ such that $|a|\leq 1$. For every $r>0$, the set $V_r$ of polynomials $f\in R[T]$ such that $\|f\|_1<r$ is an open neighborhood of the origin since, for every integer $d$, its intersection with $R[T]_d$ is an open neighborhood of the origin in $R[T]_d$. Let also $W$ be the set of $a\in R$ such that $|a|\leq 1$. Then $V_r\times W$ is a neighborhood of $(0,0)$ in $R[T]\times R$ such that $|f(a)|<r$ for every $(f,a)\in V_r\times W$. This implies the desired continuity.

Wednesday, April 10, 2024

Flatness and projectivity: when is the localization of a ring a projective module?

Projective modules and flat modules are two important concepts in algebra, because they characterize those modules for which a general functorial construction (Hom module and tensor product, respectively) behave better than what is the case for general modules.

This blog post came out of reading a confusion on a student's exam: projective modules are flat, but not all flat modules are projective. Since localization gives flat modules, it is easy to obtain a an example of a flat module which is not projective (see below, $\mathbf Q$ works, as a $\mathbf Z$-module), but my question was to understand when the localization of a commutative ring is a projective module.


Let me first recall the definitions. Let $R$ be a ring and let $M$ be a (right)$R$-module.

The $\Hom_R(M,\bullet)$-functor associates with a right $R$-module $X$ the abelian group $\Hom_R(M,X)$. By composition, any linear map $f\colon X\to Y$ induces an additive map $\Hom_R(M,f)\colon \Hom_R(M,X)\to \Hom_R(M,X)$: it maps $u\colon M\to X$ to $\phi\circ u$. When $R$ is commutative, these are even $R$-modules and morphisms of $R$-modules. If $f$ is injective, $\Hom_R(M,f)$ is injective as well, but if $f$ is surjective, it is not always the case that $\Hom_R(M,f)$ is surjective, and one says that the $R$-module $M$ is projective if $\Hom_R(M,f)$ is surjective for all surjective linear maps $f$.

The $\otimes_R$-functor associates with a left $R$-module $X$ the abelian group $M\otimes_R X$, and with any linear map $f\colon X\to Y$, the additive map $M\otimes_R X\to M\otimes_R Y$ that maps a split tensor $m\otimes x$ to $m\otimes f(x)$. When $R$ is commutative, these are even $R$-modules and morphisms of $R$-modules. If $f$ is surjective, then $M\otimes_R f$ is surjective, but if $f$ is injective, it is not always the case that $M\otimes_R f$ is injective. One says that $M$ is flat if $M\otimes_R f$ is injective for all injective linear maps $f$.

These notions are quite abstract, and the development of homological algebra made them prevalent in modern algebra.

Example. — Free modules are projective and flat.

Proposition. — An $R$-module $M$ is projective if and only if there exists an $R$-module $N$ such that $M\oplus N$ is free.
Indeed, taking a generating family of $M$, we construct a free module $L$ and a surjective linear map $u\colon L\to M$. Since $M$ is projective, the map $\Hom_R(M,u)$ is surjective and there exists $v\colon M\to L$ such that $u\circ v=\id_M$. Then $v$ is an isomorphism from $M$ to $u(M)$, and one can check that $L=u(M)\oplus \ker(v)$.

Corollary. — Projective modules are flat.

Theorem (Kaplansky). — If $R$ is a local ring, then a projective $R$-module is free.

The theorem has a reasonably easy proof for a finitely generated $R$-module $M$ over a commutative local ring. Let $J$ be the maximal ideal of $R$ and let $k=R/J$ be the residue field. Then $M/JM$ is a finite dimensional $k$-vector space; let us consider a family $(e_1,\dots,e_n)$ in $M$ whose images form a basis of $M/JM$. Now, one has $\langle e_1,\dots,e_n\rangle + J M = M$, hence Nakayama's lemma implies that $M=\langle e_1,\dots,e_n\rangle$. Let then $u\colon R^n\to M$ be the morphism given by $u(a_1,\dots,a_n)=\sum a_i e_i$; by what precedes, it is surjective, and we let $N$ be its kernel. Since $M$ is projective, the morphism $\Hom_R(M,u)$ is surjective, and there exists $v\colon M\to R^n$ such that $u\circ v=\id_M$. We then have an isomorphism $M\oplus N\simeq R^n$, where $N=\ker(v)$. Moding out by $J$, we get $M/JM \oplus N/JN \simeq k^n$. Necessarily, $N/JN=0$, hence $N=JN$; since $N$ is a direct summand of $R^n$, it is finitely generated, and Nakayama's lemma implies that $N=0$.

Example. — Let $R$ be a commutative ring and let $S$ be a multiplicative subset of $R$. Then the fraction ring $S^{-1}R$ is a flat $R$-module.
Let $u\colon X\to Y$ be an injective morphism of $R$-modules. First of all, one identifies the morphism $S^{-1}R\otimes_R u\colon S^{-1}R\otimes_R X\to S^{-1}R\otimes_R Y$ to the morphism $S^{-1}u\colon S^{-1}X\to S^{-1}Y$ induced by $u$ on fraction modules. Then, it is easy to see that $S^{-1}u$ is injective. Let indeed $x/s\in S^{-1}X$ be an element that maps to $0$; one then has $u(x)/s=0$, hence there exists $t\in S$ such that $tu(x)=0$. Consequently, $u(tx)=0$, hence $tx=0$ because $u$ is injective. This implies $x/s=0$.

Theorem.Let $R$ be a commutative ring. If $M$ is a finitely presented $R$-module, then $M$ is locally free: there exists a finite family $(f_1,\dots,f_n)$ in $R$ such that $R=\langle f_1,\dots,f_n\rangle$ and such that for every $i$, $M_{f_i}$ is a free $R_{f_i}$-module.
The proof is a variant of the case of local rings. Starting from a point $p\in\Spec(R)$, we know that $M_p$ is a finitely presented flat $R_p$-module. As above, we get a surjective morphism $u\colon R^n\to M$ which induces an isomorphism $\kappa(p)^n\to \kappa(p)\otimes M$, and we let $N$ be its kernel. By flatness of $M$ (and an argument involving the snake lemma), the exact sequence $0\to N\to R_p\to M\to 0$ induces an exact sequence $0\to \kappa(p)\otimes N\to \kappa(p)^n\to \kappa(p)\otimes M\to 0$. And since the last sequence is an isomorphism, we have $\kappa(p)\otimes N$. Since $M$ is finitely presented, the module $N$ is finitely generated, and Nakayama's lemma implies that $N_p=0$; moreover, there exists $f\not\in p$ such that $N_f=0$, so that $u_f\colon R_f^n\to M_f$ is an isomorphism. One concludes by using the quasicompactness of $\Spec(R)$.

However, not all flat modules are projective. The most basic example is the following one.

Example.The $\mathbf Z$-module $\mathbf Q$ is flat, but is not projective.
It is flat because it is the total fraction ring of $\mathbf Z$. To show that it is not projective, we consider the free module $L={\mathbf Z}^{(\mathbf N)}$ with basis $(e_n)$ and the morphism $u\colon L\to\mathbf Q$ that maps $e_n$ to $1/n$ (if $n>0$, say). This morphism is surjective. If $\mathbf Q$ were projective, there would exist a morphism $v\colon \mathbf Q\to L$ such that $u\circ v=\id_{\mathbf Q}$. Consider a fraction $a/b\in\mathbf Q$; one has $b\cdot 1/b=1$, hence $b v(1/b)=v(1)$. We thus see that all coeffiencients of $v(1)$ are divisible by $b$, for any integer $b$; they must be zero, hence $v(1)=0$ and $1=u(v(1))=0$, a contradiction.
The proof generalizes. For example, if $R$ is a domain and $S$ does not consist of units, and does not contain $0$, then $S^{-1}R$ is not projective. (With analogous notation, take a nonzero coefficient $a$ of $v(1)$ and set $b=as$, where $s\in S$ is not $0$; then $as$ divides $a$, hence $s$ divides $1$ and $s$ is a unit.)

These recollections are meant to motivate the forthcoming question: When is it the case that a localization $S^{-1}R$ is a projective $R$-module?

Example. — Let $e$ be an idempotent of $R$, so that the ring $R$ decomposes as a product ot two rings $R\simeq eR \times (1-e)R$, and both factors are projective submodules of $R$ since their direct sum is the free $R$-module $R$. Now, one can observe that $R_e= eR$. Consequently, $R_e$ is projective. Geometrically, $\Spec(R)$ decomposes as a disjoint union of two closed subsets $\mathrm V(e)$ and $\mathrm V(1-e)$; the first one can be viewed as the open subset $\Spec(R_{1-e})$ and the second one as the open subset $\Spec(R_e)$.

The question was to decide whether this geometric condition furnishes the basic conditions for a localization $S^{-1}R$ to be projective. With the above notation, we recall that $\Spec(S^{-1}R)$ is homeomorphic to a the subset of $\Spec(R)$ consisting of prime ideals $p$ such that $p\cap S=\emptyset$. The preceding example corresponds to the case where $\Spec(S^{-1}R)$ is open and closed in $\Spec(R)$. In this case, we view $S^{-1}R$ as a quasicoherent sheaf on $\Spec(R)$, it is free of rank one on the open subset $\Spec(S^{-1}R)$, and zero on the complementary open subset. It is therefore locally free, hence the $R$-module $S^{-1}R$ is projective.

Observation.The set $\Spec(S^{-1}R)$ is stable under generization. If $S^{-1}R$ is a projective $R$-module, then it is open.
The first part is obvious: if $p$ and $q$ are prime ideals of $R$ such that $p\subseteq q$ and $q\cap S=\emptyset$, then $p\cap S=\emptyset$. The second part follows from the observation that the support of $S^{-1}R$ is exactly $\Spec(S^{-1}R)$, combined with the following proposition.

Proposition. — The support of a projective module is open.
I learnt this result in the paper by Vasconcelos (1969), “On Projective Modules of Finite Rank” (Proceedings of the American Mathematical Society 22 (2): 430‑33). The proof relies on the trace ideal $\tau_R(M)$ of a module: this is the image of the canonical morphism $t\colon M^\vee \otimes_R M\to R$. (It is called the trace ideal, because when $M$ is free, $M^\vee\otimes_R M$ can also be identified with the module of endomorphisms of finite rank of $M$, a split tensor $\phi\otimes m$ corresponds with the endomorhism $x\mapsto \phi(x)m$, and then $t(\phi \otimes m)=\phi(m)$ is its trace.) Now, if $p$ belongs to the support of $M$, then $\tau_R(M)_p=R_p$, while if $p$ does not belong to the support of $M$, one has $M_p=0$, hence $\tau_R(M)_p=0$. In other words, the support of $M$ is the complement of the closed locus $\mathrm V(\tau_R(M))$ of $\Spec(R)$.

On the other hand, one should remember the following basic property of the support of a module.

Proposition. — The support of a module is stable under specialization. The support of a finitely generated module is closed.
Indeed, for every $m\in M$ and $p\in \Spec(R)$, saying that $m=0$ in $M_p$ means that there exist $s\in R$ such that $s\notin p$ with $sm=0$. In other words, this set is $\mathrm V(\mathrm{ann}_R(m))$. This shows that the support of $M$ is the union of the closed subsets $\mathrm V(\mathrm{ann}_R(m))$; it is in particular stable under specialization. If $M$ is finitely generated, this also shows its support is $\mathrm V(\mathrm{ann}_R(M))$, hence is closed.

At this point, one can go either following Vasconcelos (1969) who shows that a projective module $M$ of the form $S^{-1}R$ is finitely generated if and only if its trace ideal is. In particular, if $R$ is noetherian and $S^{-1}R$ is a projective $R$-module, then $\Spec(S^{-1}R)$ is closed. It is thus open and closed, and we are in the situation of the basic example above.

One can also use a topological argument explained to me by Daniel Ferrand: a minimal prime ideal of $R$ that meets $\Spec(S^{-1}R)$ is disjoint from $S$, hence belongs to $\Spec(S^{-1}R)$. Consequently, $\Spec(S^{-1}R)$ is the union of the irreducible components of $\Spec(R)$ that it meets. If this set of irreducible components is finite (or locally finite), for example if $\Spec(R)$ is noetherian, for example if $R$ is a noetherian ring, then $\Spec(S^{-1}R)$ is closed.

I did not find the time to think more about this question, and it would be nice to have an example of a projective localization which does not come from this situation.

Saturday, March 16, 2024

Combinatorics of the nilpotent cone


Let $n$ be an integer and $F$ be a field. Nilpotent matrices $N\in \mathrm M_n(F)$ are those matrices for which there exists an integer $p$ with $N^p=0$. Their characteristic polynomial is $\chi_N(T)=T^n$, and they satisfy $N^n=0$, which shows that the set $\mathscr N_n$ of nilpotent matrices is an algebraic variety. The equation $N^n=0$ is homogeneous of degree $n$, so that $\mathscr N_n$ is a cone.

The classification of nilpotent matrices is an intermediate step in the theory of Jordan decomposition: In an adequate basis, a nilpotent matrix can be written as a diagonal block matrix of “basic” nilpotent matrices, $p \times p$ matrices of the form \[ \begin{pmatrix} 0 & 0 & \dots & 0 & 0 \\ 1 & 0 & & & \vdots \\ 0 & 1 & \ddots & & 0 \\ \vdots & \ddots & \ddots & \ddots & 0 \\ 0 & & 0 & 1 & 0\end{pmatrix} \] whose minimal polynomial is $T^p$. The sum of the sizes of these blocks is $n$ and in this way, it is associated with any $n\times n$ nilpotent matrix a partition $\pi$ of~$n$. It is known that two nilpotent matrices are conjugate if and only if they are associated with the same partition. For any partition $\pi$ of~$n$, let us denote by $J_\pi$ the corresponding matrix whose sizes of blocks are arranged in increasing order, and $\mathscr N_\pi$ the set of nilpotent matrices that are associated with the partition $\pi$.

The theorem of Fine and Herstein (1958)

Having to teach “agrégation” classes made me learn about a classic combinatorial result: counting the number of nilpotent matrices when $F$ is a finite field.

Theorem (Fine, Herstein, 1958). — Let $F$ be a finite field with $q$ elements. The cardinality of $\mathscr N_n(F)$ is $q^{n^2-n}$. Equivalently, the probability that an $n\times n$ matrix with coefficients in $F$ be nilpotent is $q^{-n}$.

The initial proof of this results relies on the action of $\GL_n(F)$ on $\mathscr N_n(F)$: we recalled that the orbits correspond with the partitions of $n$, hence a decomposition \[ \Card(\mathscr N_n(F)) = \sum_{\pi} \Card(\mathscr N_\pi(F)). \] We know that $\mathscr N_\pi(F)$ is the orbit of the matrix $J_\pi$ under the action of $\GL_n(F)$. By the classic orbit-stabilizer formula, one thus has \[ \Card(\mathscr N_\pi(F)) = \frac{\Card(\GL_n(F))}{\Card(C_\pi(F))}, \] where $C_\pi(F)$ is the set of matrices $A\in\GL_n(F)$ such that $AJ_\pi=J_\pi A$. The precise description of $C_\pi(F)$ is delicate but their arguments go as follow.

They first replace the group $C_\pi(F)$ by the algebra $A_\pi(F)$ of all matrices $A\in\mathrm M_n(F)$ such that $AJ_\pi=J_\pi A$. For any integer, let $m_i$ be the multiplicity of an integer $i$ in the partition $\pi$, so that $n=\sum i m_i$. The block decomposition of $J_\pi$ corresponds with a decomposition of $F^n$ as a direct sum of invariant subspaces $V_i$, where $V_i$ has dimension $i m_i$. In fact, $V_1=\ker(J_\pi)$, $V_1\oplus V_2=\ker(J_\pi^2)$, etc. This shows that $A_\pi(F)$ is an algebra of block-triangular matrices. Moreover, the possible diagonal blocks can be shown to be isomorphic to $\mathrm M_{m_i}(F)$. In other words, we have a surjective morphism of algebras \[ A_\pi(F) \to \prod_i \mathrm M_{m_i}(F), \] whose kernel consists of nilpotent matrices. In particular, the proportion of invertible elements in $A_\pi(F)$ is equal to the proportion of invertible elements in the product $\prod_i \mathrm M_{m_i}(F)$.

Ultimately, Fine and Herstein obtain an explicit sum over the set of partitions of $n$ which they prove equals $q^{n^2-n}$, after an additional combinatorial argument.

Soon after, the theorem of Fine and Herstein was given easier proofs, starting from Gerstenhaber (1961) to Kaplansky (1990) and Leinster (2021).

A proof

The following proof is borrowed from Caldero and Peronnier (2022), Carnet de voyage en Algébrie. It can be seen as a simplification of the proofs of Gerstenhaber (1961) and Leinster (2021).

Let us start with the Fitting decomposition of an endomorphism $u\in \mathrm N_n(F)$: the least integer $p$ such that $\ker(u^p)=\ker(u^{p+1})$ coincides with the least integer $p$ such that $\im(u^p)=\im(u^{p+1})$, and one has $F^n=\ker(u^p)\oplus \im(u^p)$. The subspaces $N(u)=\ker(u^p)$ and $I(u)=\im(u^p)$ are invariant under $u$, and $u$ acts nilpotently on $\ker(u^p)$ and bijectively on $\im(u^p)$. In other words, we have associated with $u$ complementary subspaces $N(u)$ and $I(u)$, a nilpotent operator of $N(u)$ and an invertible operator on $I(u)$. This map is bijective.

For any integer $d$, let $\nu_d$ be the cardinality of nilpotent matrices in $\mathrm M_d(F)$, and $\gamma_d$ be the cardinality of invertible matrices in $\mathrm M_d(F)$. Let also $\mathscr D_d$ be the set of all pairs $(K,I)$, where $K$ and $I$ are complementary subspaces of dimensions $d$, $n-d$ of $F^n$. We thus obtain \[ n^2 = \sum_{(K,I)\in\mathscr D_d} \nu_d \cdot \gamma_{n-d}. \] We need to compute the cardinality of $\mathscr D_d$. In fact, given one pair $(K,I)\in\mathscr D_d$, all other are of the form $(g\cdot K,g\cdot I)$, for some $g\in\GL_n(F)$: the group $\GL_n(F)$ acts transitively on $\mathscr D_d$. The stabilizer of $(K,I)$ can be identified with $\GL_d(F)\times \GL_{n-d}(F)$. Consequently, \[ \Card(\mathscr D_d) = \frac{\Card(\GL_n(F))}{\Card(\GL_d(F)\Card(\GL_{n-d}(F))} = \frac{\gamma_n}{\gamma_d \gamma_{n-d}}. \] We thus obtain \[ q^{n^2} = \sum_{d=0}^n \frac{\gamma_n}{\gamma_d \gamma_{n-d}} \nu_d \gamma_{n-d} = \gamma_n \sum_{d=0}^n \frac{\nu_d}{\gamma_d}. \] By subtraction, we get \[ \frac{\nu_n}{\gamma_n} = \frac {q^{n^2}}{\gamma_n} - \frac{q^{(n-1)^2}}{\gamma_{n-1}},\] or \[ \nu_n = q^{n^2} - q^{(n-1)^2} \frac{\gamma_n}{\gamma_{n-1}}. \] It remains to compute $\gamma_n$: since an invertible matrix consists of a nonzero vector, a vector which does not belong to the line generated by the first one, etc., we have \[ \gamma_n = (q^n-1) (q^n-q)\dots (q^n-q^{n-1}). \] Then, \[ \gamma_n = (q^n-1) q^{n-1} (q^{n-1}-1)\dots (q^{n-1}-q^{n-2}) = (q^n-1)q^{n-1} \gamma_{n-1}. \] We thus obtain \[ \nu_n = q^{n^2} - q^{(n-1)^2} (q^n-1) q^{n-1} = q^{n^2} - q^{(n-1)^2} q^{2n-1} + q^{(n-1)^2} q^{n-1} = q^{n^2-n}, \] as claimed.

The proof of Leinster (2021)

Leinster defines a bijection from $\mathscr N_n(F)\times F^n$ to $\mathrm M_n(F)$. The definition is however not very canonical, because he assumes given, for any subspace $V$ of $F^n$, a basis of $V$.

Take a pair $(u,x)$, where $u\in\mathscr N_n(F)$ and $x\in F^n$ and consider the subspace $V_x=\langle x,u(x),\dots\rangle$, the smallest $u$-invariant subspace of $F^n$ which contains $x$. To describe $u$, we observe that we know its restriction to $V_x$, and we need to describe it on the chosen complementary subspace $V'_x$.

To that aim, we have to give ourselves an endomorphism $u'_x$ of $V'_x$ and a linear map $\phi_x\colon V'_x\to V_x$. Since we want $u$ to be nilpotent, it is necessary and sufficient to take $u'_x$ nilpotent.

Instead of considering $\phi_x\colon V'_x\to V_x$, we can consider the map $y\mapsto y+\phi_x(y)$. Its image is a complement $W_x$ of $V_x$ in $F^n$, and any complement can be obtained in this way. The nilpotent endomorphism $u'_x$ of $V'_x$ transfers to a nilpotent endomorphism $w_x$ of $W_x$.

All in all, what the given pair $(u,x)$ furnishes is a subspace $V_x$ with a basis $(x_1=x,x_2=u(x),\dots)$, a complement $W_x$, and a nilpotent endomorphism $w_x$ of $W_x$. This is more or less what the Fitting decomposition of an endomorphism gives us! Recall that $V_x$ was assumed to have been given a basis $(e_1,\dots,e_p)$. There exists a unique automorphism of $V_x$ which maps $e_i$ to $u^{i-1}(x)$ for all $i$. In other words, we have a pair of complementary subspaces $(V_x,W_x)$, a linear automorphism of $V_x$, and a nilpotent automorphism of $W_x$. By the Fitting decomposition, these data furnish in a bijective way an endomorphism of $F^n$, and that concludes the proof.

A remark about motivic integration

The framework of motivic integration suggests to upgrade these combinatorial results into equalities valid for all field $F$, which hold in the Grothendieck ring of varieties $\KVar_F$. As an abelian group, it is generated by symbols $[X]$, for all algebraic varieties $X$ over $F$, with relations $[X]=[Y]+[X\setminus Y]$, whenever $Y$ is a closed subvariety of $X$. The ring structure is defined so that the formula $[X]\cdot[Y]=[X\times Y]$ for all algebraic varieties $X$ and $Y$ over $F$.

By construction of this ring, equalities $[X]=[Y]$ in $\KVar_F$ imply that many invariants of $X$ and $Y$ coincide. In particular, when $F$ is a finite field, they will have the same number of points.

The question is thus to compute the class $[\mathscr N_n]$ of the variety $\mathscr N_n$, for any field $F$. The proofs that I described above can be more or less transferred to this context and imply the following theorem. We denote by $\mathbf L\in \KVar_F$ the class of the affine line $\mathbf A^1$.

Theorem. — One has an equality $[\mathscr N_n] \mathbf L^n = \mathbf L^{n^2}$ in the localization of the Grothendieck ring $\KVar_F$ by the element $(\mathbf L-1)\dots(\mathbf L^{n-1}-1)$.

The following question is then natural. (I have not thought about it at all.)

Question. — Does one have $[\mathscr N_n]=\mathbf L^{n^2-n}$ in $\KVar_F$?

Friday, October 13, 2023

A combinatorial proof of the Newton identities

Let $T_1,\dots,T_n$ be indeterminates. For all $m\geq0$, denote by $S_m$ the $m$th elementary symmetric polynomial, given by \[ S_m = \sum_{i_1 \lt \dots \lt i_m} T_{i_1}\dots T_{i_m}. \] These are polynomials in $T_1,\dots, T_n$, and as their names suggest, these polynomials are symmetric, meaning that \[ S_m (T_{\sigma(1)},\dots,T_{\sigma(n)}) = S_m (T_1,\dots,T_n) \] for any permutation $\sigma$ of $\{1,\dots,n\}$. By definition, one has $S_0=1$ (there is exactly one family of integers, the empty one, and the empty product is equal to~$1$) and $S_m=0$ for $m>n$ (there are no families of integers $1\leq i_1\lt\dots\lt i_m\leq n$). A well-known theorem asserts that any symmetric polynomial in $T_1,\dots,T_n$ (with coefficients in a commutative ring $R$) can be written uniquely as a polynomial in $S_1,\dots,S_n$ (with coefficients in $R$). It is in particular the case for the Newton polynomials, defined for $p\geq 0$ by \[ N_p = T_1^p + \dots+T_n^p . \] In particular, $N_0=n$ and $N_1=S_1$. The next Newton polynomial can be computed by “completing the square”: \[ N_2 = T_1^2+\dots+T_n^2 = (T_1+\dots+T_n)^2 - 2 S _1 = S_1^2 - 2S_1. \] These are the first two (or three) of a family of relations, called the Newton identities, that relate the polynomials $N_p$ and the polynomials $S_m$: \[  \sum_{m=0}^{p-1} (-1)^m N_{p-m}\cdot S_m + (-1)^p p \cdot S_p = 0. \] Using that the coefficient of $N_p$ is given by $S_0=1$, they allow to express inductively the polynomials $N_p$ in terms of $S_1,\dots,S_p$. If $p>n$, all terms with $m>n$ vanish. Using that the coefficient of $S_p$ is $N_0=p$, these formulas also show, conversely, that if $p!$ is invertible in the commutative ring $R$, then $S_1,\dots,S_p$ can be expressed in terms of $N_1,\dots,N_p$. There are various proofs of these identities. Most of the one I have seen are algebraic in nature, in so that they rely on the relations between the roots of a polynomial and its coefficients, and/or on expansion in power series. The following proof, due to Doron Zeilberger (1984) (Discrete Math. 49, p. 319, doi:10.1016/0012-365X(84)90171-7), is purely combinatorial. I have been made aware of it because it is the one that is formalized in the proof assistant system Lean. (See there for the source code.) We consider the set $S=\{1,\dots,n\}$ and define a set $X$ whose elements are the pairs $(A,k)$ satisfying the following properties:
  • $A$ is a subset of $S$;
  • $k$ is an element of $S$;
  • the cardinality of $A$ is less or equal than $p$;
  • if ${\mathop {\rm Card}}(A)=p$, then $k\in A$.
Define a map $f\colon X\to X$ by the following rule: for $(A,k)$ in $X$, set
  • $f(A,k) = (A \setminus\{k\}, k)$ if $k\in A$;
  • $f(A,k) = (A\cup\{k\}, k)$ if $k\notin A$.
Write $f(A,k)=(A',k)$; observe that this is again an element of $X$. The first two conditions are obvious. In the first case, when $k\in A$, then the cardinality of $A'$ is one less than the cardinality of $A$, hence is at most $p-1$, and the last condition is vacuous. In the second case, when $k\notin A$, the cardinality of $A$ is strictly smaller than $p$, because $(A,k)\in X$, so that the cardinality of $A'$ is at most $p$; moreover, $k\in A'$ hence the last condition holds. Observe that $f\colon X\to X$ is an involution: $f\circ f={\mathrm{id}}_X$. Indeed, with the same notation, one has $f(A,k)=(A',k)$. In the first case, $k\in A$, hence $k\notin A'$, so that $f(A',k)=(A'\cup\{k\},k)=(A,k)$; in the second case, one has $m\in A'$, hence $f(A',k)=(A'\setminus\{k\}, k)=(A,k)$ since $A'=A\cup\{k\}$ and $k\notin A$. Moreover, $f$ has no fixed point because it switches pairs $(A,k)$ such that $k\in A$ and pairs such that $k\notin A$. Zeilberger's proof of the Newton identities build on a function $w\colon X\to R[T_1,\dots,T_n]$ such that $w(f(A,k))=-f(A,k)$ for all $(A,k)\in X$. Since $f$ has no fixed point, the expression vanishes \[ \sum_{(A,k)\in X} w(A,k) \] since one can cancel a term $(A,k)$ with its image $f(A,k)$. Zeilberger's definition of the function $w$ is \[ w(A,k) = (-1)^{\mathrm{Card}(A)} T_k^{p-\mathrm{Card}(A)}\cdot \prod_{a\in A} T_a. \] It satisfies the desired relation: for $(A,k)\in X$ such that $k\notin A$, one has $f(A,k)=(A\cup\{k\}, k)$, so that \[ w(f(A,k)) = (-1)^{\mathrm{Card}(A)+1} T_k^{k-\mathrm{Card}(A)-1} \prod_{a\in A\cup\{k\}} T_a = - (-1)^{\mathrm{Card}(A)} T_k^{k-\mathrm{Card}(A)} \prod_{a\in A} T_a = w(A,k). \] When $k\in A$, one has $f(A,k)=(A',k)$ with $k\notin A'$, and the formula applied to $(A',k)$ implies the one for $(A,k)$, because $f$ is an involution. Now, in the relation $ \sum_{(A,k)} w(A,k) =0$, we first sum on $A$ and then on $k$: \[ \sum_{A\subset S} \sum_{k | (A,k) \in X} (-1)^{\mathrm{Card}(A)} T_k^{p-{\mathrm{Card}(A)}} \prod_{a\in A} T_a = 0\] and put together all those sets $A$ which have the same cardinality, say, $m$. This gives \[ \sum_{m=0}^n (-1)^m \sum_{\mathrm{Card}(A)=m} \prod_{a\in A} T_a \cdot \sum_{k | (A,k)\in X} T_k^{p-m} = 0. \] Let us compute separately the contribution to the sum of each $m$. When $m\gt p$, no subset $A$ is authorized so that the corresponding sum is zero. On the opposite, when $m\lt p$, for any subset $A$ such that $\mathrm{Card}(A)=m$, all pairs $(A,k)$ belong to $X$, so that the inner sum is $N_{p-m}$, and the whole term is $(-1)^mS_m N_{p-m}$. Finally, when $m=p$, the only $k$ that appear in the inner sum are the elements of $A$, and the corresponding term is $\sum_{k\in A} T_k^0=\mathrm {Card}(A)=p$; this term contributes to $(-1)^p p S_p$.

Friday, August 4, 2023

On soccer balls

This is a swift adaptation of a thread on Mastodon written after having read the beginning of Mara Goyet's column in yesterday's edition of Le Monde (please don't spoil the game by jumping to it now). So look at this picture below.
Does something strike you in it? The directions? The ball? The character? Nothing? 68 people voted, roughly one third found nothing, and two thirds saw that there's a problem with the ball. One nice remark from Sarah was that the ball was larger than the character! So, what's the problem with the ball? As it has been observed by Matt Parker in 2017, no soccer ball can be made following what the picture suggests: what we see is a combination of hexagons, some black, some white, and it is actually impossible to stitch hexagons together to do a ball. Or the other way round, it is impossible to take a ball and draw an hexagonal pattern on it. On the other hand, it is quite easy to draw an hexagonal pattern on a plane, and many kitchen/bathrooms feature such a tiling. Here is a photograph (stolen to somebody on Pinterest) of a 4th cent. BCE pavement (Porta Rosa, in the city of Elea-Velia, Italy)
But on a ball, nope. So how is a soccer ball made if not with hexagons? — The point is that is not made of hexagons only : it has 20 hexagons (so that the sign post above is slightly reminiscent of a soccer ball) but it also has 12 pentagons.
The standard “Tesla” soccer ball with its white hexagons and black pentagons. (Wikimedia, shine2010)
The whole structure forms what is called a truncated icosahedron. A regular icosahedron, as shown on the picture below (the Spinoza monument in Amsterdam) would have 20 triangular faces, such that five of them meet at each vertex. If one make a cut at these vertices, each of them is replaced by a pentagon, and one get the truncated icosahedron. I should add the mathematical explanation for the impossibility of stitching hexagons into a ball. This comes from a formula due to Euler that if one stitches polygons to a ball in anyway, some relation must hold : the number of vertices minus the number of edges plus the number of polygons is always equal to 2. Imagine you have some number of hexagons, say \(n\). That makes \(6n\) edges, but each time you make a stitch, you glue two edges together, hence \(3n\) edges in the end. What about the vertices, namely the points when at least 3 hexagons will be joined? If the picture is correct, then 3 hexagons meet at each vertex, so the number of vertices is \(6n/3=2n\) — Then Euler's formula says \(2n - 3n + n = 2\), which is absurd. … Now, is this really a necessity that exactly 3 hexagons meet at each vertex? Couldn't we have more complicated configurations? As a matter of fact, I'm not sure about the answer, and I can try imagine a strange configurations where there are a bunch of hexagons meeting in a different number of points. But algebra seems to contradict this in the end. Let me try. Assume we have \(s_3\) vertices where 3 hexagons meet, \(s_4\) vertices where 4 vertices meet, etc. The total number of vertices is then \(s = s_3 + s_4 + \cdots\). The \(n\) hexagons contribute to \(6n\) vertices, but those of type \(s_3\) count for 3 of them, those of type \(s_4\) for 4 of them, etc. so that \(6n = 3s_3 + 4s_4 +\cdots = 3s + d\), where \(d\) is defined by \(d = s_4 + 2s_5+ \cdots\). In particular, we have \(6n \geq 3s\), which means \(2n \geq s\). Euler's formula still says \(2 = s - 3n + n = s - 2n \leq 0\), still giving a contradiction. And if you followed that proof, you may object that there is actually a way to stitch hexagons into a ball: it consists in stitching exactly 2 hexagons together. You get something completely flat, but if you blow enough air or fill it with wool, so that it gets round enough, you might feel that you got a ball. The picture made by the hexagons on a ball is that of an hexagon drawn on the equator, each of them covering one hemisphere. Here are additional references
  • A post on the BBC site, where the problem is explained roughly, up to the fact that the UK authorities refused to propose a correct design for the post after Matt Parker raised up the issue
  • A video by Matt Parker on his YouTube channel, @standupmaths, where he discusses how soccer balls are made, especially in the 2014 WWC or the UK Premier Ligue
  • A text by Étienne Ghys on the large audience website Image des mathématiques about the 2014 WWC soccer ball (in French)
  • Euler's formula has a rich history which is evoked in the Wikipedia page quoted above, but is also the matter of the great book Proofs and Refutations by Imre Lakatos.

Thursday, July 6, 2023

Electrostatics and rationality of power series

I would like to tell here a story that runs over some 150 years of mathematics, around the following question: given a power series $\sum a_n T^n$ (in one variable), how can you tell it comes from a rational function?

There are two possible motivations for such a question. One comes from complex function theory: you are given an analytic function and you wish to understand its nature — the simplest of them being the rational functions, it is natural to wonder if that happens or not (the next step would be to decide whether that function is algebraic, as in the problem of Hermann Amandus Schwarz (1843–1921). Another motivation starts from the coefficients $(a_n)$, of which the power series is called the generating series; indeed, the generating series is a rational function if and only if the sequence of coefficients satisfies a linear recurrence relation.

At this stage, there are little tools to answer that question, besides is a general algebraic criterion which essentially reformulates the property that the $(a_n)$ satisfy a linear recurrence relation. For any integers $m$ and $q$, let $D_m^q$ be the determinant of size $(q+1)$ given by \[ D_m^q = \begin{vmatrix} a_m & a_{m+1} & \dots & a_{m+q} \\ a_{m+1} & a_{m+2} & \dots & a_{m+q+1} \\ \vdots & \vdots & & \vdots \\ a_{m+q} & a_{m+q+1} & \dots & a_{m+2q} \end{vmatrix}. \] These determinants are called the Hankel determinants or (when $m=0$) the Kronecker determinants, from the names of the two 19th century German mathematicians Hermann Hankel (1839—1873) and Leopold von Kronecker (1823–1891). With this notation, the following properties are equivalent:

  1. The power series $\sum a_n T^n$ comes from a rational function;
  2. There is an integer $q$ such that $D_m^q=0$ for all large enough integers $m$;
  3. For all large enough integers $q$, one has $D^q_0=0$.
(The proof of that classic criterion is not too complicated, but the standard proof is quite smart. In his book Algebraic numbers and Fourier analysis, Raphaël Salem gives a proof which arguably easier.)

Since this algebraic criterion is very general, it is however almost impossible to prove the vanishing of these determinants without further information, and it is at this stage that Émile Borel enters the story. Émile Borel (1871–1956) has not only be a very important mathematician of the first half of the 20th century, by his works on analysis and probability theory, he also was a member of parliament, a minister of Navy, a member of Résistance during WW2. He founded the French research institution CNRS and of the Institut Henri Poincaré. He was also the first president of the Confédération des travailleurs intellectuels, a intellectual workers union.

In his 1893 paper « Sur un théorème de M. Hadamard », Borel proves the following theorem:

Theorem. — If the coefficients \(a_n\) are integers and if the power series \(\sum a_n T^n \) “defines” a function (possibly with poles) on a disk centered at the origin and of radius strictly greater than 1, then that power series is a rational function.

Observe how these two hypotheses belong to two quite unrelated worlds: the first one sets the question within number theory while the second one resorts from complex function theory. It looks almost as magic that these two hypotheses lead to the nice conclusion that the power series is a rational function.

It is also worth remarking that the second hypothesis is really necessary for the conclusion to hold, because rational functions define functions (with poles) on the whole complex plane. The status of the first hypothesis is more mysterious. While it is not necessary, the conclusion may not hold without it. For example, the exponential series \(\sum T^n/n!\) does define a function (without poles) on the whole complex plane, but is not rational (it grows too fast at infinity).

However, the interaction of number theoretical hypotheses with the question of the nature of power series was not totally inexplored at the time of Borel. For example, a 1852 theorem of the German mathematician Gotthold Eisenstein (Über eine allgemeine Eigenschaft der Reihen-Entwicklungen aller algebraischen Functionen) shows that when the coefficients \(a_n\) of the expansion \(\sum a_nT^n\) of an algebraic functions are rational numbers, the denominators are not arbitrary: there is an integer \(D\geq 1\) such that for all \(n\), \(a_n D^{n+1}\) is an integer. As a consequence of that theorem of Eisenstein, the exponential series or the logarithmic series cannot be algebraic.

It's always time somewhere on the Internet for a mathematical proof, so that I have no excuse for avoiding to tell you *how* Émile Borel proved that result. He uses the above algebraic criterion, hence needs to prove that some determinants \(D^q_m\) introduced above do vanish (for some \(q\) and for all \(m\) large enough). Then his idea consists in observing that these determinants are integers, so that if you wish to prove that they vanish, it suffices to prove that they are smaller than one!

If non-mathematicians are still reading me, there's no mistake here: the main argument for the proof is the remark that a nonzero integer is at least one. While this may sound as a trivial remark, this is something I like to call the main theorem of number theory, because it lies at the heart of almost all proofs in number theory.

So one has to bound determinants from above, and here Borel invokes the « théorème de M. Hadamard » that a determinant, being the volume of the parallelipiped formed by the rows, is smaller than the product of the norms of these rows, considered as vectors of the Euclidean space : in 2-D, the area of a parallelogram is smaller than the lengths of its edges! (Jacques Hadamard (1865—1963) is known for many extremely important results, notably the first proof of the Prime number theorem. It is funny that this elementary result went into the title of a paper!)

But there's no hope that using Hadamard's inequality of our initial matrix can be of some help, since that matrix has integer coefficients, so that all rows have size at least one. So Borel starts making clever row combinations on the Hankel matrices that take into accounts the poles of the function that the given power series defines.

Basically, if \(f=\sum a_nT^n\), there exists a polynomial \(h=\sum c_mT^m\) such that the power series \(g=fh = \sum b_n T^n\) defines a function without poles on some disk \(D(0,R)\) where \(R>1\). Using complex function theory (Cauchy's inequalities), this implies that the coefficients \(b_n\) converge rapidly to 0, roughly as \(R^{-n}\). For the same reason, the coefficients \(a_n\) cannot grow to fast, at most as \(r^{-n}\) for some \(r>0\). The formula \(g=fh\) shows that coefficients \(b_n\) are combinations of the \(a_n\), so that the determinant \(D_n^q\) is also equal to \[ \begin{vmatrix} a_n & a_{n+1} & \dots & a_{n+q} \\ \vdots & & \vdots \\ a_{n+p-1} & a_{n+p} & \dots & a_{n+p+q-1} \\ b_{n+p} & b_{n+p+1} & & b_{n+p+q} \\ \vdots & & \vdots \\ b_{n+q} & b_{n+q+1} & \dots & b_{n+2q} \end{vmatrix}\] Now, Hadamard's inequality implies that the determinant \(D_n^q\) is (roughly) bounded above by \( (r^{-n} )^p (R^{-n}) ^{q+1-p} \): there are \(p\) lines bounded above by some \(r^{-n}\) and the next \(q+1-p\) are bounded above by \(R^{-n}\). This expression rewrites as \( 1/(r^pR^{q+1-p})^n\). Since \(R>1\), we may choose \(q\) large enough so that \(r^p R^{q+1-p}>1\), and then, when \(n\) grows to infinity, the determinant is smaller than 1. Hence it vanishes!

The next chapter of this story happens in 1928, under the hands of the Hungarian mathematician George Pólya (1887-1985). Pólya had already written several papers which explore the interaction of number theory and complex function theory, one of them will even reappear later one in this thread. In his paper “Über gewisse notwendige Determinantenkriterien für die Fortsetzbarkeit einer Potenzreihe”, he studied the analogue of Borel's question when the disk of radius \(R\) is replaced by an arbitrary domain \(U\) of the complex plane containing the origin, proving that if \(U\) is big enough, then the initial power series is a rational function. It is however not so obvious how one should measure the size of \(U\), and it is at this point that electrostatics enter the picture.

In fact, it is convenient to make an inversion : the assumption is that the series \(\sum a_n / T^n\) defines a function (with poles) on the complement of a compact subset \(K\) of the complex plane. Imagine that this compact set is made of metal, put at potential 0, and put a unit electric charge at infinity. According to the 2-D laws of electrostatics, this create an electric potential \(V_K\) which is identically \(0\) on \(K\) and behaves as \( V_K(z)\approx \log(|z|/C_K)\) at infinity. Here, \(C_K\) is a positive constant which is the capacity of \(K\).

Theorem (Pólya). — Assume that the \(a_n\) are integers and the series \(\sum a_n/T^n\) defines a function (with poles) on the complement of \(K\). If the capacity of \(K\) is \(\lt1\), then \(\sum a_n T^n\) is rational.

To apply this theorem, it is important to know of computations of capacities. This was a classic theme of complex function theory and numerical analysis some 50 years ago. Indeed, what the electric potential does is solving the Laplace equation \(\Delta V_K=0\) outside of \(K\) with Dirichlet condition on the boundary of \(K\).

In fact, the early times of complex analysis made a remarkable use of this fact. For example, it was by solving the Laplace equation that Bernhard Riemann proved the existence of meromorphic functions on “Riemann surfaces”, but analysis was not enough developed at that time (around 1860). In a stunningly creative move, Riemann imagines that his surface is partly made of metal, and partly of insulating material and he deduces the construction of the desired function from the electric potential.

More recently, complex analysis and potential theory also had applications to fluid dynamics, for example to compute (at least approximately) the flow of air outside of an airplane wing. (I am not a specialist of this, but I'd guess the development of numerical methods that run on modern computers rendered these beautiful methods obsolete.)

The relation between the theorems of Borel and Pólya is that the capacity of a disk is its radius. This can be seen by the fact that \(V(z)=\log(|z|/R\)\) solves the Laplace equation with Dirichlet condition outside of the disk of radius \(R\).

A few other capacities have been computed, not too many, in fact, because it appears to be a surprisingly difficult problem. For example, the capacity of an interval is a fourth of its length.

Pólya's proof is similar to Borel's, but considers the Kronecker determinant in place of Hankel's. However, the linear combinations that will allow to show that this determinant is small are not as explicit as in Borel's proof. They follow from another interpretation of the capacity introduced by the Hungarian-Israeli mathematician Michael Fekete (1886–1957; born in then Austria–Hungary, now Serbia, he emigrated to Palestine in 1928.)

You know that the diameter \(d_2(K)\) of \(K\) is the upper bound of all distances \(|x-y|\) where \(x,y\) are arbitrary points of \(K\). Now for an integer \(n\geq 2\), consider the upper bound \(d_n(K)\) of all products of distances \( \prod_{i\neq j}{x_j-x_i}\)^{1/n(n-1)}\) where \(x_1,\dots,x_n\) are arbitrary points of \(K\). It is not so hard to prove that the sequence \(d_n(K)\) decreases with \(n\), and the limit \(\delta(K)\) of that sequence is called the transfinite diameter by Fekete.

Proposition. — \( \delta(K)= C_K\).

This allows to make a link between capacity theory and another theme of complex function theory, namely the theory of best approximation, which end up in Pólya's proof: the adequate linear combination for the \(n\)th row is given by the coefficients of the monic polynomial of degree \(n\) which has the smallest least upper bound on \(K\).

If all this is of some appeal to you, there's a wonderful little book by Thomas Ransford, Potential Theory in the Complex Plane, which I find quite readable (say, from 3rd or 4th year of math studies on).

In the forthcoming episodes, I'll discuss two striking applications of the theorems of Borel and Pólya to proof by Bernhard Dwork of a proof of a conjecture of Weil (in 1960), and by a new proof (in 1987) by Jean-Paul Bézivin and Philippe Robba of the transcendence of the numbers \(e\) and \(\pi\), two results initially proven by Charles Hermite and Ferdinand von Lindemann in 1873 and 1882.