Tuesday, December 21, 2021

The very simple proof that the alternating group on 5 letters (or more) is simple

\def\supp {\operatorname{supp}}

As explained in the previous post, I wanted to formalize the proof that the alternating group on 5 letters or more is simple, using the Iwasawa criterion. This formalization is in the middle of nowhere, because it requires a proof that the natural action of the alternating group AnA_n on kk-elements subsets of {1,,n}\{1,\dots,n\} is primitive, provided n2kn\neq 2k, k0,nk\neq 0,n and n5n\geq 5.

A simple proof

There is a simple, slightly computational proof that the alternating group of 5 letters (or more) is simple, it is explained in many textbooks, for example in James Milne's Group theory notes. However, its formalization is not so easy, because even Milne has a lot of implicit stuff.

The situation, as in Lemma 4.36 of Milne's booklet, is the following.

One is given a nontrivial normal subgroup NN of AnA_n and one wishes to prove that N=AnN=A_n. We know that the 3-cycles generate AnA_n. We also know that 3-cycles are pairwise conjugate in the symmetric group, and also, using n5n\geq 5, in the alternating group AnA_n. Consequently, it suffices to prove that NN contains a 3-cycle. To that aim, one argues by induction on the cardinality of the support $\supp(g)$ of a nontrivial element gg of NN

Here is an attempt at presenting the induction argument in a way which is as straightforward as possible.

Since g1g\neq 1, $\supp(g)$ has cardinality 2\geq 2.

If $\supp(g)$ has cardinality 22, then gg is a transposition, contradicting the hypothesis that gg belongs to the alternating group. So the support of gg has cardinality 3\geq 3.

If $\supp(g)$ has cardinality 33, then gg is a 3-cycle, and we are done. So we may assume that the support of gg has cardinality 4\geq 4.

Let us choose an element a{1,,n}a\in\{1,\dots,n\} which belongs to the largest cycle of gg and set b=g(a)b=g(a); by assumption, one has bab\neq a. The proof consists in considering an element c{1,,n}c\in\{1,\dots,n\} such that cac\neq a and cbc\neq b, the 3-cycle h=(abc)h=(a\,b\,c) and the conjugate g=hgh1g'=h g h^{-1}

Since hh is a 3-cycle, it belongs to the alternating group; since NN is normal, one has gNg'\in N.

We wish to apply the induction hypothesis to the element gg1g' g^{-1} of NN. So we need to prove

  1. ggg'\neq g, and
  2. The support of gg1g' g^{-1} has cardinality strictly smaller than the one of gg.

To guarantee (1), that ggg'\neq g, it suffices to choose cc such that g(b)g(b)g'(b)\neq g(b). But g(b)=hgh1(b)=hg(a)=h(b)=x, g'(b) = hgh^{-1}(b) = hg(a) = h(b) = x, so the new assumption we make is that cg(b)c\neq g(b).

The rest of the argument is devoted to finding appropriate conditions on cc that guarantee (2). First, observe the inclusion $\supp(g'g^{-1})\subset g(\supp(h))\cup \supp(h)$, which is proved by contraposition. Indeed, if xx does not belong to the right hand side, then $g^{-1}(x) \notin \supp(h)$, hence h1g1(x)=g1(x)h^{-1}g^{-1}(x)=g^{-1}(x) (for example, using that $\supp(h)=\supp(h^{-1})$), and then gg1(x)=hgh1(g1(x))=hg(g1(x))=h(x)=xg' g^{-1}(x)=hgh^{-1}(g^{-1}(x))=hg(g^{-1}(x))=h(x)=x, since x∉h(x)x\not\in h(x). This proves that the cardinality of the support of gg1g'g^{-1} is at most 6.

However, since g(a)=bg(a)=b belongs both to $g(\supp(h))$ and to $\supp(h)$, the cardinality is at most 5. Explicitly, $\supp(g'g^{-1})\subset \{a,b,c,g(b), g(c)\}$. In particular, a clever choice for cc is only needed when $\supp(g)$ has cardinality 4 or 5!

To conclude in the remaining cases, we remark that there are only two possibilities for the cycle-type of gg: it can only be (5)(5) or (2,2)(2,2), since it is an alternating permutation, and we split the discussion according to these two cases:

  • If the cycle-type of gg is (5)(5), then we choose for cc the “last” element of the cycle of aa, namely c=g1(a)c=g^{-1}(a). Then, g(c)=ag(c)=a, so that $\supp(g'g^{-1})\subset\{a,b,c,g(b)\}$ which has at most 4 elements.
  • If the cycle-type of gg is (2,2)(2,2), then we have g(b)=ag(b)=a and we choose for cc any fixed point of gg. Then $\supp(g'g^{-1})\subset\{a,b,c\}$ has at most 3 elements.

About the formalization

One annoying part for formalizing this argument is the elimination of cycle-types. One would like that a computer assistant is able to list all possible cycle-types of a given size. Presumably it can, by I cannot (yet), so I need to do the argument by hand, for that specific value.

In principle, that argument needs to be spelt out in class too. We use two formulas:

  1. The sum of the length of the cycles is the cardinality of the support, hence 44 or 55 in this case.
  2. The signature of a permutation is even if and only if the number of cycles and the cardinality of the support have the same parity.

One way to write it down consists in taking the length mm of the smallest cycle of gg. One has m2m\geq 2 by assumption.

  1. If there are no other cycles, then the cycle-type of gg is (m)(m). Then m=4m=4 or 55, and only (5)(5) respects the parity constraint.
  2. Otherwise, there is only one other cycles, otherwise the sum of their lengths would be at least 3263\cdot 2\geq 6. If mm' is the length of that other cycle, one has 2mm2\leq m\leq m'. Then 2mm+m52m\leq m+m'\leq 5, hence m2m\leq 2, so that m=2m=2. This gives m3m'\leq 3, giving two cycle-types (2,3)(2,3) and (2,2)(2,2), of which the second one only satisfies the parity constraint.

Monday, December 13, 2021

Not simple proofs of simplicity

The last few weeks, I started my self-education in proof formalization (in Lean) by a test case, the simplicity of the alternating group. In this blog post, I want to discuss some aspects of the underlying mathematics which I think I learnt during the formalization process.

Simple groups. Some examples

This is about groups, eventually finite groups. Let us recall that a subgroup NN of a group GG is normal if for every gGg \in G and nNn\in N, one has gng1Ngng^{-1}\in N. Concretely, this means that the two equivalence relations modulo NN (left or right) coincide, and that the quotient set G/NG/N inherits a group structure that makes the natural projection GG/NG \to G/N a morphism of groups.

In this setting, the group GG can be viewed as an extension of the two groups G/NG/N and NN, which are possibly simpler, and this extension may help at understanding properties of the group GG. At the opposite, one says that GG is simple if it has no normal subgroup besides {e}\{e\} and GG itself (and if G{e}G\neq\{e\}).

Trivial examples of normal subgroups of a group GG are {e}\{e\} and GG itself. Less trivial examples are the center Z(G)Z(G) of GG (those elements gg such that gh=hggh=hg for all hGh\in G), and the derived subgroup D(G)D(G), the subgroup generated by all commutators ghg1h1ghg^{-1}h^{-1}. This commutator subgroup is interesting: any subgroup NN of GG containing the commutator subgroup D(G)D(G) is normal, and the quotient is abelian; and conversely.

The kernel of a group morphism is a normal subgroup, and the construction of a quotient shows that all normal subgroups appear in this way. In particular, the alternating group AnA_n is a normal subgroup of the symmetric group SnS_n.

A simple group has either Z(G)={e}Z(G)=\{e\} or Z(G)=GZ(G)=G; in the latter case, this means that GG is commutative, and it quickly appears that GG has to be finite of prime order. Consequently, our discussion will now concern groups with trivial center.

The concept of simplicity of groups is classically presented in connection with Galois theory, and the question of solvability of equations by radicals. Namely, a “general” polynomial equation of degre nn has SnS_n, the full symmetric group on nn elements, for its Galois group, and, if n5n\geq 5, the only possible dévissage of this group consists in introducing the alternating group AnA_n of even permutations, the subgroup AnA_n being normal and simple. On the other hand, the solvability of polynomial equation by radicals is equivalent to such a dévissage where all successive quotients are cyclic groups (equivalently abelian groups). Since AnA_n is not abelian, this implies that a “general” polynomial equation of degree nn is not solvable by radicals. However, using simplicity of the alternating group is much stronger than what we need: what would be needed is solvability of the symmetric group, and that this does not hold if n5n\geq 5 is much simpler. Consequently, for the problem of resolution by radicals, it suffices to prove that the derived subgroup D(An)D(A_n) of the alternating group is equal to AnA_n.

Theorem. — For n5n\geq 5, one has D(An)=AnD(A_n)=A_n. In particular, the alternating group and the symmetric group on nn letters are not solvable.
I give two (equivalent) proofs, the second one being a computational interpretation of the first one. Both use that the 3-cycles generate AnA_n and are conjugate in AnA_n. The computational proof is shorter, arguingly simpler. As a matter of fact, I never understood it, nor could remember it, until I translated the conceptual proof into the proof assistant.
Consider the morphism p ⁣:AnAn/D(An)p\colon A_n\to A_n/D(A_n). Since An/D(An)A_n/D(A_n) is commutative, all 3-cycles have the same image. Since the square of a 3-cycle is again a 3-cycle, both have the same image. This implies that for every 3-cycle gAng\in A_n, one has p(g)=p(g2)p(g)=p(g^2), hence p(g)=ep(g)=e. Since the 3_cycles generate AnA_n, the morphism pp is trivial; since it is surjective, one has An/D(An)={e}A_n/D(A_n)=\{e\} and D(An)=AnD(A_n)=A_n.
Computationally, consider a 3-cycle gg and its square g2g^2. Since they are conjugate, there exists hAnh\in A_n such that g2=hgh1g^2=hgh^{-1}. Then g=hgh1g1g=hgh^{-1}g^{-1}, so that gg is a commutator; in particular, D(An)D(A_n) contains all commutators and D(An)=AnD(A_n)=A_n.

The remaining cases, for $n\leq 4, are interesting, but classically left as exercises in text books:

  1. One has A1=S1={e}A_1=S_1=\{e\};
  2. The group S2S_2 is the cyclic group of order 2, hence is simple and solvable, and A2A_2 is trivial;
  3. The group S3S_3 is a noncommutative group of order 6, and A3A_3 is a cyclic group of order 2.
  4. The groups S4S_4 and A4A_4 are noncommutative and solvable, of orders 24 and 12. The derived subgroups D(S4)D(S_4) and D(A4)D(A_4) are both equal to the Klein subgroup V4V_4 of S4S_4, consisting of the permutations of the form (ab)(cd)(ab)(cd) for a,b,c,da,b,c,d any enumeration of 1,2,3,41,2,3,4 — “double transpositions” – and of the identity. The group V4V_4 is commutative, isomorphic to (Z/2Z)2(\mathbf Z/2\mathbf Z)^2, and the quotient D(A4)/V4D(A_4)/V_4 is cyclic of order 33.

Another classical series of simple groups appears in linear algebra. Let FF be a field and let nn be an integer such that n2n\geq 2. The group GL(n,F)\mathrm{GL}(n,F) of n×nn\times n invertible matrices is not simple, for it is noncommutative but its center consists of homotheties; we can try to mod out by the center, getting the group PGL(n,F)=GL(n,F)/F×\mathrm{PGL}(n,F)=\mathrm{GL}(n,F)/F^\times but that one may not be simple. Indeed, another reason for GL(n,F)\mathrm{GL}(n,F) not to be simple is that it admits the special linear group SL(n,F)\mathrm{SL}(n,F), kernel of determinant, as a normal subgroup. The group SL(n,F)\mathrm{SL}(n,F) has a nontrivial center in general, it consists of homotheties of ratio aF×a\in F^\times such that an=1a^n=1 — let us denote it by μn\mu_n. But the quotient PSL(n,F)=SL(n,F)/μn\mathrm{PSL}(n,F)=\mathrm{SL}(n,F)/\mu_n is simple in general — in general meaning that is is always the case but for two exceptions:

  1. n=2n=2 and F=F2F=\mathbf F_2. Then PSL(2,F2)S3\mathrm{PSL}(2,\mathbf F_2)\simeq S_3 (by the action on P1(F2)\mathbf P_1(\mathbf F_2), see below), hence is not simple.
  2. n=2n=2 and F=F3F=\mathbf F_3. Then PSL(2,F3)S4\mathrm{PSL}(2,\mathbf F_3)\simeq S_4 (again by the action on P1(F3)\mathbf P_1(\mathbf F_3)), and is not simple.

Bilinear algebra gives rise to new groups, orthogonal, unitary and symplectic, which also furnish simple groups up to elementary transformations. By the famous “classification of finite simple groups”, these constructions furnish all finite simple groups, up to 26 (or 27) examples called sporadic groups. This remarkable theorem has a fantastic proof, encompassing thousands of pages across the years 1960-2010.

But the question here is: How can one prove that a given group is simple?

Alternating groups

There is a large supply of proofs that the alternating group AnA_n is simple for n5n\geq 5. Here is a sketch of one.

Let NN be a normal subgoup of AnA_n (n5n\geq 5) and assume that N{e}N\neq\{e\}. An element of AnA_n can be written as a product of an even number of transpositions. If two successive permutations in the product are equal, we can cancel them; if the share exactly one a common member, as in (ab)(ac)(a\,b)(a\,c), their product is a 3-cycle (acb)(a\,c\,b); if they have no common member, their product is a double transposition. On the other hand, if n5n\geq 5, we can either insert (bc)(bc)(b\,c)(b\,c) in the product (ab)(cd)(a\,b)(c\,d), writing a double transposition as a product of two 3-cycles, or insert (de)(de)(d\,e)(d\,e) in the product (ab)(ac)(a\,b)(a\,c), writing a 3-cycle as a product of two double transpositions. Consequently, AnA_n is generated by, either the 3-cycles, or the double transpositions. Moreover, since n5n\geq 5, we can check that 3-cycles are pairwise conjugate, and similarly for double transpositions; consequently, if the normal subgroup NN of AnA_n contains either a 3-cycle, or a double transposition, it will contain all of them, hence be equal to AnA_n.

When n=5n=5, the only case that remains to consider is when NN contains a 5-cycle, say g=(12345)g=(1\,2\,3\,4\,5). Conjugating gg by the 3-cycle h=(451)h=(4\,5\,1), we get hgh1=(h1h2h3h4h5)=(42351)Nhgh^{-1}=(h1\,h2\,h3\,h4\,h5)=(4\,2\,3\,5\,1)\in N. By construction, this element behaves as gg on 55, but differs. Consequently, the commutator hgh1g1hgh^{-1}g^{-1} is a nontrivial element of NN that fixes 55. By the first two cases, one has N=A5N=A_5.

A similar argument works in general, arguing by descending induction on the cardinality on the fixed point set of an element geg\neq e of NN. One considers an element hh of AnA_n and the conjugate hgh1hgh^{-1}; if g=(a1a2)(b1b2)g=(a_1\,a_2\,\dots)(b_1\,b_2\,\dots)\dots, then hgh1=(ha1ha2)(hb1hb2)hgh^{-1}=(ha_1\,ha_2\,\dots)(hb_1\,hb_2\,\dots)\dots is an element of NN that behaves as gg on many elements, but not all. Consquently, hgh1g1hgh^{-1}g^{-1} is a non trivial element of NN that fixes more elements than gg, and we conclude by induction. (Details can be found in James Milne's lectures, 4.34.)

The Iwasawa criterion

In 1941, Kenkiti Iwasawa published a proof of the simplicity of the projective special linear group. From this proof, a general criterion for simplicity has been abstracted:

Theorem (Iwasawa). — Let GG be a group with a primitive action on a set XX. Assume that one is given, for every xXx\in X, a subgroup T(x)T(x) of GG satisfying the following properties:

  • For every xXx\in X, the subgroup T(x)T(x) is commutative;
  • For every gGg\in G and xXx\in X, T(gx)=gT(x)g1T(g\cdot x)=g T(x)g^{-1};
  • The union of the groups T(x)T(x), for xXx\in X, generate GG.
Then any normal subgroup NN of GG that acts nontrivially on XX contains the commutator subgroup of GG. In particular, if G=D(G)G=D(G), then GG is simple.

There are two classical ways to state that the action is primitive. The simplest states that it is transitive and that the stabilizers of points are maximal subgroups of GG. Another is that there is no imprimitivity block, a nontrivial partition (Xi)(X_i) of XX such that for every gGg\in G and every ii, there exist jj such that gXi=g\cdot X_i=X_j$. One can prove that a 2-transitive action (= transitive on ordered pairs of distinct elements) is primitive.

Iwasawa applies his criterion to G=SL(n,F)G=\mathrm{SL}(n,F) acting on the projective space Pn1(F)\mathbf P_{n-1}(F) of lines in FnF^n. Except when n=2n=2 and FF has 2 or 3 elements, this action is 2-transitive, hence primitive. For a nonzero xFnx\in F^n, one considers the group T(x)T(x) of linear transformations of the form yy+ϕ(y)xy\mapsto y + \phi(y) x (transvections), for all linear forms ϕ\phi on FnF^n such that ϕ(x)=0\phi(x)=0. They constitute a commutative subgroup of SL(n,F)\mathrm{SL}(n,F) (isomorphic, as a group, to Fn1F^{n-1}). The map TT gives rise to the data as in Iwasawa's theorem. Consequently, every normal subgroup NN of SL(n,F)\mathrm{SL}(n,F) that acts nontrivially on Pn1(F)\mathbf P_{n-1}(F) contains the commutator subgroup of SL(n,F)\mathrm{SL}(n,F), which is SL(n,F)\mathrm{SL}(n,F). Explicitly, either NN consists of homotheties, or NN contains SL(n,F)\mathrm{SL}(n,F). This implies that PSL(n,F)\mathrm{PSL}(n,F) is simple.

Applying the Iwasawa criterion for symmetric groups

One may wish to apply the Iwasawa criterion to the symmetric group. However, the conclusion is not as nice as what I had hoped initially.

Let SnS_n act on the set of 2-element subsets of X={1,,n}X=\{1,\dots,n\}. If n5n\geq 5, the action is primitive. This is not completely obvious, because it this action is not 2-transitive (you cannot map {1,2}\{1,2\} and {1,3}\{1,3\} to {1,2}\{1,2\} and {3,4}\{3,4\})! Its primitivity means that stabilizers are maximal subgroups, and one can prove that the stabilizer of a 2-element subset is indeed maximal unless n=4n=4. It is also faithful (no nontrivial acts trivially). To a pair {a,b}\{a,b\}, associate the subgroup of order 2 generated by the transposition (ab)(a\,b). It satisfies the criterion, and this shows that the nontrivial normal subgroups of SnS_n contain D(Sn)=AnD(S_n)=A_n. Since AnA_n has index 2, this shows N=AnN=A_n or N=SnN=S_n.

It is interesting to guess how the criterion breaks for n=4n=4. In fact, the action of S4S_4 on pairs is transitive, but not primitive: the stabilizer of {1,2}\{1,2\} is a subgroup of S4S_4 of order 44, consisting of the identiy, two transpositions (12)(1\,2) and (34)(3\,4), and their product. Since Card(S4)=24!=233\mathrm{Card}(S_4)=24!=2^3\cdot 3, this subgroup is contained in a 2-sylow subgroup, of order 8, so is not maximal.

However, and I find it unfortunate, this proof does not imply that the alternating group AnA_n is simple for n5n\geq 5. To prove the simplicity of AnA_n along these lines, we need to consider the following actions.

  • Let AnA_n act on the set X3X_3 of 3-element subsets of XX. For x={a,b,c}X3x=\{a,b,c\}\in X_3, consider the subgroup T(x)T(x) of order 3 of AnA_n consisting of the identity and the 3-cycles (abc)(a\,b\,c) and (acb)(a\,c\,b). (It is the alternating group on xx, viewed as a subgroup of AnA_n.) The Iwasawa criterion applies provided the action is primitive, which presumably holds for n>6n>6. Consequently, AnA_n is simple for n7n\geq 7. However, if n=6n=6, the stabilizer of {1,23}\{1,2\,3\} in A6A_6 is not maximal, for a reason analogous to the one explained for S4S_4.
  • Let AnA_n act on the set X4X_4 of 4-element subsets of XX. For xX4x\in X_4, we can consider the Klein subgroup T(x)T(x) of A4A_4, acting on xx, viewed as a subgroup of AnA_n. I hope that the action is primitive, except for n=8n=8, and this would prove that AnA_n is simple for n8n\geq 8. This uses that double transpositions generate AnA_n.
  • One can improve the two preceding arguments a little bit. If n=5n=5, we can have AnA_n act on X2X_2, associating with xx the alternating group on the three remaining letters. Therefore, A5A_5 is simple (the action is primitive because the stabilizer of a point is the Klein subgroup of A4A_4, as a subgroup of A5A_5, its cardinality is 12, that of A5A_5 is 60, and since 60/12=5 is prime, the Klein subgroup of A4A_4 is maximal in A5A_5). Similarly, if n=6n=6, we have A6A_6 act on X2X_2, associating with xX2x\in X_2 the Klein subgroup corresponding to the four remaining letters. Provided one can prove that the stabilizer of a pair in A6A_6 is a maximal subgroup, this gives a proof that A6A_6 is simple!

The primitivity assertions seem to hold. In fact, maximal subgroups of AnA_n and SnS_n are classified by the O'Nan–Scott theorem. Those we're concerned with have type (a) in the notation of (Liebeck, Praeger, Saxl, 1987. “A Classification of the Maximal Subgroups of the Finite Alternating and Symmetric Groups.” Journal of Algebra 111 (2): 365–83. DOI:10.1016/0021-8693(87)90223-7), and according to Theorem 1 of that paper, the relevant subgroups are indeed maximal.

Formalization: where am I now?

I have already formalized the Iwasawa criterion in Lean (there's an ongoing pull request for this), as well as the result that the normal subgroups of SnS_n are {e}\{e\}, AnA_n and SnS_n for n5n\geq 5.

It remains to formalize the rest of the proof, and the main difficulty will be in proving primitivity, plus some nice way of defining the maps TT so that their properties are visible.

I also wish to formalize the simplicity of the special linear group along these lines, and it should be ready for an application to orthogonal, unitary or symplectic groups as well.

Thursday, April 22, 2021

Growth of the Gaussian pivoting algorithm

Gaussian elimination” is the standard method for solving systems of linear equations that runs by choosing one pivot variable in one of the equations and eliminating it from the other equations by a suitable linear combination. These other equations have one less variable and one may iterate the process, leading to a system that has a triangular form and can be solved in a relatively straightforward way. In the so-called Gauss-Jordan method, the pivot variables are eliminated from all of the equations, and this leads to the row reduced echelon form of the initial system, a new system in which the pivot variables are explicitly solved in terms of the remaining variables; it also has the merit of being independent of the intermediate steps, but this would be a story for another night.

How the name of Gauss is attributed to this method is also another story, that is recounted in great detail by J. Grcar. As he explains, systems of linear equations and their solution already appear on Babylonian tablets (2000 BC), on the Egyptian Rhind papyrus (1550 BC), in the Chinese Nine chapters on the mathematical art (200 AD), the Arithmetica of the Greek mathematician Diophantus (250 AD), the Āryabhaṭīya of the Hindu mathematician Āryabhaṭa (499 AD). Such systems were essentially absent of the mathematics treatises during the Renaissance and it is to Newton (17th century) that we owe its revival in Western Europe. At the beginning of the 19th century, in relation with the problem in celestial mechanics/geodesy of fitting together multiple imprecise measurements, Gauss and Legendre invented the least square methods. This involved a system of linear equations which Gauss solved by what he called “common elimination”. In the 20th century, the development of computing machines and numerical analysis led to further work, from Cholesky (in relation with geodesy), Crout, to von Neumann and Goldstine and their LDULDU decomposition.

Whoever had to perform elimination by hand knows that the computations are rapidly tedious and often lead to more and more complicated fractions. 

When computer calculations are done with floating point algebra, the difficulty of rounding errors appears. If in a linear system, say Ax=bAx=b,  the matrices AA and bb are only known up to an error, so that the system that is actually solved would rather be (A+E)x=b+δb(A+E)x=b+\delta b, and it is of an obvious importance to compare its solution with the solution of the initial system. One way to make this comparison involves the inverse of AA, which is unknown at this stage. The product of the norms AA1\|A\| \,\|A^{-1}\| is the conditioning number of AA, and one can not avoid the problem of ill-conditioned matrices, which will inherently lead to lack of precision.

But when floating points numbers are used in the computation, a new problem appears, even when one restricts to well-conditioned systems. Floating points numbers are of the form ±a10e\pm a\cdot 10^e (in base 10, say), where aa is a real number between 11 and 1010 which is known up to a fixed number of decimal places. In other words, floating points numbers are known up to a relative error. Let us examine the consequence for the subtraction of floating point numbers. Consider two such numbers, say x=a10ex=a\cdot 10^e and x=a10ex'=a'\cdot 10^e, with the same exponent ee, and such that aa and aa' are close. Their difference is given by xx=(aa)10ex'-x=(a'-a)\cdot 10^e, but since aa and aa' are close, their difference y=xxy=x'-x is no more between 11 and 1010, but may be, say of size 10510^{-5}, so that the floating point expression of xxx'-x is (105(aa))10e5=b10e5(10^5(a'-a))\cdot 10^{e-5}=b\cdot 10^{e-5}; the problem is that the last 5 decimals of bb are absolutely unknown, which leads to a relative error for yy which is 10510^5 as big as expected!

Now, by its very essence, the elimination method features a lot of such subtractions, hence it is inherently not very well suited with floating points numbers. 

In the 1960s, the American mathematician Wilkinson analysed the situation very precisely.  He showed that for the Gaussian elimination, the main parameter is the relative size of the matrices that intervene in the process. To set up some notation, imagine that the initial system Ax=bAx=b is transformed, step by step, into a series of equivalent systems A(r)x=b(r)A^{(r)}x=b^{(r)}, where A(r)=(aij(r))A^{(r)}=(a_{ij}^{(r)}) is a matrix whose first rr lines are in triangular form, and the remaining nrn-r lines still need to be worked on. To get the matrix A(r+1)A^{(r+1)}, one subtract a multiple mirm_{ir} of the rrth row from the iith row he multipliers, for ii ranging from r+1r+1 to nn, where the multipliers mirm_{ir} are defined by
mir=air(r)/arr(r). m_{ir}=a_{ir}^{(r)}/a_{rr}^{(r)}.
Large multipliers lead to lack of precision, but if complete pivoting method is used, one has mir1|m_{ir}|\leq 1. In this case, one observes that a bound aij(r)M|a_{ij}^{(r)}|\leq M for the coefficients of A(r)A^{(r)} leads to the bound aij(r+1)2M|a_{ij}^{(r+1)}|\leq 2M at the next step. At the last step, one gets aij(n)2n1M|a_{ij}^{(n)}|\leq 2^{n-1}M, where M=sup(aij)M=\sup(|a_{ij}|). Consequently, the relevant constant to be estimated,
R(A)=suprsupi,jaij(r)supi,jaij, R(A) = \sup_r \frac{\sup_{i,j}|a_{ij}^{(r)}|}{\sup_{i,j}|a_{ij}|},
satisfies R(A)2n1R(A)\leq 2^{n-1}.

In fact, Wilkinson (1961) gave a much better bound. Let B(r)B^{(r)} be the square matrix of size nrn-r that has to be worked on after the rrth step and let brb_{r} be the maximum size of its coefficients, the size of the chosen pivot since one does complete pivoting. One has the following fomula for its determinant:
det(B(r))=br+1bn. \det(B^{(r)})= b_{r+1}\cdots b_n.
Moreover, the Euclidean norms of the the columns of B(r)B^{(r)} are bounded above by nrbr+1\sqrt{n-r} b_{r+1} and Hadamard inequality (“the volume of a parallelepiped is smaller than the product of the sizes of its edges”) implies that
det(B(r))(nr)(nr)/2br+1nr. | \det(B^{(r)})| \leq (n-r)^{(n-r)/2} b_{r+1}^{n-r}.
Together, these relations lead to an upper bound
R(A)n1/2(231/241/3n1/(n1))1/2, R(A) \leq n^{1/2} \left( 2\cdot 3^{1/2}\cdot 4^{1/3}\cdots n^{1/(n-1)}\right)^{1/2},
roughly nlog(n)/4n^{\log(n)/4}, a bound that Wilkinson considers to be a “severe overestimate”, and in Wilkinson (Rounding Errors in Algebraic Processes, Dover, 1963, bottom of p. 97), he even notes that “No matrix has yet been discovered for which R(A)>nR(A)>n.

This statement remained known as Wilkinson's conjecture, although Wilkinson himself did not state it as such. Tornheim (1964) proved that Hadamard matrices — matrices with entries ±1\pm1 and pairwise orthogonal columns and rows — satisfy R(A)nR(A)\geq n and this led Cryer (1968) to formally state the conjecture and to suspect that R(A)<nR(A)<n unless AA is a Hadamard matrix. Some matrices have been shown such that R(A)(n+1)/2R(A)\geq (n+1)/2 (Higham & Higham, 1989) and random matrices seems to have an RR-factor roughly n1/2n^{1/2} (Trefethen & Schreiber, 1990). In fact, Hadamard matrices of size n=12n=12 (Edelman & Mascarenhas, 1995) or n=16n=16 (Kravvaritis & Mtrouli, 2009) satisfy R(A)=nR(A)=n, but this is definitely nontrivial.

However, Gould (1991) could exhibit a matrix AA of size n=13n=13 with  growth factor R(A)=13.0205R(A)=13.0205, thus providing a counterexample to Wilkinson's “conjecture”. To that aim, he reduced the question to finding a solution of a nonlinear programming problem with roughly n3/3700n^3/3\approx 700 variables, fortunately a sparse one, a computation he did using a programming package he had developed with Conn and Toint. The matrix he gives has integer coefficients of size up to 20 digits!

But the story does not end here!

Edelman tried to replicate Gould's computations using the computer algebra softwares Mathematica and Maple — what he found is the growth factor of Gould's matrix is around 7.3557.355, consistently in both softwares! As he writes, one of these softwares could have had a bug, but it is rather unlikely that both of them had had the same one.

What happened, and you will agree that there is much irony in this story, is that Gould had performed his computations using floating point algebra. A near tie in the 6th pivot lead to an incorrect choice of pivot and to an erroneous computation, even within the double precision that he has used.

Fortunately, Edelman (1992) showed that changing one coefficient by 10710^{-7} in Gould's matrix yields a growth of 13.0213.02, so that Wilkinson's “conjecture” is definitely incorrect.

Friday, April 2, 2021

On the Hadamard-Lévy theorem, or is it Banach-Mazur?

During the preparation of an agrégation lecture on connectedness, I came across the following theorem, attributed to Hadamard–Lévy: 

Theorem. — Let f ⁣:RnRnf\colon \mathbf R^n\to\mathbf R^n be a C1\mathscr C^1-map which is proper and a local diffeomorphism. Then ff is a global diffeomorphism.

In this context, that ff is proper means that f(x)+\| f(x)\| \to+\infty when x+\| x\|\to+\infty, while, by the inverse function theorem, the condition that ff is a local diffeomorphism is equivalent to the property that its differential f(x)f'(x) is invertible, for every xRnx\in\mathbf R^n. The conclusion is that ff is a diffeomorphism from Rn\mathbf R^n to itself; in particular, ff is bijective and its inverse is continuous.

This theorem is not stated in this form neither by Hadamard (1906), nor by Lévy (1920), but is essentially due to Banach & Mazur (1934) and it is the purpose of this note to clarify the history, explain a few proofs, as well as more recent consequences for partial differential equations.

A proper map is closed: the image f(A)f(A) of a closed subset AA of Rn\mathbf R^n is closed in Rn\mathbf R^n. Indeed, let (am)(a_m) be a sequence in AA whose image (f(am))(f(a_m)) converges in Rn\mathbf R^n to an element bb; let us show that there exists aAa\in A such that b=f(a)b=f(a). The properness assumption on ff implies that (am)(a_m) is bounded. Consequently, it has a limit point aa, and aAa\in A because AA is closed. Necessarily, f(a)f(a) is a limit point of the sequence (f(am))(f(a_m)), hence b=f(a)b=f(a).

In this respect, let us note the following reinforcement of the previous theorem, due to Browder (1954):
Theorem (Browder). — Let f ⁣:RnRnf\colon \mathbf R^n\to\mathbf R^n be a local homeomorphism. If ff is closed, then ff is a global homeomorphism.

A surprising aspect of these results and their descendents is that they are based on two really different ideas. Banach & Mazur and Browder are based on the notion of covering, with ideas of homotopy theory and, ultimately, the fact that Rn\mathbf R^n is simply connected. On the other hand, the motivation of Hadamard was to generalize to dimension nn the following elementary discussion in the one-dimensional case: Let f ⁣:RRf\colon\mathbf R\to\mathbf R be a C1\mathscr C^1-function whose derivative is >0>0 everywhere (so that ff is strictly increasing); give a condition for ff to be surjective. In this case, the condition is easy to find: the indefinite integral f(x)dx\int f'(x)\,dx has to be divergent both at -\infty and ++\infty. In the nn-dimensional case, the theorems of Hadamard is the following:

Theorem.Let f ⁣:RnRnf\colon\mathbf R^n\to\mathbf R^n be a C1\mathscr C^1-map. For rR+r\in\mathbf R_+, let ω(r)\omega(r) be the infimum, for xRnx\in\mathbf R^n such that x=r\|x\|=r, of the norm of the linear map f(x)1f'(x)^{-1}; if 0dr/ω(r)=+\int_0^\infty dr/\omega(r)=+\infty, then ff is a global diffeomorphism.

In Hadamard's paper, the quantity ω(r)\omega(r) is described geometrically as the minor axis of the ellipsoid defined by f(x)f'(x), and Hadamard insists that using the volume of this ellipsoid only, essentially given by the determinant of f(x)f'(x), would not suffice to characterize global diffeomorphisms. (Examples are furnished by maps of the form f(x1,x2)=(f1(x1),f2(x2))f(x_1,x_2)=(f_1(x_1),f_2(x_2)). The determinant condition considers f1(x1)f2(x2)f_1'(x_1)f_2'(x_2), while one needs individual conditions on f1(x1)f'_1(x_1) and f2(x2)f'_2(x_2).)

In fact, as explained in Plastock (1974), both versions (closedness hypothesis or quantitative assumptions on the differential) imply that the map ff is a topological covering of Rn\mathbf R^n. Since the target Rn\mathbf R^n is simply connected and the source Rn\mathbf R^n is connceted, ff has to be a homeomorphism. I will explain this proof below, but I would first like to explain another one, due to Zuily & Queffelec (1995) propose an alternating proof which is quite interesting.

A dynamical system approach

The goal is to prove that ff is bijective and, to that aim, we will prove that every preimage set f1(b)f^{-1}(b) is reduced to one element. Replacing ff by fbf-b, it suffices to treat the case of b=0b=0. In other words, we wish to solve that the equation f(x)=0f(x)=0 has exactly one solution. For that, it is natural to try to start from some point ξRn\xi\in\mathbf R^n and to force ff to decrease. This can be done by following the flow of the vector field given by v(x)=f(x)1(f(x))v(x)=-f'(x)^{-1}(f(x)). This is a vector field on Rn\mathbf R^n and we can consider its flow: a map Φ\Phi defined on an open subset of R×Rn\mathbf R\times\mathbf R^n such that tΦ(t,x)=v(Φ(t,x))\partial_t \Phi(t,x)=v(\Phi(t,x)) for all (t,x)(t,x) and Φ(0,x)=x\Phi(0,x)=x for all xx. In fact, the Cauchy–Lipschitz theorem guarantees the existence of such a flow only if the vector field vv is locally Lipschitz, which happens if, for example, ff is assumed to be C2\mathscr C^2. In this case, there is even uniqueness of a maximal flow, and we will make this assumption, for safety. (In fact, the paper of De Marco, Gorni & Zampieri (1994) constructs the flow directly thanks to the hypothesis that the vector field is pulled back from the Euler vector field on Rn\mathbf R^n.)

What are we doing here? Note that in Rn\mathbf R^n, the opposite of the Euler vector field, defined by u(y)=yu(y)=-y, has a very simple solution: the flow lines are straight lines going to 00. The formula above just pulls back this vector field uu via the local diffeomorphism ff, and the flow lines of the vector field vv will just be the ones given by pull back by ff, which will explain the behaviour described below.

In particular, let aRna\in\mathbf R^n be such that f(a)=0f(a)=0 and let UU be a neighborhood of aa such that ff induces a diffeomorphism from UU to a ball around 00. Pulling back the solution of the minus-Euler vector field by ff, we see that once a flow line enters the open set UU, it converges to aa. The goal is now to prove that it will indeed enter such a neighborhood (and, in particular, that such a point aa exists).

We consider a flow line starting from a point xx, that is, ϕ(t)=Φ(t,x)\phi(t)=\Phi(t,x) for all times tt. Let g(t)=f(ϕ(t))g(t)= f(\phi(t)); observe that gg satisfies g(t)=f(ϕ(t))(ϕ(t))=g(t)g'(t)=f'(\phi(t))(\phi'(t))=-g(t), hence g(t)=g(0)etg(t)=g(0)e^{-t}. Assume that the line flow is defined on [0;t1[[0;t_1\mathopen[, with t1<+t_1<+\infty. by what precedes, gg is bounded in the neighborhood of t1t_1; since ff is assumed to be proper, this implies that ϕ(t)\phi(t) is bounded as well. The continuity of the vector field vv implies that ϕ\phi is uniformly continuous, hence it has a limit at t1t_1. We may then extend the line flow a bit right of t1t_1. As a consequence, the line flow is defined for all times, and g(t)0g(t)\to0 when t+t\to+\infty. By the same properness argument, this implies that ϕ(t)\phi(t) is bounded when t+t\to+\infty, hence it has limit points aa which satisfy f(a)=0f(a)=0. Once ϕ\phi enters an appropriate neighborhood of such a point, we have seen that the line flow automatically converges to some point af1(0)a\in f^{-1}(0).

Let us now consider the map λ ⁣:Rnf1(0)\lambda\colon\mathbf R^n\to f^{-1}(0) that associates with a point ξ\xi the limit of the line flow tΦ(t,ξ)t\mapsto \Phi(t,\xi) starting from the initial condition ξ\xi. By continuity of the flow of a vector field depending on the initial condition, the map λ\lambda is continuous. On the other hand, the hypothesis that ff is a local diffeomorphism implies that f1(0)f^{-1}(0) is a closed discrete subset of Rn\mathbf R^n. Since Rn\mathbf R^n is connected, the map λ\lambda is constant. Since one has λ(ξ)=ξ\lambda(\xi)=\xi for every ξf1(0)\xi\in f^{-1}(0), this establishes that f1(0)f^{-1}(0) is reduced to one element, as claimed.

Once ff is shown to be bijective, the fact that it is proper (closed would suffice) implies that its inverse bijection f1f^{-1} is continuous. This concludes the proof.

The theorem of Banach and Mazur

The paper of Banach and Mazur is written in a bigger generality. They consider multivalued continuous maps F ⁣:XYF\colon X\to Y (kk-deutige stetige Abbildungen) by which they mean that for every xx, a subset F(x)F(x) of YY is given, of cardinality kk, the continuity being expressed by sequences: if xnxx_n\to x, one can order, for every nn, the elements of F(xn)={yn,1,,yn,k}F(x_n)=\{y_{n,1},\dots,y_{n,k}\}, as well as the elements of F(x)={y1,,yk}F(x)=\{y_1,\dots,y_k\}, in such a way that yn,jyny_{n,j}\to y_n for all jj. (In their framework, XX and YY are metric spaces, but one could transpose their definition to topological spaces if needed.) They say that such a map is decomposed (zerfällt) if there are continuous functions f1,,fkf_1,\dots,f_k from XX to YY such that F(x)={f1(x),,fk(x)}F(x)=\{f_1(x),\dots,f_k(x)\} for all xXx\in X.

In essence, the definition that Banach and Mazur are proposing contains as a particular case the finite coverings. Namely, if p ⁣:YXp\colon Y\to X is a finite covering of degree kk, then the map xp1(x)x\mapsto p^{-1}(x) is a continuous kk-valued map from XX to YY. Conversely, let us consider the graph ZZ of FF, namely the set of all points (x,y)X×Y(x,y)\in X\times Y such that yF(x)y\in F(x). Then the first projection p ⁣:ZXp\colon Z\to X is a covering map of degree kk, but it is not clear that it has local sections.

It would however not be so surprising to 21st-century mathematicians that if one makes a suitable assumption of simple connectedness on XX, then every such FF should be decomposed. Banach and Mazur assume that XX satisfies two properties:

  1. The space XX is semilocally arcwise connected: for every point xXx\in X and every neighborhood UU of xx, there exists an open neighborhood UU' contained in UU such that for every point xUx'\in U', there exists a path c ⁣:[0;1]Uc\colon[0;1]\to U such that c(0)=xc(0)=x and c(1)=xc(1)=x'. (Semilocally means that the path is not necessarily in UU' but in UU.)
  2. The space XX is arcwise simply connected: two paths c0,c1 ⁣:[0;1]Xc_0,c_1\colon[0;1]\to X with the same endpoints (c0(0)=c1(0)c_0(0)=c_1(0) and c0(1)=c1(1)c_0(1)=c_1(1)) are strictly homotopic — there exists a continuous map h ⁣:[0;1]Xh\colon[0;1]\to X such that h(0,t)=c0(t)h(0,t)=c_0(t) and h(1,t)=c1(t)h(1,t)=c_1(t) for all tt, and h(s,0)=c0(0)h(s,0)=c_0(0) and h(s,1)=c0(1)h(s,1)=c_0(1) for all ss.

Consider a kk-valued continuous map FF from XX to YY, where XX is connected. Banach and Mazur first prove that for every path c ⁣:[0;1]Xc\colon [0;1]\to X and every point y0F(c(0))y_0\in F(c(0)), there exists a continuous function f ⁣:[0;1]Yf\colon[0;1]\to Y such that f(t)F(c(t))f(t)\in F(c(t)) for all tt. To that aim, the consider disjoint neighborhoods V1,,VkV_1,\dots,V_k of the elements of F(c(0))F(c(0)), with y0V1y_0\in V_1, say, and observe that for tt small enough, there is a unique element in F(c(t))V1F(c(t))\cap V_1. This defines a bit of the path cc, and one can go on. Now, given two paths c,cc,c' such that c(0)=c(0)c(0)=c'(0) and c(1)=c(1)c(1)=c'(1), and two maps f,ff,f' as above, they consider a homotopy h ⁣:[0;1]×[0;1]Xh\colon[0;1]\times[0;1]\to X linking cc to cc'. Subdividing this square in small enough subsquares, one see by induction that f(1)=f(1)f(1)=f'(1). (This is analogous to the proof that a topological covering of the square is trivial.) Fixing a point x0Xx_0\in X and a point y0F(x0)y_0\in F(x_0), one gets in this way a map from XX to YY such that F(x)F(x) is equal to f(1)f(1), for every path c ⁣:[0;1]Xc\colon[0;1]\to X such that c(0)=x0c(0)=x_0 and c(1)=xc(1)=x, and every continuous map f ⁣:[0;1]Yf\colon [0;1]\to Y such that f(t)F(c(t))f(t)\in F(c(t)) for all tt and f(0)=y0f(0)=y_0. This furnishes a map from XX to YY, and one proves that it is continuous. If one considers all such maps, for all points in F(x0)F(x_0), one obtains the decomposition of the multivalued map FF.

To prove their version of the Hadamard–Lévy theorem, Banach and Mazur observe that if f ⁣:YXf\colon Y\to X is a local homeomorphism which is proper, then setting F(x)=f1(y)F(x)=f^{-1}(y) gives a multivalued continuous map. It is not obvious that the cardinalities k(x)k(x) of the sets F(x)F(x) are constant, but this follows (if XX is connected) from the fact that ff is both a local homeomorphism and proper. Then FF is decomposed, so that there exist continuous maps g1,,gk ⁣:XYg_1,\dots,g_k\colon X\to Y such that f1(x)={g1(x),,gk(x)}f^{-1}(x)=\{g_1(x),\dots,g_k(x)\} for all xXx\in X. This implies that YY is the disjoint union of the kk connected subsets gj(X)g_j(X). If YY is connected, then ff is a homeomorphism.

The versions of Hadamard and Lévy, after Plastock

Hadamard considered the finite dimensional case, and Lévy extended it to the case of Hilbert spaces.

Plastock considers a Banach-space version of the theorem above: f ⁣:EFf\colon E\to F is a C1\mathscr C^1-map between Banach spaces with invertible differentials and such that, setting ω(r)=infx=rf(x)1\omega(r)=\inf_{\|x\| = r}\|f'(x)^{-1}\|, one has 0ω(r)dr=+\int_0^\infty \omega(r)\,dr=+\infty. Of course, under these hypotheses, the Banach spaces EE and FF are isomorphic, but it may be useful that they are not identical. Note that f(E)f(E) is open in FF, and the proposition that will insure that ff is a global diffeomorphism is the following one, in the spirit of covering theory.

Proposition.(Assuming that ff is a local diffeomorphism.) It suffices to prove that the map ff satisfies the path lifting property: for every point xEx\in E and every C1\mathscr C^1 map c ⁣:[0;1]f(E)c\colon[0;1]\to f(E) such that c(0)=f(x)c(0)=f(x), there exists a C1\mathscr C^1 map d ⁣:[0;1]Ed\colon[0;1]\to E such that c(t)=f(d(t))c(t)=f(d(t)) for all tt and d(0)=cd(0)=c.

The goal is now to prove that ff satisfies this path lifting property. Using that ff is a local homeomorphism, one sees that lifts are unique, and are defined on a maximal subinterval of [0;1][0;1] which is either [0;1][0;1] itself, or of the form [0;s[[0;s\mathclose[. To prevent the latter case, one needs to impose conditions on the norm f(x)1\| f'(x)^{-1}\| such as the one phrased in terms of ω(r)\omega(r) as in the Hadamard–Lévy theorem. In fact, Plastock starts with a simpler case.

Proposition.The path lifting property follows from the following additional hypotheses:

  1. One has f(x)+\|f(x)\|\to+\infty when x+\|x\|\to+\infty;
  2. There exists a positive continuous function M ⁣:R+R+M\colon\mathbf R_+\to\mathbf R_+ such that f(x)1M(x)\|f'(x)^{-1}\|\leq M(\|x\|) for all $x.

Assume indeed that a path cc has a maximal lift dd, defined over the interval [0;s[[0;s\mathclose[. By the hypothesis (i), d(t)d(t) remains bounded when tst\to s, because c(t)=f(d(t))c(t)=f(d(t)) tends to c(s)c(s). Differentiating the relation c(t)=f(d(t))c(t)=f(d(t)), one gets c(t)=f(d(t))(d(t))c'(t)=f'(d(t))(d'(t)), hence d(t)=f(d(t))1(c(t))d'(t)=f'(d(t))^{-1}(c'(t)), so that d(t)M(d(t))c(t)\| d'(t)\|\leq M(\|d(t)\|) \|c'(t)\|. This implies that d\|d'\| is bounded, so that dd is uniformly continuous, hence it has a limit at ss. Then the path dd can be extended by setting d(s)d(s) to this limit and using the local diffeomorphism property to go beyong ss.

The Hadamard–Lévy is related to completeness of some length-spaces. So we shall modify the distance of the Banach space EE as follows: if c ⁣:[0;1]Ec\colon[0;1]\to E is a path in EE, then its length is defined by  (c)=01 f(c(t))11c(t)dt. \ell(c) = \int_0^1 \| f'(c(t))^{-1}\|^{-1} \|{c'(t)}\|\, dt. Observe that f(c(t))11ω(c(t))\|f'(c(t))^{-1}\|^{-1} \geq \omega(\|c(t)\|), so that (c)01ω(c(t))c(t)dt. \ell(c) \geq \int_0^1 \omega(\|c(t)\|) \|{c'(t)}\|\, dt. The modified distance of two points in EE is then redefined as the infimum of the lengths of all paths joining two points.

Lemma.With respect to the modified distance, the space EE is complete.

One proves that (c) c(0)c(1)ω(r)dr\ell(c) \geq \int_{\|{c(0)}\|}^{\|{c(1)}\|}\omega(r)\,dr. Since 0ω(r)dr=+\int_0^\infty \omega(r)\,dr=+\infty, this implies that Cauchy sequences for the modified distance are bounded in EE for the original norm. On the other hand, on any bounded subset of EE, the Banach norm and the modified distance are equivalent, so that they have the same Cauchy sequences.

Other conditions can be derived from Plastock's general theorem. For example, assuming that EE and FF are a Hilbert space HH, he shows that it suffices to assume the existence of a decreasing function λ ⁣:R+R+\lambda\colon\mathbf R_+\to\mathbf R_+ such that f(x)(u),uλ(x)u2\langle f'(x)(u),u\rangle \geq \lambda(\|x\|) \| u\|^2 for all x,yx,y and 0λ(r)dr=+\int_0^\infty \lambda(r)\,dr=+\infty. Indeed, under this assumption, one may set ω(r)=λ(r)\omega(r)=\lambda(r).

Application to periodic solutions of differential equations

Spectral theory can be seen as the infinite dimensional generalization of classical linear algebra. Linear differential operators and linear partial differential operators furnish prominent examples of such operators. The theorems of Hadamard–Lévy type have been applied to solve nonlinear differential equations.

I just give an example here, to give an idea of how this works, and also because I am quite lazy enough to check the details.

Following Brown & Lin (1979), we consider the Newtonian equation of motion:  u(t)+G(u(t))=p(t) u''(t) + \nabla G (u(t)) = p(t) where GG represents the ambiant potential, assumed to be smooth enough, and p ⁣:RRnp\colon \mathbf R\to\mathbf R^n is some external control. The problem studied by Brown and Lin is to prove the existence of periodic solutions when pp is itself periodic. The method consists in interpreting the left hand side as a non linear map defined on the Sobolev space EE of 2π2\pi-periodic C1\mathscr C^1-functions with a second derivative in F=L2([0;2π];Rn)F=L^2([0;2\pi];\mathbf R^n), with values in FF. Write LL for the linear operator uuu\mapsto u'' and NN for the (nonlinear) operator uG(u)u\mapsto \nabla G(u). Then LL is linear continuous (hence L(u)(v)=L(v)L'(u)(v)=L'(v)), and NN is continuously differentiable, with differential given by  N(u)(v)=(tQ(u(t))(v(t))) N'(u) (v) = \left( t \mapsto Q (u(t)) (v(t)) \right) for u,vEu,v\in E, and QQ is the Hessian of GG.

In other words, the differential (L+N)(u)(L+N)'(u) is the linear map vL(v)+Q(u(t))vv\mapsto L(v) + Q(u(t)) v. It is invertible if the eigenvalues of Q(u(t))Q(u(t)) are away from integers. Concretely, Brown and Lin assume that there are two constant symmetric matrices AA and BB such that AQ(x)BA\leq Q(x) \leq B for all xx, and whose eigenvalues λ1λn\lambda_1\leq \dots\lambda_n and μ1μn\mu_1\leq\dots\leq \mu_n are such that there are integers N1,,NnN_1,\dots,N_n with Nk2<λkμk<(Nk+1)2N_k^2<\lambda_k\leq\mu_k<(N_k+1)^2 for all kk. Using spectral theory in Hilbert spaces, these conditions imply that the linear operator L+Q(u) ⁣:EFL+Q(u)\colon E\to F is an isomorphism, and that (L+Q(u)1\|(L+Q(u)^{-1}\| is bounded from above by the constant expression  c=sup1knsup(λkNk2)1,((Nk+1)2μk)1). c= \sup_{1\leq k\leq n} \sup (\lambda_k-N_k^2)^{-1},((N_k+1)^2-\mu_k)^{-1} ).

Thanks to this differential estimate, the theorem of Hadamard–Lévy implies that the nonlinear differential operator L+NL+N is a global diffeomorphism from EE to FF. In particular, there is a unique 2π2\pi-periodic solution for every 2π2\pi-periodic control function pp.

I thank Thomas Richard for his comments.

Saturday, March 27, 2021

The Fermat–Lévy constant

Undergraduate analysis is full of computions of strange trigonometric series and I posted on Twitter, as a challenge, to compute the following constant: c=p=31p202π(n=11n2cos(nx))3dx. c = \sum_{p=3}^\infty \frac1{p^2} \int_0^{2\pi} \left( \sum_{n=1}^\infty \frac1{n^2} \cos(nx)\right)^3\, dx. As some followers quickly remarked, there could be some irony with this exercise, because quite unexpectedly, they could solve it using the solution to Fermat's Last Theorem. So I'll start with the origin of this problem.

(Light) philosophy of mathematics

This constant came as a screen capture from a 1950 paper by Paul Lévy, Axiome de Zermelo et nombres transfinis (Zermelo axiom and transfinite numbers) where he discusses various reasons to accept, or refuse, Zermelo's principle according to which every set can be endowed with a well-ordering.

Zermelo had proved this principle in 1905 and we know today that this principle is equivalent to the “axiom of choice”. In fact, given a set SS, Zermelo chooses once and for all, for every nonempty subset AA of SS some element f(A)f(A) of AA and it seems that this is precisely this simultaneous quantity of choices that led his contemporaries to realize the necessity of the axiom of choice, as an axiom, first for philosophical reasons — because they objected to the very possibility of doing so.

We know today how Zermelo's theorem is equivalent to the axiom of choice, that it is essentially inoccuous (Gödel, 1938, showing that if there is a set theory without assuming choice, there is some set theory where the axiom of choice is valid), but that it is also unprovable from the other axioms of set theory, say, those (ZF) proposed by Zermelo and Fraenkel (Cohen, 1965, showing that if there is a set theory without assuming the axiom of choice, there is some set theory where the axiom of choice does not hold).

We also know how this axiom, whose usefulness is recognized in the proofs of many theorems of algebra (existence and uniqueness of an algebraic closure, of prime or maximal ideals, of basis for vector spaces, etc.), also leads to paradoxical constructions (such as nonmeasurable sets, or the Banach–Tarski paradoxical decomposition of the unit sphere into finitely many parts that can be recombined into two unit spheres). For many modern mathematicians, the question is now that of a risk-benefice balance — using this axiom on safe grounds, say. However, for mathematicians such as Lebesgue, Borel or Brouwer, who were inspired by a constructive or intuitionistic philosophy of mathematics, this axiom had to be rejected, because they objected to the idea of using objects that cannot be named.

In his 1950 paper, Paul Lévy discusses these objections and explains why he rejects them. As he recognizes, the nature of language only allows for a countable amount of mathematical formulas, hence we can only name a countable amount of mathematical objects, and we know – this is a 1878 theorem of Cantor – that the continuum, for example, is not countable. There must exist, says Paul Lévy, numbers that cannot be named, « des nombres innomables ». I should add that Loren Graham and Jean-Michel Kantor wrote a book on that topic, Naming Infinity: A True Story of Religious Mysticism and Mathematical Creativity, which contains a whole chapter on what they call “the French trio: Borel, Lebesgue, Baire”.

At this point, Lévy introduces the constant whose definition starts this note. Let us now read what Lévy tells us (my translation):

Do mathematical objects exist in themselves, or only as products of the human mind and of our human logic? The point of view of Mr Brouwer regarding the truth of the statements of theorems forces us to ask the question.

Let us consider a precise statement, to fix ideas, that of Fermat's theorem. For us, it is true or false, and the sum c=p=302π(1cosnpxn2)3dx c = \sum_{p=3}^\infty \int_0^{2\pi} \left( \sum_1^\infty \frac{\cos n^p x}{n^2}\right)^3\, dx is zero in the first case and positive in the second. If it is false, one can check that it is false (although we do not know in advance how much time is needed). If it is true, an infinite number of checks would be necessary to assure it, and nothing guarantees a priori that this infinite number can be replaced by a finite reasoning. So there are three possible hypotheses (for some theorems, there would be four): the theorem is false; it is true and provable; it is true and unprovable.

In this latter case, Mr Brouwer refuses to consider this theorem as true, and the constant cc is neither zero nor positive (nor negative, of course).

Let us detail his argument a little bit. Lévy says that his strangely defined constant cc is zero in case FLT is true, and is strictly positive otherwise. But what is its value? For some mathematicians such as Brouwer, this constant can only have a value if we can prove that we are in one case or the other. At the time when Lévy writes his article, in 1950, one did not know yet whether Fermat's last theorem was true or false, it would only be proved in 1995! There is one way to prove that FLT is false, it consists in enumerating all quadruples (x,y,z,p)(x,y,z,p), where x,y,zx,y,z are strictly positive integers and pp is a prime number 3\geq 3, until one reaches some quadruple where xp+yp=zpx^p+y^p=z^p – this guarantees that c>0c>0, Lévy says. But if FLT is true, then this enumeration will go on indefinitely and we will never reach the conclusion. Lévy has an objection to this disappointment, which I find quite amusing, based on the paradox of Zeno of Elea: imagine that we are quicker and quicker at each verification, for example, say twice as fast each time. We could need 1 second for the first quadruple, half a second for the second one, a quarter second for the third one, and so on, and then the whole verification would take 2 seconds. This is obviously a thought experiment and I do not know if physics forbids such a phaenomenon, which anyway Lévy qualifies as supernatural. Computer scientists could object as well, because the complexity of the required computations grows at each step. But let us stop that this philosophical discussion and go back to Lévy's constant.

Computing the Lévy constant

The indicated sum is a series, indexed by prime number p3p\geq 3 (we could consider integers as well) of some integral from 00 to 2π2\pi of some trigonometric expression (divided by p2p^2) which I will denote by cpc_p.

The point is that cpc_p is always positive or zero, and that cpc_p is strictly positive if and only if Fermat's last theorem fails for exponent pp. It is not precised by Lévy, but the goal of division by p2p^2 is to make sure that the infinite series converges, whatever its sum.

Since cosines are smaller than 11, the series of all cos(npx)/n2\cos(n^p x)/n^2 converges normally and its sum is a continuous function of xx which I will denote by f(x)f(x). And one has cp=02πf(x)3dxc_p=\int_0^{2\pi} f(x)^3\,dx. (Since ff is bounded from above independently of pp, for example by π2/6\pi^2/6, each integral is uniformly bounded, and the division by p2p^2 will make the whole series convergent.) To understand this integral, let us expand the cube.

What we get is a sum, indexed by triples of integers (k,m,n)(k,m,n) of terms of the form cos(kpx)cos(mpx)cos(npx)/k2m2n2\cos(k^px) \cos(m^px)\cos(n^px) / k^2m^2n^2, and we need to integrate each of them from 00 to 2π2\pi.

Several methods are possible, such as knowing one's formulas for products of cosines, but I prefer replacing the cosines by their complex exponential expression (Euler's formula), so that each term gets replaced by a sum of 8 terms of the form e±ikpxe±impxe±inpx/8k2m2n2e^{\pm i k^p x} e^{\pm i m^p x} e^{\pm i n^p x} / 8 k^2m^2n^2. Let us factor the exponentials: we get ei(±kp±mp±np)x/8k2m2n2e^{ i (\pm k^p\pm m^p\pm n^p) x} / 8k^2m^2 n^2, and its integral from 00 to 2π2\pi is equal to π/4k2m2n2\pi/4k^2m^2n^2 if the argument of xx vanishes, and is zero otherwise.

In other words, the triple (k,m,n)(k,m,n) contributes as a strictly positive quantity to the sum as soon as (±k,±m,±n)(\pm k,\pm m,\pm n) is a counterexample to Fermat's theorem for exponent pp. Amusing, isn't it? Yes, but you will admit that it is essentially tautological… However…

The circle method

I now reach deeper, in any case mathematically more efficient, considerations and wish to say some words of the circle method. I'll take as an example the problem that the English mathematician Waring had posed in 1770.

In his Arithmetics, Diophantus had asked whether every integer is a sum of 4 perfect squares of integers, and Lagrange proved in 1770 that it is indeed the case. Waring generalized the question to higher powers: is it true that every integer is the sum of at most 9 perfect cubes of integers, or 19 fourth powers, etc.? These two numbers, 9 and 19, correspond to the fact that we need indeed 9 cubes to write 23 (23=23+23+23+1+1+1+1+123 = 2^3 + 2^3 + 2^3 + 1 +1 + 1 + 1 + 1) and 19 fourth powers to write 7979. So Waring was asking whether this necessary bound was, in fact, sufficient.

We know today that this is the case for cubes (Wieferich, 1909; Kempner, 1912) ou les puissances quatrièmes (Balasubramanian, Dress, Deshouillers, 1986). We also know that there is, for every integer k3k\geq3, some smallest integer g(k)g(k) such that every integer is the sum of at most g(k)g(k), as well as an upper bound which conjecturally is an equality, namely g(k)=2k+(3/2)k2g(k)=2^k+\lfloor (3/2)^k\rfloor - 2.

However, the question seems to me as rather less interesting, when kk grows, than the asymptotic analogue: every large enough integer is the sum of at most G(k)G(k) perfect kkth powers, because for this problem a much better bound is known to hold: G(k)2klog(k)+G(k)\leq 2k \log(k)+\cdots (Vinogradov, Karatsuba, Vaughan, Wooley…).

And to establish these results, trigonometric series strike back, as the heart of the circle method of Hardy and Littlewood.

To write NN as a sum of kkth powers, the idea is to set f(x)f(x) as the sum, for all integers nn smaller than P=N1/kP=N^{1/k}, of exp(2iπnkx)\exp(2i\pi n^k x), and to observe that the number of solution to Waring's equality n1k++nrk=Nn_1^k+\cdots+n_r^k=N is given by the integral from 00 to 11 of the quantity f(x)rexp(2iπNx)f(x)^r \exp(-2i\pi Nx).

At this point, we have not been less tautological than Paul Lévy was, and it required the genius of Hardy, Littlewood and their followers to observe that this integral could be actually evaluated. To that aim, the idea is to split the interval [0;1][0;1] in two regions. First of all, intervals (major arcs) centered around rational numbers of the form a/qa/q, where qq is “small”, and the complementary set (minor arcs). Roughly, the contribution of major arcs can be reduced to trigonometric sums in finite fields, using the Chinese remainder theorem and other trics, and it appears to have a simple asymptotic expansion – this is number theory. When the Gods favor us, the contribution of the minor arcs may be bounded efficiently, using more analytic methods, such as Cauchy-Schwarz or Weyl differencing. This requires the integer rr to be large enough, and then one gets an asymptotic expansion for the number of rr perfect kkth powers that add up to NN, and it is strictly positive for NN large enough!

What one needs to remember is that this works for rr quite large. In other words, we can study Diophantine equations whose number of variables is very large compared to the degree. An article by Birch (and which I have not studied well enough…) is called Forms in many variables. Alas, it is completely inoperant for questions such as Fermat's last theorem which has 3 variables and quite a large degree. On the other hand, the circle method is very robust, it allows the equation to be modified – Birch's paper is very flexible – and we know that small modifications of an equation can give rise to dramatic differences regarding its behaviour.

So, as the computer scientists would say, this may be rather a feature than a bug.

Wednesday, February 24, 2021

Transcendence properties of the Painlevé and Schwarzian differential equations

The paper of Joel Nagloo, Model Theory and Differential Equations, just published in the Notices of the AMS, is a beautiful portrait of an area of research that encompasses the end of the 19th century, the full 20th century, up to today. It conflates two totally distinct line of thought. The first line is the theory of differential equations, such as the ones that appear in trigonometry (the sine and cosine functions are solutions of the differential equation y+y=0y''+y=0) or in elaborations of trigonometry (Schwarzian or Painlevé equations). 

The Painlevé equations, named after the mathematician and statesman Paul Painlevé (1863–1933) — he has been minister of war and prime minister during the French 3rd Republic — are a series of 6 differential equations of the second order, of the form y=R(t,y,y)y''=R(t,y,y'), where RR is a rational function. They are characterized by the Painlevé property that all “movable singularities” of their solutions— those which depend on the initial conditions and are not imposed by the form of the equation— are poles. Painlevé classified these equations: up to computation errors later corrected by Gambier and Fuchs, this gives the following irreducible list of 6 equations, in which the variable is tt, the unknown is yy, and the Greek letters α,β,γ,δ\alpha, \beta,\gamma,\delta representing parameters:

  1. y=6y2+ty''=6y^2+t
  2. y=2y3+ty+αy''=2y^3+ty+\alpha
  3. tyy=t(y)2yy+δt+βy+αy2+γty4tyy''=t(y')^2-yy'+\delta t+\beta y+\alpha y^2+\gamma ty^4
  4. yy=12(y)2+β+2(t2α)y2+4ty3+32y3y y''=\frac12 (y')^2+\beta+2(t^2-\alpha)y^2+4ty^3+\frac32 y^3
  5. y=(12y+1y1)(y)21ty+(y1)2t2(αy+β1y)+γyt+δy(y+1)y1y''=(\dfrac1{2y}+\dfrac1{y-1})(y')^2-\dfrac1t y'+\dfrac{(y-1)^2}{t^2} (\alpha y+\beta \dfrac1y)+\gamma\dfrac yt+\delta \dfrac{y(y+1)}{y-1}
  6. y=12(1y+1y1+1yt)(y)2(1t+1t1+1yt)y+y(y1)(yt)t2(t1)2(α+βty2+γt1(y1)2+δt(t1)(yt)2y''=\frac12(\dfrac1y+\dfrac1{y-1}+\dfrac1{y-t})(y')^2-(\dfrac1t+\dfrac1{t-1}+\dfrac1{y-t})y'+\dfrac{y(y-1)(y-t)}{t^2(t-1)^2}(\alpha+\beta \frac t{y^2}+\gamma \dfrac{t-1}{(y-1)^2}+\delta\dfrac{t(t-1)}{(y-t)^2}

By irreducible list, it is meant that all equations satisfying the Painlevé property are reducible to either previously known equations (involving elliptic functions or the Ricatti equations), and that those equations are not. Therefore, the solutions of these equations were classically called “transcendental” because they were generally (meaning except for some particular choices of parameters) not rational functions, nor could be derived by algebraic equations. The basic questions that people want to understand here is how much these functions are transcendental.

In this direction, a 2020 theorem of Joel Nagloo is that if you take nn distinct solutions y1,,yny_1,\dots, y_n of a “generic” Painlevé equation of type III or VI, then there is no (nontrivial) algebraic relation between the 2n2n functions y1,,yny_1,\dots,y_n and their derivatives y1,,yny_1',\dots,y_n'. Here, generic means that the parameters involved in the differential equation they satisfy are algebraically independent over the field of rational numbers.

With Guy Casale and James Freitag, Joel Nagloo has proved a similar theorem for the third order Schwarzian equations associated with a Fuchsian subgroup Γ\Gamma of PSL(2,R)\mathrm{PSL}(2,\mathbf R). The situation is more subtle: there are possible algebraic relations, but they are completely classified by Hecke correspondences.

The second line of thought is model theory, an area of mathematics classically attached to mathematical logic that tries to understand the intrinsic properties of mathematical theories, whatever they are. To that aim, they do geometry: they consider “definable subsets”, that is, loci defined (in some object of the considered theory) by the type of equations that the theory affords.

For example, if the theory involves fields only, the object could be a large algebraically closed field (such as the field complex numbers) and the equations are essentially polynomial equations, with the exception that they allow quantifiers, and negations, to define their loci. In fact, in this case, it is a theorem, classically attributed to Chevalley, that quantifiers are not needed — one only gets so-called constructible sets.

Another kind of interesting structure is that of differential fields, in which one has the classical field operations, but also an abstract derivation operator (satisfying the standard rules for derivations). Then, definable sets are essentially differential equations.

What model theory does in general is trying to define ways to categorize the definable subsets. One such important invariant would be an analogue of the Zariski dimension of constructible sets in algebraic geometry: the Morley rank. Roughly, a definable set has Morley rank 00 if it is finite. It has Morley rank at least nn if it admits an infinite family of disjoint definable subsets of Morley rank n1\geq n-1. A particular case of sets of Morley rank 11 are the strongly minimal definable subsets which are infinite and such that any definable subset of it is finite or cofinite (the complementary set is finite).

More importantly, model theory identifies structural properties of theories that make them “neater”, and this gives rise to an “universe of theories”. As you can see on this “map of the universe” (made by Gabriel Conant), the theory ACF of algebraically closed fields is one of the simplest ones that exists, and recent work in model theory tries to clarify what happens in much more complicated ones, characterized by incredible names or acronyms, strongly minimal, stable, distal, o-minimal, NIP, supersimple, NTP₂…

One of the fundamental results in model theory that emerged in the years 1980-2000 is the theory of Zariski geometries and Boris Zilber's “trichotomy principle” that indicates that 1-dimensional (strongly minimal) objects classify in 3 disjoint classes:

  • geometrically trivial: there are essentially no other relation between distinct points of that object, beyond the fact that they belong to it. The generic Painlevé equations furnish important examples of this case, but proving strong minimality is the hard step.
  • group-like: some algebraic group is hidden in the picture, and constrains all possible relations.
  • field-like: this is essentially algebraic geometry over some hidden field.

I must be terse here, stability theory in model theory is an immense area with which I am not enough familiar, and I just insert pictures of three big names in the field, namely Morley, Shelah and Hrushovski:

For more details, and definitely more insight, I refer you to Joel Nagloo's paper quoted above. I also refer you to the Freedom Math Dance blog posts on Model theory and algebraic geometry.

Thursday, January 14, 2021

On Rolle's theorem

This post is inspired by a paper of Azé and Hiriart-Urruty published in a French high school math journal; in fact, it is mostly a paraphrase of that paper with the hope that it be of some interest to young university students, or to students preparing Agrégation. The topic is Rolle's theorem.

1. The one-dimensional theorem, a generalization and two other proofs

Let us first quote the theorem, in a nonstandard form.

Theorem.Let I=]a;b[I=\mathopen]a;b\mathclose[ be a nonempty but possibly unbounded interval of R\mathbf R and let f ⁣:IRf\colon I\to\mathbf R be a continuous function. Assume that ff has limits at aa and bb, equal to some element R{+}\ell\in\mathbf R\cup\{+\infty\}. Then ff is bounded from below.

  1. If infI(f)<\inf_I(f)<\ell, then there exists a point cIc\in I such that f(c)=infI(f)f(c)=\inf_I (f). If, moreover, ff has a right derivative and a left derivative at cc, then fl(c)0f'_l(c)\leq0 and fr(c)0f'_r(c)\geq0.
  2. If infI(f)\inf_I(f)\geq\ell, then ff is bounded on II and there exists a point cIc\in I such that f(c)=supI(f)f(c)=\sup_I(f). If, moreover, ff has a right derivative and a left derivative at cc, then fl(c)0f'_l(c)\geq0 and fr(c)0f'_r(c)\leq0.

Three ingredients make this version slightly nonstandard:

  • The interval II may be taken to be infinite;
  • The function ff may tend to ++\infty at the endpoints of II;
  • Only left and  right derivatives are assumed.

Of course, if ff has a derivative at each point, then the statement implies that f(c)=fl(c)=fr(c)=0f'(c)=f'_l(c)=f'_r(c)=0.

a) As stated in this way, the proof is however quite standard and proceeds in two steps.

  1. Using that ff has a limit \ell which is not -\infty at aa and bb, it follows that there exists aa' and bb' in II such that a<a<b<ba<a'<b'<b such that ff is bounded from below on ]a;a]\mathopen ]a;a'] and on [b;b[[b';b\mathclose[. Since ff is continuous on the compact interval [a;b][a';b'], it is then bounded from below on II.
    If infI(f)<\inf_I(f)<\ell, then we can choose R\ell'\in\mathbf R such that infI(f)<<\inf_I(f)<\ell'<\ell and aa', bb' such that f(x)>f(x)>\ell' outside of [a;b][a';b']. Then, let c[a;b]c\in [a';b'] such that f(c)=inf[a;b](f)f(c)=\inf_{[a';b']}(f); then f(c)=infI(f)f(c)=\inf_I(f)
    If supI(f)>\sup_I(f)>\ell, then we have in particular +\ell\neq+\infty, and we apply the preceding analysis to f-f.
    In the remaining case, infI(f)=supI(f)=\inf_I(f)=\sup_I(f)=\ell and ff is constant.
  2. For x>cx>c, one has f(x)f(c)f(x)\geq f(c), hence fr(c)0f'_r(c)\geq 0; for x<cx<c, one has f(x)f(c)f(x)\geq f(c), hence fl(c)0f'_l(c)\leq0.

The interest of the given formulation can be understood by looking at the following two examples.

  1. If f(x)=xf(x)=|x|, on R\mathbf R, then ff attains its lower bound at x=0x=0 only, where one has fr(0)=1f'_r(0)=1 and fl(0)=1f'_l(0)=-1.
  2. Take f(x)=ex2f(x)=e^{-x^2}. Then there exists cRc\in\mathbf R such that f(c)=0f'(c)=0. Of course, one has f(x)=2xex2f'(x)=-2xe^{-x^2}, so that c=0c=0. However, it is readily seen by induction that for any integer nn, the nnth derivative of ff is of the form Pn(x)ex2P_n(x)e^{-x^2}, where PnP_n has degree nn. In particular, f(n)f^{(n)} tends to 00 at infinity. And, by induction again, the theorem implies that PnP_n has nn distinct roots in R\mathbf R, one between any two consecutive roots of Pn1P_{n-1}, one larger than the largest root of PnP_n, and one smaller than the smallest root of PnP_n.

b) In a 1959 paper, the Rumanian mathematician Pompeiu proposed an alternative proof of Rolle's theorem, when the interval II is bounded, and which works completely differently. Here is how it works, following the 1979 paper published in American Math. Monthly by Hans Samelson.

First of all, one uses the particular case n=2n=2 of the Levi chord lemma :

Lemma.Let f ⁣:[a;b]Rf\colon [a;b]\to\mathbf R be a continuous function such that f(a)=f(b)f(a)=f(b). For every integer n2n\geq 2, there exists a,b[a;b]a',b'\in[a;b] such that f(a)=f(b)f(a')=f(b') and ba=(ba)/nb'-a'=(b-a)/n.

Let h=(ba)/nh=(b-a)/n. From the equality
0=f(b)f(a)=(f(a+h)f(a))+(f(a+2h)f(a+h))++(f(a+nh)f(a+(n1)h), 0 = f(b)-f(a) = (f(a+h)-f(a))+(f(a+2h)-f(a+h))+\cdots + (f(a+nh)-f(a+(n-1)h),
one sees that the function xf(x+h)f(x)x\mapsto f(x+h)-f(x) from [a;bh][a;b-h] to R\mathbf R does not have constant sign. By the intermediate value theorem, it vanishes at some point a[a;bh]a'\in [a;b-h]. If b=a+hb'=a'+h, then b[a;b]b'\in[a;b], ba=(ba)/nb'-a'=(b-a)/n and f(a)=f(b)f(a')=f(b').

Then, it follows by induction that there exists a sequence of nested intervals ([an;bn])([a_n;b_n]) in [a;b][a;b] with f(an)=f(bn)f(a_n)=f(b_n) and bnan=(ba)/2nb_n-a_n=(b-a)/2^n for all nn. The sequences (an)(a_n) and (bn)(b_n) converge to a same limit c[a;b]c\in [a;b]. Since f(bn)=f(c)+(bnc)(f(c)+o(1))f(b_n)=f(c)+(b_n-c) (f'(c) + \mathrm o(1)), f(an)=f(c)+(anc)(f(c)+o(1))f(a_n)=f(c)+(a_n-c)(f'(c)+\mathrm o(1)), one has
f(c)=limf(bn)f(an)bnan=0. f'(c) = \lim \frac{f(b_n)-f(a_n)}{b_n-a_n} = 0.

What makes this proof genuinely distinct from the classical one is that the obtained point cc may not be a local minimum or maximum of ff, also I don't have an example to offer now.

c) In 1979, Abian furnished yet another proof, which he termed as the “ultimate” one. Here it is:

It focuses on functions f ⁣:[a;b]Rf\colon[a;b]\to\mathbf R on a bounded interval of R\mathbf R which are not monotone and, precisely, which are up-down, in the sense that f(a)f(c)f(a)\leq f(c) and f(c)f(b)f(c)\geq f(b), where c=(a+b)/2c=(a+b)/2 is the midpoint of ff. If f(a)=f(b)f(a)=f(b), then either ff or f-f is up-down.

Then divide the interval [a;b][a;b] in four equal parts: [a;p][a;p], [p;c][p;c], [c;q][c;q] and [q;b][q;b]. If f(p)f(c)f(p)\geq f(c), the f[a;c]f|_{[a;c]} is up-down. Otherwise, one has f(p)f(c)f(p)\leq f(c). In this case, if f(c)f(q)f(c)\geq f(q), we see that f[p;q]f|_{[p;q]} is up-down. And otherwise, we observe that f(q)f(c)f(q)\leq f(c) and f(c)f(b)f(c)\geq f(b), so that f[c;b]f|_{[c;b]} is up-down. Conclusion: we have isolated within the interval [a;b][a;b] a subinterval [a;b][a';b'] of length (ba)/2(b-a)/2 such that f[a;b]f|_{[a';b']} is still up-down.

Iterating the procedure, we construct a sequence ([an;bn])([a_n;b_n]) of nested intervals, with (bnan)=(ba)/2n(b_n-a_n)=(b-a)/2^n such that the restriction of ff to each of them is up-down. Set cn=(an+bn)/2c_n=(a_n+b_n)/2.

The sequences (an),(bn),(cn)(a_n), (b_n),(c_n) satisfy have a common limit c[a;b]c\in [a;b]. From the inequalities f(an)f(cn)f(a_n)\leq f(c_n) and ancna_n\leq c_n,  we obtain f(c)0f'(c)\geq 0; from the inequalities f(cn)f(bn)f(c_n)\geq f(b_n) and cnbnc_n\leq b_n, we obtain f(c)0f'(c)\leq 0. In conclusion, f(c)=0f'(c)=0.

2. Rolle's theorem in normed vector spaces

Theorem. Let EE be a normed vector space, let UU be an open subset of EE and let f ⁣:URf\colon U\to\mathbf R be a differentiable function. Assume that there exists R{+}\ell\in\mathbf R\cup\{+\infty\} such that f(x)f(x)\to \ell when xx tends to the “boundary” of UU — for every <\ell'<\ell, there exists a compact subset KK of UU such that f(x)f(x)\geq\ell' for all xUx\in U but x∉Kx\not\in K. Then ff is bounded below on UU, there exists aUa\in U such that f(a)=infU(f)f(a)=\inf_U (f) and Df(a)=0Df(a)=0.

The proof is essentially the same as the one we gave in dimension 1. I skip it here.

If EE is finite dimensional, then this theorem applies in a vast class of examples : for example, bounded open subsets UU of EE, and continuous functions f ⁣:URf\colon \overline U\to\mathbf R which are constant on the boundary (U)=UU\partial(U)=\overline U - U of UU and differentiable on UU.

However, if EE is infinite dimensional, the closure of a bounded open set is no more compact, and it does not suffice that ff extends to a function on U\overline U with a constant value on the boundary.

Example. — Let EE be an infinite dimensional Hilbert space, let UU be the open unit ball and BB be the closed unit ball. Let g(x)=12Ax,x+b,x+cg(x)=\frac12 \langle Ax,x\rangle+\langle b,x\rangle +c be a quadratic function, where AL(E)A\in\mathcal L(E), bEb\in E and cRc\in\mathbf R, and let f(x)=(1x2)g(x)f(x)=(1-\lVert x\rVert^2) g(x). The function ff is differentiable on EE and one has
 f(x)= (1x2)(Ax+b)2(12Ax,x+b,x+c)x.  \nabla f(x) =  (1-\Vert x\rVert^2) ( Ax + b) - 2 (\frac12 \langle Ax,x\rangle + \langle b,x\rangle + c) x.
Assume that there exists xUx\in U such that f(x)=0\nabla f(x)=0. Then Ax+b=λxAx+b = \lambda x, with
λ=21x2(12Ax,x+b,x+c). \lambda= \frac2{1-\lVert x\rVert ^2} \left(\frac12 \langle Ax,x\rangle + \langle b,x\rangle + c \right).
Azé and Hiriart-Urruty take E=L2([0;1])E=L^2([0;1]), for AA the operator of multiplication by the function ttb(t)=t(1t)b(t)=t(1-t), and c=4/27c=4/27. Then, one has g(x)>0g(x)>0, hence λ>0\lambda>0, and x(t)=1λtb(t)x(t)=\frac1{\lambda-t}b(t) for t[0;1]t\in[0;1]. This implies that λ1\lambda\geq 1, for, otherwise, the function x(t)x(t) would not belong to EE. This allows to compute λ\lambda in terms of μ\mu,  obtaining λ3/4\lambda\leq3/4, which contradicts the inequality λ1\lambda\geq 1. (I refer to the paper of Azé and Hiriart-Urruty for more details.)

3. An approximate version of Rolle's theorem

Theorem. Let BB the closed euclidean unit ball in Rn\mathbf R^n, let UU be its interior, let f ⁣:BRf\colon B\to \mathbf R be a continuous function on BB. Assume that fϵ\lvert f\rvert \leq \epsilon on the boundary (U)\partial(U) and that ff is differentiable on UU. Then there exists xUx\in U such that Df(x)ϵ\lVert Df(x)\rVert\leq\epsilon.

In fact, replacing ff by f/ϵf/\epsilon, one sees that it suffices to treat the case ϵ=1\epsilon =1.

Let g(x)=x2f(x)2g(x)=\lVert x\rVert^2- f(x)^2. This is a continuous function on BB; it is differentiable on UU, with g(x)=2(xf(x)f(x)) \nabla g(x)=2(x-f(x)\nabla f(x)). Let μ=infB(g)\mu=\inf_B(g). Since g(0)=f(0)20g(0)=-f(0)^2\leq0, one has μ0\mu\leq 0. We distinguish two cases:

  1. If μ=0\mu=0, then f(x)x\rvert f(x)\lvert \leq \lVert x\rVert for all xBx\in B. This implies that f(0)1\lVert\nabla f(0)\rVert\leq1.
  2. If μ<0\mu<0, let xBx\in B be such that g(x)=μ g(x)=\mu; in particular, f(x)2x2μ>0f(x)^2\geq \lVert x\rVert^2-\mu>0, which implies that f(x)0f(x)\neq0. Since g0g\geq0 on (U)\partial(U), we have xBx\in B, hence g(x)=0\nabla g(x)=0. Then x=f(x)f(x)x=f(x)\nabla f(x), hence f(x)=x/f(x)\nabla f(x)=x/f(x). Consequently,
    f(x)xf(x)x(x2μ)1/2<1. \lVert \nabla f(x)\rVert \leq \frac{\lVert x\rVert}{f(x)}\leq \frac{\lVert x\rVert}{(\lVert x\rVert^2-\mu)^{1/2}}<1.

This concludes the proof. 


Thanks to the Twitter users @AntoineTeutsch, @paulbroussous and @apauthie for having indicated me some misprints and incorrections.