Machine Learning and the Continuum Hypothesis

How the unprovable concerns the unlearnable

Robert Passmann
Cantor’s Paradise

--

‘What can machines learn?’ is probably one of the most exciting questions in the theory of machine learning. It concerns the limits of the discipline — the unlearnable. The Continuum Hypothesis, on the other hand, touches the limits of our mathematical knowledge — the unprovable.

Photo by bruno neurath-wilson on Unsplash

Let’s begin with machine learning. To study questions of learnability, we need a mathematically precise framework for the analysis of machine learning. For this purpose, Leslie Valiant introduced probably approximately correct learning (PAC-learning) in 1984.

The idea of PAC-learning is simple: A learner receives tiny bits of information on a certain problem and should output a general hypothesis derived from these data. For example, the learner could receive photos and information on whether or not these photos contain a cat. The learner then has to select a function that decides for a given photo whether it contains a cat or not.

The name of PAC-learning comes from the fact that we require the learner to select an approximately correct function with high probability. In conclusion, a problem is PAC-learnable if the learner will probably select an approximately correct function.

What is Cantor’s continuum problem? In 1878, the mathematician Georg Cantor asked whether all subsets of the real numbers are either in 1-to-1-correspondence with the natural numbers or in 1-to-1-correspondence with the real numbers. That’s the so-called Continuum Hypothesis.

Gödel and Cohen proved in the course of the 20th century that this problem is undecidable from the axioms of standard mathematics — the Zermelo-Fraenkel axioms of set theory (ZFC). They showed that there can neither be a proof of the Continuum Hypothesis nor a proof refuting it. The Continuum Hypothesis is unprovable and so is its negation.

It seems that we have two very different problems here: There is the PAC-learning problem from theoretical computer science discussing whether or not machines can learn certain functions. And there is the Continuum Problem asking whether there are infinite sets of a certain size. What does this have to do with each other?

In 2019, a group of researchers, Ben-David et al., published an article entitled “Learnability can be undecidable” in Nature Machine Intelligence:

We describe simple scenarios where learnability cannot be proved nor refuted using the standard axioms of mathematics. Our proof is based on the fact the continuum hypothesis cannot be proved nor refuted. (Ben-David, Hrubeš, Moran, Shpilka and Yehudayoff, Learnability can be undecidable)

How do they show this? Greatly simplified, Ben-David and his collaborators define learning problems. They call it estimate the maximum (EMX) problem and give this as a motivating example:

Imagine a website that is being visited by a variety of users. Denote by X the set of all potential visitors to the website. The owner of the website wishes to post ads on it. The posted ads are to be chosen from a given pool of ads. Each ad A in the pool targets a certain population of users F(A)X. For example, if A is a sports ad then F(A) is the collection of sports fans. The goal is to place an ad whose target population visits the site most frequently. The challenge is that it is not known in advance which visitors are to visit the site. (Ben-David et al., Learnability can be undecidable)

This simple problem of finding the best ad for your website can be generalised to a mathematical problem involving families of sets and probability distributions. Given a collection of subsets F of some domain D, and a probability distribution on D, find that set in F that has the highest probability. What makes this really hard is that we don’t know the probability distribution in advance!

How do Ben-David and his colleagues show that the EMX-learnability of a certain problem independent of the Zermelo-Fraenkel ZFC axioms of mathematics?

One way to do this would be to construct models of set theory, i.e. possible worlds of mathematics, in one of which the problem is EMX-learnable and another one in which it is not. However, this would be massively complicated.

Instead, Ben-David and his co-authors do what many clever mathematicians have done before and use already established results. They prove — using connections between learning and compression — that a certain very complicated instance of their EMX-learning problem is equivalent to Cantor’s Continuum Hypothesis.

If you can show that a statement is equivalent to the Continuum Hypothesis, then you have, in fact, shown that your statement is independent of ZFC. That’s easy to see: If your statement is true, then by equivalence, the Continuum Hypothesis must be true. If your statement is false, then by equivalence, the Continuum Hypothesis must be false. But the Continuum Hypothesis is neither true nor false in ZFC, so the same must hold for your statement.

Ben-David, Hrubeš, Moran, Shpilka, and Yehudayoff have found a clever way to connect the foundations of mathematics to the foundations of machine learning. Everything is connected — and the limits of learnability may depend on our set-theoretic foundations of mathematics.

A critical voice

But wait a minute. How serious should we take these results? The Dutch mathematician and set theorist K. P. Hart critically analysed Ben-David et al.’s paper. Roughly speaking, Hart’s point is this:

The learning problems essentially require the learner to come up with a generalised function. If that generalised function is good enough, then we say that the learner has learned the problem. In other words, the problem is learnable.

However, to require that a problem is learnable by a machine is a further restriction. Alan Turing showed with his famous Halting Problem that not all functions are computable. So, to say that a machine can learn a problem, we must make sure that the functions we are talking about can actually be handled by a machine:

The functions that are used … are quite arbitrary and not related to any recognizable algorithm. (K. P. Hart, Machine learning and the Continuum Hypothesis)

Hart points out that this was already recognised by Ben-David and his co-authors. He then suggests making the problem more precise:

One possible way of separating out ‘algorithmic’ functions is by requiring them to have nice descriptive properties. If ‘nice’ is taken to mean ‘Borel measurable’ then the desired functions do not exist. (K. P. Hart, Machine learning and the Continuum Hypothesis)

The property of Borel measurable is a property that functions can have. The precise details of the definition are not important here. However, using standard formalisations of algorithms (e.g. Turing machines), and considering algorithm functions as those mapping some input to some output according to the algorithm, it follows that algorithms are Borel measurable functions.

Hart shows that no Borel-measurable function can solve the EMX-learning problem. This means that no algorithm, no computer, no machine can solve this problem. Hart concludes that “there is no algorithm solving this particular learning problem” and therefore:

The results … imply that a Borel measurable learning function does not exist. In this author’s [K. P. Hart’s] opinion that means that the title of [“Learnability can be undecidable”] should be emended to “EMX-learning is impossible”. (K. P. Hart, Machine learning and the Continuum Hypothesis)

What should we make of this? Whether or not the EMX-learning problem is a proper example for the undecidability of a learning problem, my conclusion is this: How abstract and obscure the foundations of mathematics and logic may sometimes seem, they are so fundamental that they affect the most applied disciplines like machine learning.

Have a look at my profile for more stories on, for example, the Continuum Hypothesis or the axioms of Zermelo-Fraenkel Set Theory.

--

--

I’m a logician working in mathematics and philosophy. PhD Candidate at ILLC, University of Amsterdam.