Optimization: The math behind machine learning

Posted by Adam Walhout


Optimisation is not only a buzz word, it is a solid mathematical concept that is behind most machine learning algorithms.

Mathematical optimisation entered my life when I was a physics student and hasn’t left me since. Even after I changed jobs – leaving academia for data science – it continues to be part of my daily life. When talking about mathematical optimisation one often refers to the act of minimising (or maximising) a function. To understand better this concept I will present three examples from three worlds that I know from direct experience:

  • Theoretical physics
  • Experimental physics
  • Machine learning

The problems studied in these areas are extraordinarily complex, so my friends belonging to one of the three should not get offended by the simplifications that I am going to present in this post. As one proceeds from theoretical physics, through the experimental universe to arrive to the cutting-edge term “machine learning”, one has to give up the exactness of the solution to a problem in order to gain predictive power.


Theoretical physics

The duty of the theoretical physicist is to take a quite simple set up and describe it in great detail and exactness. That is where the challenge is, in describing with precision, deeply. Let’s talk about a text book example: the problem of a ball attached to a spring, that can move only in one direction.


Harmonic Oscillator


If one slightly displaces the ball, stretching the spring, the ball will start to move back and forth, with such a neat and smooth movement that deserve the name “harmonic oscillator”. For systems like this, the type of motion will always be the same. Where does this consistency of behaviour come from? This is due to the so-called principle of minimal action. In classical physics this is a very important concept. The ball, often referred to as particle by physicists, is always moving in such a way that a function called the action is minimised. This function, that is a close relative of the energy, is relatively easy to write down for a system that contain very few elements, like the above example. Once one knows this function then the motions of all the elements in the system is found by minimising the action. The motion will always be the same for classical system with identical action because it will always correspond to the minimum of this function.



The Particle always move in a way that minimises the action. 
Minimal effort and maximum result! Optimisation!


Experimental physics

Let me now speak about my friends: the experimental physicists. Their world it is a bit different from the one of theoretical physicists, even if the collaboration between these two groups of scientists is crucial. When analysing experimental data they have to deal with a lot of noise. This noise makes the data look different, sometimes very different, from the ideal theoretical model, but the existence of theoretical models is a great help for the experimentalist. It is a mutually beneficial relationship: theorists give models to experimentalists that use the data to validate these models and suggest future development for the research of the theorist, by discovering new physics that does not fit previous available models. In the case in which there are different theoretical models available, the job of the experimental physicist, after an enormous effort to collect and clean the data, is to compare the data with the models. In this case what one minimises is the distance between the data points and the line of the theoretical model. The chosen model will be the one that goes closer to the experimental data.


These points are describing the motion of the harmonic oscillator, but there is some noise. One can see that the red line describing the theoretical harmonic oscillator fit the data well.


Machine learning

A few months ago, I decided to change field and I migrated from theoretical physics to data science and machine learning. A whole new universe opened before me. The data in data science are way more messy and noisy than in a physicists’ worst nightmare. This put me in a combined state of panic and excitement – a mixture of thought oscillating, not even so harmonically, between “this is inhumanly messy” and “I am so excited to have the chance to challenge myself”.


But I soon realised that there are many good techniques that extract values from these data, moreover these data are not inhuman at all. The complexity of the data is related to human society and human interactions. In data science the situation is definitely not the same as in physics. Particles don’t suddenly start to behave awkwardly because they went through a recent breakup, customers can. These types of situations are a source of noise.



Exploring the data for business value

In physics you analyse data to claim a discovery, in data science the same accuracy is not required, and not even possible. Most of the time one looks for tendencies, trends and insights that translate into business value. As for the experimental physics case we want to optimise the distance, by minimising it. The tricky part is that this time there is no theorist to give us a model to test, this time we need to come up with a model just by looking at the data. The first thought could be to use a model that perfectly reproduces all the data points. It is easy to understand that this will have no predictive power. Data are noisy and they don’t lie on the line described by the ideal model. Future data will not lie on a line that fits with all the old data either.


The model that interpolates among all the data points has no predictive power.


Luckily lots of techniques are available to smoothen up the model, being careful not to smooth up the thing so much that the predictive power is lost again. In some cases a line can be a good approximation but in other cases it is way to smooth. That is at the base of many machine learning algorithms. The secret is to minimise a function that is a trade-off between the minimisation of the distance and a smooth model. In this way we can write a model with a predictive power. smooth_data

A smoother fit has a higher predictive power than an interpolation that overfits the data. One has to be careful not to smooth up things too much otherwise the solution would be inadequate.


Optimization lights the path to knowledge

Mathematical modelling is the key to go from data to information that can result in an increase in productivity and efficiency of businesses and enterprises. Mathematical optimisation is the light that guides the path into a deep knowledge of processes and transactions – and is at the core of what we deliver in AIMS.

Nowadays it is very common to hear the term machine learning. I hope that from now on when you hear this term you will pay a mental tribute to mathematical optimisation, without whom the progresses achieved so far would have been impossible.


Schedule an AIMS demo 


About the Author

alessandraAlessandra is AIMS’ Data Scientist. She has a background in the world of mathematical modelling. She has a PhD in Theoretical Physics from the University of Padua and she has been working for several years in the academic world. She was a postdoctoral researcher at NORDITA (Nordic Institute of Theoretical Physics) in Stockholm and at DESY (German Electron Synchrotron) in Hamburg. She is also a Marie Curie Alumna (Marie Curie Actions).



Subscribe by Email

Most Popular Posts