- Posted by Daitan Innovation Team
- On December 20, 2019
- Convex Optimization, Data Science, Deep Learning
How Business Requirements can Prevent You from Using Available Machine Learning Tools and What to do About That
When we search the internet for the required technical abilities of a successful data scientist, we will find variations of the following list:
- Analytical skills
- Linear algebra
- Multi-variate calculus
- Machine learning
- Data visualization
- Software engineering
And so on. But there is a common subject to some items on this list that does not get the same attention, despite being the main engine of several tools used by a data scientist: Convex Optimization.
When hearing this term, most people will immediately start talking about how gradient descent is the most awesome thing there is, how we can add momentum to it, how can we choose, adapt, or even circumvent the choice of the step-size parameter, and so on.
However, in reality, convex optimization goes well beyond gradient descent and its variants.
As shown in figure 1, this is a subject that combines several of the above-listed abilities, such as linear algebra, calculus, statistics, and the learning word in machine learning.
Not only that, with the increased volume of data both in industry and academia, the desire for faster optimization methods, and the popularization of parallel computing, both fields hugely influence each other.
Machine learning is posing new optimization problems and recovering forgotten optimization methods, while convex optimization research is allowing statisticians and computer scientists to tackle new and bigger problems.
Refresher: What Is Optimization?
In short, optimization is a branch of mathematics that focuses on finding a solution that minimizes an objective function, possibly with some constraints on the potential values of our problem variables.
The most general optimization problem has the form:
In this setting, x ∈ ℝⁿ is our variable and we want to find the values for this variable that achieve the minimum value of f(x), our objective function (to maximize the function, you can minimize its negative version).
We also may be restricted by some functions on our variables limited by scalar values, called constraints. Sometimes, we can find an analytical formula to solve an optimization problem, but usually, there is no closed-form to solve a problem like this.
And even if we do have an analytical solution, there may be other business constraints that will prevent us from using it in practice.
But here is the trick: depending on some properties of the domain of the variable x, the function f(x), and the constraints fᵢ(.), we can use methods that will certainly find the best solution.
If we do not find interesting properties, we may have a problem that will not be solvable in a reasonable amount of time, or we will not be capable of finding the best solution.
By what we have exposed, you can guess the most important property we can find: convexity. When the domain of our variable, our objective function, and our constraints are convex, we have a convex problem and a large literature!
Given the plethora of resources on the web on the subject, the goal of this post is not to be a comprehensive guide to convex optimization and machine learning. This would be impossible even for a book with 300 pages (personal guess or limitation).
Instead, we want to show you how a canonical formulation that every data scientist is familiar with (or will be one day) — a regularized linear problem — can become impractical if you are restricted to off-the-shelf libraries.
We will also see how the business constraints can guide us to a practical solution, based on prior knowledge that we may have about the problem. Let’s get started.
Canonical Formulation: Cost Function + Regularization
The example that we will focus on considers a regression task, but the concepts that we will discuss are much broader. Take as an example the famous linear regression with least squares:
Notice that the formulation does not present any regularization term, but the common practice tells us to regularize. Together with the Ridge, LASSO, and Elastic Net regularization terms, this problem makes a very popular set of regression methods. Here is a more general representation of this problem:
Where 𝔉(x) is an empirical loss function, ℜ(x) is a regularization term, and λ is the regularization parameter.
In the case of regression, a common choice for 𝔉(.) is the mean squared error, but we can make different choices. We focus here on the MSE loss function.
For ℜ(.), it is common to use the l₂ norm, the l₁ norm for the LASSO problem, or both for the Elastic Net regularization. Let A be our design matrix, of m× n dimensions, b ∈ ℝᵐ our label vector.
We show in the table below the most common configuration for these functions.
Common choices for the loss function and regularization terms.
Of course, these are well-studied problems, with stable libraries in several programming languages, and in practice, you should not be concerned with their implementation details for a business project.
You do not want to spend too much time debugging your implementation; or dealing with edge cases. But let’s walk through some regular use cases in the data science practice and see how this simple problem can get impractical without the proper abilities.
Use Case 1: Fit and Predict
Suppose that you are facing a small regression problem. All your data fits in memory, having less than 2GB or so, and the inference can be done locally on the same machine. Any UCI and most Kaggle datasets will fall into this use case.
Without tricks to fool you, this is the most common scenario where we can readily use an off-the-shelf implementation (for instance, Sklearn).
If you are not using any regularization term — and are also feeling a little bit adventurous — you can just use your favorite programming language to compute the analytical solution to the least-squares problem:
Here are examples of Python functions that: solve Eq. (1) directly; and use the NumPy library for a stable solution, which is the underlying method operating in the Scikit-Learn library.
Use Case 2: Data Doesn’t Fit in Memory
The second use case is more interesting.
Suppose now that the training set is too big to fit in memory, therefore, loading the entire dataset is not an option for this problem.
Clearly, we are not able to use the analytical solution, as we will need to load portions of the data sequentially. We will need to use optimization in order to find a solution to this problem.
In this case, several techniques can be handy, such as the gradient descent (if regularizing with the l₂-norm); the fast iterative shrinkage-thresholding algorithm FISTA (Beck, 2009), or the alternating directions method of multipliers (ADMM) (Boyd, 2011).
Methods that can handle non-smooth functions. Instead of computing the exact solution, we run an iterative process that finds a near-optimal solution, given enough iterations.
Knowing which algorithms we can use, the bigger problem now is to manage loading portions of the data and properly freeing memory. Some packages can even do this as default behavior, such as the biglm R package, based on (Miller, 1992).
Here is a toy example in Python. We use pandas to lazily load a CSV file, train a least-squared model using Scikit-Learn, and make use of the parameters from the previous iteration as warm-start.