**Introduction**

**Online learning** is a subfield of machine learning where practitioners sometimes refer to as **incremental **or **out-of-core **learning where machines need to continuously learn and predict in real-time. Lately, this field is gaining more attention, especially with the continuous training and deployment of machine learning models. According to the traditional learning paradigm, data becomes available in sequentially incremental order to update predictors for future data at each step or point in time. This is opposing to the conventional batch learning techniques which generate the best predictor by learning on the entire training data set at once. Compared to “traditional” ML solutions, online learning is a fundamentally different approach, one that embraces the fact that learning environments can (and do) change over time. Ideally, we need a model that not just predicts in real-time, but learns in real-time. There are mainly two use cases to adopt online learning:

- The case when it is computationally infeasible to train over the entire dataset
- The case when the model needs to dynamically adapt to new patterns in the data which arrive later after training

However, it is tricky to implement correctly to reduce the risk of catastrophic inferences as it is prone to drifting and severe deviations introduced by the newly encountered data instances. The solution to these challenges are part technical, and part Machine Learning & Data Goverance. ML models learn a set of parameters from training over the datasets. Traditionally, these parameters are static and unchanged throughout all future predictions. Let’s consider this scenario, we deploy the trained models online and usually multiple instances of the same model. The models then serve predictions using REST APIs with the capability of learning on the fly from every new instance received.

A few questions naturally arise in this situation such as how to synchronize between the different model instances? Do we update the model on the basis of example by example or occasionally after we collect a few instances (schedule training)? How and when to measure the model concept drift?. This is just a glimpse of the type of complications that can face practitioners employing online learning.

In this article, we will present some common misconceptions when it comes to online learning. Then, we will go through identifying the different types of online machine learning differentiating between traditional and neural models. We will cover how each type of models are currently developed and what are the main road blocker for each.

** Related but not the same**

In studying online learning, there are a few other terms that appear while searching as related topics, but not all of them is specifically relevant to online learning. This often causes a lot of confusion to readers to fully differentiate relevant from non-relevant.

#### Related Concepts:

: This term refers to the classical learning model where we learn the estimation parameters over the whole dataset, assuming you know the whole data distribution. Gradient descent on either the overall dataset of on batches optimizes the parameters of these types of models [1].**Offline/Batch Learning**As shown previously, practitioners usually use this term interchangeably with the term online learning to denote dynamic nature of estimating the model parameters in the course of learning [2,3,4,5]. Models use new input data continuously to extend their existing knowledge.**Incremental Learning:**: These are feedforward neural networks that can perform classification, regression, clustering, sparse approximation, compression and feature learning with a single layer, where the parameters of hidden nodes (not just the weights connecting inputs to hidden nodes) need not be tuned [6]. They can randomly choose the input weights and analytically determines the output weights of the single hidden layer feedforward neural networks. This network is very amazingly fast in training, however, it is a shallow network that is suitable for a certain subset of problems. Moreover, it trains traditionally on the whole set. Later, they introduced an online learning version to handle the case of the continuous incoming sequential stream of data [7]. Due to the popularity of such a technique, there exist a few implementations such as Numpy-ELM, ELM, Python-ELM.**Extreme learning machines**: It is a special category of machine learning techniques addressing the problem when only one, or a few (**One-Shot learning****n-shot**), examples are available at training and you need to classify many new examples in the future. Whereas most ML algorithms require training on hundreds or thousands of samples/class and on very large datasets. In fact, learning information about object categories from one, or only a few is quite a complex task. However, this type of learning is not relevant to our topic of online learning, but many people mistakenly confuse it with online learning.

Thus, all these forms of learning do not fulfil the main goal of online learning which is to learn and predict in real-time.

** Types of Online learning **

### 1) Traditional ML Online Learning

There is a great body of traditional machine learning algorithms that are implemented by the famous scikit-learn (python library) that support online learning. This takes place in the sense of incrementally updating the model based on fitting new data instances to an already previously trained model. The function *partial_fit()* provided by scikit-learn and exposed to users, support incremental training on small batches of data. The straight forward support of online learning to a wide range of scikit-learn ML algorithms comes in handy for a number of applications. *creme *is a library that supports online ML, where the model can learn from one observation at a time incrementally. Vowpal Wabbit is another good ML tool where it supports fast online learning modules.

### 2) Neural Online Learning

In many complex problems, traditional machine learning techniques can only be sub-optimal as it fails to capture underlying complex structures. Thus, the utilization of neural-based techniques becomes a favoured solution to such problems. One reason is that deep learning has great non-linear fitting capabilities. This allows it to learn deep intrinsic features from raw data automatically. However, in the online learning setting, the neural setting becomes far more complex than in the traditional techniques presented above.

A few challenges usually face deep learning techniques including (but not limited to) vanishing gradient, diminishing feature reuse, saddle points (and local minima). Now, let’s discuss how neural network (NN) models are optimized during training. This will give us a notion of how to do it in an online setting. Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way methodology [1]. Many implementations are available in the famous NN python libraries such as Tensorflow, Keras and Caffe. Gradient descent optimization can take one of three forms. These are based on how much data is available to compute the gradient of the objective function as follows:

#### Optimization Techniques:

: It is the vanilla gradient descent technique, where the gradient of the cost function is computed w.r.t. to the data parameters θ. This takes place on the entire training dataset for a prespecified number of epochs. It can be very slow and is intractable for huge datasets. This type does not allow online learning, i.e. updating the model with new examples on-the-fly as it assumes the whole knowledge in advance.**Batch gradient descent**: In contrast to batch gradient optimization, the parameters are updated for each training example. This makes the training faster and allows it to be adopted in an online learning setting. However, the frequent updates which are prone to high variance can cause the objective function to fluctuate heavily during training. This can complicate the optimization function convergence to the exact minimum. It was found that decreasing the learning rate may help in slowing the gradient overshooting. This help in eventually reaching the global minimum point [1].**Stochastic gradient descent**: It is the most widely adopted gradient optimization technique. It updates the parameters for every mini-batch of n (64,128, 256, etc) training instances. This way it takes the best of both worlds: a) reduces the variance of the parameter updates, b) with mini-batch optimized matrix optimizations can be used to compute the gradient in a very efficient way.**Mini-batch gradient descent**

In case a small number of instances available – a quick convergence would be of high priority, and thus shallow networks would be preferred, whereas, for a large number of instances, the long-run performance would be enhanced by using a deeper network. Consequently, most existing online neural learning algorithms are designed to learn shallow models for simplicity using either linear optimization methods [8] or kernel optimization-based methods [9]. Multiple kernels were also employed for online where a kernel-based prediction function from a pool of predefined kernels in an online learning fashion [10,11]. Unfourtnuetly, these methods do not allow for complex nonlinear representation learning naturally embedded in complicated application scenarios.

On the other hand, a solution to develop online deep learning solutions is by directly applying a standard backpropagation training on single instances. Nevertheless, a key challenge to such a technique is the choice of proper model capacity (e.g., depth of the network). If the model is too complex (e.g., very deep networks), the learning process will converge too slowly (vanishing gradient and diminishing feature reuse), thus losing the desired property of online learning.

Learning a deep NN on the fly [4] is one of the most famous and recent attempts to bridge the gap between online learning and effective deep learning. They start with a fast shallow network that convergences easily; then gradually switch to a deeper model automatically with incoming new data employing Hedge Backpropagation for training. Fourtnetly, there exist an implementation found here to such approach publicly available by authors for a practical examination of the model. However, [12] report that in some cases, there can be a delay in weight updates using Hedge Backpropagation (**HBP**) which can cause difficulty in training the lower layers of the network, making it difficult to adaptively update parameters. To this extent, and to target the capacity scalability and sustainability challenge if data streams change in nature with time, [12] introduced an incremental adaptive deep model (**IADM**).

Thus, in order for you to examine if a deep learning model online algorithm is sound and viable there are a few questions that need to answer:

- How to choose the network depth and whether it is static or dynamic?
- what is the most appropriate training strategy without facing common deep learning issues (e.g. vanishing gradient)?
- How to change the capacity of the network?
- When to change the capacity of the network?
- How to perform the steps above in an online setting?

**Online Learning Challenges Summary**

Online learning is a complex machine learning sub-problem where the main purpose is to learn and predict in real-time. Many complications arise in the process of training and deployment. In this article, we highlighted some of the problems that face practitioners in the field. Also, we tried to cover the problem from different dimensions based on the current state of the art techniques. Along the way, we also raised a few challenging software engineering concerns to online ML training. Such challenges cover how to deploy, test and update online machine learning models. Below, we summarize the set of challenges that practitioners face while developing such ML techniques:

- How to choose the capacity of the online models?
- What is the acceptance criteria for training new online models?
- How to synchronize between different deployed online model instances?
- Would you consider a mediator to orchestrate distributed model synchronization?
- How to better schedule training of online models (by single examples or a sequence of examples)?
- How and when to measure models concept drift?

## **Conclusion **

In this article, we only scratched the surface of the fascinating ML subfield of online learning. The main goal of such models is to learn and predict in real-time. We explored together with the important definitions, the current most interesting techniques and applications. Moreover, we have pinpointed several issues that face and can block the development and deployment of online learning-based algorithms. Multiple famous libraries currently support online learning using traditional ML approaches to an acceptable extent. On the other hand, there is still much to do in the neural approaches arena. We showed what are the challenges, criteria of judging proposed solutions and possible directions of potential advancements. We also showed that you can easily apply online learning after defining your problem setting and answering a few questions. This helps in clearly stating the most important criteria of interest to reach your goal.

### References:

[1] Ruder, Sebastian. “An overview of gradient descent optimization algorithms.” arXiv preprint arXiv:1609.04747 (2016).

[2] Huang, Guang-Bin, Lei Chen, and Chee Kheong Siew. “Universal approximation using incremental constructive feedforward networks with random hidden nodes.” *IEEE Trans. Neural Networks* 17.4 (2006): 879-892.

[3] Liang, N. Y., Huang, G. B., Saratchandran, P. and Sundararajan, N. A fast and accurate online sequential learning algorithm for feedforward networks. *IEEE Transactions on neural networks*, *17*(6), 1411-1423. (2006)

[4] Doyen Sahoo and Quang Pham and Jing Lu and Steven C. H. Hoi “Online deep learning: Learning deep neural networks on the fly”. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI); 2660-2666 (2018)

[5] Polikar, Robi, et al. “Learn++: An incremental learning algorithm for supervised neural networks”. *IEEE transactions on systems, man, and cybernetics, part C (applications and reviews)* 31.4: 497-508. (2001)

[6] Huang, Guang-Bin, Qin-Yu Zhu, and Chee-Kheong Siew. “Extreme learning machine: a new learning scheme of feedforward neural networks.” *Neural networks* 2: 985-990. (2004)

[7] Huang, Guang-Bin, et al. “On-line sequential extreme learning machine.” *Computational Intelligence* 2005: 232-237. (2005)

[8] Crammer, Koby, et al. “Online passive-aggressive algorithms.” *Journal of Machine Learning Research, *Mar: 551-585. (2006)

[9] Hoi, Steven CH, et al. “Online multiple kernel classification.” *Machine Learning* 90.2: 289-316. (2013)

[10] Jin, Rong, Steven CH Hoi, and Tianbao Yang. “Online multiple kernel learning: Algorithms and mistake bounds.” *International conference on algorithmic learning theory*. Springer, Berlin, Heidelberg, 2010.

[11] Sahoo, Doyen, Steven Hoi, and Peilin Zhao. “Cost-sensitive online multiple kernel classification.” *Asian Conference on Machine Learning*. 2016.

[12] Yang, Yang, et al. “Adaptive Deep Models for Incremental Learning: Considering Capacity Scalability and Sustainability”. *Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining*. 2019.