Thank you. So I'll be talking about this very interesting topic about machine learning system design. So this is one topic which is not taught in the AI schools, but it is an essential building block for solving machine learning problems. So my background is from academy research in machine learning, and I have been involved in the production of machine learning systems. So I can relate to the fact that there are big differences between machine learning in the academy setting and machine learning in production. So I'm going to touch upon those nuances in my presentation. I hope this might be useful for folks who are trying to venture into machine learning. And it'll also be very useful for people who are trying to deploy their machine learning systems in production. So starting with my presentation. So as I mentioned that if I have to relate to the differences in what we see in machine learning in academy research, be it AI schools or research in universities versus machine learning, what we do in production setting.
There are some big differences. The, the very first differences that in the research setting like the data is actually more often than not a curated one. So most of the time the effort is spent on trying to achieve that incremental training accuracy, that enhanced training accuracy. So we try all kinds of modeling techniques and sometimes, or in fact, most times and assemble of those modeling techniques to achieve that extra mile in training accuracy. Whereas in production setting, there is a concept of model serving, which is nothing but the, you validate your model, what you have built, and then you deploy it for consumption. And then you maintain that model. So this whole bundle of steps is termed as model serving. So in production, we, we put a lot of emphasis on the model serving component because that is where our stakeholders are going to consume the model.
So for them an accuracy of say 85 versus an 84.34 might not be very different because the consumer might not feel that difference when they are consuming a recommendation system, for example, but in academic settings or in a, in a competition like Kaggle, this might be a difference of around 200 ranks or more. So that is the difference, which I'm talking about in a CAD setting and production setting. The second aspect, which I would want to touch upon is that in CAD failure, a lot of emphasis on the state of the art results. So we try to come up with a technique, which is better than the technique, which we had arrived at say six months back. And we want to beat the results, which we have achieved six months back, but in production, the, the objective is very different. We want to solve the problems for our stakeholders and how efficiently we are able to solve it, determines whether our solution is successful or not. So it, it might not be achieving the state of the art results, but it, if it is solving the problem for our stakeholders, it is doing a job.
So in our wick setting, the thing is that we, we usually have a data set, which is quite consolidated and quite clean. So the model refinement actually overtakes the data cleaning part, and lot of time is spent on selection of the model and trying to come up with the best modeling technique. But in production, you rarely get a data set, which is clean. And you, you have to, you have to constantly work with your stakeholders to identify the right data sources, and then to aggregate those data sources in such a way that you can do some intelligent feature engineering on that. And these steps usually take the maximum time. So I'll be speaking about this in my later slides, this, these steps are called creation of data pipeline. And these, these is the step which takes the maximum amount of time. And the model building phase is actually quite less time consuming comparatively, which is a major difference, which we see in the academy setting.
Now, the other major difference that in S we generally end up building massive models, be it in terms of size or in terms of the computational consumption. Whereas in production, we have to be very careful to resist the temptation of building very massive models, because ultimately it might be very difficult to deploy that in production. And we, we might face all kinds of issues when we are trying to solve this problem of model serving with these heavy models, even maintaining these heavy models might be quite challenging. So we have to consider this trade off between effective and light models versus complex models in production. So with that background, I would like to move to what are the components of machine learning system design. So I have actually accumulated the different steps into these four building blocks and most of the machine learning problems, if not all, can be solved by these four building blocks.
So these, you can, you can think of these steps as universal four step methodology to solve any machine learning problem. So the very first step is a problem formulation stage where you, this is a very, very crucial step where you sit down with your stakeholders and you try to tightly knit the problem statement in such a way that your, all the pain points of the stakeholders are being addressed. And there are no unexpected expectations from the machine learning solution, which is going to come up. So this is the stage where you try to jot down all the requirements, and then you try to formulate your problem statement. And once that is done, then you try to discuss and discover what is, what is the data, which is available to you from different data sources, which your stakeholder has, and which you can basically explore, which might be useful for solving that particular problem statement.
And this will come under the stage of data pipeline, where you will try to aggregate the data, consolidate it, and then clean it. Then all the other stages of feature engineering feature formulation will be part of this particular step. Now in model building, we do a very crucial step of feature engineering. And then you do the model selection. You do the model refinement, and then you try to come up with quite competent in model accuracy. I term it as in model, because till this stage, you are not, you are not exposing your model to your validation data set your real live data set. So you have carefully kept that holdout sample. And this will be exposed to the model in the last stage, which is the model serving stage. So I have broken down model serving into validation, deployment, consumption, and maintenance. So in this model, serving stage, this is a very crucial stage for deployment where you choose the right validation metric.
And then once you are satisfied with that validation metric, when once your stakeholder is satisfied with the results, which you're achieving you go ahead and deploy the model either on the cloud or on premise or on the application of the stakeholder. And then you monitor the consumption to tell a certain time period. And then the maintenance of that model, the refresh, the periodic refreshes of the model takes over. So even though in the diagram, you will see that there's a sequence to this, these stages, but they are actually cyclic as well. For example, if in the data pipeline stage, you are thinking you, you are realizing that you might not have sufficient data for solving that particular problem formulation. Then you will go back and discuss with your stakeholder, whether the problem statement has to be tweaked because we don't have sufficient data for the earlier problem statement.
And then in the model building stage, when you realize when you are doing a quick and dirty POC, you realize that again, the data might not be sufficient for trying to achieve the accuracy which were earlier discussed. So then you'll again, go back to the data pipeline drawing board, or sometimes you might even have to go all the way back to the problem formulation itself. So all these building blocks are actually cyclic in nature. So this is overall machine learning system design, building blocks. So I'll move on to the challenges which we face because I have very carefully jotted down some major challenges, which we face in production, which is different from what we face in the academy settings. So the very first and foremost challenge, the data challenges. So anyone who has solved a machine learning problem in production will tell you that this is the most crucial stage.
And most of his time his or her time would be spent on trying to circumvent the data challenges. Be it quality, be it like trying to identify the right set of features for your problem statement, trying to identify the right data sources to satisfy that problem statement. So most of the time is spent there and with data privacy coming into picture, the trying to solve for data biases coming into picture. This stage has only become more and more complicated. Now, the second, which is a continuation, but there is a slight difference is the trying to address the inherent biases in the data. So now with topics like AI, ethics, gaining a lot of prominence and rightly so, we have to be all the more careful to weed out the inherent biases in the data. And I just want to touch upon this topic a bit more.
So this, this is a pretty difficult topic to solve because if you realize we cannot completely weed out the bias in the data, because actually these biases in a way are helping us to achieve the accuracy, which is above the average. I'll give an example. Like if you have a feature which is able, which is allowing you to discriminate between the positive class and the negative class, that feature might also be adding to the bias, right? So it's not so simple that you can simply weed out the biases. You have to be very careful not to kill the accuracy power of your models by removing the biases. So it's a very fine line when you are trying to address the bias issue in the data. Now, the third point is that, how do you keep your model explainable? So there is a entirely different branch of AI called explainable AI, which is quite prominent in the current times.
And there's a lot of research, which is happening in making the model explainable, even even the likes of deep planning models are now being made explainable to an extent. So we have to be very conscious of this fact that the models, which we serve are explainable so that we can, the, our stakeholders can actually question what the data scientists have built. And they are able to understand what they have built rather than just consuming it. So I'm talking about the stakeholders, not the end consumers, because the end consumers might not be very bothered about the machine learning model, which is there under the hood, but it is very important for the intermediate stakeholders to know what is happening. So the explainability of the model is a very important step as well. And at times you have to again, do this straight off that whether you want the accuracy, that slight increase in accuracy, or you want the model to be more explainable, for example, in autonomous driving, where each incremental accuracy is directly related to the safety of not only the drivers, but also the people around the car that time you have to let go on the explainability part of the model.
But when you are trying to understand the consumer sentiments in CPG domain, that time you can, you can trade off on the side of trying to make the model more explainable and you can do away with some notches of accuracies.
So the last but not the least is trying to select a proper validation matrix for the model. So in CAD setting, we do research on quite a different set of validation matrix, but in production setting, it becomes all the more important I'll be touching upon this in my later slide. But just to tell you at this moment that in a production setting, it becomes very important to frame your metric in such a way that it satisfied the needs of the problem with the, the regular validation metric might not always be suitable to do justice to the problem statement. So this is another challenge, which the data scientist faces in production,
The, the major issue. And I must say it with the capital is managing expectations of the stakeholders. So many, a times we are in a discussion with the stakeholder where we feel that still machine learning is viewed as something which can solve everything. AI can, AI can solve everything about a particular use case if we have the data, but that is not always true. We have seen it multiple times. The, the most recent example would be the prediction in us elections, for example, not only this time, but also last time. So even though we have a good amount of data set available to us, but machine learning might not magically be a hundred percent accurate all the time. And we scientifically prove why it is not accurate, but the major point here is that the expectations have, has to be set, right? And that too, at the start itself, that this, the use case, which we are trying to solve machine learning will help the stakeholder in trying to do a better job than average prediction, but it might not be 10% accurate all the time, especially in the times when the data is very volatile, like the current pandemic times.
So I'm, I'm speaking for, from experience, like suppose you are trying to focus the demand from the consumer side, and you have a model which is trained on historical data pre COVID data. Whereas right now where the consumer trends have completely shifted and we are living in the new normal that time machine learning might not do a very good job, right? So stakeholders have to be aware of these limitations.
So now the second point, which we often see when we are working with different stakeholders and different companies, most of you might be from consulting companies. So you might agree to this fact that when we try to solve a problem statement from a company, we see that they are trying to, they are trying to achieve what the leading research labs are doing. But actually when you take a step back, they actually do not need to do that. Their problem statement might be solved with much simpler models, rather than those complex and computationally heavy models, which will be very difficult to deploy in their applications as well as it'll be very time consuming. And it'll be very, this, this whole model will be quite data hungry, and that particular company might not have the sufficient data to do justice to that heavy model. So instead of trying to do research on trying to implement that particular latest state of the art technique, the problem statement can be solved by much simpler, much explainable, but yet effective models.
So this is a point which has to be driven towards the stakeholders. Now the third point is something which I've already explained, but it is so important that it requires a reiteration that tight problem framing actually goes a long way in determining whether the solution will be a success or not. So again, this is an iterative process. One cannot expect that the problem statement formulation will be magically correct in the first go itself. It is actually a cyclic process where you sit down with the stakeholders, you try to understand his pain points, his or her pain points. And then try to formulate that into a mathematical problem in such a way that all his pain points are addressed. And then you try to map it to the data which he, he or she has. And then you try to see what are the trade offs, whether that particular problem statement will be solved by the data, what he has, or you have to tweak or modify the problem statement.
So, so these things, you only come to know my experience while you work in production setting. It is very difficult to get this experience in the academy settings. Now, like the other very important thing is that you should try to avoid using very sophisticated state of the art modeling technique and the first go itself in trying to solve these problem statements for your stakeholders, because the sophisticated modeling techniques might give you slight incremental accuracies, but it'll, it'll, it might be detrimental for you when you are trying to deploy it in the production setting and maintaining that will be quite expensive. But if you lay that amount of effort on trying to get the right data, the trying to address the data quality issues, then the enhancement, which you'll get inaccuracies will be immense. It actually will be limitless. Whereas if you try to, if you try to change your technique, try to make the technique more and more complex, the incremental enhancement, which you'll get might not be worth it. So try to focus on data, the quality of the data and trying to work on data transformation part, and then try to focus on the modeling part would be my takeaway.
So this particular slide tries to speak about choosing the right validation matrix for your problem. So this actually comes under the validation part, which again, is a part of model serving where off the shelf matrix might not suit your use case. I'll tell you an example, like in the current pandemic times where we have seen all kinds of all kinds of changes in consumer behavior, it is quite difficult to do the right forecast and that too, based on the training, which is done on pre COVID data. So it is quite difficult to get an accuracy in the range of say, high nineties or high eighties, but what you can achieve is good directional accuracy. So right now the companies would be very happy. If you can tell that there would be an increase in this particular product category by this much percentage. And if there is actually an increase, it'll be quite useful for the business.
But if you say that there's go, there's going to be an increase of 5%. Whereas there was a decrease of minus 10% that might not go down very well with the stakeholders. So, so the metric, which you try to DataMine here would be something like a poit that is percentage of change in direction. How many times you are, you are able to capture the peaks and troughs in the data, which you are predicting correctly. So, which is very different from trying to accurately predict what, how many, how many packets you'll sell. So, so depending on the use case, you have to D mind the right validation matrix. The second example might be the large retailers, the Walmarts, the targets. They might be very interested in, in knowing what will be the accuracy around say Thanksgiving or Christmas, rather than like letting thing them know about a blanket accuracy, like the overall accuracy metric. So in that case, you have to give extra weight to the prediction in these key segments of the year. And the overall weightage would be determined by these key segments. So this choosing of the right validation metric is actually a very essential component in making your modeling solution a success or a failure. So with that, I would conclude my presentation and I would be open to question.