Event Recording

OT Patch Management Best Practices

Name: OT Patch Management Best Practices
Uploaded: 2022-05-13T12:00:00+02:00
Duration: 19 min 25 s
Description: This video has no description

Posted on May 13, 2022

Speaker

Fulup Ar Foll

Founder and Lead Architect
IoT.bzh

Show Transcript

So I'm Phillip AAL. I'm the CEO of a small company located in Britain under the west coast of France. We are 30 Linux engineers. We just live in long, which, and ABO city over 15 minute from the, you know, the beach and the surface place. Okay.

You, you can connect to your customer with a video call. You cannot go surfing with video calls. So it's better to be by the beach. We are especially known for the work we've done in the automotive sector, especially with the automotive Linux.

So patch, update equation, and it's effectively a complex equation. So we, we try to look for few things here. The first thing is to understand that there is a key difference in between, you know, the traditional it environment. And I just start my, my meter, okay. Time meter and the embedded world. The first one is clearly that 10 years lifetime in the embedded world is very common. Okay. The average age of a car in Europe is 11 years.

The average age of a ship is 25 years, and you have many industrial systems that remains working for 10 years, 15 years, 20 years, and changing anything is a very risky operation. Okay. When you have a complex industrial system, it may have take years to assemble that. And so changing something is always a risky operation. So everyone is reluctant in doing that. And also let's be honest that most of the people are not used to update software. So they tend to consider that they should not change anything. And the go-to production will was very long, very often. Okay.

I just take a simple example in Britain, we have a city named Ren where Siemens is installing a subway and they are testing the line B for now two years. It means that they have been running empty subway for two years and it's still not in production. Okay. So obviously when it takes two years to, to, to put your system in place, you you're very reluctant to, to have a software guy coming in and say, oh yes, I'm only going to change the software. Don't worry about it. Okay. And so software clearly, your second class citizen. Okay. It's a feature is not a goal. Okay.

And I think we have to take that in account. Now there are some issue coming from for, for that business as well. So first one is clearly complexity. Okay. Embedded software were used to be simple. If you have five years or 10 years old car, you probably only have few hundred thousand lines of code running in your car. If you have a new car with a, as you probably have in between 60 to 100 million lines of code, okay. And people are talking about 300 to 500 million lines of code for fully automated car. Okay. So the complexity is getting completely crazy. And I think we all suffer engineers.

So we know that as soon you have more than 1000 line of bug of code, you at least have one bug. Okay. So it's like the reality. The other second issue that is coming in this equation is the fact that we have more and more interconnection. Okay. Standalone system is not an option anymore. And you may have many and unknown device in the loop. The first one typically is the phone okay. Of the user, whether it is used to control the system for UI or for, you know, second factor authentication as presented before. But you may also have supervision.

You may have an navigation system, you may have a payment, you have a lot of coming in your system. Or I think we lost a presentation. And another element is that because you have hundred or even thousand of component, typically in a, in embedded system, like your TV, you have 2000 to 3000 components that are assembled together. And this component have different lifecycle, different owner and different strategy of updating. Okay. And so last but not least problem of obviously is how you propagate privilege in that environment. Okay.

So fact that, you know, not only you have to deal with the traditional dynamic privilege like user identity or usage context, you also have to deal with static privilege on, is this application allowed to talk with that one? Okay. And you have to make sure you can propagate this privilege from the cloud to the sensor. And as a result of usually cybersecurity risk is getting bigger and bigger. And you know, this is the Penton Institute. Research and car for example, is number two on the list.

And typically attacking IOT, industrial IOT is a variable business because you are attacking expensive equipment. Okay. And so if it expensive, it means there is some money to make by attacking them. Some people will consider that there are complex and, and going to be very expensive to attack them.

Okay, this is security by office station. And unfortunately this is not working. We know it's not working and globally, the industry has in between little to no knowledge about cybersecurity and return of experience on that subject. And last point I was also addressed yesterday in one of the talk is the fact that we are running in an embedded system. So we have very limited hardware resources to implement the security. Okay. And so it's very common, for example, to have less than one gig of Ram, you only have to arm core to run your system.

And you guys probably going to give you only 3% of the system to implement the security model. Okay. So that's not a lot of power. So you have very limited resource to implement your security model, but not only that, you have a lot of specific hardware with a lot of legacy things. And typical example in the industry might be MOBU or can in a car zebu are not protected at all. So in a car, if you connect on the canvas, anyone can do anything. Okay? Because there is no protection on the bus and not in all industry, but in many of the industry as well.

We have a lot of unique or almost unique system. If you take two ship, for example, they're not equivalent. So probably the it system is going to be different. Okay. So hell might be the same, but everything inside is different, which obviously makes very hard to, to share the cost in between, you know, projects. But it also makes very hard to leverage experience. When you have an attack on one system to replicate this knowledge on another one and that, but not least you have certification and certification from a cybersecurity point of view and patch is a nightmare. Okay.

As I said before, if you take two years to, to, to, to put your system in production, you are not going to accept to update it in 48 hours. And if certification takes six months, you cannot implement your, your patch in less than 30 day and 30 day is today. What the automotive regulation is mandating. Okay. In between the time you, you have a known vulnerability and this vulnerability is collected on, on your cars. So you have to shorten significantly the certification time and historically certification rely on the V cycle. And the V cycle, unfortunately is both expensive and long.

And it was designed at a period where the system will, will run with the same software all along, says a lifetime. Okay.

So it, it was acceptable to take one year to certify your system because the software would run for 10 years. But obviously this is not working in the new world where we need to go faster the market, but we also need to have continuous improvement because of cybersecurity. And just as a reminder on Linux, we have in between three to five known per day. Okay. And we on just on the, the call Linux packet we have in between nine and 15 update per hour. Okay. Which means that it is common to have 1000 update per day. Okay. So obviously you're not going to answer that with a traditional model.

So there are some option nevertheless, to, to, to move forward and to try to solve this impossible equation. The first thing is, try to ask yourself some very basic question. And if you can answer to more than two of those question, you really advance compared to the rest of the industry. The first one is, if you have a known vulnerability, let's say GTC one X to what one, you know, a Z as a Berg. The first question is how many of my devices are using this library. Okay. Probably 99% of the industry is not able to respond to that first question. Okay.

Obviously when you have detected the vulnerability, second question is how you're going to fix it. Another important question is how do I fix the system that is eight years old?

I, I don't know if you ever try, but you take an old system, let's say more than five years and try to apply your patch on it. Okay. And you will see that it is not simple. Okay. How do I update the device when I have unstable, you know, connectivity?

You know, it, it means that, you know, not everyone has 5g. Okay. You have a lot of country where the connectivity is very poor. You may also have systems that are in, you know, the third floor underground with no connectivity. So how do you deal with that? Okay. And how do I test my system before pushing in production? If I only have almost unique system, how do I reduce a non environment that is significant? So I can do the testing and what is the impact of the update on my certification? So the good thing is normalization is forcing the industry to change. Okay.

And two years ago, when you were talking to the automotive industry about, you know, updating continuously their car, they would say, it's not a problem for us. Okay. The new regulations are one hundred and fifty five, one hundred and fifty six is going to be enforced in Europe from July this year. Okay. And suddenly everyone wake up with, we need to have an update mechanism. Okay. So normalization is really the way to go. You also have to automate completely your process. Okay? You are not going to run that by hand. Okay.

As I said before, you know, if you have hundred or 1000 change per day, you're not going to deal with those change manually. So you need to have a fully CICD software factory. You need to have automatic test reporting. So your test would run continuously and automatically you need to generate automatically all the image to run the test with a different hardware. So different version of the software. You may have, you obviously have to run with containers and you have to run that in both a native and across environment.

We should not forget that in the embedded world, most of the time we are not running on a native world, but we are running on a cross world. Okay. Because the system are too small. So we cannot build the system on themselves when you have producer right code. The second issue is to make sure you run the right code, because if the update is the right way to solve vulnerability update is also a way to push vulnerability on your system. Okay. And Tesla is a good example where, you know, they had two years ago, people who pushed malware in the Tesla through the update system. Okay.

So update is very dangerous, but it is also the only option to solve the issue. So you have to understand how to do your update.

And, you know, we had a, a question before on, on how we authenticate the system here, authentication on who is doing what is extremely important. Okay. Another key element when you do update is to make sure that all your security rules are generated automatically and are corresponding to what your application is supposed to do. Okay. So that's another key element, which is probably specific to the embedded world. And then from an architecture point of view, you want to run everything in a contentized microservice architecture.

The reason we want to do that is because if I do a patch update on my geolocation system, I want to make sure that this cannot have influence on the rest of the system. So I only have to reduce the test on this, this small container. And I can restrict the amount of testing that I have to do to go, to go back to production. Obviously you have to support transport agnostic, but I would say that's probably normal business, but containers and microservice architecture is extremely important. Microservice architecture is also where we implement all the introspection Oak.

So we can, we can certify the system is behaving correctly. You need to have a serious releasing policy. Okay. It's not like we have a Burg.

We, we do a patch, we send it, we install it. Okay. And we forget about it. Okay. That's not working every time you do a release, you should sync how do, how I will maintain and do a patch on that release in 10 years from now. Okay. Which mean you have to keep all the asset that you use to build a release, which mean not only the source code, but also the compilers, all the tools. Okay. And for example, in, in IOT visits, when we build a system, we cut the network. Okay. Like that we can guarantee that no resource can be downloaded dynamically. Okay.

If, if the build process tried to download something from the network, the build is going to break. Even if it's a, a small image, a GP file. Okay. Because in 10 years from now, we are almost sure that this your aisle is not going to be valid. Okay. So you have to download everything you have to, to keep continuously updating your code. If you wait five years and in five years from now, you say, oh, we had a Burg, let's try to fix it. You will have like 20,000 error at compilation time, which is just impossible to fix. Okay.

So you have to make sure that you keep the system under control or, you know, during all the lifetime of, of your equipment, you have to keep the list of renewability known. So you, you cannot fix everything. It's impossible. Okay. So you have to, to be in a position to tell your customer, okay. In your current version of the system, this is the open nav you have, but in your context are not critical. Okay. And so you have to mandate and, and to mitigate a risk like that, and you have to limit the, the number of flavors, you know, in the embedded world.

Everyone tend to say, we are so specific that we need something unique. Okay. That's not true. Okay. So you have to explain them that you have to run one given version of the kernel one, given version of the middleware one, given version of everything, because otherwise you will try to boil the ocean by maintaining, you know, millions hundred millions of billions, of lines of code, which is obviously not going to fly. And you have to prepare for easy certification.

As we said before, certification is an issue, which mean you have to generate as much as possible of documentation to make the certification simple. So there are nevertheless some unsolved issue in my opinion, and the first one and the most critical one, in my opinion, is the certification. And that's the only one I'm going, going to talk about. Okay. And as I said before, if you, you have a security risk that you have to fix in less than 30 day, which as I said, is pretty common in the industry today, between 30 and 90 days. And if your certification takes six months, it's not going to fly.

So you have to make the cost of re-certification acceptable, but also the timeline or re-certification acceptable. So the real question here is what is the impact of the update on my certification process and how do I automate that? So there are no magic ones there, but there are some ideas for going that direction. In my opinion, one of the most promising one is a work done by the nest. It's a language named O Oscar that allows you to define how the system should functionally behave. Okay.

And the advantage of that specification is it allows you to implement automatic introspection on your system. So in my opinion, it's a very promising emo Dell, what we already have as well, at least on Linux is we have a lot of capability inside the system to do introspection. Okay.

So, and we have a research program going now. I cannot guarantee we're going to be successful. Okay. But we work on that. We try to do an automatic matching in between the formal specification and the introspection we can get from the system. And obviously in the perfect world, we should be in a position to generate a report that say with this new update, that's, that's a functional requirements that you don't fulfill anymore. Okay. I will give you question.

We have, for example, a camera that is filming you to guarantee the driver is, is not phoning or, you know, reading a book is Monday three. If you have an autopilot level five, okay. So level five, more or less means that the car can drive by itself.

But the, the driver is still responsible of taking the, and back in case of problem. So the camera are verify that the, the driver is effectively in a position to come back and take the controllers vehicle. Okay. In order the AI process to work, the camera has to generate 25 image per second. Okay. So one of the functional constraint we have is you have to assert that the camera effectively generate 25 image per second. And you have to guarantee that even when you do patches, okay. And that's something we can do automatically. Okay.

And I think that really is a way to go if we want to enable patch on the, on the system. So that's my conclusion. You have to keep the system under control. I think it's probably one of the hardest element to do. You have to simplify and standardize. And here we are talking as much of politics as technology, okay? Because not everyone is ready to accept this simplification. You have to do everything under the control of C I C D. So every time someone stand a line of code, you should do the test automatically.

You have to automate as much as possible, all the image generation and all the native part of the development. And you have to make sure that your code is as generic as possible. So you reuse the code other and other, because the more you reuse the code, the better your code is going to be. And the is going to be to maintain in the long run. You have to make sure that your security and your safety is auditable. So you have to build your system in such a way that safety and security can be audited. I won't say 100% automatically, but at least largely, automatically, otherwise not going to be.

You're not going to be able to patch your system. And I would say two last element. I want to highlight. You have to enforce your privilege model, and you have to make sure that your rule, your security rules are generated automatically because you cannot rely on human for that. And you have to run stuff as insulated as possible and containers for that is really the way to go. Not the traditional container has we have them in it, but container with name space insulation.

So we can guarantee that we, we have a kind of wide box and we know this wide box can only do what is specify in the privilege and the security environment in, and this guarantee you that if you change a white box, it's not going to have influence on the other white box of your system. And it's not that easy to do when you only have one gig of Ram and two arm core. I can tell you. Okay. So that for me, I think I'm on time.

Like this?

Don't like this?

OT Patch Management Best Practices