Event Recording

Big Bang to the Cloud - Lessons Learned from a Successful Large-scale Production System Migration

Name: Big Bang to the Cloud - Lessons Learned from a Successful Large-scale Production System Migration
Uploaded: 2023-05-10T12:00:00+02:00
Duration: 19 min 44 s

Posted on May 10, 2023

Managing access is a critical capability for the IT infrastructure of any enterprise, especially when dealing with over 6,800 integrated applications used by millions of authentication requests. Due to the increasing demand for availability, scalability, and support for market-specific customizations, as well as the migration of more products and applications to the cloud, we had to migrate our infrastructure and application stack to the AWS cloud. This stack had been introduced in an on-premises setup in 2017 and now follows modern paradigms such as GitOps, Everything as Code, and highly automated processes based on Service Layers and ForgeRock. Our main concern was ensuring that the integrated application landscape remained functional during the migration without experiencing any impact or downtime.

During this presentation, we will share our experience and discuss the key takeaways from our successful large-scale production system migration to the cloud, including:

Understanding the target architecture for the migration project
Identifying the challenges that arise during cloud migration
Discovering strategies for minimizing the impact on integrated applications during the migration process.

Show description

Understanding the target architecture for the migration project
Identifying the challenges that arise during cloud migration
Discovering strategies for minimizing the impact on integrated applications during the migration process.

Speakers

Dr. Heiko Klarl

Chief Marketing and Sales Officer
iC Consult

Michael Maier

IAM Consultant
iC Consult

Stephanus Rieger

Product Owner
BMW AG

Show Transcript

Yeah, hi everyone. Thanks for joining us. Big Bang to the Cloud on a migration story and lessons learned from our migration. So very awesome, very awesome story. Cool that you are here, Stefanos. Thanks for that. Let me introduce with a quick introduction who we are. So I'm HaCo leading sales and marketing and IC consult, working closely together with the BMW folks for more than 13 years.

Michael, product owner in the project. Also from IC consult and Stephanos product owner at bmw. Responsibility For the access management piece, we have basically two, two sections. We share quickly, very quickly some backgrounds on how the setup is about dimensions and what we have done. And then basically probably the most interesting for you, the lessons learned, how we achieved to do the migration and moving an on-prem access management fully fledged into the cloud.

And yeah, Stephanos, couple of backgrounds on the reason why we did the migration. And couple on sizing and numbers from bmw. Thank you. Thank you. Or give me the possibility to share a little bit my thoughts here about how to move such a large scale authentication service to the cloud. Why do we do that?

Okay, actua, we dealing with a lot of number, high number of applications that we have in in in our authentication provider and it is all located also in Munich. And if you think of BMW as a worldwide operating company and would like to share now authentications only one, one point, then this will lead to a lower resilience if something is occurring here. So in order to reduce latency and improve resilience, we decided according to our cloud strategy to also move our authentication service to cloud. With this we can have an on-demand scalability.

It's much better than you can do it with the on-premise solutions. And of course this is giving us new now the opportunity to switch over to a multi-regional concept. What's absolutely necessary is if you think of operating things out of China, you always have some, some special cases there that you need to take care of. We would like to operate things out of us and we would like to come closer to our customers and now to our users and reducing the latency and increasing the possibilities of authentication.

And just to give you a little bit of an overview of our environment that we have so far, we have introduced in the last couple of years singers and on possibility for all of our applications. So effect is also the production lines with the, with the plans. We have of course our systems for financial services. We have the states and marketing systems and all of them are using a single sign on seamless as much as we can. And with this we have around two and a half thousand applications that we have in here already introduced.

That means as we have different rounds at the moment with around 14,000 integrations and with this we are operating during a day around 26 million authentication recasts. So this is a large scale and that means, so my boss always told me the minimum requirements of the availability of your system is 110% cause if this authentication is not working, we can't operate anything. We can even close the production. And so we see how important this is. So no stress.

Yeah, Stand outside. And that's why we have also introduced here I see consult and for our partner here where we have set up the situation here in order to help us moving our system from on-premise to cloud. And now we were thinking of how we will do that application wise with 2,500 applications. That's a long term. You can't do this in between a couple of minutes. You will have years in order to op organize that.

So we decided together with our architects, think of a big bang and now we are coming to the start thinking how to organize such a big bang without having any issues in production and with how, without having any downtime and how this is done. My colleagues here on the left, on on the side will explain you a little bit. Yeah. So then maybe let's, let me just show you a little bit about what, let's say where we going to. So what we migrate to the cloud. So as Stefana has already mentioned, the kind of goal was to have no downtime to have everything in place as as soon as possible.

So big bank migration and yeah. So we decided to have yeah, move to the cloud. So keep it as similar as possible.

But yeah, we are not doing a simple lift and shift so we're not taking our service, installing them into some kind of hyperscaler and then that's it. We're kind of using the whole, yeah. How we say cloud Cuban needs function. So upscaling, downscaling, ingress and all this kind of stuff. So it was really, yeah. So the first challenge was to keep the environment as similar as possible in the cloud as well, especially for all the customers. And as you have seen, there are 2,300 applications integrated into our systems. So if there's a bug, they have found it. Yeah.

And if there is a bug, they will use it. So we kind of have to yeah, keep the the solution as as similar as possible. Yeah.

And yeah, how, how we achieved that. So with, with our service layers kind of philosophy, we can deploy those, those clusters on every kind of cloud, which is a, a good thing. So for example, for bmw we decided to move into the kind of BMW own hosted AWS cloud. But of course we are also capable of dealing with with other clouds.

So yeah, we also had a little issue or not was not our issue, but during the goli there was some kind of outage in some kind of other cloud regions. I think it was Microsoft and there were some systems down and then we kind of needed to postpone the whole goli because yeah, it's cloud and every cloud is the same.

So yeah, It was was very amazing. Curtis at this time my vice president joined us with this go live and told us guys during that day, we already have an outage with Microsoft here. I already you really sure that you would like to go to cloud today. And okay. And after that we started a new discussion. If something like this could happen, how do you prevent this in the future? Is your system capable to run only on one hyperscaler or can we use multiple of them? It would be much better to have it on multiple environments. Cause if one is not available that you can switch.

And here again we are talking about the big bang. Yeah. And so just another picture of how, how we deal or how we, we set up our service layers, glasses or this is also one thing that helps us or will help us in in the future a lot because yeah, we are not doing this kind of cloud native, so we don't use the, the aws Cuba needs service. So we have our own Cuba needs installed in there. And so we kind of can protect our, let's say, customer data and also our data itself from yeah, the big tech companies. And as we had also heard in the lecture yesterday, yeah.

And as I said, so the idea is not on or as also Stephano said, the idea is not only to keep this in in one region or even in in two hyperscalers, we are also thinking of having the, the global strategy here. So also China also the US and with with this also the for especially regulations that we need to fulfill there as well. Yeah. Thanks. Thanks for presenting the architecture Michael.

Yeah, so basically, which was the main focus insights of the migration experience. So how did the team, how did W BMW manage it to have a seamless migration without down times for applications, without, without bringing efforts, additional efforts to the application. You can imagine having those many integrated application, everyone is very happy if he or she has nothing to do. So what are the strategies for minimizing the impact of an executing such an migration and yeah, what have, have you planned Mike to follow?

So first of all, my architects convinced me a big bang would be the best approach. That was a nice job. They have done a lot of investigation in that evaluations how to do that. Okay. They convinced me. Now I have to convince the management.

Of course, if you wanna do something like that, you need to now ensure that everything's working as expected. So basic requirements and prerequisites are here. If you're gonna move to cloud, everything should work as it should be as it's doing before. And expectation is it's not getting slower has to get quicker. Now how to ensure that this is happening. And with this we have set up some approaches in order to give the evidence that we have prepared everything accordingly and that we can really do this big mag big bang here and how this was done. Some details from Michael here. Yeah.

So as I said, most critical part are the applications because we don't want the applications to do anything to change anything. So we just kind of take them, yeah, kind of by hand and tell them, look, this is how it's gonna take go, go on in the future. And you don't need to to worry because we will help you and we will have also the staging environment where you can test everything.

So from, I'm just talking about the technical details here. So we have this kind of three stretch, three staging environment in BMW where let's say we have the integration environment and all the applications need to test there. And we had a really, really long integration phase where we told every Integra, every application, especially the kind of major business application here look this is our environment test as much as you can test as as long as as you want, but please test.

And this was one of the, the the big, big things we, we had and also we, we told them, look guys, it's now a Sunday, let's say lazy Sunday evening. We will do a kind of a soft go live for you so that you can test all your applications also in the productive environment without any kind of harm, without any downtime. So let's say the daily work can still be done and we are on at least on Monday morning on the old solution. So even though if we would have some, some issues there that we can can prevent them.

And There was another completely advantage that we had with that, with this Prego live. We had the, the possibility to do this switch this test upfront, not just in the integration environment. If you have an integration environment, this is a completely different environment. Everything is working there. Whenever you're gonna do the switch, no one really cares. But here with this Prego live, this was very a critical pint point in here you have the possibility to deal with the switch to completely do that. And out of that was very amazing.

We had some findings in here we have, we never have thought of that. This may be an issue in here and we come to this later course. This is something that you should take care if you're gonna develop applications in that case and deal with authentication services. So also another thing that's also here on our list, as Stefan has already said, we need to convince, let's say everyone that our solution is good and is better as as OnPrem especially some internal architects from from bmw. And so we had also a excessive kind of low tests.

We have also excessive kind of yeah monitoring so that we can see okay, what's the current status of our environment and what we are doing today. So that we then can also tell them, look, this is what we're doing today and after the switch it doesn't look any kind of different and it's still the same.

Yeah, Yeah. So you can't control what you don't measure. That's why we have set up this dashboards in order to give the evidence that everything is up and running. This is also very crucial for the management. You can give them the evidence. Everything is up and running. You can show them how we have set it up and you can completely explain them how this big bang will happen. You can show them this is how, how we measure that. We gave them the response times of the new environments with all the load performance tests that we have done.

And we explained them together with the feedback of the applications that has done the testing with us. That everything was working as expected and as planned. And this evidence and the, the buy-in from the managers management is very crucial here. If you don't get the support, you will always get the fear from these guys that something could go wrong and afterwards you will end up in a mess.

So you need their buy-in and you need to give them the evidence that everything is planned well and that also the applications that they're operating here in their departments are up and running and work as expected And, and it's for sure a great feature for the future. Cause basically you can now benchmark every change, every new feature against your numbers, whether they are improving or decreasing.

And we had a couple of conversations here also with vendors and the questions, when you to change something in the system, it's clear the system power and performances either to improve or at least stay constant coming to lessons learned on both sides. So it's always a learning and a project is always a kind of churn of it. A lot of peaks and probably some bellies and everyone learns from from it.

So, so we did what are the important lessons we've learned on BMW side but also on on IC consult side. I would like to start with that. What I already have mentioned, the buy-in of the management is crucial here. You need to convince them, you need to explain them what you're doing. Of course then yes, you have to be stay in the budget, otherwise the buy-in is lost very quickly. Yes. What we have done here is a lot of pioneer work.

Cause if you're gonna start up with such a, a complex situation and dealing to cloud, you have to ensure that the environment and the connection between the on-premise world and the cloud world is set up in a way that you can use it in that way. You have to think of how is your disaster recovery plan? If you're gonna move to cloud, how can you move back? What is the situation if something is going wrong? Yeah.

So from, from my point of view, even though if it's on the bad side, it's kind of this application resilient training. Yeah. Because we are doing the switches and with the switches layer we, everyone needs to log in again and we kind of needed to train. And also the soft CoLab was a good part for that. The application to yeah, kind of manage this, this behavior because we had applications they needed to research our servers every time we switched and we told them, look, this is not something that we can deal with in the future.

And yeah, so that's, that's one, one of my my things. And also on, on the good points. So if you have an experienced operations team, that's also, yeah, good, good for you to have. But I think that's good for everyone to have an experience. What was very amazing during this big bang, we had especially booked a room so that application responsibles and managers could join us and with this big bang, okay, the big bang itself took us around five minutes in order to do a dispatcher switch at the end.

And a crucial, what was was very amazing during the time we had done this, the switch and one of our managers wanted to join here, but he was a little late and then he came into this room, was sitting there and following our discussions that we had. And then he asked me, when, when, when are you start doing the switch?

Oh, this was done already 10 minutes ago and everything is up and running. And that was something that was very impressive for him. That such a, a big authentication service with this huge number of application can be switched in between minutes.

And as, as the same as we have done this also with the disaster recovery, we were able to switch back in between minutes. Of course the SD application have to reauthenticate, but this is just a lock in mask. This is coming up. Okay. One of the findings was that some application have not implanted resilience as as they should have. So if they come become a message that that is telling you, oh, the token is outdated, it's not valid anymore. That didn't reconnect automatically but thrown an arrow.

So this is something that, that we have learned and that we have now put into the, the requirements book of each and every application, be resilience and react on exceptions that you're gonna get from an authentication system. So basically summarizing it, what you can take with you as an recommendation. So it's regular and clear communication. Get the buy-in off your stakeholders, keep them informed, get your manager's buy-in. Cause it's a huge project. It's not the cheapest project and it's a long running project as it's basically a program and it's your foundation of the company.

So basically it never ends. Asanas mentioned application resilience training. So also your client applications to be integrated have to be resilient not only for the scope of the migration, but as we've seen all hyperscalers can have hiccups as well. So basically resiliency is an important thing and life monitoring gives you a good feeling as the product owner, as the application team being responsible for the application but helps, helps you very, very strong to have an understanding of the future.

So think of your providing or creating a new RFP in the five years in the future towards your new system, towards your new architecture. Then you have a clear understanding of your benchmarks. Now we are exactly in time. Martin is very happy and I hope you have the chance to ask at least one most complex questions. A question is Martin, I think We already have seven questions from the online audience or that came in via app. You also can always ask you questions via the app. So we will pick one of these. We need a very short, concise answer, Phillip. Okay.

The question by the online audience was, did you migrate only the IDP and access management capabilities to the cloud or parts of the IJ system as well? At this time we only moved the the i, the access manager itself to the cloud. The igs have to remain on premise, at least at the moment, as this is the connection directed to the internet. Further on, we're gonna proceed here and switch to the public ingress so that at least we have then at the end an internet to internet connection and don't have to go the way through an on-premise.

Okay, perfect. Thank you. So thank you very much for sharing all the insights.

Like this?

Don't like this?

Big Bang to the Cloud - Lessons Learned from a Successful Large-scale Production System Migration