KuppingerCole's Advisory stands out due to our regular communication with vendors and key clients, providing us with in-depth insight into the issues and knowledge required to address real-world challenges.
Meet our team of analysts and advisors who are highly skilled and experienced professionals dedicated to helping you make informed decisions and achieve your goals.
Meet our business team committed to helping you achieve success. We understand that running a business can be challenging, but with the right team in your corner, anything is possible.
Session at the European Identity & Cloud Conference 2013
May 16, 2013 10:30
Session at the European Identity & Cloud Conference 2013
May 16, 2013 10:30
Big Data meets Security: Analyzing systems logs to understand behavior has become one of the main applications of big data technology. Open source initiatives as well as commercial tools and applications for big data integration, collection and analytics become more important building blocks of cyber attack resilience through better collection and analysis of very large sets of log and transaction data, real-time analysis of current events and potentially also prediction of future behavior.
Big Data technologies were invented to store and rapidly process the vast amount of data available today into useful “Smart” Information. What is common across these technologies is that their initial aims are focused on data processing capabilities rather than security and compliance. One particular concern is the lack of control over identity and access especially in the area of administration. This was fine when the application of the tools was confined to experimental or small scale usage. Now that they are being widely deployed for commercial purposes this is no longer satisfactory.
Big Data technologies need to be managed and secured in the same way as the other components in the IT infrastructure.
So we immediately start into the first track, the first speak I'd like to give a very short overview. What is big data? What is the risk and what is needed? Big data actually has been mentioned a number of times already.
And it's, it's, it's a, it's a big term. It's not clearly defined yet. So it's very similar to the cloud topic. We had number of years ago, when it was introduced, people were, were not clear about what is cloud computing, exactly what is, what is in it? What is not, what is part of it?
So here, my analogy I I'd like to make is actually big data is just like data warehousing, but on massive more data, which is publicly available in the internet, or maybe also privately available in internet and the major issue here by big data from a productive and, and from a, from a point of view of its useful, is that for actually being able to use it, you need changes in software architectures and changes also in hardware architecture. So SAP HANA, as you may have heard in memory computing is a major prerequisite for doing big data, right. And done in time.
So always imagine you are strolling through a mall and there's advertisements, which are specifically addressing you as a person based on what you've done the last for the last 30 minutes. So this is exactly the idea of basically what, what is behind what is necessary for being able to do that? The risk of course, is that, and this is known from the security and military industry. That's seemingly unrelated information. As soon as you start analyzing and combining them, then the information becomes relevant.
The typical example that I can give here is the number of if number of airplanes per, per, per, per location aircraft in a, in a military environment, for example, is important to, to compute and to, to do some, some start to, to, to, to estimate the number of support, activities of personal support personality you need in that places. Also another number, namely the number of functioning aircraft are not functioning.
Aircraft is important to manage the, the business continuity aspect, but this information combined to each other, namely to know on which location, which number of aircraft is not working, is exactly the interesting piece. So, so combining information, that's the major message here combining and aggregating information in a big data context actually creates value, creates information, and also could also create confidentiality. And so this is something we need to think about in a risk context, more problematic is that we often do not know which access control we could place there in advance.
So existing access control models actually are based on the assumption that we know which informational assets have, which protection needs. Now, if you don't know what you're going to compute and what you're going to analyze, how will you be able then to determine what actually should be protected and which, which queries in a, in an SQL slash data warehousing speak, which querie actually will be, should be, should be able, should be allowed to be executed and so on. So we need new authorization models in the big data context.
From my point of view, there's no owner of a data and Mike will give details on that later. And we need definitely also new protection means. So I think I've told everything here. So I'd like to give the floor to Mike and he will give his presentation his first presentation of this session. Thank you very much. Thank yous. So what I'm going to talk about today is This is still my, can you switch this presentation please? I don't. I dunno what I'm looking at Me.
Just wait, this is the wrong presentation. Okay.
So let, let's start off by saying that I seem to have a good name to talk about big data. I'm Mike Small, small is beautiful as they say so here we have the presentation. So what I'm going to talk about this morning is to go into a little bit more detail about what exactly big data means. Data in itself is not really very valuable. What is valuable is what you can make out of it. And SCA was touching upon those issues just a moment ago.
So what we believe is important with big data is to be able to turn it into smart information and kind of as a proof point of that, I've, I've brought out a number of, of examples where organizations are using big data to get value. And interestingly, many of those examples don't require, or don't involve some of the, the new technologies that are being described as being essential. And then I'm going to have a little bit of a talk about some of the downsides. So first of all, what is big data? Now? I would like to say that big data, what it means is a relative thing.
And I gave you an example. There it's a real example. When I was working for a hardware manufacturer in the 1960s, early 1970s, there was nearly a riot amongst the hardware salesman when the company introduced a 30 megabyte disc drive, because previously the salesman had made money by selling several eight megabyte disc drives, and they believed that they would never be able to make any money because one 30 megabyte disc drive would be all that people needed. This may seem completely bizarre and ridiculous nowadays.
But if we go back even further and think about the 19th century in the 19th century, people started to collect mortality data. That is to say the time, the age and the cause of people's death.
So, you know, you've got maybe 10 million paper records distributed around all the churches and parishes in, in, in the country. And yet by bringing that data together, people were able through manual processing to improve public health because there was concrete proof from this, that bad water supplies caused people to get diseases. And it allowed the creation of life insurance policies and life insurance and annuity policies, which were based on facts.
And so we've seen this evolution of big where big really means what it's very difficult to process with the technology that you have at the moment. So big data is something that's always going to be with us. And it kind of is almost a reflection of what it is. That's difficult to process at the moment. So a lot of people are talking about big data in terms of monitoring social interactions and monitoring Facebook and so forth. And that's certainly an important aspect, but some of the things I'm going to talk about today are where you can get value outta big data in other areas.
So there are three characteristics that in general are considered to be the things that define big data. First of all, is volume. And it's that the problem that you're faced with is this vast amount of data in terms of the processing capability. And there's an example on the slide from an IDC report, which they produce, which, which basically says that in 2012, there were roughly 2.8 exabytes, and that's two times 10 to the 18 megabytes of, of data. So that's a, a lot of, a lot of data. The second thing is what's called velocity, and this is how quickly this data can be generated.
And an example is of the large Hadron karate CERN, and what that does when they fire it off. It generates somewhere in the order of 600 million events per second. And each one of those events is detected by sensors, which can create a megabyte of data per event. So that's a fair amount of stuff to be hit in the face with when you're trying to decide whether or not you found the so-called God particle or the, the HIPO.
And then there is the question of variety, which is that perhaps a lot of the technology that we have that we're running our businesses on today is, is really based on the notion that most of the, the data is either numerical or simple textual. But in fact, now there's all this video, there's this voice, there's all kinds of other datas around that that need to be processed. And that leads to a number of different technology needs. So big data is characterized by those, those, those areas. So what about the big data technologies?
And they don't actually replace existing database management systems, but I'm going to just talk through some of the things. And in my next presentation, I'll give you some more detail about it now, in order to deal with the volume and processing this volume, it, it's interesting that Yahoo created this piece of software called map reduce, which runs on what is effectively an assembly of, of commodity hardware, which is designed and connected together in a way to give the ability to do a massive amount of parallel processing.
So it's really a parallel process includes because in, in a previous life, I was engaged with, with people who were developing truly parallel processing in, in a kind of approvable way. But nevertheless, this had is capable of reducing 20 billion, 20 billion line tables down to down, down to five or 6 million line tables, which are within the capability of standard processing. And if you don't want to assemble that yourself, you can even get access to that kind of capability through cloud services. And Amazon, for example, have this thing called elastic map reduce.
So just to summarize what map reduce does is what I used to do with packs of cards in the 1970s, which is we had a pack of cards which contained a whole variety of different, different kinds of records. And we didn't have very much memories. So we read the cards in and we normalized the, the, the, the, the actual fields, so that all the things that we wanted to sort on lined up in one place, then we didn't have a lot of memory to do the sort. So we would write them out to tape and we would sort as much as we can.
And then we would merge the, the records together until we got something that was, was processable. And that's precisely what elastic map reduce or what map reduce does. You can read a load of text messages. You can pick out a text field like temperature, and you can pick out the numeric field that's to do with that. That's the map part. Then you do a gigantic sort and you reduce it to discover that actually there was a, a million people were saying that the temperature in Munich is pretty hot. So that's what, what that technology does for velocity.
You have a number of different messaging kinds of systems that are evolving, that are intended to be able to cope with the vast quantity of, of, of messages that comes, comes along. And if nothing else give you the ability to quickly and reliably pick out the messages that are of interest to you.
And so, for example, Twitter storm is one of the things that's growing up, IBM have a thing which is called info streams and, and so forth. And the, the other issue is to do with variety. So variety is that, like I said earlier on that much of the technology that we have today is really focused on being able to efficiently produce sort of well mapped kinds of data. And in order to be able to deal with that difference.
What, what, what has happened is there's been this creation of the so-called no SQL databases of which perhaps what one of the ones that's the most famous is S3 from Amazon and S3 was released in, I think about 2006 as a database that was capable of a, a accepting just about any kind of data that's that, that, that was going to be backed up. And so the, the, the maximum size of any object is five gigabytes. So you can have five gigabyte objects, and currently Amazon are claiming they have over 2 trillion objects in storage with that.
And there are other kinds of stores, which are focused on different kinds of things, for example, XML and graphical data, which is important because a lot of, a lot of the analysis that you want to do is around connections and connections can be represented as graphs. And the natural language processing piece is, is important to be able to pick out from social interactions, from emails and so forth, the meaning of, of the different ways that you can say things or write things. So those are the, the sort of technologies now.
So what those technologies are basically doing is in allowing you to cope with this volume, this velocity, and this variety, and reduce it into something that's meaningful. And the difference between data and information is that data is really just symbols without context and meaning, and information is something that allows you to answer questions. And so if you, if you see a lot of, a lot of Twitter messages, which say happy birthday, then that's very interesting, but it doesn't give you a lot of information.
But if in fact you discover that there are a lot of messages that say happy birthday to a particular person, then you start to say, ah, well, I now know that this person has a birthday around here, and maybe you can also glean from this, the information about their age. So that's giving you an information that would answer a valuable question. So how do you actually get value out of information? So this isn't a new idea. And I talked to at the very beginning of this that said throughout history, people have got value from information.
In fact, the Rothchilds made their money amongst other things from knowing who won the battle of Waterloo. Now in England, we say that actually it was the Germans that won because they got all the benefits, whereas the French and the English that exhausted themselves through, through, through this everlasting battle. But the D Rothchild had a carrier pigeon. And this carrier pigeon got the information about the winner of that battle to the DRO child's family before anybody else.
And that allowed them to buy to astutely, buy government stock and to sell government stock in a way which, which, which gave them a big benefit. When in fact the knowledge became what more widely known. Now in the 1990s, there were books, and this was a book by a chat called Michael E. Porter on getting how organizations can get data out of competitive data out of, out of information. And basically they said, there are three things you could do with it. You can transform the product, you can change the nature of competition and you can improve the competitive edge.
And indeed, that is what is happening. That is what people are doing with big data. And some of these examples that I will give you in a moment will, will do that. And even more recently, there was this quote from a professor at MIT, talking about how he had been able by analyzing the results of companies, that to show that companies that had data driven decision making were generally speaking, doing better than those that didn't use data. They used opinion nor they used just a basic sort of gut feeling of what was a good thing. So using data allows you to make your operations more efficient.
It allows you to plan things better, and it allows you to effectively create new products that, that more closely match the market. So looking at some of these, the, these examples, the first example is it was reported in the BBC that the fire service in Amsterdam has aggregated data from a whole series of different external sources. For example, there are maps to do with which are available through through government freedom of information.
There are information about the construction of different kinds of buildings, and there are there's information about the location relative to water and so forth, which they were able to put together using a big data tool to reduce it down to a size where they could put it through a fairly reasonable, regular business analytics tool. And the co the result of all of that is that they were able to build a risk rating for the, the buildings around greater Amsterdam.
And this is very helpful because it allows them to, first of all, to give advice to the buildings, the ones that they recognize are most risk, it allows them to do better planning of how they would respond. If a particular building was set on fire, it also allows them to consider more carefully about how they should locate, locate their resources in order to be able to properly respond in a way which is most effective. So that's an example of risk management based on the aggregation of a lot of data from different sources, most of which was external to the fire brigade.
Now, here's another interesting one that you've probably, you know, about this kind of equipment. Some of it is made in Germany. The example that I used here was, was taken from caterpillar, which is one of the competitors of the, of that. But basically they, this firm makes equipment, which is very large vehicles that are used on mines inside mines on open cast mines. And each of these vehicles is in incredibly highly instrumented. The instrumentation gathers data about what what's going on, where the, where the, the, the vehicle is.
And that instrumentation is, is in fact, allowing that information to be sent back to caterpillar headquarters, if you will, who also sell a service to the users of those vehicles, where they can have a dashboard called cat mind star, which allows them to monitor in great detail what's happening in the vehicle. And the, the point about this is being able to more efficiently predict failure and to predict when you should economically replace things. And the example at the top is what happened when a track broke on a particular vehicle in a particular mine, and this thing was then stuck.
It actually was blocking all of the other vehicles from doing their job. And a track is not a piece of equipment, which is normally kept. And so there was a great delay in being able to get hold of and replace that track. So being able to more accurately predict when you should have done preventative maintenance, would've made a big difference in that particular case. And there are a number of corresponding examples.
For example, GE engines use, would you believe they use a salesforce.com messaging system in order to bring together live data from aircraft in flight, which are using their engines. And the objective of this is that if there is a problem in an engine, or if there is a nascent problem that could be prevented, that that information could actually lead to the spare part required being delivered to the airport, where that plane is going to land before the plane actually arrives, then we have the question of operational risk.
And many of you will have seen that, that the, that, that, first of all, we had S I E M security information event management. Now we've got the security intelligence, security analytic tools.
And the, the point about this is that this is on the, the cyber attack is almost exactly the kind of thing thats was talking about at the very beginning that nowadays cyber criminals want to attack an organization in a covert way. They don't want to be visible. So what you, what you need to be able to do is to see the needle in the haystack you may have. And indeed one of these products claims to scale, to being able to cope with a million security equipment events per second, and to process these.
And the idea is to be able to pick out of this volume of data, those, those which represent some kind of abnormal or anomalous anomalous patterns of events in order to be able to detect the intruder has arrived before that the intruder actually does something that's bad. And the picture that I've taken there comes from a very interesting report, which was published in the us post nine 11. And it was a report commissioned to look at the risks around the public infrastructure in, in that includes things like water, power, and gas in the us to terrorist attack.
And it's called this health stands for high impact, low frequency events. And indeed some of the events that they are talking about in there is, is a attacked by hackers or terrorists or whatever on the Scarda the, the control and monitoring equipment of these large utilities and smart meters.
Now, I perhaps you all have smart meters in, in, in, in Germany. I know you're very into being power efficient in the UK. My energy provider is constantly trying to persuade me that I can save money by having a smart meter. And they try to persuade me by telling me that that allows me when I'm watching television, I can have the smart meter besides me and see how much power I'm using.
Well, I don't actually see what it's going to do for me. And the point is that he isn't going to do very much for me individually, but having smart meters is gonna do a great deal for the energy companies, because basically the, in the UK, which may be the case, every everywhere else, there is incredibly fine monitoring of the high voltage to 600 KV power supply pylons that you see across the countryside. But there is no monitoring of the two 40 volt network around domestic areas.
And this has put something of a concern to the network operators, because at the same time, what we've got is we've got low carbon technologies being sold to individuals. So for example, whereas I can have a 30 plug, which is equivalent to about three kilowatt load, but if I buy a, an electric vehicle, then I can buy a 42 kilowatt charger, which could come from my main supply to that. That's quite a big step. And so the other problem is that people are putting in photo VoLTE micro generation size, which is going to re lead to reverse flows.
Now, at the same time, there is a legal requirement on the network operators about the voltage that's provided. So they see smart meters as being a, of instrumenting the, the low voltage networks, which allows them to avoid having to reinforce that network to, to, to, to make sure that it can cope with these unknown changes that are going to come through low carbon technologies. But to do that, they are talking about having two 25 million smart meters deployed each of which is generating events around every 10 minutes.
And they have to have a way of transmitting that information and being able to make use of that information, to do things, for example, to offer a spot price to an electric car charger so that they can balance out the load by controlling equipment that's in your house using things like, like, like pricing and so forth. So there's an, an example of potentially large saving to the public utilities through that. And then the final example I'm giving is to do with wind turbines.
And I, I see a lot of wind turbines in, in Germany. There's a lot in being built in the UK every now and then there's a picture in the paper of one that has burnt out or been blown down and is actually very, very important where you locate those turbines because turbulence is bad, too higher, wind is bad. And being able to predict the best location is really, really important.
Now, the people that that do that already had a big database, but they wanted to increase the size of that database tenfold in order to make the predictions more accurate. And they wanted also to be able to get faster answers to questions and using the conventional business analytic tools that they developed themselves, it was taking weeks to get the answers that they needed. So using this big data technology, they were actually able to increase the data size by 10 and reduce the time that it took them to get useful answers from weeks down to minutes.
So that's that, that's another key thing that that is done. So what about the downsides?
Well, there are some downsides because obviously whatever we can do with big data, then the bud guys can do with big data. So that's one of the, the things, but first of all, often the result of all this extra work is actually only a very small improvement, but sometimes that small improvement is worthwhile, but not necessarily always. So you have to really think is the risk and reward going to be worth it. Then if you get it wrong and you wrongly anticipate what people need, then it could well be that you'll annoy the customer rather than actually be felt to be getting closer to the customer.
And finally, if you believe absolutely that there are no problems with big data and work on it on that assumption, you may find that you've got it completely wrong, and there's a number of different ways that you can get it wrong, which I will talk about in my next session to do with information stewardship. And according to the Anissa documentation, the Anissa report on emerging threats, they identified these four threats as coming from big data that big data gives attackers more information about how successful their malware was.
It gives them more information to understand how to build better exploit kits. And it also helps them to be able to see whether or not they were successful in data breaches and to collect more information, which they can use for identity theft. So those are potential downsides of the use of this. These ideas and technologies could, could lead to, to being used by the bad guys. So in summary, big data is really a relative thing.
What is big is relative to the technology we have at the moment that there's been a lot in the press about the privacy and the personal impact of things and looking at Twitter and so forth. And my colleague Dr. Cast and KIST will be talking a little bit about that later on. But in fact, it's also currently being used in a number of other areas which are to do with technology, where there are clear and manifest benefits, but you get that, not from the data itself, but from the ability you have and the approach you take to transforming that data into smart information.
And the technology that's involved often may involve some of the more emergent technologies, but there's a number of illustrations, which I gave you there, where in fact people are using existing technologies, perhaps with some modifications, but nevertheless, using existing technologies to get good value out of big data. And that's basically my presentation and there is already a couple of documents that we've published, which you can see on here. So I recommend you go onto the Kuppinger call website and get your copy of these. Thank you very much.