Event Recording

Dr. Zacharias Voulgaris: The Usefulness of Anonymization and Pseudonymization in Data Science and A.I. Projects

Name: Dr. Zacharias Voulgaris: The Usefulness of Anonymization and Pseudonymization in Data Science and A.I. Projects
Uploaded: 2020-10-29T12:00:00+01:00
Duration: 16 min 2 s
Description: This video has no description

Posted on Oct 29, 2020

Speaker

Dr. Zacharias Voulgaris

CTO
Data Science Partnership Ltd

Show Transcript

My name is agar, and I'm the chief science officer of data science partnership, a company that does consultation for various data science and AI kind of projects in the UK and in Europe in general. So as you probably already know, and are aware of privacy, is something very important for all of us, especially those of us who are dealing with data related to privacy, regardless of what role we have in it. And in data science specifically, this is a crucial matter because that's where there is a lot of liability that is not addressed often by the data science people.

And fortunately, we have to borrow the methodologies organization ization through from the cybersecurity field to make this call process more streamlined and safer for everyone. So in this stock, we look at just that focusing on these two techniques, these are some of the topics we're going to delve into in this particular talk.

And it's, I have to say in advance that it's hard to do this topic justice in such a short talk, but I hope that this can be a good incentive for you to research it further using this topics as starting points for your own research. So, first of all, what's personal developable information or PII for short.

And why is it so useful today, especially in the data science field, PII is defined as any data that involves personal or private attributes of a particular individual or group of individuals, such as their name, financial data, the location, particular words, they live personal professors. They may have as well as medical records and what have you. And naturally this is where all the privacy Dell dwells in because by protecting privacy, basically we have to deal with PII.

And if we can't do that for whatever reason, either by mistake or because there is some malicious intent by someone, some third party, for example, then we have PII leakage and that's a very serious issue because this can be actually cause for a lawsuit.

And this are very serious lawsuits because the whole matter of PII is very clearly stated in the GTR legislation, which involves all parties in the AEU community, as well as other companies outside the U that deal with this particular users and data science, because we deal with all kinds of data and a big, big amount of such data coming from different data streams. It is very likely that we'll find something that this PII related in one, at least of those data streams and protecting it is something that is very essential, particularly before we use it in the models.

Because if we have already used the PII in a model and we have a result, the chances of this PII becoming liability are much smaller. So how this does this PII feature in the science specifically as well as say, yeah. And what do the risks current evolve? First of all, PI usually takes the form of specific variables in the, and these variables are what is turned into the features that we use in the models.

The whole process is quite time consuming, and I can't go into it in any indepth right now, but suff to say that it can be either done manually or automatically the latter is a case of AI systems. But in any case, especially when we're dealing with transparent models, which are often used in data science, these particular features can be traced back to the variables and those variables can be traced to specific individuals. So PII delves into those variables, but also combinations of these variables can also reveal PII.

For example, sometimes location, data, like where they have been in the past few weeks, as well as any allergy information. They have this, this by themselves may not be PII per se, but when you use an in tandem, then you can reveal information about individual, especially if that person lives in a small community. And what's more main data science models, especially those that have AI on the back end. You can predict PII variables. So even if youit them, you can still predict them with high enough accuracy jeopardizing the privacy of individuals involved.

So this naturally makes the whole problem much more challenging. And that's something that people have to take into account when it comes to data science and AI projects. But let's now Del more into the specific methodologies of cybersecurity, namely anonymization and pseudonymization. First of all, analyzation is the process of destroying PII. So it is the full proof way of dealing with this PII by either omitting it altogether or by scrambling it in such a way that it cannot be traced back in any way. Like it's a one way street.

So the organization is a bit more sophisticated because it involves various processes that deal with concealing the PAI. So the PII remains there, but it's not recognizable unless you have some additional information which the data scientist or any Analyst involved in the whole project will have.

And naturally because of the, the nature of ization, you can reap that private, the, the, the concealed information to the original individuals that corresponds to, and both automatization as anonymization deal with considering PII actually, while still maintaining it, its usefulness to some extent when it comes to the data science models, particularly when it comes to ization. So how do these methodologies add value?

I think it's pretty obvious that by securing the privacy of the people involved, this is how they add value the, to the organization because they mitigate the risk of having lawsuits because of PII leakage. Because if there is a, an organization, for example, there is no leakage whatsoever. Like if somebody even accesses the dataset that we use, they cannot find anything useful for the, for them that could jeopardize the privacy or the people involved as for pseudonymization.

Even if they do access dataset, somehow they may not be able to trace the, those concealed PII variables to the original people, easy, if not at all. So that's something, bring people forget that the science that we have to really protect the, the people behind the data, it's not just about the data. It's also about the people behind it. So the organization specifically maintains this private information well pro well at the same time, keeping the information useful.

So the, the data is still usable in a data science model, but at the same time, if somebody access it, they cannot really do much with it. And of course there's a tradeoff between the cybersecurity level that we have and the usefulness of protected data on the one extreme, we have dataset complete stripped out of all, PA that it could have. And this way we end up with a relatively weaker model, but on the other extreme, we have PII that is fully abundant and we don't do anything with it. So cybersecurity goes out the window in that case.

And we end up with a stronger model, but there's immediate states in between those two extremes that have to do with how much we employ pseudonymization. Let's look at some specific processes used for these two methodologies for minimization. Things are fairly simple because we don't have many techniques, either remove the data altogether that has PII, or we avoid acquiring it altogether. There's also the case of scrambling it in an irreversible way. In which case it's still anonymized as for organization.

It has different techniques, such as tokenization, masking data, blurring, encryption, and scrambling. These are the main ones and all of them involve concealing the data in one way or another tokenization specifically uses a non mathematical approach, excuse me, by which we create a token for each specific unique name. So if we're protecting the names of individuals, for example, we can use a token, which is also characters like letters, basically, but it is juries to someone who looks at it because even though it's the same character type, they, they don't have any innate information to them.

It's like some random characters basically. And the information, we need some additional information to be able to make this mapping and make it two way. And this is usually something that lives in a different table in a database or in a different database or a different file altogether. And this whole thing is very useful when you're dealing with legacy systems, like all databases, where the character type and the length has be preserved. So that's why the conversation is often very popular for in that regard.

Of course, the masking technique is useful in its own way, in other kinds of data, for example, credit card data, where you mask or conceal one big part of the data and you just leave the rest visible. So this way you can still communicate this data to the client or to the user and, and, and they can recognize the credit card. For example, from those four last digits that remain concealed well, the rest is must. So if somebody intercept this communication, they cannot figure out the original credit card number.

Obviously encryption is scrambling are a bit different because they use a way that is mathematically sound and well researched to make the data into GI. So someone can see, can see something completely recognizable compared to the original data. And in the case of encryption, this is something that can easily be reversed through the use of a decryption key. And the decryption key is something that usually is a bunch of random strings or of characters. And because of that, it can easily be reversed while with scrambling, you just mix the different digital characters together.

And in many cases, this is irreversible. Like when you're doing with numeric data, you cannot reverse them back because you don't know the doesn't make sense. The exact order, you have to have some additional information involved for the whole process. That's why traveling is often considered as an anonymization technique, not though that GDPR states clearly that if you have any additional data that you use for this encryption, for example, the encryption key, it has to be stored separately from the pseudonymized data.

And this adds an extra layer of security to the existing cybersecurity level. And finally, we have data bearing, which uses an approximation of the data to conceal the information this way. And this makes it practically impossible to trace it back to the region of people. You can think of an image data like the, a mugshot image, and with blurring, you just blur the different pixels together. So you can get, can still tell that this person is like this or another way, but you can't discern the specific identity of the individual because all the facial characteristics are pretty much gone.

And you can do this with other multimedia data as well. Not just images, data gathering is not used that much data science though. So let's look at now some useful ions to have about PII and the two cybersecurity methodologists that we just talked about. So the first of all is, although it's very powerful and very useful.

It's not, Pania when it comes to securing PII because theoretically at least it can be reversed. So if you're doing, for example, with encryption, if the encryption key is not very well developed, and it's something that somebody can guess very easily, they can reverse the process with a brute force attack. And so given enough computational power and time, they can reverse this kind of ization method and reveal the PII and the same goes with other PII concealment methods in ization.

Although it all depends on the data and the project, what's more regardless of the usage of ization, it still abides to the GDPR regulation. So you still need to get permission from the users to use their data. You don't have this issue with anonymization though. What's more evaluating this PAI related features in the data science model is paramount for deciding what to use, because if a particular variable, which is then turned into a feature, doesn't have much predicted potential for what you're trying to predict, then why use it at all? I just get rid of it and apply optimization.

Alternatively, if that feature is very useful and it's, you need to have it in the model to have a robust model, you apply organization to it. What's more, there's also this, the sizing data approach, which is very different to the formation, cybersecurity methodologies, but it also has some similarities student izing data has to do with creating new data. That looks like the original data, but it doesn't correspond specific individuals. So in the image data, for example, it's like creating new phases that look a bit like the original phases, but they're not phases of real people.

So even if somebody has access to this synthesized data, they cannot do anything useful with it other than use it in the science project, of course, but they can't reveal PII because they don't, it doesn't have any PII start with, of course, to be able to do, to apply this technique. You need to have lots of data and, and also you have to have a specialized AI model usually to make this happen. So it's not something you would use in all projects, but it is there as an option. And finally, not all data science applications led themselves to the synthetic or anonymized data.

For example, there are, there are recommended systems such as those in Amazon or sites like good reads that make some suggestions about what you could read or what else you could buy from their site. And that kind of system uses PII. Even if it's a pseudonymized it's useful, but you can't get rid of the PA altogether because the system would not work. So before we wrap up this talk, I have some learning resources for data science matters. So first of all, there's this latest book of mind called Julia for machine learning, which I authored this past year and got published this spring.

And it deals with how you can use Julia, a very promising programming language for machine learning applications and other data science techniques. It's very hands on. And it doesn't much other than some understanding of programming in Julia on the right. I have a couple of course, platforms that deal with data science specifically. The first one is a company in Germany that started in the past year. And the second one is a company in the us.

And, and the important thing that both of these platforms offer also mentoring, which is a very powerful thing to have when you're learning data science. Because when you're learning something like this, you also have questions and the mentors are able to address those questions and help you streamline your whole learning and make it a bit more enjoyable as well. So that's all I had for you for today. Feel free to keep in touch with me through any one of these channels. And thank you very much for your attention.

Like this?

Don't like this?

Dr. Zacharias Voulgaris: The Usefulness of Anonymization and Pseudonymization in Data Science and A.I. Projects