GDPR obligates organizations to provide data subjects with access to their personal data. To comply, companies must be able to answer a seemingly innocuous but frighteningly difficult question: What do we know about the data subject? Further, organizations must respond to Data Subject Access Requests (DSARs) in a privacy-preserving, Privacy by Design-embedded manner. This is going to be problem as organizations are not going to be able to reliably find the data – as there are too many places look, data variability (Elizabeth vs. Liz) and other problems. In this keynote these identity challenges will be explored and remedies will be suggested.
Good morning. Just some quick background about me. I had a software company and I moved it to Las Vegas in the early nineties. One of the systems that we built for the casinos was for this company called Griffin investigations, and they helped surveillance departments keep clever, bad people out. And so we instrumented this system and added facial Fisher recognition to it in 96, but this system was used to help bring down the MIT card count team. So if you've read the book, bringing down the house or saw the movie 21, my team, and I had our hand in bringing those little bastards down to their knees. We, we built another assistant for the casinos that became known as Nora or non-obvious relationship awareness. More about that in just a second, but ultimately my company got acquired by IBM. I spent 11 and a half years there as an IBM fellow.
And in 2009, while at IBM, I started working on an entity res a new entity resolution project with privacy by design from the beginning. And about two years ago, I spun that company out into my new company, sensing in a one of a kind partnership. That's my background. But this system called Nora. The goal of it was to take data, the identity data from lots of different databases across the casinos. And some of the clever bad guys would they'd have 32 different names. They'd have five different tax ID numbers. They'd use multiple dates of birth. They're trying to deceive the casinos. And so the casinos are trying to make sure they understand who they're doing business with. And so what we did is we created this Nora software to figure out who was, who, and then also, who was related to who, and this was used to generate alerts.
They had about 15 million customers, employees, and 18 different watch lists. I dusted off one of the original reports, a summary of a report that we showed the casino. When we did a first run, it was kind of funny because I sat there with the head of corporate security and I showed them that they had 24 active players who were really known cheaters. And one of 'em, they had recently bought a plane ticket for, to fly the cheater back. And so he said, are you telling me that we're flying the cheaters in, on our own money? It was like, I'm afraid. So sir, and we found a variety of other things. Like some of the employees were the player, which was against policy and some of the employees were related to vendors. And a few of them were even the same as the vendor. But what made all of this work was entity resolution is again, figuring out who's who, and just in the last year or so, I've spent about 30 hours creating the next few charts about showing you what entity resolution means to me in slow motion. I'm gonna show you this. And I'm gonna show how I think this relates to solving a problem with GDPR. Take a look at these three records. And by the way, no one to date has ever got this, right?
This one. Yes. But as I show you a few more, it just gets harder and harder to hold it in your head and you'll see why. But do you think these three are the same people or not? The first all three of the records have roughly the same name cuz Bob Robin Robert are equals basically the first and third records have the same address. Roughly the phone numbers are all the same. The date of births are different between one and two, but only the month and day are transposed. And so most people would agree that that's probably the same person. And so we're gonna call that entity one. And now income's this record. It's the same email address as record number one, is it the same person? No, this would be a family, maybe sharing an email address. And so you really can only graph it and call it a discovered relationship.
And of course, if you can have discovered relationships, maybe you would have disclosed relationships like on an employment application, you might know the spouse. And so this one's very simple. It can't be the same, cuz you just said it's related to record three. So that's a disclosed relationship. And how about this record here? What would you do with that? Is it entity E one? Would you be sure? Could you be sure with just a name and a date of birth and we, we would think of that as a possible match and you would just need a little more evidence to be sure.
Now how about this record? This is a bit, it, it it's tricky. Cuz most systems out there would arbitrarily bind this to either record or one or four, cuz it has the same email address, but it really could be either if that record had on it shoot on site, you would not want to randomly give it to one or the other. And so in fact, what you should do is hold it out. But I will tell you if entity E two didn't exist, you could be confident in that moment that it was E one. And if E one did not exist, you could be confident that it was E two, but when they both exist, you have to hold it out. And we think of this as a form of ambiguity. Now what about this record here? It shares the name, the address and the phone of record two.
And so it will become part of E one, but it also shares the email address with record six. The moment you learned this, not only does it add to E one, but that other entity that other record has to snap in this turns out to be something rather difficult to do. It means every time you learn something new, you have to look at the past millions of records or hundreds of millions of records and ask yourself now that I know this, had I known that in the beginning? What I've changed my mind about any earlier decision.
Now, similarly with this record, number nine, take a look at that and see if you could maybe determine what, what it should do. It's I'll tell you it's definitely not E one because the dates of birth is so different and it says senior. So it's gonna become a different entity, but something else is gonna happen as well. And I'll tell you, what's gonna happen is record three. That has an ID number really belongs with it. And so the moment number nine presents itself, record three pops out. And that disclosed relationship from E three has to move with it
That took a couple hours right there to make those move like that. Okay. What about this record? What do you think this record should do? If we, if you looked at this for a while, you'd realize it doesn't contain enough information to match record one and it does not contain enough information to match record two or three, none of the records, but only when you look at the union of those records inside of E one, do you realize that that record belongs in E one? This is a big difference between record matching systems cuz record 10 doesn't match any individual records. It only matches the union that has been created inside of E one. And so we think of that as entity centric learning.
Now here's the last one. As when you delete record eight like in GDPR, right? To be forgotten you say, Hey, you can't know that record. You have to delete it. What you have to do when you delete a record is you have to return the system to a, a, a state as if, as if it never knew. And so what, what might happen if you delete eight and I'll tell you what happens is record six has to go back to becoming a possible match. And it turns out record 10 has to go to a relationship and it's this style of entity resolution that we used in that Nora technology for the casinos. And something funny about that project is the head of corporate security every now and then would just look at us and go, that's incredible, but he wouldn't tell us anything. They ran it like a secret system, but he retired.
And a few years after he retired, I took him out for a few beers and I said, tell me something. And he said, you know that system, you built us. It found us some very interesting leads, but we would sometimes get leads from other places, not from your system and your software that you built. Nora did something beyond just finding leads. And I looked at him, I, I was a little upset because we'd built the entire system in 90 days for $25,000. And so when he told me it did something extra, you know, when he said this, and this is the quote, he goes, never underestimate what else it did. I was thinking we must have undercharged if it did anything else at all. And what he said is it allows us to do a form of enterprise discovery where we can search this database and find things in the enterprise that would be unlocatable.
And it took me a few years to realize how significant that was. And I'm gonna now show you how that can be used in GDPR compliance. Okay. So first, just to set this up, as you may know, you probably already know this and GDPR compliance starts though with one seemingly simple question. What do we know about the data subject? Well, it turns out this is an exceptionally hard question to answer, and I'm gonna show you in four, four challenge statements. So here you get a, a data access request for Liz Reston. And now you have one of your employees looking for it. Will they remember to search all the databases? Will they search the payroll database because maybe Liz Reston was fired three years ago. Will they remember to look there? And will they remember to look in all of those secondary databases that get created in the data warehouse and data, Mars and lakes.
And so on. The second challenge is data has a lot of variations. Maybe Liz Restin gave the name, Liz Restin, and sure it could find this record here, but would you find this record with the, with Elizabeth and a different email address or this record with a misspelled last name? What about all these records with the maiden name? Would it really be fair to think that the person making the search is gonna remember to search for Liz Elizabeth and Beth and the dozens of spellings of Mohamed that's impractical? The third challenge, and this is often overlooked is maybe the, the search that you're doing for Liz Reston, with the email address it's possible in the payroll system, you have the same email address, but often payroll systems don't allow you to search on email address. They only let you search on the name, the date of birth, maybe the employee number, maybe a national ID. So how would you even search that database? Are you gonna call the it department and have them run a custom query? And the fourth one is just because it looks the same. Doesn't mean it is the same here. The email address is the same, but it's Bob Reston. You wouldn't wanna accidentally take the loyalty club data from Elizabeth, from Bob, excuse me, and send it to Elizabeth. You'd be the subject of your own data breach.
And really this whole notion of having to search systems individually is rather unreliable and risky. It's kind of like having a library and no card catalog at the library and having to roam the aisles, all the floors and all the aisles of the library to try to locate the, the data. And so we, we think of this as the missing link, because what's gonna happen is the data access request is gonna show up. And the question is, how are we gonna find the data we surveyed for this? In January, we surveyed the thousand companies across the five largest economies in Europe, no surprise. Many of the companies weren't feeling ready, but of the large companies there a certain number 246 inquiries per month is what was being estimated. And each inquiry result in searching on average 43 databases is what came out of this survey. Each database they thought would take seven minutes. I think it would take quite a bit longer. And it turns out on an average large company they're gonna need seven and a half full time people just to look for the data.
And I think this is underestimated. And so I'm just introducing the notion here of single subject search. You have, you take the data out of the various source systems, whether it's 40 or 400 or 4,000, you use entity resolution to create a central index. When Liz shows up, you search the index and you get search results, it's like going to the library and going to the card catalog. And it tells you exactly where the four or five books are. So you don't have to roam the halls or search all those systems. Now just a really quick demo. Cause I did, I decided to do this on my own company. I went and downloaded salesforce.my company, salesforce.com. I just exported it as a CSV. And I went to MailChimp and feed blitz and Zen desk for trouble tickets. And I just exported all of those. And then I imported them into this screen here and did entity resolution on them.
It recognized all of 'em. So there was no mapping to be done. They auto mapped. I went up to the dashboard here and I entered into the search area now because it's real data. I can't show you much, but I entered my girlfriend, her, her name's Michelle Jonas. I know it's kind of funny that it would be my girlfriend with my last name. It's a longer story, but pretty funny. But what I did here is I misspelled her name. I put in one Ellen instead of two, and Jonas is often misspelled and typed Jones. And so I put Jones and she actually lives on Pacific street, not Pacific court in Santa Monica, California, but I just put in the zip. So that's what I put into this search. And because it's an entity resolved index and you're doing entity resolution on the search, it comes up and says, we found one record.
And even though the name an address were spelled like that, the record, the identity that you find is this. And if you click the show more button, I'm just gonna, it shows you what three records says. There's three records and three data sources are part of that entity. And here's the three records. If you click the show more and at the bottom here is where the name and the address lived because that's in my contact list in my, in my outlook. But it also found this record and this record, and these records did not even have a name and address. I've just located records about Michelle Jonas in systems that didn't even have the features I was searching for you. Would've never found, you could never find these. If you had searched those systems with name and address, da dah, any, any questions I will just, I will add. We'll see if anybody has a question, but I'll just add, I answer every email I get from everybody on earth. So if anybody ever wants to just email me by any topic whatsoever, I'll answer. If I can, it might take a week or two, but I'm very accessible questions,
Questions. Yeah, we have questions. What I'm quite sure. And by the way, we had questions for the other session as well, but due to the mix of the sessions they got lost. So if you could display the questions there's, you can see, there are a lot of questions and we have still a few minutes left. So
We let me take right to be forgotten real quick.
Yeah. So maybe let's start on top and get down. Well, can I
Take the right to no,
You forgotten one. You can yeah. Feel free.
One of the things that we did when we built that system, nor in Las Vegas, which turns out to be a right, to be forgotten, same situation. If you're a problem gambler by law, you can go and register and say, I'm a forgotten I'm, I'm a, I'm a, a, I have a gambling addiction and you can put yourself on something called the self exclusionary list. It's like, it's like telling a casino. You have to forget me. You can't market to me. And if the casinos accidentally market to them after they've been notified, you know, cuz now the person's at home and they're not being marketed to, but suddenly it says, come for the weekend, get a free room, get a free buffet. And, and so if that person shows up and gambles, after they notified the casino and lose money, can Sue the casino and they have for hundreds of thousands of dollars.
And what was happening is the casino would take somebody outta their marketing list, but data's kind of messy. It kind of wiggles around it floats around and suddenly they're starting to market to the same person again. And so one of the exciting things that you can do with a central entity resolved index is you can do something that I think of now as continuous, right, to be forgotten monitoring. It means if they tell you to forget them, if they, that record of that person creeps back in. And if suddenly in one of the marketing databases, you can be alerted to that before you accidentally start communicating with them again or doing analytics on their data. And so I think that's pretty close to that one.
Okay. So, so how do you handle the possibility that data records might be entirely false or humans trust lying?
Ah, this is fairly, this is fairly easy to answer actually, since a lot of my work over the years has been on how to catch clever, bad people and they lie. The records are false. Yeah. And the way you find falseness in data is you have to find a disagreeing data point. If your neighbor tells you they've, they've never been to France. You don't know if they're lying, but later maybe you're at the park with the neighbors for a birthday party. And maybe the wife tells you ever since my husband lived in France, blah, blah, blah. And so that's how you find lies in data as you compare it to other data.
Okay. So I think we are close to the end of the time. Can
I make one comment on scalable?
Sure. Comment. Yes.
The biggest systems that I've built on ND resolution have tens of billions of records. And the biggest one that I have running in my new code base is a, is a voter registration modernization project in the us. And we have a quarter of a billion records in that one so far.
So, so you can go big, you can go big. Okay. Thank you very much. Tr thank you for your.
How can we help you