Back
#50
January 31, 2022

EP50 The Epic Battle: Machine Learning vs Millions of Malicious Documents

Guest:

23:27

Topics covered:

  • This episode draws on a talk available in the podcast materials. Could you summarize the gist of your talk for the audience?
  • What makes the malicious document problem a good candidate for machine learning (ML)? Could you have used rules?
  • “Millions of documents in milliseconds,” not sure how to even parse it - what is involved in making it work?
  • Can you explain to the listeners the motivation for reanalyzing old samples, what ground truth means in ML/detection engineering, and how you are using this technique?
  • How fast do the attackers evolve and does this throw ML logic off?
  • Do our efforts at cat-and-mouse with attackers make the mice harder for other people to catch? Does massive-scale ML detections accelerate the attacker's evolution?

Do you have something cool to share? Some questions? Let us know:

Transcript

Timothy: Hi there. Welcome to the "Cloud Security Podcast" by Google. Thanks for joining us today. Your hosts here are myself, Timothy Peacock, the product manager for threat detection here in Cloud, and Anton Chuvakin, a reformed analyst and esteemed member of the cloud security team here at Google. You can find and subscribe to this podcast wherever you get your podcasts as well as at our website cloud.withgoogle.com/cloudsecurity/podcast. If you like our content and want it delivered to you piping hot every Monday, please do hit that subscribe button. You can follow the show and argue with your hosts on Twitter as well. Twitter.com/cloudsecpodcast. Anton, we've got Elie Bursztein joining us again today. And this is an exciting episode because we're talking about a very, very real application of machine learning to the problem of threat detection which I'm always the first person I think after you to say you can't do that.

Anton: I would say it wasn't I can't do that. It was more like most people who say that they did it really didn't or most people who really did do it didn't succeed or they achieved much less but marketed it much more. It's less about it can't work. It's more about it kind of hasn't so far despite all the noise. But you know what? We're gonna have an example today where it very clearly did work. 

Timothy: That's what's fun about this, you know? It's not like we're doing this one out of marketing. We've actually been doing this for a long time and it's not a service we sell. We're talking about this just because we think it's an interesting thing.

Anton: Yes, correct. And it's also--it will become very clear from the episode that some of the stuff is kind of magical, but it's also magical as in it works well with other techniques. It works well together with rules. It works well together with very non-magical stuff that's just hard work.

Timothy: Yes. 

Anton: So to me, this episode would prove the skeptics that AI is not good in security. It will prove to optimists that AI is great in security. 

Timothy: I think the truth as always might lay somewhere in between. And with that, let's welcome today's guest, Elie Bursztein from Google. Elie, this episode draws on a talk available in the podcast materials. I'm assuming some of our listeners didn't do their homework. So could you summarize the gist of your talk for the audience so they don't have to, you know, pause the episode, go read the thing, and then come back to us?

Elie: Absolutely. Thank you for having me, Tim. 

Timothy: Of course.

Elie: The talk you're referring to is a talk I did I think two or three years ago around the challenges when you try to apply machine learning for anti-abuse in practice based on our experience at Google, right? And we tried to go over the main pitfall, not necessarily having answer for all of them of course as we're going to discuss lots of open questions. Lots of things to figure out, but we try to be very, very concrete around what happen when you have a false positive. How do you deal with people actively trying to tamper with your neural networks or your classifier as we see regularly and things like these. So I think we'll go about some of those questions, right? So that's the gist of the talk. As usual, the talk is available after the fact. If you don't do your homework, maybe you want to inspire to locate it after that. I'm sure there will be a link somewhere. 

Timothy: There is. 

Elie: Awesome.

Anton: When I was reading the talk, one thing that struck me kind of hard is that in one place you mentioned processing millions of documents in milliseconds. And frankly, I don't even know if my brain can parse this like millions of documents in milliseconds.

What is involved in making this work? People may not get the scale of what you're doing. So like how do you process millions of documents in milliseconds? How does it work?

Elie: Difficulty. I think the 100 milliseconds, you know, came from the latency we need to have to scan document for flow through Gmail or Google Drive, right? At that point, a lot of the content we have needs to go through a service, right? A service is the attachment scanner, right? That Google have that is also used for Google Workspace. So it's used for end-users and for company as well, right? If we were to spend another like ten minutes per document, people will not receive their mail. You would upload or you want to download a Drive file and then you don't get your Drive file so the requirement production constraint. And this is something that production team are running, right? At Google. We are researching. We have building kind of the machine learning side and the static analyzer and stuff like that. The production team is running [inaudible]. And I think beyond the 100 milliseconds, it has to be--it should least to be--to be the cover, right? We have to do it worldwide. We have to do it 24/7 reliably across the world for every document going through Gmail or every attachment going through Gmail to make sure none of them contain malware with all the difficulty as we talk about this notion [inaudible] attack we can briefly touch upon. These 100 milliseconds is very, very difficult to do it too with a lot of parallel competition and containers as always. That's Googler scale, right? That's what it amount today to be able to scan and protect yourself from malware. 

Anton: Essentially, you'll do it because we have Google scale because we are Google. A short and somewhat non-techie answer is that if you have to process documents in 100 milliseconds then we make it so because we are Google and have the infrastructure. Is it too snide of a remark or?

Elie: A little bit of snide.

Timothy: No, Anton. I think what you've done is you've rederived the expression necessity is the mother of invention. Elie had to do this so we did it.

Anton: Kind of. 

Elie: I think we spend a lot of effort. Is not on only Googler scale. You know CPUs and you can scan each document in parallel, right? That's what we do. Every docu--we have I don't know how many thousands of parallel competition where the service absorbs the documents. It's still a lot of work which is done by the anti-abuse teams to sandbox to various containers to make sure that where there is a malware you don't go into process something which is dangerous without limiting its access, making sure that all the library and functional code are as fast as we can using caching. So there's a lot of engineering behind the scene. I do think as a company we do that very well too, right? We're not the only one who can do it. I don't think--but you're right. It required of engineering. It require a lot of ongoing effort. You might be able to process 95% of the document that fast and then the five last percent--five--one percent--the last one percent which as difficult case is--is disproportionally where you put a lot of effort, right? So kind of this [inaudible] low. You can get 80% there and then the last 20% you spend 10 years, 20 years trying to perfect your technique so that it actually works well. That's probably what's make it unique to be like a large company and having enough resources to do it. And that's the very, very expensive cost of security in general. It's the ongoing battle and the ongoing effort to optimize and the more you do the more you process advance, right?  The more you actually cram into detection there, right? What we were doing ten years ago and what we do now the amount of complexity have increased drastically, but we didn't have such a little envelope of time so I don't know what go faster. Is it the complexity or is it the amount of parallel competition? And you know, more slow is dead so, you know, CPU start to slow down so now you have to use TPU then you have to do like different type of accelerator. So it's a whole stack effort and we really do benefit for sure of the advantage that Google is bringing in term of competition efficiency, accelerators to be able to do more in the same amount of time. I don't know how long it's going to be sustainable, but so far we've made a good progress.

Anton: That's pretty clear and I think I apologize for my kind of like hey just throw a Google scale and it's solved. I understand that I used to joke that even a file copy. A petabyte-scale is a major engineering effort to require--may require a PhD to accomplish task. But I wanna switch gears to the other big complexity of this is that a lot of vendors, a lot of people throw ML at various security problems, and occasionally the results are quite hilarious or they're just bad. We know now that malicious documents is a good problem for ML, but because we made it work. So do you know what makes this security problem a good candidate for applying machine learning? Could we have used just rules and if not, what would we miss? Why is ML unicorn doing it this time? 

Elie: So we're doing the hard question today. I see. Okay. Let's do the hard question today.

Timothy: Oh, no. This is the starting of the hard questions. They get worse. 

Elie: This is the starting? Oh--oh, boy. Okay. Well, what is make it a good candidate? The first thing to say is at the time we did the talk, which was two years ago, it was already four years into the making. So it didn't happen overnight. We did test a lot of things. So the reason why we did that is I think two-folds. The first one is we've been driven by data or data-driven. As a reminder, Gmail do not authorize executive orders* attachment. So when we start to work on how can we improve the safety of Gmail users whether they are enterprise or they are end-users, we had only attachment to work with. That constraint already is a--I would say it's a space. If you think about we could have started with Drive and Drive do authorize attachment of binary and so forth. However, when we look at the data, we felt that telemetry was showcasing a very, very strong rise into Office document and PDFs and often in general like document as a precursor of the malware.  So for the people who are joining us what's usually happen in a creation of a malware is the attacker will send you a document which would either execute a macro. Sometime an exploiter* although it's rare these days. And then you click on it and then it downloads some sort of a binary decrypted potentially and then you execute it. That is the end payload. So you have the precursor as a document and you have the payload which is more along the line of the binary stuff. So we don't have the payload for Gmail. So that was one of the thing. Now what we know is from the machine learning standpoint, a lot of progress has been made in 2015 on natural language processing which is essentially processing text. Document have--contrary to binary which are compiled have a script, right? And have content which is mostly if you really extract it away very, very close to programming languages. It's not really the case then there's a lot of transformation to do but you can apply a lot of programming languages technique on it and you can apply a lot of natural language processing and ideas from the machine learning inside of that to the field. So we thought we'd give it a try and we find the right formula. Just to be clear the machine learning is only applied to very, very specific threat, including VBA-based and also Excel 4.0-based detection which actually is the one who have very complicated scripts. And in that specific case, it works. So I think part of the success was driven by, as I said, necessities is the one that was driven by opportunity of benefiting from a lot of the work done in other team at Google who drove research on AI who had drove good ideas. And when you bring those two and you try to make it efficient, it kind of work. I think it pulls back to also a business need, right? Kind of a need [inaudible], right? I try to make a chart on complexity of the processing versus the time it takes. On one hand, you have anti-virus which are extremely fast, and on the other end are missing a bunch of zero-day [inaudible], right? Variation of attacks or Excel 4.0 for [inaudible] sometime undetected stuff like that. And on the extreme end, we have a sandbox where you open the document and you see what happen and you instrument like a full VM. At 100 milliseconds, we can't really do this. So we can do that for very very specific elements [inaudible] protection reason and for very, very suspicious document. But in general, we cannot do it at scale. So it seemed that you could pay a little bit more than an AV but not too much to get way more in-depth. And I think machine learning is not--we don't try to replace everything. I think you love the startup which have kind of interesting result I would say is because they try to do everything at once while we were trying to say okay, we try to feed the technology in between and add it to our tool belt to increase resilience. More often than not we are mostly concerned about the lowest detection rate. We're not concerned about the highest detection rate. You know, like are you at 99.99 peak? Not really. What's important--what's important is how low do we deep when there is an active attack? Over the last few days, there is, for example, Emotep which is hovering the world. If you follow the price, there is a lot of massive companies these days mostly driven by Office document and Excel again. I mean, I can't fully say the official numbers, but let's say something which have contributed significantly to stop seeing which have not been seen by any other technology was the planning because it extrapolate. But again, it missing that also defense have detected and I think that's what make it very, very different ansd that's what we try to convey to the talk is we apply AI as a belt and suspender idea rather than just making it as a stand-alone. And I think it would have some pro on certain aspect and it would have some deficiency on others and we really, really aim for resilience. And I think that's what makes security a very, very different field is we want to have as much resilience and as much as kind of double security or security in-depth, right? That's the main principle that--or other people mentioned. So that the idea. I think that's what makes it different is we took a very, very specific problem and really tried to add to the thing not to replace. I don't if know that make sense

Timothy: No, that makes a ton of sense. I wanna keep getting deep on the ML piece here 'cause I found one of the techniques here really interesting. One of the things that you described doing is reanalyzing old samples as a way of finding ground truth. Can you first off explain to users what ground truth is, why it's important, and then in the discipline of ML-based detection engineering what you're really trying to do here? 

Elie: Three questions. 

Timothy: Yeah. 

Elie: Okay, I'll try to answer all three. 

Timothy: Perfect.

Elie: Let's start with the first one. What is ground truth? When you're trying machine learning, machine learning learn by example. So you hope to be able to teach the machine what to recognize and the power of the machine is not really to invent new thing or to think. What it does really well it interpolate. Interpolate means if there is two sample and there you have a slight variation between the two, it will be able to say well there is something along those line in a very, very high dimension or complicated function approximation but that's what it does. It try to interpret extremely well. So it deal really really well with valuations provided you train correctly. So in reality when people talk about new malware there's not a whole lot of new, new thing.

Timothy: Rumsfeld [inaudible]. 

Elie: Yeah, exactly. I think that in a constraint or what you could do with VBA or Excel or PDF is kind of well-known. So maybe zero days, right? And zero day will completely catch you off-guard and I do think the machine learning will not see them to be clear. So that's why you need all the technology. But for most of the things they do, they do a flavor of this, a flavor of that. Last year it was in the press. Many group at that group because V-Day was so well defended decided to use Excel 4.0. Excel 4.0 is a 20 years old format. So when we talk about looking into the past, some of the attack are using very, very old legacy languages that were invented 20 years ago. So you get this idea of why looking to the past is important. The other problem we have is when you train, you need to have a example of what you want to see, but you also need to have a recorded verdict. And the verdict is essentially whether this document is malicious or not.

That is very, very hard to get if you don't wait a little bit because AVs as any kind of company improve the scanners and we discover what was missed. Sometimes we have user report. We have industry report. And so as you adjust your scanner, we relabel with current knowledge something which was on a few days back. And so basically we improve our understanding of the past. That's what I meant by looking in the past is you can tell if with a document today is a machine learning can make a guess. And what may happen in some of the cases is every other system say it's not a malware and the machine learning say, "No, no, no. I'm sure." So now you have kind of this ambiguity of who to trust. Turns out that based now of two years of looking hindsight the machine learning is mostly right. Mostly is like 99--I don't want to put the exact number because I'm not sure exactly what it is but over 99% sure. And that when that happen, we tend to make the decision to spam folder or to block the document because the machine learning has seen something of interpolated into a way. That said, when we retrain the new model and the model have to be retrained because the landscape keep changing. Yeah, that's one of those thing which is different from normal machine learning. If you try to recognize a dog or a cat, very nuclear* winter, your cat is going to look the same. The problem is when we look at the type of malware, they are different, right? They are 60 to 80% different day to day. As I said not completely out-of-the-blue new thing. It's just variation and constant change by the attacker who try to evade system and they probably have, you know, a copy of every AV they have. There's even underground services which show you to test your malicious payload against all the no navy* and the private version. So you call that private VT sometime. You know, it's very funny. Yeah, so some stuff like that, right? 

Timothy: This is what I find so fascinating about using ML for security problems. You don't have geese out there trying to look like ducks for the most part, but we do have an active adversary that we're playing against here. Somebody interested in buying Toyotas isn't pretending to look like they're interested in buying Fords. Advertising has a much easier problem than we have in security when it comes to applying ML to these problems.

Elie: I don't know.

Timothy: You disagree?

Elie: I don't know that

Anton: But we do have vendors who look like complete ducks when they try using ML. naively.

Elie: I don't know what that is or I do say, you know, ads have their own problem. It's very, very hard for them especially when you have online ads. It's a bit of a digression, but they also have this problem of people trying to make fake ads. Ads which are not legitimate [inaudible]. A lot of difficulty to not have ads serving malware, spreading disinformation, spreading--redirecting you to counterfeits. There is quite a bit of challenges on that. I would say recognizing cat versus dog is a pretty safe bet or something you can do which would probably work okay. And yes we don't have that in security. But in certain cases, as I said for--it's particular for document it's very constrained [inaudible] a primitive underneath it. So with that, you can train the machine learning which has a pretty good way to interpolate. I'm very careful to not say detect new things or zero days as I'm unsure that it's doable. I'm not sure it is doing it. Maybe it does. Obviously, it's worth reiterating we did not look at our user content so we can also not have a lot of certainty around most of the document. Document we can look at are either reported to us or as they might show up later on onto public sources. But we don't inspect the recommended content. So sometime, you know, we detect and then we only have statistical estimate. So if there is a zero-day maybe it was broke, but I have no way to say that. What we could say for sure is we have seen the machine learning catch part of last campaign and add sometime up to 100% increased detection up to 200% where it means like two-third were missed by other technique because it is able to interpolate it better. So I think that that's what it does. It's a very underwhelming thing by the way if you think about it. It's very, very far from the promise of, you know, a robot killing all human.

Anton: Fixing all problems. 

Elie: Yes. 

Anton: Yes, that's right. The replacement activities like you don't need rules anymore. Just use our ML classifier and it always gives you the right answer. Like, it's not like we are far from that. We're just not going there.

Elie: I think some people try. There is this field called certified ML where they try to tell you that they have some guarantee that the ML will not be too far off. I do think for a stable problem that may be doable. Now, what is a stable problem? Those are fairly constrained. That's not going to be self-driving cars. Self-driving car is not a stable problem because there is so many things which can happen on the road. That's not going to be one of those. A stable problem might be controlling an elevator or maybe processing.

Anton: Almost nothing in cyber basically. In our beloved domain, almost nothing is like that. 

Elie: In our domain, it is extremely unlikely that we'll find something where we have strong guarantees of performance, but I never seen also any vendor guarantee detection rate even traditional AVs or even Google. You can't promise on the adversarial attacks that you're not going to be invaded. You could certainly do everything you can to mitigate that and have controls in place and have multiple layer of defense, but there's always something will go through.

Anton: Which kind of reminds me of the other angle on this about stable problems and like do we have any evidence that attackers are trying to evolve and kind of like use anything that's counter-ML or anything that tries to attack or disrupt or bypass ML techniques specifically? Like, obviously people try to bypass rules by doing things differently. In your work, do you see any evidence of the attackers countering the ML techniques in particular?

Elie: Aha, so that's what you meant by we were just starting a few minutes ago on the hard question? 

Anton: With the easy questions. Yes, correct. 

Timothy: That's right.

Anton: That is now where I get into the hard question. 

Elie: Okay, so the reason why it's a hard question is I don't have an answer. However, I can tell you two things. One, I think there was a report maybe by Gartner who said they think 60% of the attacks they see are on TML. I'm not sure how they get to that estimate so we should ask them. We do see people trying to game* the system very aggressively. We don't have, as far as I can tell at Google, any system which is only ML-based. So I don't know if the attacker is able to literally optimize against one specific component. I would say they do optimize very aggressively against what we do. As I mentioned, it was very, very interesting is a few months after we launched our ML detector for VBA, we saw a very strong shift of how the attacker were trying to evade detection by moving to Excel 4.0. It was already a shift but we saw a very, very large shift say like 30% or something like that I think on the amount of VBA decreased significantly. The amount of Excel 4.0 did increase significantly because they realized that was not something we had ML for at that point. 

Timothy: Hmm.

Elie: So there is definitively some sort of reaction we can see. So we can see them trying to evade for sure. We see them trying to poison our dataset. People doing false reporting on Gmail, spam following things. Removing things out to the spam folder is a classic for many, many years. So I do think there are both of those. We do see some of those. And then the last thing which is a bit more depressing if you take a little bit of the scientific ad. What we are looking at here is called out of dataset, right? So out of dataset sample. There is a lot of negative result. We think it's as hard to detect them as to know whether or not this is something you haven't seen. So it's out of distribution sample [inaudible]. It's really, really hard to detect and I think some people are--argue in the ML community that you can't do it reliably. You can maybe detect some specific technique and some evasion technique if you are looking for them, but that's supposing you know what the attacker is using to do this. So if one day there is like an others chef tool to do this, you could probably optimize that. I think that's part of people do some type of stuff like this for deep fakes.

Timothy: Hmm.

Elie: Where they try to look for, you know, no library who generate deep fake and try to detect those, but you know what you're looking for. If you have an attacker and you don't really know what they're doing, it will be really, really hard. So I don't even think we can detect them. That's my--that seems as the short answer is. The short answer is I think it's really, really hard to detect and there is no theoretical foundation to say it's actually doable. I think you can find some artifact if you have seen multiple times the same technique used and you will maybe be able to detect some of it. Now, how much we have we don't measure that. The closest I ever seen was the Gartner report if I remember it correctly.

Timothy: Well, those Gartner reports. You know what we always say about them around here.

Anton: Aww. 

Elie: We thank them. They are great people. They--they are super nice. They sometime reach out to ask questions. They are super nice people. So I like Gartner.

Timothy: They're very nice people and they intend to do a good thing in the world.

I wanna move us towards wrapping up because we're just about at time here. Our traditional closing questions are maybe not super applicable given how on the edge the work is that you're doing, but I'll ask it anyway. Do you have one weird tip for people to get better at using ML for security and what's your recommended follow-up reading aside from, you know, also the talk that we will again link to in the show notes?

Elie: Wanting to get better at cybersecurity?

Timothy: Specifically using ML.

Elie: Using ML for security?

Anton: Please don't say that at the bite of documents because this would be like cheating.

Elie: Okay. Actually, I was not thinking of that at all. I would say pick one [inaudible] and do it yourself and play with it. Just getting this thing working, Anton, will give you a lot of insight into how this thing works and what are the difficulties. It's probably less hard than people might seem to get started. You get to 80% of the result. So really, really hard part of ML is the last 20%. You're getting, you know, from 85% to 90%. That's the place where it gets very, very, very hard. But simply picking up one idea. I want to take phishing pages. I want to redo--there was a very, very nice dataset by Sofos* on malware to do detection. Pick any of those, apply the basic deploying techniques, try to get it to work, and it's going to be very satisfying because you get something to work. You know, you get your own detector yourself. You should be happy. And b, you will see all the different type of difficulty from getting the data to tracking the data to training to, you know, getting the GPU to work to do the planning which is not that easy actually. And then get to all those stages and you get one experience and this experience will make it clear all the things you could learn and all the direction you can go. 

Timothy: Hmm?

Elie: I will say just take it very very narrow. I think when people look at ML they tend to look at courses which have lot of theory here and have a lot of application and you do a little bit of that. A little of CYAN*. A little bit of vision. A little bit of NLP. And that seems impossible and you don't know where to start. And I say that's okay. Pick one. Pick--it's the simplest neural network. The simplest neural network will get you somewhere. It's not going to be state-of-the-art but that's okay. That's not the goal. You get a reasonable network on a reasonable problem will work reasonably well. And I think AI is to the point now where you can do that. You can do that at home. You can use Korab if you don't have a GPU which is the online service Google provides for a fee-to-train model that is heavily used. If you train your model, you have the satisfaction to see it climb. You see your accuracy. You play. It's quite a very, very hands-on practice. It's very, very practical. It's very, very much to experimentation that you'd learn about it. So I'm quite sure a lot of experimentation. I would actually steer away from theory first and just try to get a sense of what it works and then the theory would make more sense when you get it to work. So completely the reverse advice that I hear a lot,

Timothy: No, I love this. For an episode where so much of this was well this is really

hard. This is a very open space. This is very difficult to get right. Your advice is very practical. Just go try it. Go do it. And go do it in a way that is achievable, narrowly focused, and listeners could actually hang up this episode and go do. It's amazing.

Elie: As I said I think we are to the point today where with the tooling which exists and the resources which exist, you can if you put your mind to it get to 80% of the efficiency. I think everyone can get to 80% if you put your mind to it. So it's not like forever. You can get there. The 20 last percent and that's what turns the difference between a prototype that you do for learning to a natural product is really, really hard. And then what distinguish a great product from a good product in AI is when you get to the last 1%, right? And this last 1% require even more effort and way more advanced and way more computation than everything else. It has this kind of exponential difficulties and then you have to maintain it. So I do think you can get started very easily. Making it work in practice is a whole different level of skills. And maintaining it and keeping it [inaudible] extreme adversarial contact is where it's really, really hard and why there is so few AV company really, right? And really good company. They put up resources. Very expensive for them to do it because it's really, really hard. So I encourage you get to the 80 there and then you'll get [inaudible] a bit and it will be awesome.

Timothy: That's a fantastic answer. Elie, thank you so much for joining us today.

Elie: Of course. Bye-bye. Thank you so much. 

Timothy: Thank you. 

Anton: And now we are at time. Thank you very much for listening and of course for subscribing. You can find this podcast at Google Podcast, Apple Podcast, Spotify, or wherever else you get your podcasts. Also, you can find us at our website. No more website jokes. Cloud.withgoogle.com/cloudsecurity/podcast. Please subscribe so that you don't miss episodes. You can follow us on Twitter. Twitter.com/cloudsecpodcast. And of course, your hosts are also on Twitter. Pretty active I would say. Anton_chuvakin and _timpeacock. Tweet at us, email us, argue with us. And if you like or hate what we hear, we can invite you to the next episode. See you on the next Cloud Security Podcast episode and of course, welcome to 2022.

View more episodes