Back
#75
July 18, 2022

EP75 How We Scale Detection and Response at Google: Automation, Metrics, Toil

Guest:

27:27

Subscribe at Google Podcasts.

Subscribe at Spotify.

Subscribe at Apple Podcasts.

Topics covered:

  • I know we don’t like to say “SOC” here, so why don’t we talk about the role of automation in detection and response (D&R) at Google?
  • One SRE concept we found useful in security operations is “toil” - How do we squeeze toil out of D&R practice at Google?
  • A combined analyst and engineer role (just like an SRE) was critical for both increasing automation and reducing toil, how hard was it to put this into practice? Tell us about that journey?
  • How do we automate security signal analysis, can you give us a few examples?
  • D&R metrics have been a big pain point for many organizations, how does SRE thinking of SLOs and SLIs (and less about SLAs) helps us in our “not SOC”?
  • How do we avoid falling into the “time to respond” trap that rewards fast response, sometimes at the cost of good?

Do you have something cool to share? Some questions? Let us know:

Transcript

[MUSIC PLAYING] TIMOTHY: Hi there. Welcome to the "Cloud Security Podcast" by Google. Thanks for joining us today. Your hosts here are myself, Timothy Peacock-- but not the only Tim on this episode-- the Product Manager for Threat Detection here at Google Cloud, and Anton Chuvakin, a reformed analyst and esteemed member of our Cloud Security Team here at Google.

You can find and subscribe to this podcast wherever you get your podcasts, as well as at our website, Cloud.WithGoogle.com/cloudsecurity/podcast.

If you like our content and want it delivered to you piping hot every Monday afternoon Pacific Time, please do hit that Subscribe button. You can follow the show and argue with your hosts on Twitter as well-- Twitter.com/cloudsecpodcast.

Anton, we have another Tim joining us today.

ANTON: Correct. But to me this is really not why this episode is so interesting.

TIMOTHY: It's not?

ANTON: It's not.

TIMOTHY: I thought that was the single coolest thing.

ANTON: Yeah?

TIMOTHY: No, it's not. It's definitely not.

ANTON: Well, Tim-squared or Double-Tim, or whatever the--

TIMOTHY: Tim times Tim? No. This Tim is an awesome guest, because he's responsible for protecting Google in a really, really real way.

ANTON: Yes, correct. And also it's an episode that would remind the audience of another super popular-- I think it was the number two for a while-- episode with Julian that focused on detection at Google. So, to me, this is going to be very fun.

TIMOTHY: Yes, I think this is going to be super fun. We cover a lot of great ground. And so with that, no further ado, let's welcome today's guest.

This is our first episode with two Tims. Today, we are joined by Tim Nguyen, a Director of Detection and Response here at Google. I know we don't like to say SOC. But why don't we talk about the role of automation in, say, detection and response? We won't say SOC, just detection and response.

TIM NGUYEN: Yeah, thanks. Thanks for that, Tim. My team would have had a stern word with me if I called them a SOC. So thank you for that kind intro.

ANTON: SOC-less. Kind, but SOC-less.

TIMOTHY: No. I've worked with them for years now. I know not to call them a SOC.

TIM NGUYEN: Yeah. So I think listeners of your previous episodes have heard of the team name. We call ourselves the SST, Security Surveillance Team. We are responsible for the detection of all malicious activity at Alphabet. So our scope is huge.

And so directly to your question of, what role does automation play in detection and response? A huge role. A huge role, right? I mean, our scope is all of Alphabet. A couple of characteristics of that scope-- we're the largest Linux fleet in the world. We have really every flavor of operating system out here-- Linux, Windows, Mac, and software as a service, infrastructure as a service, et cetera.

And then we have about 150,000 employees at last count. So the only way that we could ever hope to defend against malicious activity at this scale is through automation. And automation has enabled us to have a number of key game changers.

So first and foremost, it's less gathering, more direct analysis. So we don't have to go and manually retrieve machine information. We don't have to go and manually retrieve user information. We don't have to manually go and retrieve process executions, those types of things. So all of that information enrichment is automatically gathered by machines and presented to a human being, alongside some hints as to where to go.

ANTON: Sorry for interrupting very briefly. But these are the tasks that everybody knows, that nobody has time for them. And you, very calmly and clearly, explained that we don't do them because automation does them. Elsewhere, in many places, it's manual. So this is a bit of a paradox, right? Everybody agrees nobody has time for it, but very few automate them. This is kind of the point, right? One of the points?

TIM NGUYEN: Yeah, absolutely. And we made a very conscious decision early on-- we called it less gathering, more analysis-- or something pithy like that. I forget. Somebody smarter than me coined a better phrase.

But the whole point was to minimize the amount of time, repetitive time, getting the same information over and over again, and getting to the point where you can make those security decisions. Because ultimately that's why we're here. We're here to determine if something is bad or not. And as much information as we can bring to bear for a human being to essentially-- eventually and essentially make that decision, the better we can be at our job, the better we can be at protecting this massive scale that we're up against.

TIMOTHY: So it's really about focusing on enabling the people to do the thing that people are uniquely good at. People are, in my experience, actually pretty bad at searching for info and remembering info and writing down info. But we're pretty good, compared to the alternatives, at making decisions.

TIM NGUYEN: I think people are uniquely good at nuance, at using judgment, at navigating ambiguous or incomplete pieces of information. But where we can augment that to the point where we can have that final layer of judgment, we have to be able to augment most ideally all of that context, all of that enrichment. Otherwise, we're going to be spending 95% of our time trying to build that context versus making that decision.

And as I said, against this scope that we have, that would have made our mission impossible. It's just literally impossible to do our mission without automation.

ANTON: That actually does make sense to me. But it's interesting how, when you present the manual way of doing things-- like you mentioned the concept of toil-- I've mentioned it-- I've described toil to somebody from a more normal SOC. And right before I said that toil is seen as bad by SREs, the person said, but, wait, that's all I do. At which point, as they say-- and then he got enlightened.

Because ultimately, every piece of work that doesn't make the system better over time, that scales linearly with workload-- like every property of toil from an SRE book looked like a normal day in many SOCs. Which to me is a little scary. So in this case, how do we squeeze the toil out of our DNR practice? Of course, you say with automation, but more specifically.

TIM NGUYEN: Yeah, absolutely. And it's funny that you mentioned SRE, because I come from SRE. I've been at Google almost 18 years.

TIMOTHY: No kidding?

TIM NGUYEN: Prior to helping start the Detection Team-- which I've been a part of for about 12 years now-- 12, 13 years now-- I spent a number of years in SRE. The manager that I reported to first on this team also came from SRE. So we've brought over a lot of those concepts.

Toil has a pretty particular meaning within the SRE world. It's a task that's manual, repetitive, automatable, tactical, that has little to no enduring value. And it's that last part that you really want to squeeze out-- is that little to enduring value. It's not to say that all manual tasks are toil. Manual tasks that lead up to something automatable or in the vein of discovery or in the vein of cleanup-- that kind of thing, yeah, I mean, that has to be done.

But when you find yourself repetitively doing something over and over again, that should tickle something in the back of your mind that says, hey, we should be able to automate this. And that's the process that our team goes through with every single new detection, with every single new source of work. I like to joke that the half-life for our team wanting to automate something is 30 minutes.

[LAUGHTER]

For example, a common request for us would be, hey, here's an IOC. Does this thing exist anywhere at Google? So the question we're always answering is, are we--

ANTON: By the way, a useful reminder to the audience-- anywhere here means really anywhere. Not like a decent attempt at 30% of anywhere. Everywhere we know. And like, in this case, as far as I remember you telling me this, which blew my mind, was like everywhere really means everywhere.

TIMOTHY: Everywhere, yeah.

ANTON: It can't be 97% of everywhere. It's everywhere.

TIM NGUYEN: Yeah.

ANTON: Which is really tricky.

TIM NGUYEN: Every corporate workstation, with all the flavors of operating systems, every production server, every cloud VM, every cloud compute that underlies a cloud product-- everywhere means everywhere.

So again, against that scale, when you have various types of IOCs-- IPs, domains, et cetera-- hashes, et cetera, there's no such thing as manual. And when you do it one time, we're like, oh my god, this is the very definition of toil. I am going to build a framework to be able to automate this. I am going to build-- I'm going to reuse common frameworks within Google to be able to feed in inputs and to query at scale large amounts of systems.

I know we've talked about this in previous podcasts, using BigQuery, using other systems that we have, like Dremel, that kind of thing. We're able to reuse lots and lots of enterprise tooling that SRE have built to process big data for us to answer security-based questions.

TIMOTHY: When I first sat down with the SST Team early in my time with Google, I realized that it was a mountain of Sawmill and Dremel, and everything was on top of that. And I was like, oh my god. At the end of the day, it's so horrifyingly complex and yet so simple.

One of the things that's so interesting here is you're describing a role that sounds like a combination of analyst and engineer-- just like SRE. And I can see how that engineering skill leads into increasing automation, reducing toil. How is it to get that into practice? Surely, it's hard to hire security engineers that pass the SWE hiring bar here.

TIM NGUYEN: Right. Absolutely. And so just to start off with a concrete statement-- I know this is going to be a little bit controversial. I've heard other viewpoints stated on this very show. All of our security engineers need to know how to code, period. We interview. We have a coding portion of an interview. It's language agnostic. It could be Python. It could be Golang, et cetera. But they need to understand how to read and write code.

And the reason for that is exactly what we've been talking about-- automation. In order to succeed in the role, a security engineer will need to partake in building automation. Not just automation of systems but also automated analysis, which we have not yet talked about.

A large piece of our everyday workflow involves encoding analytical steps, analytical logic, in code, so that your work can be triaged and can be absorbed by others on the team. So the coding skill is an everyday part of what will make you a successful engineer on this team. So it's hard. It's hard. But I mean, the role itself is really an encapsulation of probably two or three or four different roles, if you look at another company.

So in one person, what you're talking about is one security engineer doing threat modeling, logs acquisition, data modeling, signals development, analysis automation, and then also the analysis triage, across different--

ANTON: Up to IR.

TIM NGUYEN: Up to IR as well.

ANTON: Up to IR.

TIM NGUYEN: Yeah. And some parts, logs analysis, IoT checks, signal development, for active investigations as well. So I think it's a lot of fun, to be honest, to do all those things.

TIMOTHY: How do they possibly have time to do any of that if they're doing all those things?

TIM NGUYEN: So again, we either consciously or unconsciously base a lot of our operating principles on the SRE model. So SRE CAPS has published a 50-50 model for operations and engineering. We have something very, very similar. We have a 40-60, 40% engineering-- well, 40-40-20-- 40% operations, 40% engineering, and then 20% overhead slash kind of 20% project time.

TIMOTHY: Got it. That makes a lot of sense. So that last 20%, that's like dealing with me when I'm trying to get Cloud TF to do something with me?

TIM NGUYEN: It's strategic leadership. I'll put quotes there.

TIMOTHY: Yeah, OK.

TIM NGUYEN: Yeah, how about that? [LAUGHS]

TIMOTHY: Got it. So can we talk about automated analysis? Because that sounds really interesting.

TIM NGUYEN: Yeah. Automated analysis is really trying to encode the human brain and how a human would walk through a certain event. So I'll take a very well known analysis approach. Let's say you have a malware event. So a malware alert pops up-- Gen 32 malware-- whatever the code is-- we've seen this a million times. And I know Anton has probably seen this a million times.

Now, what do you do when you get one of those alerts, be it from SEEP, what have you, CrowdStrike, what have you? Well, first thing you do is you would pull down the process list. You would pull down VITs, and then you would maybe pull down the VT, or look this up in VT. You would then look at the version of the machine. Is it Windows 11? Is it Windows 10? That kind of thing. You would maybe look at the user function. Are they a finance user? Do they have admin rights on the machine? That kind of thing.

So when we talk about automation analysis or investigative analysis, that's what we're talking about, is automating all those steps. And then, also, on top of that, assigning a risk score or a risk indicator to the results of that. So there's a logic engine built in there as well, too.

So for example, if you have MalGen-- whatever-- but it didn't execute, the user wasn't admin on the machine. Our software control blocked it. And it's just quarantined somewhere. Well, actually, nothing happened. The user may have been duped to accessing a site or clicking on a binary. But hey, all of our defenses worked. We're good to go.

And if we're given all of that, a human being can triage that in about 30 seconds, declare an all clear, and then off we go. However, if any of those conditions were to be flipped, then you follow a different path. You would dig deeper. Maybe you would pull the binary. Maybe you would run an analysis on the binary. Or maybe you do a deeper look at any point in time.

So really, when you talk about automated investigation, you're talking about bringing all those steps that a human would ordinarily do and replicating that by machines so that humans can follow that and not have to replicate that themselves.

ANTON: So isn't it what, like a smaller company, would use a SOR for, like write a playbook that kind of learns from a human experience of an analyst, and then encodes it in a SOR playbook?

TIM NGUYEN: Yeah. Except the SOR playbook would still have to have the human walk through and replicate those steps. What we've done is we've encoded those steps and already presented you with that information right in front of you.

Now, I know that there are efforts to do that in the industry. I think there's a long way to go for tooling in the industry to catch up with the necessary kind of information to be presented to a human. And it's far too much. It still relies on us-- on a human being-- gathering bits of information and populating an event with it.

ANTON: And as far as I recall, your percentages for where the automation runs upon the event are really high. Well, if I deploy a SOR in my environment somewhere smaller, and I have five playbooks, and they cover 5% of the events-- and your playbook occurrence covers, as far as I remember, high 90s.

TIM NGUYEN: That's correct. That's correct, the high 90s. It varies, but 95%, 97% of our events are fed through automation, are generated through automated hunts-- we call them hunts-- and then the remaining ones-- we do still manually hunt. We still do go in and manually gather logs, eyeball them. Our security engineers have their own hunches, have their own mental models that they bring into pools of data. And many of them have their own cadence of pulling their own log sources and going through things manually.

So there's always going to be room-- like I mentioned before-- we talk about toil and whatnot-- but there's always going to be room for that manual human expertise. And as long as those tasks are adding value, as long as those tasks are building towards something larger, there's always room for those tasks.

ANTON: That actually make sense to me. And I think that if the percentage of events that has to be handed to humans in an uncooked form goes up, then of course your scaling breaks, and the whole thing unravels. You have to hire more people and do other things. So it's pretty clear why this percentage just has to be very high for the model to work.

TIM NGUYEN: And because I'm a director, I have to talk about costs a little bit. I think one of the--

TIMOTHY: Hmm, you do?

TIM NGUYEN: [LAUGHS] One of the real wins of automation has been our ability to drive down the cost of looking at events. But as you can imagine, with an investment in being able to automatically replicate pieces of logic, automatically replicate and gather and present information to humans, we can have humans act much more quickly on a ticket, thereby processing many, many more tickets, alongside all of the tickets that automation automatically closes. And we've been able to drastically drive down-- both drive down the cost per ticket per event, but also drive up the number of events that we as an organization can process over the course of a quarter, year, what have you.

ANTON: I mean, I've seen this chart, which is admittedly we know is not for public consumption, but we may make a genericized visual that is inspired by the chart. But when I saw it first, Tim's chart, it kind of blew my mind. Because it explained to me why we can handle the scale we handle. Because the cost per event really does go down, even though the number of events is actually going up. So like that does call for some Google magic, for sure.

So I'd like to go back to other topics that I want to touch-- is, of course, the other amazing angle of SREs that I was finding useful is, of course, the focus on the metrics and what they call SLOs and SLIs. Funnily enough, people coming from outside assumed that a DNI team would be obsessed about SLAs. But we are not. We are more obsessed about the SLOs and SLIs. So can you explain how all this works in the detection and response team?

TIM NGUYEN: Yeah, yeah. I know.

ANTON: I didn't say SOC. Did I say SOC?

TIM NGUYEN: No. You said-- you said--

TIMOTHY: Don't say SOC.

TIM NGUYEN: No, you said a detection and engineering team, so kudos to you, Anton. Thank you.

Yeah. I mean, I think, without being glib, for us, the SLA is you're hacked. So SLA talks about the penalties. And for us, if we're wrong, we're hacked. So we understand that there's a clarity of mission there. So you're right-- we don't necessarily focus on the SLAs.

Our SLO is our promise to the business. So our promise to the business is that we will triage an event in a very, very short amount of time. Specifically, we aim to detect and respond to threats within an hour of it occurring. Aspirationally, we want five minutes. Aspirationally, we want as close to zero as possible.

So one, that SLO really drives a lot of our strategy. We talked about automation quite a bit. It drives our investment in people who can code, people who can build this automation. But also it drives our measurements. So the SLI are measurements of the promise of the SLO. And so when we look at how we're going to measure our SLO, what we really hone in on is the concept of dwell time. How long has the adversary been here before we saw them? And like I said, we try to drive that down to zero-- to as close to zero as possible.

And this includes all sorts of threats now, obviously, the introduction of malware, unwanted software, kind of initial endpoint compromise. It's also physical, logical attacks. I mean, physical compromises of machines in our data centers is also a threat model. Access to user data is also a threat model. Exfil, that kind of thing.

So all of those measurements come into play around this concept of dwell time. But also when we scrutinize this data, we want to be realistic with it. And so when we measure this, we measure both the median, or the 50th percentile, and then the 95th percentile, to get a more holistic view of the more typical time, and then the wider variances at the long tail at the 95th percentile.

And then what we're really honing in on are technological aspects of where we can improve. For example, pipeline latency. How long does it take for something to traverse our pipeline? Are we single-homed somewhere? Is a data center draining or going down causing a massive delay? And is that OK? No, that's not OK.

[LAUGHTER]

Spoiler alert-- it's not OK. So we have to make technological and strategic decisions in order to drive these things down-- multi-homing, making sure that we have enough compute resources, that kind of thing.

Also, we zoom in on the individual programmatic ways that we can introduce latency. So we talked a lot about automation. We talked a lot about engineers being able to code. Engineers love sleep-- not sleeping in one's bed-- putting in a sleep. Engineers love putting in a quota and other types of ways to deal with noise. We really do scrutinize that, and we put a focus on improving the signals, improving the detection itself, versus perhaps some of the shortcuts that people take when they're trying to eliminate toil, eliminate noise, that kind of thing.

TIMOTHY: So this is all really interesting around latency and end-to-end timing, but that's only half of the coin. The other half is whether it's good or not. And so we could push and push and push all day on how fast is it, but the fastest thing will be to write it to dev null. So how do we know good when we see it versus just end-to-end quick when we see it?

ANTON: Admittedly, before-- this was my favorite question to ask, because I've seen too many people-- especially at the kind of SOCs for rent-- namely, MSS and MDRs-- who are kind of pushing their teams so hard to reduce this time to respond, and they end up with like swearing they can do five minutes, but then they have an incident which takes them seven days. So how is this-- I know you track the time to respond, and you do have the Sigma range-- but still, how do you incorporate good and not just fast?

TIM NGUYEN: Yeah, that's a really good question. I think good is an element of really manual review and scrutiny. So we have a weekly review of all cases. And so what we're looking at there is both the latency and all of the machine-generated stuff that happens, but also is the detection itself a good detection? Is the analysis a good analysis? How did our partner teams respond to that?

We partner well within the entire detection and response org to this event. Were we able to take this to remediation teams and derive the right solutions there? So whether something is good or not I think is an element of an overall review on top of how fast it is. So Tim, I absolutely agree with you there.

Anton, to your point about time to respond-- what we explicitly measure is time to triage. That's how we say, this is how much time it took for this IOC, this issue, to traverse the pipeline and get to a human being. We explicitly recognize that, once a human being touches on it, it could get very, very complicated. I touched on how vast Google's environment is. Something that happens within the cloud IAM stack is very complicated compared to, for example, hey, a malware alert or a flash helper alert.

And so we want to make sure that folks are incentivized, but make sure that it's OK for them to take the right amount of time and investigate fully. And so of course, all that is to say as well that, if something languishes for two days, that's not OK. I don't want to give that impression. And so that goes back to our operational muscle, that we should be reviewing things, we should have crisp, clear hand-offs and all of that. But overall, where we place our measurement onus is on that initial set of eyeballs and that initial triage versus perhaps some kind of sense of, quote unquote, "response."

ANTON: That actually does make sense to me. Perfect. Thank you very much for explaining this. Because I think this tip alone is quite useful to a whole bunch of teams, especially those falling into this trap of, like, faster, faster, faster.

We are close to time, so I wanted to ask you our famous closing question, that frankly Tim usually asks much better. So any recommended reading? I know that a SRE book would be in the resources for sure, because these two things are kind of central, the toil chapter or the SLO chapter. Any other recommended reading? And of course, one tip on how to make your-- wait a second-- would you say SOC? No-- how to make a detection response team better.

TIM NGUYEN: Yeah, I think the recommended reading-- and no surprises here-- the SRE book and then the secure systems book I think are two regular recommendations from folks like me. And so I'd like to broadly plus-one that.

For my recommendation, I think, one, is to convince yourselves and convince the business of the value of automation and the value of engineering in order to scale out the capabilities of your detection and response organization. And then, two, hire the right people that have the right skills in order to execute your mission. Easier said than done, I know. But I think we've really demonstrated over the past 12, 13 years the immense scaling that can happen when you invest in engineering, when you invest in automation.

TIMOTHY: On that last one, of hiring the people with engineering skills, I've had more than one user complain to me-- a GCP customer complain to me-- that you've hired the people they wanted to hire for those roles. So just slow down is I guess what the users would say.

ANTON: It's a friendly competition.

[LAUGHTER]

TIMOTHY: Well, Tim, thank you so much for joining us today. I really enjoyed this chat.

TIM NGUYEN: It's been a pleasure. Thank you so much for having me.

ANTON: And now we are at time. Thank you very much for listening and of course for subscribing. You can find this podcast at Google Podcasts, Apple Podcasts, Spotify, or wherever else you get your podcasts.

Also, you can find us at our website, Cloud.WithGoogle.com/cloudsecurity/podcast. Please subscribe, so that you don't miss episodes. You can follow us on Twitter-- twitter.com/cloudsecpodcast.

Your hosts are also on Twitter-- @Anton_Chuvakin, and @_TimPeacok. Tweet at us, email us, argue with us, and if you like or hate what you hear, we can invite you to the next episode. See you on the next "Cloud Security Podcast" episode.

[MUSIC PLAYING]

View more episodes