February 11, 2021

Automate and/or Die?


Joe Crawford, formerly in charge of cloud-native security at a large bank


Topics covered:

Do you have something cool to share? Some questions? Let us know:


Chuvakin: Hi, there, welcome to the Cloud Security Podcast. Thanks for joining us today. Your host here are Tim Peacock, the product manager for thread detection here at Cloud. And I'm sure I can [inaudible] a list of members of Cloud's security team here. You can find this podcast although our podcasts and are in our website when we finally launch it. You can also follow us on Today we have a new guest and a new topic. And the topic would involve remediation and automating some of the tasks in the Cloud. So our guest today Joe*. Hey, Joe, and welcome to the podcast. Could you introduce yourself and tell us a little about your previous role at Citi as well as your current job?

Joe: Sure. Thank you, guys, for having. My previous role at City, I came in as the Vice president of Cloud Native Security Engineering, focused on Google Cloud. They were doing a Google Cloud adoption, and my job was to use cloud-native tools as much as possible to build security guardrails to protects their workloads that would be going into their cloud. After my boss left, I ascended to the senior vice president as well as the head of Cisco's special projects. So I did a little bit of work, still mainly focused on Google Cloud but also with ADWS. Now I work for one of the larger cybersecurity companies in the world, still focused on Google Cloud.
Peacock: So when you were at City, you focused on Google Cloud. Is this like Kanwhizz law in action and was there an upside to that? Was this good Kanwhizz law?

Joe: In some ways, yeah. I think that it was Kanwhizz law almost in a global way because I think there's a lot of large enterprises thinking the same way about cloud and figuring things out, trying to close the security gaps, close the gaps in technology between traditional and cloud. And I think that cloud, kind of just puts itself out there as one of those tools that allow us to build and customize.

Peacock: You said you were focused on native tooling as much as possible. What did that end up looking like? What did that mean for you?

Joe: There was really three areas of security guardrails that we built. So there was the preventative, obviously, starting with your pipeline and moving forward. Then there was the alerting whatever SIM tool your using and thread hunting tool you're using integrating into that. And then there was the remedial aspect of it, in which we used cloud functions. Serverless was the greatest invention to hit Cloud as far as security and being able to customize your tools.

Peacock: And by remedial, I don't imagine you mean, like, the kind of classes I had to take as a child. I imagine you're actually talking about the topic of our podcast tonight, which is threat remediation or vulnerability remediation. So we see a lot of people talking about threats verse vulnerabilities in the Cloud, and even within my product group, we sometimes mix them up. Do you that a crisp definition was important to your role at Citi, or did you guys treat those as the same thing that you had to respond to automatically?
Joe: I think they were very much the same. But I think cloud certainly blurs the lines a little between the two. A crisp definition still wins out in my opinion because there's still some level of human interaction or human-made software that has to play a role. But I think when you think of cloud and you get into the idea that every resource is now an identity, so your VM is accessing cloud storage as the severs account that it's attached to. So it definitely blurs the lines, and you have to treat it very differently and make sure you're following the best practices. And I think, again, cloud is one of those areas where the lines are blurred, but it's still traditional enough that your "dyed in wool technician" is gonna understand it.
Chuvakin: Yeah, but here's the one that usually tweaks because I am that person who would fight to the death over the definitions and over people not confusing threats and vulnerabilities, because if you forgot to patch and you have a weakness that somebody can exploit versus somebody just hacked you, piked the password, did something, to me, the Khazim* is kind of huge. And I am really quite allergic to people who confuse them. However, when we started looking at this specific to the cloud, I kind of adjusted some of the thought flows, and some of the thinking and some of the processes are in fact similar. So I had to reset my brain a little bit, like, before that, I would be, like, reaching for my baseball bat to tell people, "Why are you confusing buffer overflow with an attack, like these are from a different domain." But I see that when people remediate an unpatched system versus, they remediate a system where they have evidence of say, password compromise, they sort of flow through similar steps. And I started questioning my, kind of, religious server in this regard. Is that what's going on or except that there is more nuance here?
Joe: Yeah, no, I think you've hit the nail on the head. I think that the treats have changed, just taking into consideration what we used to build these tools, so serverless, right? It's a whole new technology that a lot of people don't understand. So you have these functions and you put the code out there one day, but if you're not monitoring that code, if you don't that it's been updated or changed and you don't control over those IM permissions, that becomes a threat and a vulnerability almost at the same time, right, because someone can add a line of code, start data exfiltration and you have a serious problem on your hand. So I think you just have to be aware that identity is totally different than the cloud. And your identity in access management is where all of your security has to start.
Chuvakin: So then a weak password is still a vulnerability, but, like, you're one step away from it being exploited in a very bad way, kind of like that's roughly what may be going on. Because to me, I like to highlight the fact that identity is the main boundary in the cloud. You may not have layers of firewall, and that's gonna be a topic of our future podcast. I'm sure where we're gonna kind of we're expoetic over how identity is really your main security barrier in the cloud. It's not physical. It's not brackets of firewalls, apologies to firewall vendors. But it's mostly the identity really does separate the good from the bad.
Peacock: And apologies to the poor firewalls that were being kept in buckets. That's awful.
Chuvakin: Well, they lose that too.
Peacock: Yes, let them loose, and let them live their lives. They've served us well.
Chuvakin: So we've touched a little bit on this. I sort of mentioned processes and you guys mentioned automation. So the real elephant in the room is, can we automate remediation or vulnerabilities. And can we automate dealing with some of the treats in the cloud? So how do we go about that? So this is kind of my, you know, I guess introduction to the main feature. Can we hope for automation or not?
Joe: I think so. I would say that in the cloud if you are not open to automation, your probably in the wrong industry because it's given us the agility that we've been promised that technology was going to give us. So your products are changing every day. Your APIs are changing. The way that they're integrated is changing, and there's a lot of back and forth there. And I think that that one of the things that has to be stressed here is from a remedial standpoint, you have to really test things. You have to make sure before anything hits production, that's it's doing only what you want it to do. So there's a lot of back and forth between the logs and--oops, I didn't wanna destroy that, but I need to layer this a little bit better. But yeah, I think that there's very things in the cloud that I would say you can't automate if you have competent engineers on your staff, and that's the hurdle to clear, right? I think that there's a limited number of security engineers and certainly an even more limited number of cloud security engineers.
Peacock: We certainly talk about the challenge of hiring, and of the things we talk a lot about on my team is, ironically, when it comes to guardrails, how do we in our system for enforcing guardrails put up the right guardrails to help security team scale their human operations with less effort. So I one of the things I run into a lot as a challenge though, is I talk to users about automated remediation. And to me, it feels like there's a really big gap between where we are with the automated response and where we could be with the automated remediation. Lemme give you an example to tease that difference apart. If I have a basement and water in this, see I grew up on the East Coast, this is something I worry about, don't worry about it in California, but if I have a basement and water gets into it, it's really easy to have a sub-pump that automatically has a float that kicks on and pumps the water out. Now, that's an automated response. But that didn't deal with the now moldy drywall or the crinkly pictures of grandma. To deal with that, I call up Server Pro. They show up with a fancy fan, and they fix my house. That's remediation. And so when you were building these things previously as the elevated senior VP, were you building response, or were you building remediation, and did you have a path from one to the other?
Joe: Yeah, I think that response and remediation kind of became amalgamated in some ways because we were really solving for the fact that we had a lack of internal knowledge amongst OBS Teams* amongst SoC. I mean, we were having to solve problems in engineering that ordinally would have been the problem of some other team, right? So when we were responding to things, it was important that we did kick on the [inaudible] thumb. But also get all the artifacts that were necessary to build out reports so that we could respond properly. You know you don't wanna just kill someone's resource and then not have a conversation about why it was or when it was. So I think that automation end to end is really the key. And making sure that you are keeping the artifacts, keeping the documentation. It's just become so natural for us to automate everything at this point.
Peacock: So you didn't have a human in the loop most of the time, but really was truly automatic?
Joe: Truly automatic, as far I'm concerned, you can get away with having a human in the loop and still call it automation. But end-to-end automation is really the key. The fewer hands that touch things, the less mistakes you make. The less likely you are to have human engineering and influence. So it just makes for a much more secure environment.
Chuvakin: But then you had to trust the automation. You had to be sure that you remediated the issue, right? Like as Tim pointed out that this whole distinction between remediation and response, automation can act but like, how do we know that the problem is solved at the end? Wouldn't a human have to check? I'm trying to be a bit devil delicate, but, like, how do we get to trust that the problem is truly solved by the machinery with no humans?
Joe: There is a kind of loop that you build with automation where you do the remediation and then you output the logs to report that the remediation was carried out, and then you have another loop that goes back and checks your automation to make sure that it is doing what it's supposed to do. And you really should. I mean, one part of the human element that we really can't rid of is the auditing aspect of things. You need those auditors to come in and check your code, but you want to check some balances in the system. It's important. Automation will take us far but will only take us so far.
Peacock: That's truly interesting that automation ultimately rests in an organizational context of, we think we can solve this problem, did we actually solve it?
Joe: One of the interesting things is, as a security engineer, you just have to make sure that--especially in leadership, you have these metrics that you have to be able to provide. So reporting up the chain, they may not care. But they wanna know that you're doing something. They wanna make sure that the tools your building are doing what they say they're doing. So you have to perform with some sort of metric and take them reports about, you know, here is what we stopped this last quarter, here is what we know happened. Do they care? Probably not. They probably care about money and PR, right? They don't wanna end up on the nightly news. But you give them confidence through your tooling that they're not going to end up on the nightly news.
Peacock: That's one of those cases where metrics are interesting for both managing up as well managing down. What were you tracking with your team when it came to this? Were you tracking how many used cases we had, how often they fired? How did you manage this?
Joe: I was tracking how often they fired, how often the alerts that we built--because we layered things. We--we always assumed that one layer of that was gonna fail. So if you built up a preventative measure, plan for the fact that someone could locally extract anything in their Tera form, right, and make a change. So be ready for that with alerts. And then when the alerts fail, you go to remedial. And so we tracked how many alerts we had versus how many actual remedial events we had. We tracked the type of infractions that we were correcting. So if someone built a DM or a workload, or a container workload in a region that wasn't allowed we would track that, so that we knew the kind of threats that we were dealing with. But more often than not we were just providing insights into, here is the CIS benchmark, here is what we stopped. We are in compliance at all times. That was really what they wanted to see.
Chuvakin: Some of these sounds like it's a policy issue rather than even a weakness or a vulnerability or a threat, like if launch a VM in the wrong region, eventually, it may end up with privacy issues, whatever. But ultimately, it's neither a vulnerability nor a threat, right? It's a policy issue. I'm I being too subtle or that's the story?
Joe: Yeah, you're a 100 percent right, and making sure that everybody in a massive organization understands what the policy is, that's a challenge. So a lot of it was just learning on the fly in an organization and--and making sure that people the logs output to understand their machine suddenly disappeared, why their storage buckets suddenly disappeared. That was a fun time.
Peacock: Did you get particularly memorable emails from people whose instances suddenly disappear?
Joe: Oh, absolutely, and--and there's always the flops, you know, where something goes a little further than it should've and gets rid of stuff that they needed, you know, testing is important, making sure that you're giving people the information so that they can build the right resources.
Chuvakin: How about we drag the elephant in the room into the side and say some of the stuff you're describing kind of points out that your team and Tim's teams around didn't have this notorious security fear of automation like you mentioned some things may have gone too far, or whatever. But like, we always stories from the, you know, '90's, 2000s when somebody blocks somebody else's laptop and it was CEO doing a presentation, so the whole project was shut down, like the whole domain of security was terminated. We hear those stories, but apparently, in the cloud, things aren't that bad, or maybe they even aren't that way. So what about the case where things went wrong? How come automation survived those occurrences? Like what's different in the cloud?
Joe: I think that one of the biggest things is that recovery can be so rapid in the cloud. Where before, there was such a massive process to correct any of the issues that you created. At the worst, you run into in Cloud, is that you're gonna have to redeploy an entire new project. And what is that gonna take you from the perspective of infrastructures code? So I think there's less fear because of that. But one of the things that people say about the cloud is that the landscape is huge and it's--it's getting kind of harder and harder to see everything. But at the end of the day, you're actually able to visualize your isolation, your VPC project boundaries so much easier in the cloud than you are in everyday life. So you can target specific instances. You can take things to a resource level for access and make sure that they are meeting the requirements needed for special workgroups. The upside is that you get more benefits than you get negative.
Chuvakin: Sound like you also can automation to fix automation mistakes, right? I've heard that or maybe I was reading between the lines?
Joe: Absolutely, you are 100 percent right. So you have to build a loop. And one of those loops is actually making sure that the functions that you're using to secure your account are always available. So if something gets deleted, having a automated redeployment of that security function in the back end is really important.
Chuvakin: I like the thinking with loops. One of my favorites, like, thinkers is John Boyd, who came up with the OODA Loop. So I love where you're going with this. One question I have gets back to people though. It sounds like you had good people. And I actually know you had good people 'cause I meet of your tram. What advice would you have for somebody who wanted to get started with this when they had, you know, fewer people or fewer people that they trust as much as you trusted your team?
Joe: Check and double-check and double-check obviously. But I think investing in the training and the knowledge is something that a lot of organizations are probably avoiding right now. I think the obvious path is to "speed" to production. And really what you lose with that is being able to bring up a tram that understands these services in-depth. So I think training is really important. The other thing is, give people a test environment and let them go. Because the other that's happening within Cloud is we are so scared of a minute mistake, leading to a problem that we don't wanna face. When really, we should just isolate and let people get into the cloud and go because that's the best way to learn. That's how I trained my team, just gave them access, and said," start playing with it." And before long, they were probably more competent than I was at the end of the day.
Chuvakin: That's really interesting that the best way to learn is just by doing I like that.
Joe: Absolutely.
Chuvakin: Yeah, well, hey, we are just at time. So Joe, thank you so much for joining us, everybody who's listened through to this point in the podcast, that you so much for joining us. Again, you can find this podcast on Google Podcast, and wherever else you get your podcasts. You can follow us on Twitter at You can find Anton and myself on Twitter. Tweet at us, email us, argue with us. If we like or really hate what we hear, and we plan to invite you on the podcast, and that's either a blessing or a curse depending on how, you know, how much we agree or disagree, and how brave you are. So see you all next time on the Cloud Security Podcast.

View more episodes