Do you have something cool to share? Some questions? Let us know:
[CHILL MUSIC] TIMOTHY: Hi there. Welcome to the "Cloud Security Podcast" by Google. Thanks for joining us today. Your hosts here are myself, Timothy Peacock, the product manager for Threat Detection here at Google Cloud, and Anton Chuvakin, a reformed analyst and esteemed member of the Cloud Security Team here at Google. You can find and subscribe to this podcast wherever you get your podcasts, as well as at our website, cloud.withgoogle.com/cloudsecurity/podcast.
If you like our content and want it delivered to you piping hot every Monday Pacific time, please do hit that Subscribe button. You can follow the show and argue with your hosts on Twitter as well, twitter.com/cloudsecpodcast. Anton, we're having an episode today in what I consider the magical technologies of Google category. Do you think that's a fair category for this one?
ANTON: I think so too, and I think we're developing a bit of a telepathy between you and me because I was just about to say, this is in the Google security magic topic episode. This is about something we have that is so amazing that maybe other people would look at it and say, this sounds like it came from a future.
TIMOTHY: Yeah. I mean, that's got to be one of my favorite parts about being a PM here is you get to pull technology out of your hat that's very futuristic. But what we're doing today has some really cool security implications when it comes to workload identity and traceability.
ANTON: Indeed. And the threat assessment, threat modeling for this is kind of amazing. Of course, we'll touch on this in the episode, but sometimes you really do have an environment where you don't get to fix the security problem, you don't get to fight the security problem, but you get to build the infrastructure in such a way that it doesn't appear at all.
TIMOTHY: That's right.
ANTON: And to me, this bit is in that bucket of not preventing it, not detecting it, but setting it up so it never appears.
TIMOTHY: Which is frankly magical. And then this episode also has some aspects that lead to not just changing that outcome, but organizational dynamics as well. So I'm really excited about this one, and maybe with that, let's welcome today's guest. Delighted to introduce today's guest, Sandra Guo, Product Manager in security here at Google Cloud.
Sandra, we've got a really interesting problem here. Imagine a world where we make all of these great investments in our use of trusted repos, great investments in code review, securing our build systems, having reproducible builds. But how do we know what all of that led to, that secure build, is actually what got deployed to production? How do we do that?
SANDRA: You can't unless you put a policy together.
TIMOTHY: What kind of policy?
SANDRA: So, use an analogy, it's like you have all the kinds of fancy locks on your front porch and front door, but unless you screw your front door onto your house, it's not going to stop anybody from going into your house. Right? So that's what we see binary authorization is. It's a deploy time policy enforcement choke point that we have integrated into GKE, but the concept is very general. The idea is put a policy together to control what to deploy in your production environment in addition to who can deploy in your production environment.
TIMOTHY: OK, so let me make sure I understand that. We've got a technology that lets us control what gets deployed, not just who can do that deploy. So it's on top of, say, IAM controls or Kubernetes controls.
SANDRA: Exactly. Yeah, traditionally, right, the user's dev ops admin can push deployment to production, and that's fine.
TIMOTHY: That's their job.
SANDRA: At the end of the day, we need people to do stuff. But in the larger organizations, this becomes a single point of failure. You have all kinds of fancy supply chain controls, and yet, at the end of the day, you rely on 1, 2, or 20 accounts to do direct manual access to your production environment. I want to put a policy on that. We want to delegate that single point of failure.
ANTON: But so when I hear about such setups, in most cases, my immediate line of thinking is, OK, I get it. It's kind of useful as a security control, but what type of threats can this thing help us handle? Are there specific organizations who care about this? Are there specific types of badness? I know people sometimes confuse threats and risks, but so let's talk about specific types of badness that this would stop. And maybe also who, what type of org would care about this.
SANDRA: Great, great, great. I live through this every day, so a lot of this is what I think goes without saying, but let me start from the beginning. When we talk about supply chain security, there are many different points of attacks. Somebody could compromise your build system. Somebody could compromise your artifact repository. Somebody could compromise your vulnerability scanner and give you a fake result. And somebody can bypass--
ANTON: The last one would be really sad. The last one really kind of freaked-- you said it, compromised vulnerability scanner, and I literally shuddered. Like, oh, no.
SANDRA: Yeah. If you look at the supply chain attacks that happened in the past two years, there was, like, 12 notable ones. It was kind of crazy. And the attack points scatter all over the lifecycle. So what we are trying to put in place is not to make your entire CI/CD supply chain invincible. There's always going to be weak spots. But what we're putting in place is to make a single point of failure in your supply chain not result in disaster.
ANTON: So blast radius question-- blast radius question versus fixed insecurity question.
TIMOTHY: Impact radius.
SANDRA: Impact radius, right.
ANTON: Yeah, OK, you're right.
SANDRA: Having a policy that requires every production deployment to be signed off by your QA engineers, to be scanned by your vulnerability scanner, to be built by your very sophisticated central build system.
SANDRA: Those are individual checks that are enforced individually that are intrinsically linked to the workloads themselves. Right? Whether Bob deploys it or your CEO deploys it, it doesn't change the fact that a workload has or not have the necessary checkmarks to go into production.
And let's say that your wall scanner got compromised and somebody gave you a fake report and say this vulnerable piece of code is actually perfectly fine. You still have two other check marks. Your code still have to be built centrally in your central build pipeline, still have to be checked off by your QA release engineer so that single failed compromise check doesn't directly result in disaster.
ANTON: But so how does it answer the original question, though? Does it mean it's for people who care about having yet another layer, or is it about people who care about having the layer that cannot be easily compromised or penetrated?
TIMOTHY: Or is it for everyone and this should be part of your foundational assumptions around deploy time security?
SANDRA: This really should be for everyone, for organizations that has more than, let's say, five developers that are pushing code because human trust can only scale for so long. And you can trust Bob to always do the right thing, but if you have 20 Bobs in your organization, you're going to have chaos. So customers that I talk to typically have two main drivers. One is security. One is compliance.
They want to be able to answer the question of, what are the code that my organization's deploying in production, and therefore, having access to my super sensitive customer credit card information? And the other driver is compliance. There is a lot of regulations that require due diligence around testing, around vulnerability scanning, around just general vetting of code that gets admitted into production environment.
To demonstrate that kind of control becomes very complicated when your process is very manual. So CI/CD space is a very fragmented space. There are tons of build tools out there. There are tons of scanning, orchestration, testing, metadata generating-- there are all kinds of tools out there and customers use it all.
So you may have organizations that use 20 such tools, and what are you going to do? Are you going to demonstrate that each of the teams that have access to each of the tools, that have access to each of the pipeline behave perfectly correctly because you set up the right access control? That becomes very unruly very quickly. So what you want to do instead is to establish a framework and saying that, you know, there are three things I require of my production deployment to have.
Every deployment, no matter what tools are used, no matter what team they come from, they all have to have these three checkmarks. And look, I have a production policy to enforce the checkmark to be in place and I can show you the exceptions to this rule over the past six months and explain to you why they were exceptions.
But because I have this framework, I have this enforcement, I have this mechanism set up, I don't have to show you how many people have access to my wall scanning metadata. I don't have to show you how many people have access to my code repository. I don't have to show you all the pull requests, right? It becomes much more straightforward. You create that chokepoint, you create that policy, and you build your entire process on top of that.
TIMOTHY: So that's fascinating and starts to have that alien technology feel I associate with the coolest stuff here at Google. How did this get inspired? Where does it come from? What's the history on this seemingly bizarre, alien capability?
ANTON: Please don't tell me there was an incident we can't talk about because many of these stories start with, once we had this incident we can't talk about. So please don't tell me it's one of those stories.
TIMOTHY: There's no plasma globes in this story, right?
SANDRA: The technology is actually stemmed from an internal product that Google has developed in-house over 10 years ago at this point. So there were probably incidents, but I wasn't here to tell the tale. So the story was Google developed Borg to power the various services that were building the power users around the globe.
And we were scaling alt services and developer teams all over the world and we quickly realized that, hey, it becomes really hard to track what is running in production environment. We started having all of these problems that customers today starting to have with the Kubernetes and microservices.
And the company basically needed a solution to that, for security and for compliance, and we developed this technology called binary authorization on Borg. Let's just call it that because that's the current name for it. There was evolutions of it. What it essentially does is exactly what I described. It's to standardize this build and release process across all the different teams, all the different services, and have one common infrastructure to host all the different workloads.
And having a single policy that says, if you want to deploy a job in production environment having access to this kind of data, your workload has to be vetted like this. It has to be tested, has to be code reviewed, has to be vulnerability scanned, has to be developed from approved sources and libraries. So it's something that was very successful at Google internally, and we were able to scale up the company and have bajillion services and moonshots and whatnot and still have this infrastructure underpinning to make sure that we have control over all the workload running production.
And a few years ago, Kubernetes started to gain traction-- quite a few years ago, right? Kubernetes came out-- Kubernetes was largely based on the Borg production infrastructure that Google has, which are loosely in parallel. And we see that customer adopting Kubernetes starting to develop a similar type of needs. You had microservices. Your develop speed just exploded, your capacity exploded. You're able to really churn out services and code and have fast releases and have super cool product coming out on the market, like, super quickly.
But the counterpart of that is now you have so much churn in your production environment. You have so many developers that have the power of deploy code in production. Now the chaos starts to happen, and we saw that. And we're like, OK, let's give customers the control that we develop internally to solve this exact same problem and help them to create that underpinning as they scale out their production infrastructure, their microservices, their dev ops operations, and that's where the technology comes from.
And the kind of user we're targeting are really just anybody who has moved beyond five developer teams building a product. Once you have an operation team, once you start to have a build and release process, once you start to have rules and regulations and security standards you want to uphold, you should put that in a policy and make it enforceable and codified.
ANTON: OK. So I think it does make sense as a kind of an origin story, and of course, with some of the tech that was born at Google, due to our own unique circumstances, environments, threats, the question does come up about how do we make it work for others. So to me, I wanted to explore briefly, how do we make this work in practice at a real organization? You did say five developers, so it implies that maybe even an SMB-- well, let's not take that case, but think of, I don't know, a small bank or a manufacturer or a hospital. How do they make it work in real life?
SANDRA: Yeah. Yeah. I will say that a lot of our customers come from regulated industries, so it's very typical. And the typical story goes like this. A customer comes to me and say, I have a dev ops team that are managing the production infrastructure for my 20 developer teams and we are having trouble putting our arms around what goes into production. Where do I start?
And usually they start in one place, and advice that I give to every one of them-- and some of them develop organically as well-- is to start with one check, the built by me check. So organization starts by start standardizing the build environment because the biggest fear and the biggest risk is the developer just builds something on his laptop and push to production. Now I have no idea what's running, right?
So customers, they start to standardize on the build process and they implement one check, the built by me check. What does it mean in practice? It means that you have a hook in your build system. After I build every container, essentially, it signs that container and puts that signature together with an image in somewhere that the verifier can have access to.
And then you just go on and do your own thing. You do the vetting, you do the testing, and then by the time your image is ready to be pushed, binary authorization is a policy that gets defined on your runtime environment. So you will have a policy production cluster that says, require signature from centralized builder, and when a job gets deployed, doesn't matter which path they take, GKE calls out the final authorization and says, can this deployment go through?
We'll look up the policy, we'll look up the signature, we'll do the verification. We say, yes, this one was, indeed, signed by the centralized builder. Go ahead, deploy it. And in practice, this is how it goes down. And then you can add additional checks. You can add a has to be wall scanned, has to be signed by QA. Each a check with different key, and now you have a multi-party authorization release process.
TIMOTHY: So what I love about this is it's cool technology in its own right that pushes organizations into doing the right thing elsewhere. Centralizing build, doing scanning-- that's awesome. We've had a lot of conversations on the show about how to roll out zero trust in an organization. Start small, be visible, plan, plan, plan. Same lessons apply here? Like, should users start with a small corner of their app? How do they roll this out successfully? What's the path look like after?
SANDRA: Yeah. So far we've been talking about we're enforcing a deploy policy, control what can go into production. To start even smaller than that is to start with a monitoring policy.
SANDRA: Because we all understand--
ANTON: OK, sorry for interrupting, but this is like-- this is being in beta, just deja vu from the good old times. You probably have to answer, then how do you decide to switch from monitoring policy to enforcement, because a lot of security tech that has monitoring mode and enforcement mode gets stuck forever in monitoring mode. Think WAF.
ANTON: Think other tech.
SANDRA: Yeah. And it happened internally as well. Within Google, we started with monitoring, and then it took us a long time to actually switch into enforcement mode.
TIMOTHY: Huh. Mm-hmm.
SANDRA: It's certainly much easier to start with monitoring. It starts small. It has very small production impact. But the trade-off is you have to rely on the operations team to clear all the issues and to push the ball forward to a point where it can switch into enforcement mode, which could take quarters and maybe even years.
Typically we would recommend to a customer that you basically do brownfield and greenfield. For greenfield, start with enforcement. And if you have a good policy that you vetted, you do monitoring only to get the right policy. But once you have the right policy, you enforce it as much as you can.
Brownfield you're going to have to do some retroactive whack-a-mole, and they're just not avoidable. And the plan, plan, plan part is very, very important because we've had customers telling me that I set up this really nice policy for my existing clusters, and before I realized it, we have a new dev team that spin up a new cluster and now they have a whole new set of production job running there. I didn't know about it, and now I have to do retrofitting policy for that cluster, which takes a lot of effort.
So the plan part is basically to anticipate-- to structure your policy in such a way that it covers, for example, one part of your organization. So when new resources get spin up there, it automatically becomes subject to your policy, instead of having defined policy on existing resources and then you have to do whack-a-mole on new things. That would be my advice.
TIMOTHY: So to summarize, start with greenfield in enforcement mode, and for your brownfield, you get to make Anton cringe with memories of WAFs from the last 20 years.
SANDRA: Yes. Yes.
TIMOTHY: Got it.
TIMOTHY: I mean, I like a plan that makes Anton cringe. I think that's great. This is good.
ANTON: But it's also common sense. And it's also, like, I made jokes about it, but of course, if you have monitoring mode that's safer, you want to go there. The question becomes, what motivates you to ever depart from monitoring mode and go to full enforcement mode? So to me this is-- going to monitoring mode, if it's available, is a no brainer. But going off monitoring mode to enforcement mode is the brainer, I guess, would be the--
SANDRA: It's the manageability, right? Being stuck in monitoring mode means that you're stuck in whack-a-mole land forever.
ANTON: Yes. Yes.
SANDRA: And your ops team will be crying for years to come. And the monitoring is great because they give you that visibility of what are the things that are running your production environment that has outdated packages or has outdated vulnerability scans or that has been there just simply for too long? You get all of those insights. You get the visibility. At least you wouldn't have any illusions about your-- what you're running there.
ANTON: OK. That makes sense. And again, I didn't mean to force it on you to answer it because-- but I think the picture is clear. So one other favorite question I have that doesn't make me cringe at all, but makes me kind of happy, is, what are the common mistakes when people deploy? What are the operationalization challenges? What are the deployment challenges? Where does it break? And of course, you can always say, they stick in monitoring mode for too long. That's a freebie.
SANDRA: One common-- I wouldn't say a mistake. I would say it's probably a challenge or a harder path to go down, not necessarily by choice. So typically, most of the use cases I've seen, most of customers I've seen fit in two categories. I have the centralized category, where you have a centralized dev ops team managing a central set of production resources and manages the central build and release pipeline.
And for those customers, because of the centralization, it's actually much easier for them to be able to add policy to it and to be able to track things. There's other set of customer that, possibly for historical reasons, have a more distributed model. So they have the central dev ops team that's almost like a consultant that creates tools and templates and best practices for each of the individual developer teams to follow.
But at the end of the day, the developer teams are the ones that build the binaries and control the production environment and make choices of how binaries are built and released and run and monitored. In that world, the dev ops team is always in catch up mode. So even when they do use the right technology, the right template, the right policy, their best practices, simply, they're not-- they have a hard time enforcing any of it on the dev team.
So if you have-- it's typical in very, very large organizations, like global organizations, and for maybe political or just legacy reasons they have this setup. And each of-- they have 40 development teams all using their own toolkits. For example, the central team will be like, I want everybody to scan their code before the code can be deployed in production. But what can you do?
TIMOTHY: I mean, that's a lot of, I would like a pony, in many cases.
ANTON: Mm-hmm. Yes. Correct. That flies.
SANDRA: Exactly. And then you can-- you have to individually comment. There are ways to make it easy. There are ways. You can implement org policy, you can create template. You may even be able to insert some of the enforcement here and there, but it's certainly a hard row to hoe.
TIMOTHY: Yeah. That makes sense. So we're just about at time, and I want to ask our traditional closing questions first, recommended reading for our audience so we don't leave them empty handed, and then two, one weird trick to be better at, I guess, authorizing your binaries.
SANDRA: Yeah. Recommended reading-- for the binary authorization on Borg, we actually wrote a white paper a couple years ago that condenses the goodness of how Google does dev ops and the philosophies, the principles, the pitfalls. It's a white paper online. Go search for it. "Binary Authorization on Borg." Recommend it. One weird trick to make it better. That's the same advice I gave to pretty much all of my customers.
ANTON: Yes. Perfectly OK.
SANDRA: If you can, implement a built by me check at the start of everything. Will make your life more sane and will put you on the right track to be able to add more advanced and sophisticated policies down the road and allow you to put your arms around what can go into your production.
TIMOTHY: Oh, that's awesome. Sandra, thank you so much for joining us today. You are, as ever, one of my favorite Googlers to work with, and it's a real pleasure to have you on the show.
SANDRA: Aw, thank you.
ANTON: Perfect. Thank you.
SANDRA: Thank you for inviting me. Have a good day. Bye.
ANTON: And now we are at time. Thank you very much for listening, and of course, for subscribing. You can find this podcast at Google podcasts, Apple podcasts, Spotify, or wherever else you get your podcasts. Also, you can find us at our website, cloud.withgoogle.com/cloudsecurity/podcast.
Please subscribe so that you don't miss episodes. You can follow us on Twitter, twitter.com/cloudsecpodcast. Your hosts are also on Twitter @Anton_Chuvakin and @_TimPeacock. Tweet as us, email us, argue with us, and if you like or hate what you hear, we can invite you to the next episode. See you on the next "Cloud Security Podcast" episode.