February 7, 2022

EP51 Policy Intelligence: More Fun and Useful than It Sounds!



Topics covered:


Do you have something cool to share? Some questions? Let us know:


Timothy: Hi there. Welcome to the "Cloud Security Podcast" by Google. Thanks for joining us today. Your hosts here are myself, Timothy Peacock, the product manager for threat detection here at Google Cloud, and Anton Chuvakin, a reformed analyst and member of the cloud security team here at Google. You can find and subscribe to this podcast wherever podcasts are available as well as at our website If you like our content and want it delivered to you piping hot every Monday morning, please do hit that subscribe button. You can follow the show and argue with your hosts on Twitter as well, Finally, wanna kick off today's episode by making a little bit of fun of Anton. You know, when we first started talking about this episode Anton was all like, "What the heck is policy? Policy? What does that mean?" And it actually ended up being something where Anton in a very understated way tells our guests that he believes she solved something real. So it was a big turnaround on this one.

Anton: Indeed and I think I was about to say the same exact thing is that this was the topic and just kind of befuddled me completely because I went down the security policy path. Is it about acceptable use policies? Is it about other security policies? 

Timothy: No.

Anton: You don't see it, but Tim is making a face. Yes, and that's the right face. But it turns out it's something much deeper and much more cloudy and actually much more effective and much more workable hence my praise for the speaker and the topic.

Timothy: What I love is this is another example of Google Cloud using AI And not just AI snake oil. This is a real interesting use of AI and ML to help security practitioners deliver real security outcomes.

Anton: So believe it or not, you'll see this episode about policy, but you're gonna learn about AI. How about this for a bait and switch? 

Timothy: I think with that, let's welcome today's guest. We're joined today with Vandy Ramadurai, a product manager here at Google Cloud, and personally, one of my absolute favorite PMs to work with here at Google. Vandy, relative softball to start us off today. What is cloud org policy, how is it different from IAC, and why is this so important for organizations who are both starting and continuing their cloud journey to get right?

Vandy: Thanks for having me, Tim. To dive into it, let me start a little bit by telling you what is cloud or policy and I think that's important. The way I think about organization policy within GCP is it's GCP's one-stop governance guardrail system, right? For security admins to enforce what types of resource configurations are really allowed in their environment. And you know, why do they wanna do this? For things like cost-containment, compliance, security. These are all the reasons why you want your environment to look a certain way and, you know, why you want resources to be a certain way. On the developer side, right? The big benefit is we're really improving developer velocity. If you think about it back in the day it's oh, I wanna spin up this whole environment. You have to go figure it out, get approval from your admins. Now it's something like org policy it's like okay I've set that baseline guardrail that I know it's safe. A safe environment for my developers to function and for them to create resources. And so once you've enforced those kinds of things, it's really easy for developers to move fast, create things, and not have security be a blocker. So that is really cloud organization policy. To double the click into your question, right? How is it different from IAC? What I will say is infrastructure is code and more popular in my area is policy is code where customers are starting to express their intent in their get repos and then manifest on our side in different ways in different policy systems. Now, it's not an either or thing. It's not very different. In fact, customers use these things together. You don't wanna be making one-off manual changes in your environment. You actually wanna be able to do this, like, in an automated way, have version control, be able to have some kind of approval workflow of like okay, let me review these kinds of policies that are going ahead and being instantiated in my organization. So they work really, really well together. And for a lot of our customers, they use both. So you have things like pre-validation time that your infrastructure as code can help you with or your policy as code can help you with, but nothing like having that additional layer of security at admission control time where we can say absolutely not. This resource is not allowed to be created because it has an open port.

Anton: So wait a second. I wanna interrupt briefly and ask a question about the overall value of this. You mentioned that it's policy--when you mentioned policy, I immediately thought security policy, but it's really not just security policy. It's other things. So where would it matter for a typical client? Like, for a security leader for example?

Vandy: Majority of the use cases we're seeing, Anton, is actually for security use cases, but there are also some compliance use cases. Let me give you an example. Maybe for certain reasons, for compliance reasons, or for certain cost reasons, you don't want your developers using any specific set of services that aren't compliant to a specific regulatory regime. What do you do then, right? Like, you can't just send an email saying, "Please don't use this." Everyone's gonna explore as developers so that's one of the ways being able to have a policy that says here are the kinds of services that you are allowed to use in your environment. Everything else is blocked. That's an example of a cost-containment kind of policy that can also double up as a compliance policy.

Timothy: Just putting the fences up to contain the cats rather than chasing the cats after they've gotten all scattered.

Vandy: Yeah, yeah.

Anton: That doesn't make sense to me. And in this case, when you pursue that path, how does a successful org policy design looks like from maybe, like, a business point of view? We'll get to the details later but, like, how does this good successful policy design looks like?

Vandy: We think a lot about this. Let's start with a very, like, to an end-user, to a human. What does this design really matter, right? Why should it be strong and good? From the first perspective, let's take it from the start of the journey for an ad bend, right? Our big thing, a good design is you're able to quickly translate some of these requirements that you have to these guardrail systems. What are my business requirements? Like what do I want to allow in my environment and not allow and how do I express that? Being able to do that accurately is something we spend a lot of time helping our customers with and we want to make sure they get that right. So a good design, it starts with making sure you can translate that really well. The second piece of continuing that journey, right? Is being able to make sure where you can enforce these guardrails without breaking anything. So we talked about making sure developer productivity goes unaffected. The worst thing you can do is they have a VM running. You enforce this new work policy and everything breaks for them. How do you do it in a way without breaking something for a developer? This is the second area good org policy design and workflow is something we spend a lot of time building for our customers as well.

Anton: But wait. So what about developer things that should not happen? You should break them, right? Like if they're trying to do something that is risky for an organization, then not breaking is not a good goal. The goal is to break them so that they don't happen, right? Just to confirm. It's not allow all policy.

Vandy: Correct. You definitely don't want to break someone in their playground developer environment. So you wanna make sure you're saying, okay, if I enforce this policy, what is likely to be violating? Oh, I see this developer project or this project with a tag development. Maybe I should exclude that. Being able to do that well without disrupting your developers, we think that's important. But in your production environment, you're right. Of course, you need to shut down the things that are security loopholes. Absolutely agree with that. So continuing about good design. As the life cycle continues, it's also important that you are able to do attestation and auditing well. How can you prove whether it's your security team or whether it's auditors that you have complete coverage for some of the things that you're required to adhere to? So being able to translate that, express that, prove that, and show that you don't have violations, that's the third piece that we really, you know, wanna make sure that good design doesn't show too many violations and that all the violations have been remediated as well.

Timothy: This sounds not easy. But this is Google. And I have a working theory at Google that the only things worth doing are hard things. Are there ways that we're using and I think the title of the episode kind of gives it away with policy artificial intelligence. How are we using AI to make this work better? 

Vandy: Yeah. So in the policy space, a big focus for us is making sure we can sort of predict what the customer or the admin will need and make sure the second piece that we wanna make sure is nothing ever breaks in all of the things we're doing. So let me give you a very tangible example and the product that our customers love using. It's a great way to sort of explain how we deploy these ML capabilities. So we have a product called I am recommender. When you think of it, over granting is just a crazy and a hard problem. It's just so really taxing. It's so much toil to reduce over-granted permissions in the environment. A couple of years ago, we said, "Oh my gosh, you know, Gartner came out with a report that said almost 95% of the over granting--there's over granting in, you know, customer environments across all clouds. It was a crazy high number. And we said, this is something we need to solve. So what we do is we look at customers, you know, the access that they've given. So I say, "Okay, let me look at our own org. Okay, I'll see what Timothy has access to. What Anton has access to. What we're doing is we're doing a 90-day analysis of what all that you're using. If you're not using it over a 90-day period, the chances are you're actually not going to use it ever. That's a trend that we saw. The second piece is there are some cases that once a year, you're gonna spin up that one thing that you did use that permission for in that 90-day period. This is where we deploy our ML. We say, let's analyze people like Tim here. Let's see what permissions they've been using. Oh, we see that at some point he would need the IAM service account create permission. Let's actually keep that, right? This is really, really important. It seems like a small thing. Oh, great. You added a couple extra permissions that you think you'll need at some point. But the amount of time it takes for a developer to get unblocked from denied access, that's wasted money, right?

Timothy: It is.

Vandy: We really spend a lot of time deploying ML for those kinds of things, making sure nothing breaks down the line. These are things that customers would be not afraid to automate and do things that scale. And not, you know, not worry about things breaking. Timothy: So that's interesting. So you're describing a system that not only can spot the permissions I don't need but also can spot the permissions that I do need.

Vandy: That you'll need at some point in the future. Yeah. Because we've noticed this pattern among people like you with similar permissions, not just in your environment, but across our customer base. 

Timothy: That's very interesting. 

Vandy: Yeah. And another strength I would say of Google is we do everything at scale here. 

Timothy: We like to say so. 

Timothy: Yeah, yeah. We do like to say so I hope we can deliver this. We work with some really, really large customers. And if you see the amount of resources they have, we're talking millions, right? Of MVMs and things like that. I did a talk with HSBC at our, recent GCP Next and every day they create 20,000 VMs. So at that scale, you have to be able to do things like, oh, you're going to enforce this kind of powerful org policy? Let's run a preview of violations. Let's go ahead and give you a report on all the things that could potentially violate or not adhere to this. To be able to return that as quickly as possible, that's another strength of ours that we really put effort into.

Timothy: And now is that AI-powered or something else? 

Vandy: We're gonna infuse it with that as well. Yeah.

Anton: So one thing that always worried me again, back in my analyst days, when somebody would say, "Oh, we'll design a very granular policy to allow for things that should happen in block, all the things that don't happen," it always fell on the shoulders of some team somewhere as a huge burden. So whenever I hear of granular policies, I sort of freak out because I know it sounds like work and it's also ongoing work. So you mentioned AI, you mentioned the I am recommender, which is sort of an intelligence system, but what else are we doing to sort of make this an achievable task because ultimately if the policy design is too complex, it won't be used and we will have over permissioned policy just like Gartner wisely pointed out. So how do we make it tight granular yet not overly laborious? That, to me, is the challenge. I haven't seen people solve it well elsewhere. 

Vandy: It's an ongoing challenge and I would argue we're still trying to solve it. You know, even with Recommender, Anton, you pointed this out perfectly. We give you this data after 90 days, but the area we're venturing to is where are our customers starting on day one? How do we prevent that over granting on day one? And, Tim, this is where we, again, plan to, you know, think about how we can use our ML capabilities. What are people needing early on in their journey, right? When you enable X, Y, and Z services, or you plan to use certain types of services, what's the right kind of permissions you need to give? Not just this blanket owner role or editor role like these thousands and thousands of permissions but can we scope that? That's an area that we really wanna go ahead and invest in and really double-click into. But as the journey continues, we need to support customers. The only way to do this lease privilege and not have a complex design is supporting customers throughout the life cycle of this policy. So all the way from onboarding to deployment of this policy, making sure it's safely rolled out. So another tool that we have talking about scale is Simulator. 

Timothy: Tell us about Simulator 'cause that's a fun one. 

Vandy: Yeah. Customers are really scared to make changes. You know, they've always known they've had over granting. Until Recommender came about, nobody was doing anything. They knew nobody's using this permission but they're scared to remove it. So what we did with that is we launched Recommender and we said we still wanna push them forward to help them make the changes they need to make. You have a new business requirement that comes in. You need to be able to make that change. Recommender may or may not have a recommendation for you. So what happens is Simulator says, "Oh, you want to remove Tim's access from owner role to this other smaller equerry admin role? Let's see what he's been doing in the last X days." We look back the last, like, 90 days and we see your access and we say, "Okay, here's what's gonna happen. He's losing all these permissions. But out of that, here's a subset of the permissions that he's actually been using. And are you sure you wanna go ahead and remove it because you're gonna break it? It looks like he's been using this recently." So that's something we're doing with Simulator and we see that we have tremendous adoption of this because of that capability. We're actually looking at usage. If you see the tools out there in the market today, it's a simple static. It's like Tim had, you know, X, Y, and Z permission. Now he'll have permission X. Whether you were using Y or Z no one knows, but we do that added layer of analysis. So that's made it safer for customers to continue to, you know, design things and not feel overwhelmed and say, "Okay, he's using Y and Z. Let me go find another role that has those two permissions and give him that specific role."

Anton: So that means we finally solved the problem, right? I know we don't have the solution to every possible scenario, but it sounds like the over-provisioning that's all-pervasive both in the cloud and on-premise, we at least have a route that can make massive changes. To me, this is kind of big deal so why are we not talking about this? Oh, wait, I'm in marketing so maybe I should be talking about it. 

Timothy: Anton, we're currently talking about it. Also, Vandy, Anton just said you solved the problem. Like he never says that. That's a big win. You should just go home for the week now.

Vandy: Yeah, I'll take it.

Timothy: I want to shift gears a little bit and talk about kind of how your work and my work intersect because I think this is an interesting area philosophically and practically. We have a lot of things in Cloud that we could take as proactive guardrails, or we have the same problem that we could approach from a reactive detection standpoint. How do we know where to invest and how does an org know whether they should guardrail something up or wait and do detection if things go, you know, funny-looking afterwards?

Vandy: I think we're still ourselves trying to figure this out. Here's what I'll say. There are some things that, of course, it depends on the environment that you're in and all of that, but let's just take a generic customer organization, right? You know, simple things like, "Hey, your developer shouldn't be able to create a service account key, right?" So that it lands in a get repo. And the next thing you know everyone has access to those things. These are things where it's kind of scary. It also happens a lot. So you know, you want to make sure you're doing both a preventative and a detective strategy there. As I mentioned, we had HSBC talk to us about this at GCP Next. And for the longest time, Tim, they have been entirely focused on detection. I don't want certain ports on my firewalls. 

Timothy: Oh and yeah, they keep you busy.

Vandy: Yeah. I don't want, you know, no service account keys, you know, and they're constantly spending so much time remediating things. And then they stumble upon org policy and they're like, "What? We want to be only detecting things that we really are scared about. There's so much noise otherwise." Like imagine that staying in the detection world you have threats. You have some resource misconfigurations that you don't like. You have other types of policy anomalies that you want to look into and policy drift. You're spending a lot of time trying to fix those things. The last thing you need to worry about is, "Oh my gosh, this one developer is sitting down and, you know, doing a good resource configuration." So we like to think in terms of good resources and, like, good resource hygiene, right? Let's put that as preventative. The detection should really be those scary threats that you really wanna look into. You wanna make sure what is the strategy here? How do we make sure this never happens again? That's where we want them to spend the time if that makes sense. 

Anton: Yes. And I think that I want to introduce an angle to this that I sort of touched but never did. I know it's a little bit boring, but from what your description, it sounds like it has pretty sizeable implications for some regulatory compliance like PCI/DSS, removing access, removing access to systems, to data. So what part of it has compliance value versus security versus just good ops value? Can you sort of, like, make a pie chart for me? Well, over audio. Compliance, security, good operations. It sounds like all three have been covered.

Vanda: Security is the biggest bucket. And if you look, security's evolving much faster, I think than compliance requirements. The threat landscape is, like, constantly expanding. So from a simple pie chart perspective, front and center that's really for org policy, we wanna make sure that as and when we're learning either from our red team exercises or we're learning about some new threats coming in and, you know, we think we can solve them with a good protective guardrail, those are front and center security-related things we want to support. A big chunk of old our policies today are there in that space. And then the second piece of that is now we're entering compliance. It's really hard things like restricting the location where their resources can be created, making sure you're only using certain types of compliance services. That's a smaller segment, but it's a growing segment. Types of guardrails that customers are setting on day one, those are the ones. They don't want their developers using different kinds of services then they have to sit down and say, "Hey, stop using it, right?" So those are the early compliance ones. We see them being adopted really quick. And then what was the third bucket, Anton?

Anton: Just improving operations because some of the staff over-provisioning and a few other things, they sort of hurt operations as well. So to me, this is ops value, security value, compliance value. 

Vandy: Yeah. And the way you're talking about the ops, right? The access pieces, we track the metrics of what policies our customers are setting, things around the security resource configuration, and then the ops pieces are on access management.

Those are the highly adopted ones. Simple things like no one from outside my organization should have access. We're introducing a new one like here are the different types of roles that someone is allowed to grant in my organization. Those are the kinds of operations like access-related guardrails that we're getting a lot of interest in. That's another well-adopted set of policies and it's growing and we sort of bundled them up with security. 

Anton: Perfect. This is actually really good and I think we ended up with a few useful insights. So to start looking at our traditional final questions, what is one practical suggestion you can make for our listeners to improve their policy? And of course, you can say buy my product, turn on policy, but give me something most people can use to reduce policy sprawl, to reduce over-provisioning in the cloud perhaps.

Vandy: Here's what I will say. This comes back to, you know, how do you organize your environment? I would say the way some of our advanced customers are thinking about this and the things we've learned from them is first start with a baseline. What are some absolute things in your organization that you don't want happening? We have a good set of org policies that we recommend customers start with. So figure that out and enforce that broadly across your organization. It's important you start this way. Then the second piece of that is structure your folders in a way and plan for scale. You can organize it by business units. You can organize it by cost centers. The last thing I want customers doing is going ahead and giving exceptions and forgetting that they gave exceptions for these kinds of guard rails. Organize it in a way so that you're adding on constraints, depending on the business requirements of those business units. That's something I feel like it's easier said than done, but that's a very basic way to make sure that you're protecting as you are also personalizing for your different units. That's one thing I would definitely say. The second piece of that is also, you know, start with a good security framework. One thing that we've heard from our customers is what they love about org policies when a developer gets denied and says, "Hey, you're not allowed to actually do this with a VM," it's really easy to understand, "Hey, why is this not allowed in my organization? I get it, right?" So start with a really good security framework, express that in a way that makes sense to your organization, right? I think that's really important from a customer perspective and we've seen great success in the adoption of these guard rails, developer productivity, and so on and so forth.

Timothy: Vandy, that makes a ton of sense. I want to wrap up with our traditional closing question because we hate to leave listeners empty-handed. Do you have reading aside from go watch your, by the way, excellent, really well-done Next sessions with our friends over at HSBC? Do you have recommended reading for people?

Vandy: Yeah, I do have recommended reading. There's one piece. Of course, the next talks are called governance guard rails and policy intelligence. I love the way HSBC and both Target talked about it so definitely go check it out. Many of the things I've touched on, they touch on there. Outside of that, of course, we also have documentation, but we have a ton of blog posts on, like, how we're using recommended videos over granting. That's definitely one. And how more importantly to deploy these policies using policy as code? The best thing to do is to automate a whole bunch of these and remove the manual processes. We have those blog posts as well ready to go for your perusal.

Timothy: Nice. Well, Vandy, thank you so much for making the time to join us. I know scheduling this was no easy feat. Listeners, thank you so much for joining us today.

Anton: And now we are at time. Thank you very much for listening and of course, for subscribing. You can find this podcast at Google Podcasts, Apple Podcasts, Spotify, or wherever else you get your podcasts. We are, in fact, on many, many apps. Also, you can find us at our website, Please subscribe so that you don't miss episodes. You can follow us on Twitter, podcasts. Your hosts are also on Twitter at anton_chuvakin and at _TimPeacock. Tweet at us, email us, argue with us. And if you like or hate what you hear, we can invite you to the next episode. See you on the next "Cloud Security Podcast" episode.

View more episodes