Back
#54
February 28, 2022

EP54 Container Security: The Past or The Future?

Guest:

23:23

Topics covered:

  • One model for container security is “Infrastructure security  | build security | runtime security” -  which is most important to get right? Which is hardest to get right? 
  • How are you helping users get their infrastructure security right, and what do they get wrong most often here?
  • Your report states that “3⁄4 of running containers have at least one "high" or "critical" vulnerability“ and it  sounds like pre-cloud IT, but this is about containers?  This was very true  before cloud, why is this still true in cloud native?  Aren’t containers easy to “patch” and redeploy? 
  • You say  “Whether the container images originate from private or public registries, it is critical to scan them and identify known vulnerabilities prior to deploying into production.“ but then 75% have critical vulns? Is the problem that 75% of containers go unscanned, or that users just don’t fix things? 
  • “52% of all images are scanned in runtime, and 42% are initially scanned in the CI/CD pipeline.“ - isn’t pipeline and repo scanning easier and cheaper? Why isn’t this 90/10 but 40/50?
  • “62% detect shells in containers” sounds (to Anton) that “62% zoos have a dragon in them” i.e. kinda surreal. What’s the real story?
  • Containers are at the forefront of cloud native computing yet your report seems to show a lot of pre-cloud practices? Are containers just VMs and VMs just servers? 

Do you have something cool to share? Some questions? Let us know:

Transcript

Timothy: Hi there. Welcome to the Cloud Security podcast by Google. Thanks for joining us today. Your hosts here are myself Timothy Peacock, the product manager for threat detection here at Google Cloud, and Anton Chuvakin, a reformed analyst and esteemed member of the cloud security team here at Google. You can find and subscribe to this podcast wherever you get your podcasts, as well as at our website cloud.withgoogle.com/cloudsecurity/podcast.

If you like our content and want it delivered to you, piping hot every Monday, please do hit that Subscribe button. You can follow the show and subscribe to the show and argue with us on Twitter as well, twitter.com/cloudsecpodcast. And with that, I am delighted to introduce our guest today, a title so impressive, Anton could not bring himself to announce it, I am pleased to introduce Anna Belak, the director of thought leadership at Sysdig. Anna, thank you so much for joining us today. It's a real pleasure to have you on the show.

I want to kick things off with a framing question as we think about container security. I think of it as infrastructure security, build security, and runtime security. A, do you like this framing, and B, if you like it, what do you think is in there the easiest to get right, the hardest to get right? Where are you seeing investment for users to try to help them get it right? Talk to us about the general landscape on container security.

Anna: First of all, thank you for having me. This is a huge honor. I am a total fan. I think I've listened to maybe all of them or very close to all of them, and you guys rock. So very happy to be here. It's very hard to answer your question, actually. I want to say, I'm going to pick you from the beginning and you're like, there's actually application security too, but I also don't want to talk too much about it because application security is a whole other can of worms, but it's a really important.

The reason we actually care about containers, for the most part, is because it eases life for developers. That's the driver. And that ends up really important in how we then manage them, secure them, and do everything else with them in the app. But to your point about the three things you mentioned, I do actually agree that those are all key. In classic nature, your questions are really hard.

I think the hardest one is runtime. There's a couple of reasons for that. The easiest reason I think is one that's near and dear to both of your hearts. And I'd say detection and response is just hard. When you try to do detection and response on containers, it's still hard for all the reasons it was always hard. And that was hard for like some of the reasons like there, if I'm role and they do silly things that look wrong when they aren't and so on and so forth.

I think that the most important thing actually, or maybe the thing that we're spending the most time and energy on right now is more on the infrastructure side. Because before you can really worry about runtime, you have to get some basics right. And this almost goes back to security hygiene story of look, you have to not open silly ports, you have to scan for vulnerabilities, you have to think about your configuration being secure. So all of those things are required before you can get fancy with runtime stuff.

Timothy: That makes a lot of sense. We have a missing piece from that set of security, which is, of course, data forensics and incident response. I just assume people should earmark 10% of their budget for that and call it a day. So that's an interesting answer, and I think a good one that really, it starts at the start of the list, which is getting your hygiene right.

Anton: Perfect. When I think of hygiene, I think of infrastructure first, maybe I'm just not very ethic-minded or something. When we think about [inaudible] and similar technologies, helping users get their infrastructure security right, but what are they getting wrong? Because again, reading your report, which we'll mention more throughout the episode, it does seem like users and container users would sometimes get infrastructure wrong. So before we get to all the fancy stuff like runtime and build security, they actually miss the infrastructure angle. So what do they get wrong most of the time and how do you help them?

Anna: This is a think nuanced actually, because what you mean by infrastructure becomes important, like is a container a piece of infrastructure or is it a piece of application? It's a real essential or crisis question that I actually struggled with. I think that the thing that they get wrong is a higher level than any of that, and it's the pattern that they're trying to impose upon whatever it is they're doing, whether that's an application they're building or an application they're migrating or whatever.

The beauty of containers and Kubernetes and cloud native and all these other things is that you're trying to operate in a different way, like the cattle not pets thing, where you can just throw away images that are broken instead of trying to fix them, you're moving a lot faster, you have a lot more redundancy built-in by default, you're able to scale. So all these things give you a lot of power. What often happens, which I think is the biggest mistake is our users don't actually take advantage of these things. They take their legacy habits and their legacy ways of doing things and though they've moved their infrastructure and their application to the cloud, they haven't actually applied to it and you kind of caught in the philosophy of approach, and that causes them to do things like not patch often enough, or rather I shouldn't even say patch, because it's not patching, it's just immutable. You just destroy it.

Timothy: So this problem of users coming to containers, thinking containers will fix their things, that's like going to Hawaii with your partner and things aren't magically better, though. Containers as Hawaii, you say in your report, I think I can quote it as saying, "Three quarters of running containers have at least one high or critical vulnerability." To me, this is like the classic example of people adopting tooling and thinks it'll fix their problem and not having it fix the problem. So why aren't we running latest? Why aren't we redeploying? What's keeping people from having the outcome they want?

Anna: I think you're absolutely right about the cause, but I think the solution is just not simple because it's never simple. First of all, there are things that should be easier. It's quicker to get an answer to whether or not this thing is vulnerable. It's quicker to throw it away and deploy another thing that's not vulnerable, but whether or not you can actually remove a vulnerability is still not that trivial of a question. There's still dependencies. So you have components. There are other components dependent on those components. This results in a horrible supply chain and problem that causes Anton existential dread.

At the end of the day, you can't actually make the decision about whether or not any given thing can be replaced trivially without affecting lots of other things. You still end up with these risk decisions and these workflow problems that cause people to kick this can down the road. And at the end of the day, they're incentivized to release features faster. They're trying to build software, they're trying to deliver business value, and so the security thing, every moment that it slows them down, it just gets ignored or dodged.

Anton: That sounds like a lot of pre-cloud thinking. As I mentioned in the question, it sounds like a lot of the approaches and practices, and habits from before cloud come in and people follow them. So for example, you just it's easy to scan, find the vuln, but then you also said right away, it's easy to destroy it and put a new one that's not vulnerable. People do the first part, but not the second part. Here we're not asking them to patch. We are asking them to take a flame thrower and burn the old container to the ground and get a new one. That is supposed to be an easier task yet they don't do it. Am I right? Or am I still too--

Anna: You're right, but consider the implication. So if the vuln is in some, let's say it's a load balancer. It's like a single thing. It does one thing. You have a new version of this load balancer, you just destroy it, you deploy a new one and it's going to go ahead and route your traffic. That's pretty simple. Now consider the vulnas in a database. What if it's Cassandra? So now you have to patch this vuln in Cassandra. Okay, I'm not going to patch it. I'm just going to toss it in a trash and deploy a new Cassandra, but all of the stuff I built that connects to that thing that relies on that thing, may or may not be able to handle the new version trivially depending on how it's architected and how it's built. That's a risk. That's a business risk to just deploy that thing. So you have to consider the implications of that fix. Even if the fix itself in theory is simple, the outcome may be painful.

Timothy: Yeah. And you can't have a case where the medicine is worse than disease here.

Anna: Exactly.

Timothy: Sometimes you just have to choose to live with something not being quite right.

Anton: Right. But this whole cloud native stuff is fake because this is the same problem I would have if I'm patching my Oracle and the patched version 8.5.7-23_45 does not work with an app. This is the problem that I would have in 2007.

Timothy: Anton, you can't say the cloud is fake. You work at Google Cloud.

Anton: No, no, no, I didn't say cloud is fake. I said cloud native is fake.

Anna: Oh, man.

Anton: Because I said that the problems you're describing are the same problem IT would have with patching a database in 2007. Where's the magic?

Anna: Nothing about changes. I mean, there's some magic. So a lot of things do go faster. For example, the amount of time it takes between when something is disclosed or when you discover something is wrong and when you can actually fix it is much smaller now. You can have a release the next day that fixes everything potentially.

So I do think that it's a little unfair to say nothing's changed because in 2007, this might have taken months or years, and now it might take weeks or months, which is like a pretty big improvement.

Anton: Okay. I'll go with that. I will now switch to the positive theme. And one other thing from the report that they picked is that while 75% of containers, as we just said, contain patchable vulnerabilities, some customers, the most mature and meticulous customers, as you say, reduce this metric to below 5%. 5%, 75% sounds like a big difference. So as a long time operational maturity and not*, it sounds like there's a bit of a chasm, possibly a big chasm between average 75% problematic and best 5% problematic. How are they doing it? And what's the magic behind them being so much better? Are they the true cloud native? Do we now have the cloud native and the real cloud native? Or is there something else going on?

Timothy: We've got some Scotsman about to join the call, I think.

Anna: I do actually think there's something too with the "cloud native" versus a real cloud native. My personal opinion is that I do attribute a lot of that success to folks actually adopting some of these patterns. That includes not just the immutability thing of like throw it in the dumpster and deploy the new one, but also the shift left buzzword, which actually just means to find these problems sooner. The sooner you can discover that there's an issue, the easier it is to actually alleviate that dependency suffering for the developer who is responsible for making sure that the issue isn't there.

From our perspective, the folks that use, at least our tool, they scan sooner, they are more serious about making sure that they get visibility into what all these workloads are doing and how they rely upon each other, and they generally just have-- they're doing a better job of that hygiene stuff in a cloud native way that is specific how you should treat a container as opposed to trying to compromise between this legacy approach and this new approach.

Timothy: That makes a lot of sense. And on the topic of shifting left and on the topic of build, we saw in your report that 52% of images are scanned at runtime and 42 are initially scanned in the CI/CD pipeline. That seems to me backward. Shouldn't most people be pushing their scanning into the build pipeline rather than the runtime? What's up with that?

Anna: Again, this is where I would say that the folks that are high maturity are definitely pushing it into the build pipeline. And in fact, they're actually doing even more advanced things where they scan their artifacts, that they'll scan the infrastructure's code, YAML, before they ever build anything. So they can see misconfigurations before they've ever bothered to build a thing. The issue is-- I mean, first of all, they're folks who are just picking this up. It doesn't take like five minutes to build the CI/CD pipeline. So if you've never done that before, and you're just starting to adopt some of these practices, it could take a really long time to build out that.

Timothy: How long?

Anna: Years potentially, depending on where you started. It's not just about buying a trinket and plugging it in. There's a lot of process and culture that has to shift.

Timothy: But years? You make it sound like these people are melting down sand and building their chips from scratch by hand.

Anna: I mean, it's pretty close.

Timothy: So really you think this is a maturity thing and the difficulty of getting CI/CD stood up and doing it right and all that.

Anna: It's a maturity thing and a lot of it is also just skillset and process and training where people just need to get used to the way that things go.

Timothy: Sure.

Anna: It takes some tuning. So when we say scanning CI/CD, that means you have to make a decision, for example, at which point do I fail a build, at which point do I alert on a build, and at which point I just ignore it because it's never going to get fixed because it's a low or whatever? Those kinds of gates take tuning because you put something in it and it doesn't work, the developers hate you, then you go back and forth and you negotiate. So this all takes time to get right.

The other element here that is interesting is that part of those images that are scanned at runtime for the first time, actually in non-trivial part, are things that aren't actually built by the developers. So again, there may be load balancers like Kubernetes components or other kind of what we used to call middleware back in the day, that you're pulling down from a repository or you're using it as a vendor-consumed piece of software that you trust. So you assume it's not vulnerable. That may or may not be a safe assumption.

So again, the best practice is to scan all that stuff in the CI/CD 2 and having to be part of your testing process to check all those pieces. But a lot of people haven't done that yet because maybe it's on their roadmap, maybe it's just not something they think is as important as scanning the things that they do build. There's a different risk profile associated with not scanning those things because you should be able to trust your vendors to some extent rather than not scanning the stuff that you actually put together yourself.

Timothy: So people aren't scanning their shrink-wrapped code. That makes a lot of sense. Or at least they're not scanning their CI/CD. Okay.

Anton: But that's disturbing also as if there's not more disturbance stuff to find, because this is kind of something that I recall from my Gartner days where the security shift into pre-runtime. And frankly, I was nervous we're going to lose around time security because people would supposedly do all the security in the beginning and then they would claim that they're all done so they would not do runtime security. So that was my fear back in the day. It turns out the fear is not materialized, but something the worse did. They're not doing enough security before deployment.

Anna: I think you were just ahead of your time, Anton, actually. I still share this fear, honestly, that people will think that they can check all the boxes because they have this beautiful testing pipeline that makes sure that everything's okay and then they can throw that thing into production and sleep at night.

Anton: Yeah, that's my fear from 2018.

Anna: Yeah, being a paranoid security person like you, I'm over here, like that's a terrible idea.

Anton: I know it's going to be awful.

Anna: And that's part of what we do actually. We try to provide the runtime context for folks. But at the same time, it's hard to really expect good runtime hygiene when you don't have good preventative controls yet. So again, I think you're just ahead of your time and that these things will happen, but they haven't happened yet.

Timothy: This is almost starting to sound like in this call, we're back into some kind of dare I say, capability maturity model for container security.

Anna: Do you want to write something about that?

Timothy: No, I don't. That's your job. You're the director. Probably you should do it.

Anna: That's right. There's another element, actually, that's interesting in terms of people and process, and that in the old days of vulnerability management, we had this struggle of deciding who the owner of something was. When you push something through whatever scanning process to identify what's wrong with it, your next question is like, "Who's going to fix it?" So you have like application owners or system owners or what have you. And usually they were in, I know, and they hated you. And so that was painful. Now there's actually a different but maybe more complex issue because you could have an image, a container image that has different layers that are owned by different people, like it's base image that is an operating system or whatever, enabling layers, it might actually be a DevOps Ops ownership element, and the things that are sitting on top of it that the DevOps has created, or the dependencies they pulled in, like third party libraries are the developers' job.

Tooling that exists today isn't super great at being able to tell which is which and whom to ship that ticket to. And so we still have that issue of whose job is this and so who's going to fix it, never mind when they're going to.

Anton: Okay. Now, basically, to summarize it in a somewhat cynical manner, security used to mostly fight Ops and occasionally rarely fight developers, but today it looks like security will just fight DevOps, which is roughly the same people they used to fight just on a different group name organization. That's exciting.

Anna: They've been rebranded, yeah.

Anton: Okay. So switching gears again. So one thing that I also caught up in the report, and again, I wasn't trying to read the report as it's a story about how IT and security were done in the past, but there was a line about scanning third-party containers and runtime being popular because it reminded people of a simpler legacy approach to the YAML when they used to always scan servers as they're running. What it sounds to me is that it's not very cloud native. To me, that sounds like a colossal premise on-premise security, where you build something, you deploy it, you push it into prod, and then you get a scanner, you unpack your little vulnerability tool and scan it, and it's like, "Oh, there are bugs. Go fix them." Why is it still there with us? Why is this practice still around? Why didn't it die when people went to cloud native?

Anna: I think the simplest answer is that people stick to what they know. So you have to start somewhere. We try not to judge them. We just want to secure them. So there's an element of meeting them where they're at. But I do think that that's the tendency. If you can't get your developers to adopt some of these best practices of shifting left, if you can't afford to build this beautiful pipeline, you have to do something. And so you start at the end. You started all the way to the right so you can at least fail the bills that are the most broken or what have you, some standards, and then you try to go to the left.

Back to your other point about the terror of runtime security is that you can't assume that because you've scanned it on the left, it's okay forever. So you still have to scan it in runtime, probably scan it forever in runtime continuously. It's just that that shouldn't be the first time you scan it.

Anton: Yes. I think it does make sense. And I think that it's nice that we know of at least one problem would materialize in two, three years with many companies is that when they would shift too much left and they would not do enough runtime, then will be with like, "Hey, hey, we knew it back in 2022." So I want to ask another question from another thing that blew my mind a little bit from the report. There was a metric about 72% of containers having shells in them and a quote from our podcast notes is that to me, it sounds like 62% of zoos have a dragon in them. It's like really bizarre, surreal. Why are people as a Sage and into frigging containers? What's going on? This is not supposed to be like this. Can you care to comment on this one?

Anna: It's 62%, but yes, I think it's the same reason. First of all, again, I have to be a little empathetic, I think, to our poor users in that when you try to modernize a workload, which is what's happening, some of this stuff is definitely a green field where people are building brand new, beautiful, shiny things, but a lot of businesses just have legacy apps that are running that are business critical and they would like to refactor them to be newer and shinier. And that is not a trivial process. That's a very long road. They may well be taking something that is a monolith that has been running in a virtual machine for 30 years and moving into cloud in the hopes that that's going to save them money or make it more likely that they're actually going to do something novel with this thing, or just because they were told to, I don't know.

And then they're going to hopefully over time, start partying it out. So piece by piece, they're going to dismantle it and build it into something more microservice-oriented, perhaps, maybe that's the dream. And so if they're still in those early stages of, this is just a virtual machine to me, they're going to do those things. They're going to see some weird behavior or they're going to associate it to young stuff, what have you. To me, that's utterly shocking. I'm over here like, "It's a container. What are you doing?" But at the same time, old habits die hard. And if you don't have any other way, or if you haven't actually designed it to be Kubernetes orchestrated and load-balanced so that if you kill it, it'll fail over to the next one, then you don't have a choice.

Timothy: If you don't own the build process for container, it makes sense you have to SSH to it in APT to get updated. Those are your options in there. And if somebody doesn't have one, they're going to take the other. In some ways, Anton, I think we ought to be applauding these users that do go ahead and try to update their stuff even if it's not fully container optimized, operating.

Anton: Okay. Wow. You found the nice, optimistic bit in that.

Timothy: Yeah, I think that's my job.

Anna: I have an optimistic take on it too actually. Going back to what we're talking about here, this is like the assisting usage report and security report. So these people have engaged with a security vendor, the brands itself as cloud native, which means they either believe themselves to be cloud native or believe themselves to be on the road to cloud native. I feel like the fact that we are seeing scary numbers actually indicates that the adoption of cloud and containers is broadening to folks who are not as mature. So they don't quite know how to get it right, but they're trying to do it. And that actually gives me a lot of hope because we've been covering this stuff since 2015, and I've always been like, "Wow, this is so cool. Why isn't everyone doing it?" And now it feels like maybe everyone's doing it, which is great, like that's awesome.

Timothy: This conversation has made me think of the thing people used to say at conferences, "Oh, the cloud is just someone else's computer." Are containers just somebody else's computer? Is that what we're seeing here?

Anton: Tim, you know better, come on, you're baking her.

Anna: Containers are your computer. containers are tiny pieces of someone else's computer.

Timothy: Yeah, that's what it sounds like, though. It sounds like containers for all their hype have all the same problems as people's computers.

Anna: I actually would argue that the problem isn't that people don't love them enough. It's that they love them too much.

Timothy: Oh.

Anna: Let them die. Let them fall over. Why are you treating them like these virtual machine pets you used to have? They're actually still your computer. Let them be someone else's computer.

Timothy: I like that answer. So we're getting towards the end of the show and we have our traditional closing questions, which I think I get to ask both parts. First, do you have one tip for people to improve their container security outcomes? And this is the part where I want you to do more than just say, "Contact Sysdig." And then two, do have a recommended reading? And go beyond just the report, which we will be linking in the show notes [inaudible].

Anna: You should totally contact Sysdig. We'll help you out. To be honest, it goes back to what I said earlier. I think if you are on the journey to cloud native, it's okay if you're not killing it from day one, because it's actually very complex and there are many, many things to learn. So don't hate yourself for the fact that you're not killing it from day one, but do try deliberately to actually adopt the cloud native philosophy and the cloud native patterns and not just the cloud native tools. Because just doing the tech thing as always, doesn't get you very far. You have to do the process name.

Timothy: That's a great answer. Anna, thank you so much for joining us. Thank you for all of the laughs and thank you for letting me think of containers now as the tropical vacation for a struggling couple.

Anna: I didn't actually tell you what to read, but I can, if you want.

Timothy: Oh yeah, what would you like our guests to read, please?

Anna: I mean read the report, obviously, which is awesome and it's at sysdig.com. And in addition, for folks who are just learning about Kubernetes, we have a pretty cool thing called the cognitive learning hub, which you can check out and it will teach you basic things about cloud and Kubernetes security.

Timothy: Amazing. Thank you so much.

Anton: And now we are at time. Thank you very much for listening and, of course, for subscribing. You can find this podcast at Google podcasts, Apple podcasts, Spotify, or wherever else you get your podcasts. Also, you can find us at our website, cloud.withgoogle.com/cloudsecurity/podcast. Please subscribe so that you don't miss episodes. Today, I will be particularly insistent, and I would say please subscribe so that you don't miss episodes. I don't know. I should probably do it with my Russian accent.

You can also follow us on Twitter, twitter.com/cloud/secpodcasts. We are also on LinkedIn. We now have this whole separate page for the podcast. You can find your hosts on Twitter @anton_chuvakin and @_TimPeacock. Tweet at us, LinkedIn at us. I don't know what it means. Email us, argue with us, and if we like or hate what we hear, we can invite you to the next episode. See you on the next Cloud Security podcast episode.

View more episodes