March 14, 2022

EP56 Rebuilding vs Forklifting and How to Secure a Data Warehouse in the Cloud



Topics covered:

  • Imagine you are planning a data warehouse in the cloud, how do you think about security?
  • What are the expected threats to a large data store in the cloud?
  • How to create your security approach for a data warehouse project?
  • Are there regulations that force your decisions about security controls or  approaches, no matter what the threats are?
  • How do you approach data governance for this project?
  • What controls are there to implement in Google Cloud for a secure data warehouse effort?

Do you have something cool to share? Some questions? Let us know:


Timothy: Hi there. Welcome to Cloud Security Podcast by Google. Thanks for joining us today. Your hosts here are myself, Timothy Peacock, the Product Manager for Threat Detection here at Google Cloud, and Anton Chuvakin, a reformed analyst and esteemed member of the Cloud Security Team here at Google. You can find and subscribe to this podcast wherever podcasts are available, as well as at our website If you like our content and want it delivered to you piping hot every Monday, please do hit that subscribe button. You can follow the show and argue with us on Twitter as well, Anton, today's episode is about data warehouses, which I found to be way more exciting to talk about than I thought it was gonna be.
Anton: That's exactly right. And I think that the fun part is that we are talking about people who are piling up huge, huge volumes of data into Google Cloud. And then it kind of gives birth to special security requirements. So this sounds like a very fascinating setup for an episode, right? You're not just copying a little bit of data, you're copying possibly massive volumes of data. And then you have to account for regulated data sets for threats for controls. It's way more fascinating.
Timothy: Yeah, you said some of the words we tend to avoid on the show too. We used the P-word. We used the G-word. It was a good episode, I think.
Anton: You right. We used the P-word. Then of course I got immediately skittish.
Timothy: You did.
Anton: ‘Cause I thought that maybe 10,000 lawyers would show up out of the wood-lock and would like tear us to pieces. But we did say the P-words, yeah.
Tomothy: And so far, so good on the lawyer friends. So with that, let's welcome today's guest. We are joined by Erlander Lo, or E-Lo, not to be confused with the chess ranking system, who is a security and compliance specialist here at Google Cloud. We're talking about securing data warehouses, which I think is a fascinating problem, because data warehouses are one of the, in my mind, sort of canonically obvious use cases for cloud technology, as opposed to doing it yourself on prem. You wouldn't believe the stories I've heard of users having to provision, you know, yards and half miles of datacenter space for this problem. And instead, it's a couple API calls into G-Cloud to get going from a starting point, Erlander, what are the things to think about when it comes to securing a data warehouse in the cloud? Where do you even start with this problem?
Erlander: Yeah, absolutely. And thanks for having me here on the show today. I’m super excited to be here. So really, you might think about this at first blush, about maybe just take BigQuery and use the built in controls that we have with BigQuery and call it a day. But really, for me, it's more than just protecting the data within BigQuery itself. It's about that whole ecosystem, especially when we think about things from securing data at scale. You know, in the data analytics world, they often think about the three V's of data, in a sense, like the volume, the data, its philosophy, and its variety. And you have to think through as well how that data is moving into the cloud environment. It's usually typically done with an ETL type of process. So what's doing that transformation and how reliable that is. And so when I think about the security of a data warehouse at scale, I think about three different areas really. I think about, first of all, it's kind of the business objectives that we're wanting to achieve and how security can help with those things. So I had a healthcare customer and who has some phi data that they wanna bring into the Cloud. And so with that particular data, they needed to understand how to de-identify that because they had a requirement for the data in the Cloud to be de-identified. What did that data look like? We kind of worked through those different things. And we understood that it needed to be in fire format. And there were certain fields inside of that data that had unstructured doctor's notes inside of that, that they needed to be able to make sure that was de-identified as well. [Inaudible] think around and understanding how that data is gonna used really helps with the initial part of securing the data warehouse, understanding the business objectives. The second thing I think about then is the data governance around how the state is going to land in the data warehouse. And so what does that compliance landscape look like, if you will? Typically, this data may have your confidential data inside of that. It might have PI information, but more so than that it might have the notion of your crown jewels, if you will, where this data might have the competitive advantage that you need in there. So it's really understanding how do you protect that information? What's the governance around that within this Cloud environment now? And then lastly, what I like to think about too, is using sort of like a risk-based approach, if you will. One way you think about the blast radius, think about [inaudible] attacks and the TTPs, and what might happen, how do you layer insecurity around that? And one systematic way to think about that, too, might be through, like using a threat model, if you will, to really help you get a sense of where those risks and what's the threshold that you have for those risks.
Timothy: That's interesting to me because--and Anton is probably gonna laugh at my naivete here, but you put threat modeling last and preceding it were governance and what sounded a lot like a compliance requirement. And so in my day, I spent all day thinking about threats. But it's interesting you put that last.
Anton: Yeah. And then frankly, I wanted to take on the same theme exactly that a lot of this is kind of about regulatory requirements and what the auditors may want and what the customers may think they need to do to satisfy somebody. And of course, you mentioned PHI, which is valuable to attackers. But on most contexts, I've seen people bring up compliance when they discuss it. So to switch gears a little bit, I know it's all important, but what are the expected threats, that is threats in addition to auditors?
Timothy: Those are the biggest threat often.
Anton: Other than neither. So tell us about the threat models and what are the expected threats to a big data warehouse in the Cloud, if you exclude the auditors for a second.
Timothy: And for our guests who don't think about healthcare data all day, what's PHI?
Erlander: PHI is personal health care information. Some of those things are defined by Safe Harbor that you can go through Look Up. I think there's like 18 of those per se. But those are some of the types of things that we consider as PHI. Now, that's a great question Timothy, interesting you think about that. When I talk with different customers who've asked me about securing their data warehouses, usually one or four, or a combination of four different threats comes up. One of them is elevation of privileges. And really, that's about trying to understand if an attacker get a hold of some sort of identity that can do more than what they're able to do initially.
Anton: Sorry, attacker here means an insider or necessarily an outsider or either.
Erlander: It could actually be both from that perspective, maybe it's a malicious insider. But typically, it's someone that maybe has misconfigured something. But it could also be human, or it could be like a machine identity that you wanna think about there, as well. We spoke a little about auditors, for instance, you probably don't want an auditor [inaudible] to see PHI information about certain things or maybe he or even your SRE team to be able to see some of those things within your BigQuery datasets. Another expected threat that I typically run into or have conversation with customers is data exfiltration. Maybe that's because of like, we're just talking about--
Anton: Which is data theft, right?
Erlander: Yeah.
Anton: Because when you say exfil, that means somebody is doing it. It's not just something that’s permission is open by mistake. Data exfil implies somebody is doing it, correct?
Erlander: That's right. It's where they're probably trying to get a hold of that sensitive information and send it somewhere else, or think about how to capture and keep that information. It's a great point.
Timothy: You said something in there that I'm sort of surprised is even possible. And I wonder if you could expand on it for the audience a little bit. You said, I hope I understood this, right, “Keep your SRE’s from seeing the data.” How can you keep people who have [inaudible] from seeing data? That sounds like magic to me.
Erlander: Pretty interesting to think about it from that perspective. When you have these SREs [inaudible] product, when you're thinking about from a like an on prem sort of environment, infrastructure, if you will, a lot of the services that we have within our data warehouse are more about these sort of serverless technologies, if you will. These folks hopefully will have a different set on the suite, different set of identity and access controls that we've put in place. So that we understand and we make sure that those folks are able to separate the ability to read and look at data versus the ability to sort of do the administration and management piece of things. So hopefully, that helps give you a little bit of insight from that perspective that a lot of these services that we're talking about, don't have something that you can traditionally like log into, and try to gain access to things.
Timothy: So to me, that's kind of the magic of operating in Cloud, right? Where we have a permissions model that doesn't have this concept of like root on the database box. It just--you can divide your privileges differently enough that there's threats on prem that really aren't as relevant in Cloud by the nature of the surface. That to me is close to magic.
Erlander: And that's exciting. Another threat that I often talk to customers about too is then information disclosure. That's really about having folks within your team be able to see the data they're allowed to see. So we talked a little bit about audits, were are they able to see probably--I think we talked about earlier [inaudible] should it be able to see that sort of information.
Anton: So this is not theft, in this case, just to balance quickly, and sorry for interrupting .Data exfiltration means somebody is taking the data. This one is kind of more about there's no theft, there's no mistake in permissions, but somebody is seeing the data who shouldn't…
Erlander: That's correct.
Anton: -do it. So it sounds very close to the [inaudible] which I promised to never pronounce on this podcast. So is this a security issue or is it other than security issue?
Erlander: Yes, that's probably other than a security issue, if you will, from that perspective. So it's great observation there. It's still often a topic that will come up in some of these discussions that I run into with customers. Additionally, one last thing of the four items is usually the availability of your data. A lot of the regulatory frameworks will have a notion of discontinuity and availability, or things like disaster recovery. But if your data scientists are trying to create and surface some real time information from your data warehouse, you really need to be able to understand how that data is gonna be available and how quickly it'll process and flow through your data warehouse environment. Hopefully, that gives you a sense of some of the threats that I often discuss with customers.
Timothy: That absolutely makes a ton of sense. We like to have practical advice on this show as much as we like to get into talking about Karl Popper. What's your advice for somebody who's planning their data warehouse project in Cloud? You know, how do they get started with this? What should they do in their planning to make sure things go as successfully as possible?
Erlander: I like to think about it in a couple of different ways. One of the first things I think about is, where's the environment of the data warehouse? Where's it gonna be deployed? Is it being deployed in a place in your current environment where you’ve already carved out and had the ability to deploy maybe your security foundation blueprint, for instance? Where it's already established the necessary guard rails for more policies, IEM and network controls, and they are just deploying the data warehouse blueprint on top of those sorts of things when you're thinking about how you can use the underlying infrastructure. Or maybe you've already spent a lot of time hardening your infrastructure. So you want to be able to understand how your data warehouse will work within that hardened infrastructure that you may have. Or maybe one of the things you're thinking about is deploying a data warehouse within a POC type of standalone environment, you want to evaluate a lot of the controls out of the box for a particular blueprint. So really understanding where that data warehouse is going to land in your environment is gonna be an important aspect. The second thing to think about too, then, are the people that are involved here. So for instance, if you think maybe your data engineering team is gonna be responsible for deploying this and your data engineers may not have as a deep technical insight and understanding of security as your actual security team for that perspective. You might wanna understand that and also understand where your security team is. Do they just finish up maybe handling an incident response, or maybe they're just finishing an audit process like we were just talking about? So really understanding the people that are gonna be involved and be able to help you deploy your security, your data warehouse, it's going to be an important aspect. The third thing I think about too, is then sort of the business timing of all of this, when do you need that solution? And maybe we kind of take a play out of our friends from AIML where they really thought through the journey of a data scientist for supervised learning, if you will. If you take a look at some of the neat things that we've done within Google, we've got these pre-trained models where--like Vision API, where really we need something fast, or have any data to do some of the training that you have, you need to make use of those pre-trained models. And then on the other side of that, you might be able to have your data scientists that have deep expert knowledge and understanding, they can use vertex AI, understand MlOps and tensor flow. And they can build their custom models with this rich data--training data that they have. Or maybe they need something in between where they've got auto ML or BigQuery ML to do some customization with sufficient data. Likewise, when you think about from a data warehouse perspective, and looking at the data that you have, if you're able to take something like a data warehouse blueprint, if you will, and think through all of the different items that you may need, if you don't have the time to build out an entire secure environment, you might be able to take a data warehouse blueprint as is and deploy that. If you need to customize things, maybe you don't want to use as many service parameters. If you're not ready for streaming data into your environment, or you've got your own data flow flux templates that you wanna use, maybe use those instead of the ones that we've kind of thought through. Or if you've got that deep security knowledge, and your architecture team has an understanding of how to not just use the individual security capabilities, but layer them together across the entire ecosystem and environment. Really, they can then use something like the data warehouse blueprint, as an architectural checklist, understand some of the best practices that we've developed and kind of comparing that to the decisions that they've made. And then the last thing I think about too, are then the future plans that your data warehouse has. If you think about using a blueprint for building a house, for instance, you may have the notion of an unfinished basement that is already kind of built into the environment, because it's harder to obviously add an exhibit after you've built the house about the sorts of things. One of the other things to think about too, when you think about the future is, are there appropriate hooks in place, so that way that you can handle any changes that the business may need? So for instance, I was working with a customer in the financial services industry, where we were talking through re-identification, and would there be any sort of use cases and needs for that? Immediately, they didn't have that particular need, but they were really interested in some of those capabilities that we had. And they said well, that'd be kind of interesting maybe for one year, two years down the road but for right now what they need to do is just simply to de-identify that data and keep it de-identified. Software that gives our listeners a couple things to think about when building out a data warehouse.
Timothy: Erlander, that was an incredible answer. If I were to summarize, first thing you wanna do is make sure you've prepared your Cloud terrain and got everything ready to go. You've got data engineers who maybe don't speak security and you've got security engineering that you need to organizationally line up and make sure they're speaking. On top of that you need to consider the business timeline and objectives. We've got pre-canned and pre-vetted tooling which you can either take off the shelf or if you've got custom needs, you can use it as a reference. And you want to think about your future plans. I love this analogy of the unfinished basement. I just have had the misfortune in life of working on products which also have the unfinished living room, unfinished kitchen, and the bedroom that is actually just a tarp flapping in the wind.
Erlander: Yeah, I love it.
Anton: I think that's a good summary. Yes, I think that makes sense to me. And I think it's more actionable for somebody who is in the planning stage. So these plans and this approach obviously makes sense to me. So sometimes I know that there will be forces kind of beyond your control that actually forced your decisions regarding security controls. So I am bringing up the compliance and regulations again. So other situations where a regulation or a mandate will kind of force your decision regarding security controls or approaches, no matter what the threats are. I'm not talking about you deploying an anti-virus on your data warehouse because of PCI DSS. I'm talking about something a little bit more sane. How do you deal with situations where there's an externality that forces your decision regarding data warehouse security? And what can those be?
Erlander: Yeah, thanks for bringing that up. I think one of the things we think about regulations are gonna play definitely an important part of considerations. And to me, they sort of represent a min function for security needs. By that what I mean is, I recall the customer sharing with me about a time in their career, when PCI, for instance, hadn't yet recognized AS as a viable standard to use. And so that just makes it interesting to think about things from that perspective that instead of him having to use triple DES, instead of AS to go through some of those sorts of discussions early on. Of course, that is not the case with PCI today. From a regulatory perspective and some of those controls that we think through, I'll just give a couple examples from some different frameworks, if you will. So from a privacy perspective, I wanted to try to avoid some other discussion topics. One of the things that Iwanted to think through was we have an understanding of where the data needs to be, data regionalization needs that you have. From an insight perspective of DLP, we wanted to use some of those controls to make sure that we properly encrypt--de-identify some of that data and use data catalog to help with some of that information. If we take a look at some things from like, FedRAMP, for instance, and maybe using NIST 800-53, we obviously don't have time to go through all 20 of those families, but maybe focusing on like access control families, if you will, that separation of duties and least privilege is a really important aspect of some of the things that we do. So you'll see that it's important to have things like cold-level access controls to add additional context as to who is logging into your systems using access context manager and some of those items as well. And then if you take a look at PCI, and some of the data encryption data requirements that they have in place there, it's important to understand how to augment and if you key management capabilities with using it protecting those keys with a FIPS certified HSM. In our case, we've got Cloud HSM to help with those things.
Timothy: We get a lot of throwback references on the show, you might be the first person to bring up Triple DES, that's a blast from the past for us for sure. I love this answer on controls and listeners, if you haven't looked at some of the really innovative security that's available in BigQuery. I can't believe I’m about say this, but actually really get excited about the capabilities there in BigQuery. I wanna switch gears a little bit Erlander and ask about how do we data governance in a cloud data warehouse?
Erlander: Yeah, this is really interesting aspect for me to think about because we have the chance to talk a lot about trust in the security world. To me, data governance is bringing about trust into the dating world, if you will. So it's that trust in your data. It's the trust of your data’s quality, it’s integrity, it’s usability, and sort of the security aspects of all of those different things. When you think about data governance, from the inception of some of the things that we've been doing, data governance is push a lot of different principles that we've been able to apply from one of the books that some Googlers have helped write, it's called Data Governance: The Definitive Guide. One of the authors actually has helped shape some of the architecture early on for things like our data warehouse blueprint that we have here. Where it just shows how we were able to bring collaboration across our different experts, across our different products together to help think through what are the data governance needs that we have for a data warehouse. Some of the things they’re practically--when you think about data governance, it's really about setting up some of the different policy tags that we have in place that allow you to identify different columns within your datasets, and attach them to a taxonomy that you can define within things like data catalog. And so that taxonomy of how you label your data, and if it's confidential data, private data, sensitive data and public data, whatever that taxonomy is that you have there, it's very important to understand how to tie that and how that relates to your data. Once you have that mapping in place, you can then take that and apply additional security controls that we have such as Cloud DLP and de-identification templates. So that way, any of those columns that maybe are marked as private, have the appropriate transformation, define and [inaudible] for that particular color. And then you now know what DLP key is, you know what function is being done there and how that's being protected.
Timothy: Oh, that's really interesting.
Anton: I know, the data governance subject was kind of on the make some security professionals a little bit nervous. And I think that your answer is very crisp and very clear. And I kind of liked how you phrased it. I've seen a lot of data governance discussions gonna go into BS territory very quick. To switch gears from this, I don't know if you can accomplish this in the audio format. But when you think of the whole set of security controls that are there for the taking, if you're implementing a data warehouse, of course, there are staples like IAM encryption, a few other things, but like, can you read the whole menu to us and hopefully not bore us in the process. So what controls are there to implement?
Timothy: Anton, you can't ask somebody to read a menu on a podcast.
Anton: And not bore us.
Timothy: Let me make it a little easier, maybe. What are the ones that are unique to cloud and even better what are the ones that are unique to Google Cloud? See, listeners, this is real time how a pod gets made.
Erlander: Well, I love it. So let me try to push this a couple different ways. So of course, we got different layer controls that are built in. And like some of the things we talked about earlier, a lot of the capabilities services that we have and are using, we think about this data warehouse are all about sort of like serverless notion where there's not really infrastructure exposed to our customers to think through some of those things, whether it's using BigQuery. Or maybe it's a streaming engine for data flow, and things like the Kellogg and Cloud DLP. When we think about those sorts of things, some of the unique things, for both them, we'll kind of address them, maybe for preventive controls and detective controls as well. But a couple of the unique things when I think of data governance, one of the things I enjoy has been that with data catalog using the taxonomy that you can define and then apply that dynamically to some of the policy tags. The setup, those BigQuery controls that you were talking about, those ackle controls, talked about earlier about column level access controls, as well. When you think about elevation of privileges, we've got a lot of the ability to have fine grained access applied down to the resource level, which I find very powerful. Sometimes, instead of it being done at the project level or something higher level, you can actually identify which identities and find those policies to the particular like, for instance, using keys to that particular key. And then we have more policies in place as well that--for instance, we have the ability to disable service account key creation. So you can help really prevent an org wide setting, apply some of these broader controls and have them in place. When I think about some of those exfiltration controls and things that are unique to the Cloud, I like to think about VPC service control perimeters, and how we can define some of those trust boundaries, those bridges that we have in place and marry that with some of the other org policies to prevent the ability to have external IP addresses. So then you can take some of our private restricted networking capabilities, as well. So that way, folks that are trying to go into the environment or use some of the API's have to provide additional context and information with that. And then the last thing I think about too from an information disclosure perspective that's unique, I think, is our Cloud DLP, de-identification, re-identification capabilities. It's been really good to separate those from our data flow pipelines. So you have one [inaudible] think about how you can de-identify data and then re-identify data using a separate control flow that you then understand. And the other thing to think about too is establishing this kind of notion of a cryptographic boundary using our Cloud HSM capabilities that have some automatic key rotation capabilities there. So hopefully, that gives you a sense of some of those controls that we have in place. They're unique within Google Cloud.
Anton: Perfect. Thank you very much. This is a really nice and extended answer. So we're about to wrap up. And I wanted to ask you one final question. Any further reading for our audiences? Please share some links, some resources, we'll put them in the notes.
Erlander: Yeah, absolutely. There's a couple things. We've got our data warehouse blueprint that we can link to that. But something I was thinking about too, we've been talking about this journey, this transformation, if you will, there’s a great book I'd recommend too it's called Creativity, Inc. It talks a little bit about Pixar journey. It gives you some tidbits ranging from various mental models when you're facing different challenges, cultural values that are important during a transformation like candor, and having brain trust, and then also practical leadership principles. And I think all of these things are important to think through when you're thinking about a digital transformation. And how do you bring a data warehouse and secure that within your Cloud environments. Something that might you know--if you think about journeys, a different type of book about journeys. And of course, we talked a little about the Data Governance: The Definitive Guide, that's always a good one as well.
Anton: Perfect. Thank you very much for this and thank you very much for being on the show.
Erlander: Thanks.
Anton: And now we’re at time. Thank you very much for listening and of course, for subscribing. We hope this talking about secure data warehouses was way more fun. Was as fun for you as it was for us. You can find this podcast at Google Podcasts, Apple Podcasts, Spotify. We’re giving a bit more attention to our Spotify listeners lately, and wherever you get your podcasts.
Also, you can find us at our website Please subscribe so that you don't miss episodes. You can follow us on Twitter, And of course your hosts are also on Twitter @Anton_ Chuvakin and _TimPeacock, Tweet at us, email us, argue with us if you like or hate what you hear, we can invite you to the next episode. See you in the next Cloud Security Podcast episode.

View more episodes