There’s extensive documentation on what IAM Roles are available for Google Data Catalog. But when you are getting started with your data governance journey, you probably have wondered what kind of access controls are needed and who should be granted them in your organization…
- What end user should be able to discover my data assets?
- Who should be able to classify and add tags to them?
- And finally, be able to create templates and set standards for the data classification process?
This can get really complex, so in this blog post, we will start by looking at the access controls on top of metadata, which is Google Data Catalog playing field.
We can automate all that by using Terraform.
This is simply a suggestion on how to work with Data Catalog. To start off, let’s say you have some common templates that will be used to create tags in different projects.
For that we need two different pieces:
- The Tag Central Project
This is where we store all the common resources, like Tag Templates, Policy Tags, and Custom entries. So we don’t duplicate those, are charged only once, and have a much easier time when managing and making changes to them.
To showcase this, in the Terraform sample we will create 4 Tag Templates in the Tag Central project:
★ Data Engineering Template
★ Derived Data Template
★ Data Governance Template
★ Data Quality Template
- A Group of Analytics Projects
Now let’s look at the personas who will interact with the Tag Central and Analytics Projects and that we will automatically set up with Terraform.
Keep in mind that the names are just suggestions, and you could replace them with names that play similar roles, you could call Data Governors as Data Architects or Data Curators as Data Stewards and many other names in this data alphabet soup.
- Data Governors
Data Governor is the the role for people who perform administrative workloads on top of your metadata. And this means Creating/Updating/Deleting the Data Catalog resources like Tag Templates and setting the standards of your data governance process.
- Data Curators
Data Curators will take care of your data assets 🙂 … They will select the relevant ones and add meaning to them (by creating tags), so other users can easily discover and make use of them.
- Data Analysts
This is the person who will use the curated assets and define and develop domain-specific analytics to support your decision making.
Take into consideration, that those personas can change or overlap, depending on the size of your organization or the way it is structure. So you can have the same person doing more than one role.
If you use different personas, please feel free to contribute to the sample repository or add comments to this blog post with your use case, this will be really helpful.
Without further ado, let’s look at the Terraform automation because doing things manually does us no good!
Contains all the sample and a detailed step-by-step guide on how to run it.
To run Terraform, we are going to use a service account, since at the time of this writing Data Catalog does not support using end-user credentials from the Google Cloud SDK.
And to follow the best practices we won’t download the service account key, but use service account impersonation.
Create the Service Account
So the first step is creating a service account and setting the appropriate IAM roles:https://medium.com/media/cca0bf7c553ec02d2258640f6b3845f7
Set Terraform variables placeholders
Let’s look at an example of a valid configuration file:https://medium.com/media/c24cc4c794b0aa4193d93b9adfd4d736
In the sample code above, whenever you see member, it can be any of:
And at last, let’s execute:
# After that, let's get Terraform started. # Run the following to pull in the providers. terraform init # With the providers downloaded and terraform variables set, # you're ready to use Terraform. Go ahead! # Plan first to validate the execution terraform plan -input=false -out=tfplan -var-file=".tfvars" # If successfull, execute it terraform apply tfplan
After Terraform completes, we can look at the generated resources:
We can see that all the projects we set up in Terraform contain the discussed personas, with the appropriate permissions.
And let’s not forget the common resources created by Terraform:
That’s pretty much it, thanks for reading :).
Data Governance is a really complex area, and any automation that helps us to set and enforce those standards is welcome. In this blog post, we looked at Terraform samples that supports us when working at the project level.
Keep in mind that if you want to use the suggested access controls at the folder or organization level, which is a common use case for large organizations.
The iam module at the GitHub repo, is easily adaptable to that use case, all you need to do is switch the
google_project_iam_member resource to