Skip to content

datafold/terraform-google-datafold

Repository files navigation

=======

Datafold Google module

This repository provisions infrastructure resources on Google Cloud for deploying Datafold using the datafold-operator.

About this module

⚠️ Important: This module is now optional. If you already have GKE infrastructure in place, you can configure the required resources independently. This module is primarily intended for customers who need to set up the complete infrastructure stack for GKE deployment.

The module provisions Google Cloud infrastructure resources that are required for Datafold deployment. Application configuration is now managed through the datafoldapplication custom resource on the cluster using the datafold-operator, rather than through Terraform application directories.

Breaking Changes

Load Balancer Deployment (Default Changed)

Breaking Change: The load balancer is no longer deployed by default. The default behavior has been toggled to deploy_lb = false.

  • Previous behavior: Load balancer was deployed by default
  • New behavior: Load balancer deployment is disabled by default
  • Action required: If you need a load balancer, you must explicitly set deploy_lb = true in your configuration, so that you don't lose it. (in the case it does happen, you need to redeploy it and then update your DNS to the new LB IP).

Application Directory Removal

  • The "application" directory is no longer part of this repository
  • Application configuration is now managed through the datafoldapplication custom resource on the cluster

Prerequisites

  • A Google Cloud account, preferably a new isolated one.
  • Terraform >= 1.4.6
  • A customer contract with Datafold
    • The application does not work without credentials supplied by sales
  • Access to our public helm-charts repository

The full deployment will create the following resources:

  • Google VPC
  • Google subnets
  • Google GCS bucket for clickhouse backups
  • Google Cloud Load Balancer (optional, disabled by default)
  • Google-managed SSL certificate (if load balancer is enabled)
  • Three persistent disk volumes for local data storage
  • Cloud SQL PostgreSQL database
  • A GKE cluster
  • Service accounts for the GKE cluster to perform actions outside of its cluster boundary:
    • Provisioning persistent disk volumes
    • Updating Network Endpoint Group to route traffic to pods directly
    • Managing GCS bucket access for ClickHouse backups

Infrastructure Dependencies: For a complete list of required infrastructure resources and detailed deployment guidance, see the Datafold Dedicated Cloud GCP Deployment Documentation.

Negative scope

  • This module will not provision DNS names in your zone.

How to use this module

  • See the example for a potential setup, which has dependencies on our helm-charts

The example directory contains a single deployment example for infrastructure setup.

Setting up the infrastructure:

  • It is easiest if you have full admin access in the target project.
  • Pre-create a symmetric encryption key that is used to encrypt/decrypt secrets of this deployment.
    • Use the alias instead of the mrk link. Put that into locals.tf
  • Certificate Requirements (depends on load balancer deployment method):
    • If deploying load balancer from this Terraform module (deploy_lb = true): Pre-create and validate the SSL certificate in your DNS, then refer to that certificate in main.tf using its domain name (Replace "datafold.example.com")
    • If deploying load balancer from within Kubernetes: The certificate will be created automatically, but you must wait for it to become available and then validate it in your DNS after the deployment is complete
  • Change the settings in locals.tf
    • provider_region = which region you want to deploy in.
    • project_id = The GCP project ID where you want to deploy.
    • kms_profile = The profile you want to use to issue the deployments. Targets the deployment account.
    • kms_key = A pre-created symmetric KMS key. It's only purpose is for encryption/decryption of deployment secrets.
    • deployment_name = The name of the deployment, used in kubernetes namespace, container naming and datadog "deployment" Unified Tag)
  • Run terraform init in the infra directory.
  • Run terraform apply in infra directory. This should complete ok.
    • Check in the console if you see the GKE cluster, Cloud SQL database, etc.
    • If you enabled load balancer deployment, check for the load balancer as well.

Application Deployment: After infrastructure is ready, deploy the application using the datafold-operator. See the Datafold Helm Charts repository for detailed application deployment instructions.

Infrastructure Dependencies

This module is designed to provide the complete infrastructure stack for Datafold deployment. However, if you already have GKE infrastructure in place, you can choose to configure the required resources independently.

Required Infrastructure Components:

  • GKE cluster with appropriate node pools
  • Cloud SQL PostgreSQL database
  • GCS bucket for ClickHouse backups
  • Persistent disks for persistent storage (ClickHouse data, ClickHouse logs, Redis data)
  • IAM roles and service accounts for cluster operations
  • Load balancer (optional, can be managed by Google Cloud Load Balancer Controller)
  • VPC and networking components
  • SSL certificate (validation timing depends on deployment method):
    • Terraform-managed LB: Certificate must be pre-created and validated
    • Kubernetes-managed LB: Certificate created automatically, validated post-deployment

Alternative Approaches:

  • Use this module: Provides complete infrastructure setup for new deployments
  • Use existing infrastructure: Configure required resources manually or through other means
  • Hybrid approach: Use this module for some components and existing infrastructure for others

For detailed specifications of each required component, see the Datafold Dedicated Cloud GCP Deployment Documentation. For application deployment instructions, see the Datafold Helm Charts repository.

Detailed Infrastructure Components

Based on the Datafold GCP Deployment Documentation, this module provisions the following detailed infrastructure components:

Persistent Disks

The Datafold application requires 3 persistent disks for storage, each deployed as encrypted Google Compute Engine persistent disks in the primary availability zone:

  • ClickHouse data disk: Serves as the analytical database storage for Datafold. ClickHouse is a columnar database that excels at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments, but it can be scaled up based on data volume requirements.
  • ClickHouse logs disk: Stores ClickHouse's internal logs and temporary data. The separate logs disk prevents log data from consuming IOPS and I/O performance from actual data storage.
  • Redis data disk: Provides persistent storage for Redis, which handles task distribution and distributed locks in the Datafold application. Redis is memory-first but benefits from persistence for data durability across restarts.

All persistent disks are encrypted by default using Google-managed encryption keys, ensuring data security at rest.

Load Balancer

The load balancer serves as the primary entry point for all external traffic to the Datafold application. The module offers 2 deployment strategies:

  • External Load Balancer Deployment (the default approach): Creates a Google Cloud Load Balancer through Terraform
  • Kubernetes-Managed Load Balancer: Relies on the Google Cloud Load Balancer Controller running within the GKE cluster, deployed by the datafold application resource. This means Kubernetes creates the load balancer for you.

GKE Cluster

The Google Kubernetes Engine (GKE) cluster forms the compute foundation for the Datafold application:

  • Network Architecture: The entire cluster is deployed into private subnets with Cloud NAT for egress traffic
  • Security Features: Workload Identity, Shielded nodes, Binary authorization, Network policy, and Private nodes
  • Node Management: Supports up to three managed node pools with automatic scaling

IAM Roles and Permissions

The IAM architecture follows the principle of least privilege:

  • GKE service account: Basic permissions for logging, monitoring, and storage access
  • ClickHouse backup service account: Custom role for ClickHouse to make backups and store them on Cloud Storage
  • Datafold service accounts: Pre-defined roles for different application components

Cloud SQL Database

The PostgreSQL Cloud SQL instance serves as the primary relational database:

  • Storage configuration: Starts with a 20GB initial allocation that can automatically scale up to 100GB
  • High availability: Intentionally disabled by default to reduce costs and complexity
  • Security and encryption: Always encrypts data at rest using Google-managed encryption keys

Requirements

Name Version
dns 3.2.1
google >= 4.80.0

Providers

Name Version
google >= 4.80.0
random n/a

Modules

Name Source Version
clickhouse_backup ./modules/clickhouse_backup n/a
database ./modules/database n/a
gke ./modules/gke n/a
load_balancer ./modules/load_balancer n/a
networking ./modules/networking n/a
project-iam-bindings terraform-google-modules/iam/google//modules/projects_iam n/a
project_factory_project_services terraform-google-modules/project-factory/google//modules/project_services ~> 14.4.0

Resources

Name Type

Inputs

Name Description Type Default Required
add_onprem_support_group Flag to add onprem support group for [email protected] bool true no
clickhouse_backup_sa_key SA key from secrets string "" no
clickhouse_data_disk_size Data volume size clickhouse number 40 no
clickhouse_db Db for clickhouse. string "clickhouse" no
clickhouse_gcs_bucket GCS Bucket for clickhouse backups. string "clickhouse-backups-abcguo23" no
clickhouse_get_backup_sa_from_secrets_yaml Flag to toggle getting clickhouse backup SA from secrets.yaml instead of creating new one bool false no
clickhouse_username Username for clickhouse. string "clickhouse" no
common_tags Common tags to apply to any resource map(string) n/a yes
create_ssl_cert True to create the SSL certificate, false if not bool false no
database_name The name of the database string "datafold" no
database_version Version of the database string "POSTGRES_15" no
datafold_intercom_app_id The app id for the intercom. A value other than "" will enable this feature. Only used if the customer doesn't use slack. string "" no
db_deletion_protection A flag that sets delete protection (applied in terraform only, not on the cloud). bool true no
default_node_disk_size Disk size for a node number 40 no
deploy_neg_backend Set this to true to connect the backend service to the NEG that the GKE cluster will create bool true no
deploy_vpc_flow_logs Flag weither or not to deploy vpc flow logs bool false no
deployment_name Name of the current deployment. string n/a yes
domain_name Provide valid domain name (used to set host in GCP) string n/a yes
environment Global environment tag to apply on all datadog logs, metrics, etc. string n/a yes
gcs_path Path in the GCS bucket to the backups string "backups" no
github_endpoint URL of Github enpoint to connect to. Useful for GH Enterprise. string "" no
gitlab_endpoint URL of Gitlab enpoint to connect to. Useful for GH Enterprise. string "" no
host_override A valid domain name if they provision their own DNS / routing string "" no
lb_app_rules Extra rules to apply to the application load balancer for additional filtering
list(object({
action = string
priority = number
description = string
match_type = string # can be either "src_ip_ranges" or "expr"
versioned_expr = string # optional, only used if match_type is "src_ip_ranges"
src_ip_ranges = list(string) # optional, only used if match_type is "src_ip_ranges"
expr = string # optional, only used if match_type is "expr"
}))
n/a yes
lb_layer_7_ddos_defence Flag to toggle layer 7 ddos defence bool false no
legacy_naming Flag to toggle legacy behavior - like naming of resources bool true no
mig_disk_type https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#disk_type string "pd-balanced" no
postgres_allocated_storage The amount of allocated storage for the postgres database number 20 no
postgres_instance GCP instance type for PostgreSQL database.
Available instance groups: .
Available instance classes: .
string "db-custom-2-7680" no
postgres_ro_username Postgres read-only user name string "datafold_ro" no
postgres_username The username to use for the postgres CloudSQL database string "datafold" no
project_id The project to deploy to, if not set the default provider project is used. string n/a yes
provider_azs Provider AZs list, if empty we get AZs dynamically list(string) n/a yes
provider_region Region for deployment in GCP string n/a yes
redis_data_size Redis volume size number 10 no
remote_storage Type of remote storage for clickhouse backups. string "gcs" no
restricted_roles Flag to stop certain IAM related resources from being updated/changed bool false no
restricted_viewer_role Flag to stop certain IAM related resources from being updated/changed bool false no
ssl_cert_name Provide valid SSL certificate name in GCP OR ssl_private_key_path and ssl_cert_path string "" no
ssl_cert_path SSL certificate path string "" no
ssl_private_key_path Private SSL key path string "" no
vpc_cidr Network CIDR for VPC string "10.0.0.0/16" no
vpc_flow_logs_interval Interval for vpc flow logs string "INTERVAL_5_SEC" no
vpc_flow_logs_sampling Sampling for vpc flow logs string "0.5" no
vpc_id Provide ID of existing VPC if you want to omit creation of new one string "" no
vpc_master_cidr_block cidr block for k8s master, must be a /28 block. string "192.168.0.0/28" no
vpc_secondary_cidr_pods Network CIDR for VPC secundary subnet 1 string "/17" no
vpc_secondary_cidr_services Network CIDR for VPC secundary subnet 2 string "/17" no
whitelist_all_ingress_cidrs_lb Normally we filter on the load balancer, but some customers want to filter at the SG/Firewall. This flag will whitelist 0.0.0.0/0 on the load balancer. bool false no
whitelisted_egress_cidrs List of Internet addresses to which the application has access list(string) n/a yes
whitelisted_ingress_cidrs List of CIDRs that can access the HTTP/HTTPS list(string) n/a yes

Outputs

Name Description
clickhouse_backup_sa Name of the clickhouse backup Service Account
clickhouse_data_size Size in GB of the clickhouse data volume
clickhouse_data_volume_id Volume ID of the clickhouse data PD volume
clickhouse_gcs_bucket Name of the GCS bucket for the clickhouse backups
clickhouse_logs_size Size in GB of the clickhouse logs volume
clickhouse_logs_volume_id Volume ID of the clickhouse logs PD volume
clickhouse_password Password to use for clickhouse
cloud_provider The cloud provider creating all the resources
cluster_name The name of the GKE cluster that was created
db_instance_id The database instance ID
deployment_name The name of the deployment
domain_name The domain name on the HTTPS certificate
lb_external_ip The load balancer IP when it was provisioned.
neg_name The name of the Network Endpoint Group where pods need to be registered from kubernetes.
postgres_database_name The name of the postgres database
postgres_host The hostname of the postgres database
postgres_password The postgres password
postgres_port The port of the postgres database
postgres_username The postgres username
redis_data_size The size in GB of the redis data volume
redis_data_volume_id The volume ID of the Redis PD data volume
redis_password The Redis password
vpc_cidr The CIDR range of the VPC
vpc_id The ID of the Google VPC the cluster runs in.
vpc_subnetwork The subnet in which the cluster is created

About

A terraform module for deploying the Datafold infrastructure on Google cloud.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 5