=======
This repository provisions infrastructure resources on Google Cloud for deploying Datafold using the datafold-operator.
The module provisions Google Cloud infrastructure resources that are required for Datafold deployment. Application configuration is now managed through the datafoldapplication
custom resource on the cluster using the datafold-operator, rather than through Terraform application directories.
Breaking Change: The load balancer is no longer deployed by default. The default behavior has been toggled to deploy_lb = false
.
- Previous behavior: Load balancer was deployed by default
- New behavior: Load balancer deployment is disabled by default
- Action required: If you need a load balancer, you must explicitly set
deploy_lb = true
in your configuration, so that you don't lose it. (in the case it does happen, you need to redeploy it and then update your DNS to the new LB IP).
- The "application" directory is no longer part of this repository
- Application configuration is now managed through the
datafoldapplication
custom resource on the cluster
- A Google Cloud account, preferably a new isolated one.
- Terraform >= 1.4.6
- A customer contract with Datafold
- The application does not work without credentials supplied by sales
- Access to our public helm-charts repository
The full deployment will create the following resources:
- Google VPC
- Google subnets
- Google GCS bucket for clickhouse backups
- Google Cloud Load Balancer (optional, disabled by default)
- Google-managed SSL certificate (if load balancer is enabled)
- Three persistent disk volumes for local data storage
- Cloud SQL PostgreSQL database
- A GKE cluster
- Service accounts for the GKE cluster to perform actions outside of its cluster boundary:
- Provisioning persistent disk volumes
- Updating Network Endpoint Group to route traffic to pods directly
- Managing GCS bucket access for ClickHouse backups
Infrastructure Dependencies: For a complete list of required infrastructure resources and detailed deployment guidance, see the Datafold Dedicated Cloud GCP Deployment Documentation.
- This module will not provision DNS names in your zone.
- See the example for a potential setup, which has dependencies on our helm-charts
The example directory contains a single deployment example for infrastructure setup.
Setting up the infrastructure:
- It is easiest if you have full admin access in the target project.
- Pre-create a symmetric encryption key that is used to encrypt/decrypt secrets of this deployment.
- Use the alias instead of the
mrk
link. Put that intolocals.tf
- Use the alias instead of the
- Certificate Requirements (depends on load balancer deployment method):
- If deploying load balancer from this Terraform module (
deploy_lb = true
): Pre-create and validate the SSL certificate in your DNS, then refer to that certificate in main.tf using its domain name (Replace "datafold.example.com") - If deploying load balancer from within Kubernetes: The certificate will be created automatically, but you must wait for it to become available and then validate it in your DNS after the deployment is complete
- If deploying load balancer from this Terraform module (
- Change the settings in locals.tf
- provider_region = which region you want to deploy in.
- project_id = The GCP project ID where you want to deploy.
- kms_profile = The profile you want to use to issue the deployments. Targets the deployment account.
- kms_key = A pre-created symmetric KMS key. It's only purpose is for encryption/decryption of deployment secrets.
- deployment_name = The name of the deployment, used in kubernetes namespace, container naming and datadog "deployment" Unified Tag)
- Run
terraform init
in the infra directory. - Run
terraform apply
ininfra
directory. This should complete ok.- Check in the console if you see the GKE cluster, Cloud SQL database, etc.
- If you enabled load balancer deployment, check for the load balancer as well.
Application Deployment: After infrastructure is ready, deploy the application using the datafold-operator. See the Datafold Helm Charts repository for detailed application deployment instructions.
This module is designed to provide the complete infrastructure stack for Datafold deployment. However, if you already have GKE infrastructure in place, you can choose to configure the required resources independently.
Required Infrastructure Components:
- GKE cluster with appropriate node pools
- Cloud SQL PostgreSQL database
- GCS bucket for ClickHouse backups
- Persistent disks for persistent storage (ClickHouse data, ClickHouse logs, Redis data)
- IAM roles and service accounts for cluster operations
- Load balancer (optional, can be managed by Google Cloud Load Balancer Controller)
- VPC and networking components
- SSL certificate (validation timing depends on deployment method):
- Terraform-managed LB: Certificate must be pre-created and validated
- Kubernetes-managed LB: Certificate created automatically, validated post-deployment
Alternative Approaches:
- Use this module: Provides complete infrastructure setup for new deployments
- Use existing infrastructure: Configure required resources manually or through other means
- Hybrid approach: Use this module for some components and existing infrastructure for others
For detailed specifications of each required component, see the Datafold Dedicated Cloud GCP Deployment Documentation. For application deployment instructions, see the Datafold Helm Charts repository.
Based on the Datafold GCP Deployment Documentation, this module provisions the following detailed infrastructure components:
The Datafold application requires 3 persistent disks for storage, each deployed as encrypted Google Compute Engine persistent disks in the primary availability zone:
- ClickHouse data disk: Serves as the analytical database storage for Datafold. ClickHouse is a columnar database that excels at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments, but it can be scaled up based on data volume requirements.
- ClickHouse logs disk: Stores ClickHouse's internal logs and temporary data. The separate logs disk prevents log data from consuming IOPS and I/O performance from actual data storage.
- Redis data disk: Provides persistent storage for Redis, which handles task distribution and distributed locks in the Datafold application. Redis is memory-first but benefits from persistence for data durability across restarts.
All persistent disks are encrypted by default using Google-managed encryption keys, ensuring data security at rest.
The load balancer serves as the primary entry point for all external traffic to the Datafold application. The module offers 2 deployment strategies:
- External Load Balancer Deployment (the default approach): Creates a Google Cloud Load Balancer through Terraform
- Kubernetes-Managed Load Balancer: Relies on the Google Cloud Load Balancer Controller running within the GKE cluster, deployed by the datafold application resource. This means Kubernetes creates the load balancer for you.
The Google Kubernetes Engine (GKE) cluster forms the compute foundation for the Datafold application:
- Network Architecture: The entire cluster is deployed into private subnets with Cloud NAT for egress traffic
- Security Features: Workload Identity, Shielded nodes, Binary authorization, Network policy, and Private nodes
- Node Management: Supports up to three managed node pools with automatic scaling
The IAM architecture follows the principle of least privilege:
- GKE service account: Basic permissions for logging, monitoring, and storage access
- ClickHouse backup service account: Custom role for ClickHouse to make backups and store them on Cloud Storage
- Datafold service accounts: Pre-defined roles for different application components
The PostgreSQL Cloud SQL instance serves as the primary relational database:
- Storage configuration: Starts with a 20GB initial allocation that can automatically scale up to 100GB
- High availability: Intentionally disabled by default to reduce costs and complexity
- Security and encryption: Always encrypts data at rest using Google-managed encryption keys
Name | Version |
---|---|
dns | 3.2.1 |
>= 4.80.0 |
Name | Version |
---|---|
>= 4.80.0 | |
random | n/a |
Name | Source | Version |
---|---|---|
clickhouse_backup | ./modules/clickhouse_backup | n/a |
database | ./modules/database | n/a |
gke | ./modules/gke | n/a |
load_balancer | ./modules/load_balancer | n/a |
networking | ./modules/networking | n/a |
project-iam-bindings | terraform-google-modules/iam/google//modules/projects_iam | n/a |
project_factory_project_services | terraform-google-modules/project-factory/google//modules/project_services | ~> 14.4.0 |
Name | Type |
---|
Name | Description | Type | Default | Required |
---|---|---|---|---|
add_onprem_support_group | Flag to add onprem support group for [email protected] | bool |
true |
no |
clickhouse_backup_sa_key | SA key from secrets | string |
"" |
no |
clickhouse_data_disk_size | Data volume size clickhouse | number |
40 |
no |
clickhouse_db | Db for clickhouse. | string |
"clickhouse" |
no |
clickhouse_gcs_bucket | GCS Bucket for clickhouse backups. | string |
"clickhouse-backups-abcguo23" |
no |
clickhouse_get_backup_sa_from_secrets_yaml | Flag to toggle getting clickhouse backup SA from secrets.yaml instead of creating new one | bool |
false |
no |
clickhouse_username | Username for clickhouse. | string |
"clickhouse" |
no |
common_tags | Common tags to apply to any resource | map(string) |
n/a | yes |
create_ssl_cert | True to create the SSL certificate, false if not | bool |
false |
no |
database_name | The name of the database | string |
"datafold" |
no |
database_version | Version of the database | string |
"POSTGRES_15" |
no |
datafold_intercom_app_id | The app id for the intercom. A value other than "" will enable this feature. Only used if the customer doesn't use slack. | string |
"" |
no |
db_deletion_protection | A flag that sets delete protection (applied in terraform only, not on the cloud). | bool |
true |
no |
default_node_disk_size | Disk size for a node | number |
40 |
no |
deploy_neg_backend | Set this to true to connect the backend service to the NEG that the GKE cluster will create | bool |
true |
no |
deploy_vpc_flow_logs | Flag weither or not to deploy vpc flow logs | bool |
false |
no |
deployment_name | Name of the current deployment. | string |
n/a | yes |
domain_name | Provide valid domain name (used to set host in GCP) | string |
n/a | yes |
environment | Global environment tag to apply on all datadog logs, metrics, etc. | string |
n/a | yes |
gcs_path | Path in the GCS bucket to the backups | string |
"backups" |
no |
github_endpoint | URL of Github enpoint to connect to. Useful for GH Enterprise. | string |
"" |
no |
gitlab_endpoint | URL of Gitlab enpoint to connect to. Useful for GH Enterprise. | string |
"" |
no |
host_override | A valid domain name if they provision their own DNS / routing | string |
"" |
no |
lb_app_rules | Extra rules to apply to the application load balancer for additional filtering | list(object({ |
n/a | yes |
lb_layer_7_ddos_defence | Flag to toggle layer 7 ddos defence | bool |
false |
no |
legacy_naming | Flag to toggle legacy behavior - like naming of resources | bool |
true |
no |
mig_disk_type | https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/compute_instance_template#disk_type | string |
"pd-balanced" |
no |
postgres_allocated_storage | The amount of allocated storage for the postgres database | number |
20 |
no |
postgres_instance | GCP instance type for PostgreSQL database. Available instance groups: . Available instance classes: . |
string |
"db-custom-2-7680" |
no |
postgres_ro_username | Postgres read-only user name | string |
"datafold_ro" |
no |
postgres_username | The username to use for the postgres CloudSQL database | string |
"datafold" |
no |
project_id | The project to deploy to, if not set the default provider project is used. | string |
n/a | yes |
provider_azs | Provider AZs list, if empty we get AZs dynamically | list(string) |
n/a | yes |
provider_region | Region for deployment in GCP | string |
n/a | yes |
redis_data_size | Redis volume size | number |
10 |
no |
remote_storage | Type of remote storage for clickhouse backups. | string |
"gcs" |
no |
restricted_roles | Flag to stop certain IAM related resources from being updated/changed | bool |
false |
no |
restricted_viewer_role | Flag to stop certain IAM related resources from being updated/changed | bool |
false |
no |
ssl_cert_name | Provide valid SSL certificate name in GCP OR ssl_private_key_path and ssl_cert_path | string |
"" |
no |
ssl_cert_path | SSL certificate path | string |
"" |
no |
ssl_private_key_path | Private SSL key path | string |
"" |
no |
vpc_cidr | Network CIDR for VPC | string |
"10.0.0.0/16" |
no |
vpc_flow_logs_interval | Interval for vpc flow logs | string |
"INTERVAL_5_SEC" |
no |
vpc_flow_logs_sampling | Sampling for vpc flow logs | string |
"0.5" |
no |
vpc_id | Provide ID of existing VPC if you want to omit creation of new one | string |
"" |
no |
vpc_master_cidr_block | cidr block for k8s master, must be a /28 block. | string |
"192.168.0.0/28" |
no |
vpc_secondary_cidr_pods | Network CIDR for VPC secundary subnet 1 | string |
"/17" |
no |
vpc_secondary_cidr_services | Network CIDR for VPC secundary subnet 2 | string |
"/17" |
no |
whitelist_all_ingress_cidrs_lb | Normally we filter on the load balancer, but some customers want to filter at the SG/Firewall. This flag will whitelist 0.0.0.0/0 on the load balancer. | bool |
false |
no |
whitelisted_egress_cidrs | List of Internet addresses to which the application has access | list(string) |
n/a | yes |
whitelisted_ingress_cidrs | List of CIDRs that can access the HTTP/HTTPS | list(string) |
n/a | yes |
Name | Description |
---|---|
clickhouse_backup_sa | Name of the clickhouse backup Service Account |
clickhouse_data_size | Size in GB of the clickhouse data volume |
clickhouse_data_volume_id | Volume ID of the clickhouse data PD volume |
clickhouse_gcs_bucket | Name of the GCS bucket for the clickhouse backups |
clickhouse_logs_size | Size in GB of the clickhouse logs volume |
clickhouse_logs_volume_id | Volume ID of the clickhouse logs PD volume |
clickhouse_password | Password to use for clickhouse |
cloud_provider | The cloud provider creating all the resources |
cluster_name | The name of the GKE cluster that was created |
db_instance_id | The database instance ID |
deployment_name | The name of the deployment |
domain_name | The domain name on the HTTPS certificate |
lb_external_ip | The load balancer IP when it was provisioned. |
neg_name | The name of the Network Endpoint Group where pods need to be registered from kubernetes. |
postgres_database_name | The name of the postgres database |
postgres_host | The hostname of the postgres database |
postgres_password | The postgres password |
postgres_port | The port of the postgres database |
postgres_username | The postgres username |
redis_data_size | The size in GB of the redis data volume |
redis_data_volume_id | The volume ID of the Redis PD data volume |
redis_password | The Redis password |
vpc_cidr | The CIDR range of the VPC |
vpc_id | The ID of the Google VPC the cluster runs in. |
vpc_subnetwork | The subnet in which the cluster is created |