You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2021-05-04-backing-up-data-warehouse.md
+80-45Lines changed: 80 additions & 45 deletions
Original file line number
Diff line number
Diff line change
@@ -1,12 +1,11 @@
1
1
---
2
2
layout: post
3
-
title: "Backing up Data Warehouse"
3
+
title: "Backing up Delta Lake"
4
4
author: kuntalb
5
5
tags:
6
-
- delta lake
7
-
- s3
8
-
- s3 batch operation
9
-
- data warehouse
6
+
- deltalake
7
+
- s3
8
+
- data-warehouse
10
9
- backup
11
10
- featured
12
11
team: Core Platform
@@ -15,93 +14,129 @@ team: Core Platform
15
14
16
15
Transitioning from a more traditional database operation (read ACID, RDBMS blah blah) background to a newer data platform is always interesting. As it constantly challenges all yours year old wisdom and kind of forces you to adapt to newer way of getting things done.
17
16
18
-
To give a brief background, At [Scribd](https://tech.scribd.com/) we have made [Delta Lake](https://delta.io/) a cornerstone of our data platform. All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet. The Delta Lake transaction log (also known as the DeltaLog) is an ordered record of every transaction that has ever been performed on a Delta Lake table since its inception. So a particular dataset to work properly it needs to have the parquet file and the corresponding Deltalog
17
+
At [Scribd](https://tech.scribd.com/) we have made
18
+
[Delta Lake](https://delta.io/) a cornerstone of our data platform. All data in
19
+
Delta Lake is stored in [Apache Parquet](https://parquet.apache.org/) format enabling Delta Lake to leverage
20
+
the efficient compression and encoding schemes that are native to Parquet. The
21
+
Delta Lake transaction log (also known as the `DeltaLog`) is an ordered record of
22
+
every transaction that has ever been performed on a Delta Lake table since its
23
+
inception. So a particular dataset to work properly it needs to have the
24
+
parquet file and the corresponding `DeltaLog`.
19
25
20
-
So when the task of having a workable backup of all those delta lake files falls into my lap, we decided to relook some of the age old concepts of backup in a new perspective.
26
+
When the task of having a workable backup of all those delta lake files fell
27
+
into my lap, I decided to look some of the age old concepts of backup in a new
28
+
perspective. THe concerns I consdiered were:
21
29
22
30
1. What am I protecting against? How much I need to protect?
23
-
1. Can I survive with loosing some data during restore and do i have the option of rebuilding them again from that point of time recovery?
31
+
1. Can I survive with loosing some data during restore and do I have the option of rebuilding them again from that point of time recovery?
24
32
1. What kind of protection I want to put in place for the backed up data?
25
33
26
-
So what we set as objective is,
34
+
So what we set as objective as:
27
35
28
-
1. I am mainly protecting against human error where by mistake a table can be purged(vacuum), which severely hamper my ability to do a time travel if required.
29
-
1. In most cases if we have a reasonable backup ready we should be able to build the the delta that got lost between the time the backup was taken and the drop table has occurred.
36
+
1. I am mainly protecting against human error where by mistake a table can be purged ([VACUUM](https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-vacuum.html)), which severely hamper my ability to do a time travel if required.
37
+
1. In most cases, if we have a reasonable backup ready we should be able to build the Delta table that got lost between the time the backup was taken and a drop table has occurred.
30
38
31
39
32
40
## The Devil is in the Details
33
41
34
-
### Some technical decision time
35
-
36
-
#### AWS S3 Batch Operation
37
-
38
-
After deliberating a lot, we decided to do this whole backup operation independent of [Delta Lake](https://delta.io/) and go to the lowest layer possible, in our case which was S3. I never thought I would say this ever in my life ( being a RDBMS DBA) but the moment we get onto S3 layer, the whole thing become a challenge of copying few S3 buckets ( read millions of files) over instead of a database backup.
39
-
So we started looking for an efficient S3 copy operation and here comes [AWS S3 batch operation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-examples-xcopy.html) and its OOTB feature of Copying objects across AWS account. This was like match made in heaven for us.
40
-
You can use [AWS S3 batch operation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-examples-xcopy.html) to perform large-scale batch operations on Amazon S3 objects. S3 Batch Operations can perform a single operation on lists of Amazon S3 objects that you specify. A single job can perform a specified operation (in our case copy) on billions of objects containing large set of data. This operation has following OOTB features,
41
-
42
-
1. Automatically tracks progress
43
-
1. Stores a detailed completion report of all or selected actions in a user defined bucket,
44
-
1. Provides a fully managed, auditable, and serverless experience.
42
+
After deliberating a lot, we decided to do this whole backup operation
43
+
independent of [Delta Lake](https://delta.io/) and go to the lowest layer
44
+
possible, in our case which was S3. I never thought I would say this ever in my
45
+
life (being a RDBMS DBA) but the moment we get onto S3 layer, the whole thing
46
+
become a challenge of copying few S3 buckets (read millions of files) over
47
+
instead of a database backup.
48
+
So we started looking for an efficient S3 copy operation and found [AWS S3
and its feature for Copying objects across AWS account. This was like match
52
+
made in heaven for us.
53
+
You can use [AWS S3 batch operation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-examples-xcopy.html) to perform large-scale batch operations on Amazon S3 objects. S3 Batch Operations can perform a single operation on lists of Amazon S3 objects that you specify. A single job can perform a specified operation (in our case copy) on billions of objects containing large set of data. This operation has following features,
54
+
55
+
1. Automatically tracks progress.
56
+
1. Stores a detailed completion report of all or selected actions in a user defined bucket.
57
+
1. Provides a fully managed, auditable, and serverless experience.
45
58
46
59
Once we decided to use [AWS S3 batch operation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-examples-xcopy.html), the next biggest challenge was how to generate the inventory list that will feed the AWS S3 batch operation. We decided to use [AWS S3 inventory](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-inventory.html) to generate the inventory list. There are some challenges associated with that as well.
47
60
48
-
1.Pros
61
+
**Pros**:
49
62
50
-
- Simple setup, we can terraform it easily
51
-
-Much efficient operation compare to generating our list as that list object api only returns 1000 rows per call that means we have to keep iterating till we get the full list.
63
+
* Simple setup, we can terraform it easily
64
+
*Much efficient operation compare to generating our list as that list object APIonly returns 1000 rows per call that means we have to keep iterating till we get the full list.
52
65
53
-
1.Cons
66
+
**Cons**:
54
67
55
-
- We do not control when it can be run, it will generate a report on a daily basis but the timings is not in our hand.
56
-
- It runs in an eventually consistent model, i.e. All of your objects might not appear in each inventory list. The inventory list provides eventual consistency for PUTs for both new objects and overwrites, and DELETEs. Inventory lists are a rolling snapshot of bucket items, which are eventually consistent (that is, the list might not include recently added or deleted objects)
68
+
* We do not control when it can be run, it will generate a report on a daily basis but the timings is not in our hand.
69
+
* It runs in an eventually consistent model, i.e. All of your objects might not appear in each inventory list. The inventory list provides eventual consistency for PUTs for both new objects and overwrites, and DELETEs. Inventory lists are a rolling snapshot of bucket items, which are eventually consistent (that is, the list might not include recently added or deleted objects)
57
70
58
-
To overcome the Cons we can run the backup at a later date, e.g for a backup of 31st mar we can run it based on 2nd April Manifest. This manifest should have all of till 31st march and some of 1st April files.
71
+
To overcome the downsides, we decided to run the backup at a later date, e.g. for a backup of March 31st we based that off a manifest generated on April 2nd. This manifest would certainly have all data up until March 31st and some of April 1st's files as well.
59
72
60
-
Once we have settled down on this model the rest was like any other backup process, setting up the Source and the destination and have protective boundaries along destination so that we do not “by mistake” nuke the backups as well.
73
+
Once we have settled on this model, the rest of the work was similar to any
74
+
other backup process. We also set up the Source and the Destination to have
75
+
protective boundaries so that we don't accidentally propogate any deletes to
76
+
the backups.
61
77
62
-
####New Account New Beginning
78
+
### New Account New Beginning
63
79
64
-
To stop this accidental nuke of the backed up data we decided to put the backed up data set in completely separate bucket in a different account, with stringent access control in place. With New account it is much easier to control the access level from the beginning rather than controlling access in an already existing account where people already have certain degree of access and hard to modify that access levels. In the new account we ensure that apart from the backup process and few handful of people nothing will actually have access to this backed up data. That way chance of any manual error is reduced.
80
+
To stop this accidental deletion of the backed up data we decided to put the
81
+
backed up data set in completely separate bucket in a different AWS account
82
+
with stringent access controls in place. With the new account it was much easier to
83
+
control the access level from the beginning rather than controlling access in
84
+
an already existing account where people already have certain degree of access
85
+
and hard to modify that access levels. In the new account we ensured only a few handful of people nothing will actually have
86
+
access to backed up data, further reducing chances of any manual error.
65
87
66
88
### Backup Process
67
89
68
90
#### Destination Side
69
91
70
-
1. Backup will be taken on a complete separate AWS account from the source account. Hardly few admin will have access to this account to reduce the chance of manual mistake.
92
+
1. Backup will be taken on a complete separate AWS account from the source
93
+
account. Hardly few admin will have access to this account to reduce the
94
+
chance of manual mistake.
71
95
1. The whole backup process will be automated with less human intervention to reduce the scope of manual error.
72
96
1. Destination Side we will have to create buckets to store the inventory reports based on which the batch job will be run.
73
-
1. Destination Side we will have to create buckets to store the actual backup where the batch job will store the backup objects. While terraforming it we have that bucket name dynamically created with the date appended at the end of the bucket name e.g. <Source-Bucket-Name>-<dd-mmm-yyyy>, so that before each full snapshot we can create this buckets. Otherwise there is a risk of earlier full snapshots getting overwritten.
97
+
1. Destination Side we will have to create buckets to store the actual backup
98
+
where the batch job will store the backup objects. While terraforming it
99
+
we have that bucket name dynamically created with the date appended at the
100
+
end of the bucket name e.g. `<Source-Bucket-Name>-<dd-mmm-yyyy>`, so that
101
+
before each full snapshot we can create this buckets. Otherwise there is a
102
+
risk of earlier full snapshots getting overwritten.
74
103
1. Create an IAM role for the batch operation, source will give the copy object permission to this role
75
-
1. We created a lambda on the destination side to scan through all the manifest.json and create the actual batch operation and run it automatically.
104
+
1. We created a lambda on the destination side to scan through all the `manifest.json` files and create the actual batch operation and run it automatically.
76
105
77
-
#### Destination Side
106
+
#### Source Side
78
107
79
-
1. We terraformed an Inventory management config for all the bucket listed above in Source side.
108
+
1. We terraformed an Inventory Management config for all the buckets listed above in Source side.
80
109
1. This inventory config will create the inventory in Designated Manifest bucket in the destination account.
81
110
1. For all the buckets on the source side , we have to add the policy as a bucket level policy to allow the S3 batch operation role created in destination side to do the copy operation
82
111
83
112
84
-
### Some limitation
113
+
### Limitations
85
114
86
115
These are mainly the limitation of AWS S3 batch operation,
87
-
1. All source objects must be in one bucket.
116
+
1. All source objects must be in one bucket.
88
117
- This is not a challenge for us as we are going to invoke bucket level copy and create a manifest at bucket level meet this requirement
89
-
1. All destination objects must be in one bucket.
118
+
1. All destination objects must be in one bucket.
90
119
- This is not a challenge for us as we are going to invoke bucket level copy and create a manifest at bucket level meet this requirement
91
-
1. You must have read permissions for the source bucket and write permissions for the destination bucket.
120
+
1. You must have read permissions for the source bucket and write permissions for the destination bucket.
92
121
- Again with proper IAM roles for the S3 batch copy operation can manage this
93
-
1. Objects to be copied can be up to 5 GB in size.
122
+
1. Objects to be copied can be up to 5 GB in size.
94
123
- S3 Batch is using put method so its limited up to 5GB. If there is any manual upload of files that is more than 5GB we will skip it. The behaviour is tested and we found that batch operation is throwing the following error and continue with the rest of the operation.
95
124
```Some-file-name,,failed,400,InvalidRequest,The specified copy source is larger than the maximum allowable size for a copy source: 5368709120 (Service: Amazon S3; Status Code: 400; Error Code: InvalidRequest; Request ID: FHNW4MF5ZMKBPDQY; S3 Extended Request ID: /uopiITqnCRtR1/W3K6DpeWTiJM36T/14azeNw4q2gBM0yj+r0GwzhmmHAsEMkhNq9v8NK4rcT8=; Proxy: null)```
96
125
97
-
1. Copy jobs must be created in the destination region, which is the region you intend to copy the objects to.
126
+
1. Copy jobs must be created in the destination region, which is the region you intend to copy the objects to.
98
127
- Again for our purpose this is what we intended to do any way
99
-
1. If the buckets are un-versioned, you will overwrite objects with the same key names.
128
+
1. If the buckets are un-versioned, you will overwrite objects with the same key names.
100
129
- We will create new buckets for each full snapshots to mitigate this.
101
130
102
131
## Conclusion
103
132
104
-
The above approach worked well for our purpose, and if we follow the process properly it should suffice to a lot of use cases. Specially if you do not have the luxury of doing a Stop the World on your data warehouse writes and still need to have a backup with certain degree of confidence. This method does not provide an accurate point on time snapshot due to the “eventually consistent” model of manifest generation, but this method covers most of the use cases for any backup.
133
+
The above approach worked well for our purpose, and if we follow the process
134
+
properly it should suffice for many of our use-cases. This approach can work quite well if like us you do not have
135
+
the luxury of doing a "Stop the World" on your data warehouse writes, and still
136
+
need to have a backup with certain degree of confidence. This method does not
137
+
provide an accurate point on time snapshot due to the “eventually consistent”
138
+
model of manifest generation, but I believe this method covers most of the use-cases for
0 commit comments