Let's say you have a table of customer transactions where accidental duplicates might occur. You want to analyze the data accurately, so you need to remove those duplicates.
Scenario:
Your table customer_transactions
in dataset my_dataset
in project my_project
looks like this:
transaction_id | customer_id | amount | date |
---|---|---|---|
1 | 101 | 10.00 | 2024-03-08 |
2 | 102 | 25.50 | 2024-03-08 |
3 | 101 | 10.00 | 2024-03-08 |
4 | 103 | 50.00 | 2024-03-09 |
5 | 102 | 12.00 | 2024-03-09 |
6 | 101 | 10.00 | 2024-03-08 |
Use Case with deduplicate_rows
:
You can use the deduplicate_rows
function to remove the duplicate transactions:
CALL bigfunctions.us.deduplicate_rows("my_project.my_dataset.customer_transactions");
SELECT * FROM bigfunction_result;
This will create a temporary table bigfunction_result
containing the deduplicated rows:
transaction_id | customer_id | amount | date |
---|---|---|---|
1 | 101 | 10.00 | 2024-03-08 |
2 | 102 | 25.50 | 2024-03-08 |
4 | 103 | 50.00 | 2024-03-09 |
5 | 102 | 12.00 | 2024-03-09 |
Benefits:
- Simplicity: Easily deduplicate rows without complex SQL queries.
- Efficiency: Leverages BigQuery's processing power for fast deduplication, even on large tables.
- Flexibility: Works with both tables and query results, allowing you to deduplicate data from various sources.
Other Use Cases:
- Deduplicating product catalogs with slight variations in descriptions.
- Removing duplicate entries in user registration data.
- Cleaning up sensor data where multiple readings might be recorded for the same timestamp.
- Removing duplicate records from log files.
Remember to replace bigfunctions.us
with the appropriate dataset for your BigQuery region. You can also create a new table from the bigfunction_result
if you want to store the deduplicated data permanently.