Skip to content

Latest commit

 

History

History
52 lines (34 loc) · 2.21 KB

deduplicate_rows.md

File metadata and controls

52 lines (34 loc) · 2.21 KB

Let's say you have a table of customer transactions where accidental duplicates might occur. You want to analyze the data accurately, so you need to remove those duplicates.

Scenario:

Your table customer_transactions in dataset my_dataset in project my_project looks like this:

transaction_id customer_id amount date
1 101 10.00 2024-03-08
2 102 25.50 2024-03-08
3 101 10.00 2024-03-08
4 103 50.00 2024-03-09
5 102 12.00 2024-03-09
6 101 10.00 2024-03-08

Use Case with deduplicate_rows:

You can use the deduplicate_rows function to remove the duplicate transactions:

CALL bigfunctions.us.deduplicate_rows("my_project.my_dataset.customer_transactions");
SELECT * FROM bigfunction_result;

This will create a temporary table bigfunction_result containing the deduplicated rows:

transaction_id customer_id amount date
1 101 10.00 2024-03-08
2 102 25.50 2024-03-08
4 103 50.00 2024-03-09
5 102 12.00 2024-03-09

Benefits:

  • Simplicity: Easily deduplicate rows without complex SQL queries.
  • Efficiency: Leverages BigQuery's processing power for fast deduplication, even on large tables.
  • Flexibility: Works with both tables and query results, allowing you to deduplicate data from various sources.

Other Use Cases:

  • Deduplicating product catalogs with slight variations in descriptions.
  • Removing duplicate entries in user registration data.
  • Cleaning up sensor data where multiple readings might be recorded for the same timestamp.
  • Removing duplicate records from log files.

Remember to replace bigfunctions.us with the appropriate dataset for your BigQuery region. You can also create a new table from the bigfunction_result if you want to store the deduplicated data permanently.