Let's say you have a table of customer transactions where accidental duplicates might occur. You want to analyze the data accurately, so you need to remove those duplicates.

Scenario:

Your table customer_transactions in dataset my_dataset in project my_project looks like this:

transaction_id	customer_id	amount	date
1	101	10.00	2024-03-08
2	102	25.50	2024-03-08
3	101	10.00	2024-03-08
4	103	50.00	2024-03-09
5	102	12.00	2024-03-09
6	101	10.00	2024-03-08

Use Case with deduplicate_rows:

You can use the deduplicate_rows function to remove the duplicate transactions:

CALL bigfunctions.us.deduplicate_rows("my_project.my_dataset.customer_transactions");
SELECT * FROM bigfunction_result;

This will create a temporary table bigfunction_result containing the deduplicated rows:

transaction_id	customer_id	amount	date
1	101	10.00	2024-03-08
2	102	25.50	2024-03-08
4	103	50.00	2024-03-09
5	102	12.00	2024-03-09

Benefits:

Simplicity: Easily deduplicate rows without complex SQL queries.
Efficiency: Leverages BigQuery's processing power for fast deduplication, even on large tables.
Flexibility: Works with both tables and query results, allowing you to deduplicate data from various sources.

Other Use Cases:

Deduplicating product catalogs with slight variations in descriptions.
Removing duplicate entries in user registration data.
Cleaning up sensor data where multiple readings might be recorded for the same timestamp.
Removing duplicate records from log files.

Remember to replace bigfunctions.us with the appropriate dataset for your BigQuery region. You can also create a new table from the bigfunction_result if you want to store the deduplicated data permanently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deduplicate_rows.md

deduplicate_rows.md

Files

deduplicate_rows.md

Latest commit

History

deduplicate_rows.md

File metadata and controls