23
loading...
This website collects cookies to deliver better user experience
GROUP BY
statements, and generates 20 billion rows from that. This was too much for BigQuery, which cancelled the job with this error:Resources exceeded during query execution: Your project or organization exceeded the maximum disk and memory limit available for shuffle operations. Consider provisioning more slots, reducing query concurrency, or using more efficient logic in this job.
GROUP BY
keys at different points in the query. Behind the scenes, a distributed GROUP BY
requires moving all the data for a set of grouping keys to the same set of processing nodes. This is called a “shuffle” or a “repartition,” and with this much data we’re talking multiple TB of data moving around the system.WITH
claim_numbers AS (
-- Assign a number to each claim.
SELECT ROW_NUMBER() OVER () AS id, claim_id
FROM claims_table
GROUP BY claim_id
),
patient_numbers AS (
-- Assign a number to each patient.
SELECT ROW_NUMBER() OVER () AS id, patient_id
FROM claims_table
GROUP BY patient_id
),
claims AS (
-- Substitutes the integers
SELECT cn.id AS claim_id, pn.id AS patient_id, other_columns
FROM claims_table
JOIN claim_numbers cn USING(claim_id)
JOIN patient_numbers pn USING(patient_id)
),
other_ctes AS (...)
-- And then you can join on claim_numbers and patient_numbers again
-- if you need to restore the strings.
ROW_NUMBER
calculation with an error like this:Resources exceeded during query execution:
The query could not be executed in the allotted memory.
Peak usage: 158% of limit.
Top memory consumer(s): analytic OVER() clauses: 100%
OVER
clause requires Bigquery to run the entire dataset through a single processing node, which can require more RAM than the node has. But there’s another trick we can use.patient_geo
column which represents roughly where each patient is from.SELECT patient_id,
ROW_NUMBER() OVER(PARTITION BY MAX(patient_geo)) AS id
FROM claims
GROUP BY patient_id
patient_geo
column, we can integrate that into our id as well to avoid overlap. The output is a 64-bit integer, which gives us plenty of space. We know there will never be more than 100 billion patients, and so for each column, we multiply patient_geo
by 100 billion and add it to the ROW_NUMBER
for each row. A bit shift would work here too, but it doesn’t really matter.SELECT patient_id,
-- We know there will never be more than 100 billion patients,
-- so shift the partition key up by that much.
(MAX(patient_geo) * 100000000000)
+ ROW_NUMBER()
OVER(PARTITION BY MAX(patient_geo)) AS id
FROM claims
GROUP BY patient_id