46
loading...
This website collects cookies to deliver better user experience
null
? Should we clean the data at the query level or afterwards using a script?users
table and a purchases
table.user_id |
---|
1 |
2 |
3 |
4 |
5 |
user_id | price |
---|---|
1 | 79.36 |
3 | 42.97 |
1 | 33.14 |
2 | 16.48 |
4 | 96.41 |
COALESCE
is a keyword used by Postgres and other SQL dialects to return the first non-null value in a set. Because of this, it’s a common keyword for data cleaning. Let’s look closer at the following query.SELECT
COALESCE(SUM(purchases.price), 0) as LTV,
user_id
FROM purchases
WHERE purchases.user_id
IN (1, 2, 3, 4, 5)
GROUP BY user_id
ORDER BY user_id;
user_id | LTV |
---|---|
1 | 112.50 |
2 | 16.48 |
3 | 42.97 |
4 | 96.41 |
null
, which may throw off calculations. So why did COALESCE
do this?value | meaning |
---|---|
0 | We know something is there and its value is zero. |
null | We know something is there and its value is unspecified |
empty set | We know nothing is there |
COALESCE
but with the order of execution - there are no rows to return with a user_id
of 5 in purchases
. Let's think about the logical order of how we get there:purchases
that match the given user ids (5 rows, matching 4 of the given user ids)CASE
switch for our LTV because, again, there is no row, no data for it to act on.purchases
table contained a row for user_id 5 with null
for price, it would have returned as expected. Knowing that, a second option is to break out a right outer join.RIGHT OUTER JOIN
with our users table to achieve that. This way, we’ll get back a row for every user from users
, and, if it’s there, the sum of their data from the “price” column in purchases
, otherwise a 0.SELECT
users.id as user_id,
COALESCE(SUM("price"), 0) as LTV
FROM purchases
RIGHT OUTER JOIN users
ON demo.purchases.user_id = users.id
GROUP BY user_id
ORDER BY user_id;
user_id | LTV |
---|---|
1 | 112.50 |
2 | 16.48 |
3 | 42.97 |
4 | 96.41 |
5 | 0 |
💡 If all of your foreign keys are coming from a single table, the above might be the right solution for you. However, in more complex cases, you may not have something like a generic “users” table.
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
user_id | string | not null | ||
string | not null | |||
last_name | string | not null | ||
user_id | string | not null |
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
user_id | string | not null | ||
string | not null | |||
last_name | string | not null | ||
user_id | string | not null |
Column | Type | Collation | Nullable | Default |
---|---|---|---|---|
user_id | string | not null | ||
purchase_amount | decimal | not null | ||
item_sku | string | not null | ||
item_category | string | not null |
⚠️ Overkill query ahead. You should really try a function instead.
user_id
s for the query with a sublist of (user_id
, 0), then UNION
the original query with that list. If it sounds overkill, that’s because it is:SELECT
DISTINCT on (user_id) user_id,
purchases FROM( SELECT user_id, COUNT(*) as LTV
FROM purchases
WHERE purchases.user_id in (1, 2, 3, 4, 5)
GROUP BY user_id
UNION
SELECT * FROM (
values (1, 0), (2, 0), (3, 0), (4, 0), (5, 0))
AS t (user_id, LTV)
ORDER BY user_id, LTV DESC
) AS data;
user_id | LTV |
---|---|
1 | 112.50 |
2 | 16.48 |
3 | 42.97 |
4 | 96.41 |
5 | 0 |
interface Profile {
id: string;
}
interface DataResonseRow {
id: string;
data: number[];
}
function calculateLTV(
queryResults: DataResponseRow[],
aggregationMethod: string,
profiles: Profile[]
) {
const responsesById: { [id: string]: any[] } = {};
for (const i in profiles) {
if (aggregationMethod === "count" || aggregationMethod === "sum") {
//default all entries for counted or summed aggregations to 0 to start... will be replaced later if there is a response
responsesById[profiles[i].id] = [0];
}
}
for (const row in queryResults) {
responsesById[queryResults[row].id] = queryResults[row].data;
}
return responsesById;
}
🐍 Note: There are many Python libraries that help with data cleaning and analytics. Two popular choices are pandas and matplotlib. However, this is a super straightforward solution using plain-vanilla Python.
def calculate_ltv (query_results, aggregation_method, profiles):
responses_by_id = {}
for i in profiles:
if aggregation_method == "count" or aggregation_method == "sum":
# default all entries for counted or summed aggregations to 0 to start... will be replaced later if there is a response
responses_by_id[i["id"]] = [0]
for row in query_results:
responses_by_id[row["id"]] = row["data"]
return responses_by_id
mock_query_results = [{"id":1, "data":[112.50]}, {"id":2, "data":[16.48]}, {"id":3, "data":[42.97]}, {"id":4, "data":[96.41]}]
mock_aggregation_method = "sum"
mock_profiles = [{"id": 1}, {"id": 2}, {"id": 3}, {"id": 4}, {"id": 5}]
print(calculate_ltv(mock_query_results, mock_aggregation_method, mock_profiles));