It seems so simple, yet I’m stuck.
I want to limit the results that I have to only contain a certain amount of records,say 1000, per distinct value in column A. So for value 1, I need max 1000 returns, for value 2 as well etc. How do I set my limit clausule in Databricks SQL?
Column A | Column B | Column B |
---|---|---|
1 | Value X | Yes |
1 | Value Y | Yes |
3 | Value X | No |
2 | Value Y | No |
1 | Value X | Yes |
3 | Value Y | Yes |
2 | Value X | Maybe |
2 | Value Y | Maybe |
I’ve tried a lot, so far no success.
0
Use a windows function
to achieve this. Change your order by if needed.
select colA, colB, colc
from (
select colA, colB, colC,
row_number() over (partition by colA order by random()) as rn
from your_table
)
where rn <= 1000
EDIT: Based on comments, you want to filter on some certain values, and, left join. The Order by Random adds lots of time. If you don’t need it truly random, then order by something else to significantly reduce your query time.
select colA, colB, colc, colN
from (
select a.colA, a.colB, a.colC, b.colN,
row_number() over (partition by a.colA order by a.colA) as rn
from table1 a
left join table2 b
on a.colX = b.colY
where a.colA in (1, 2)
)
where rn <= 1000
3
I also came up with the ORDER BY RANDOM() as the ORDER BY clause of the window function. An ORDER BY is required for the ROW_NUMBER() OLAP function, that’s why I also tried it.
Better if you have something in the table to order by to get the n most important rows per Column_A value.
Also – Databricks supports the QUALIFY clause, which lets you avoid nesting two queries for filtering:
WITH
-- your input ...
indata(Column_A,Column_B,Column_C) AS (
SELECT 1,'Value X','Yes'
UNION ALL SELECT 1,'Value Y','Yes'
UNION ALL SELECT 3,'Value X','No'
UNION ALL SELECT 2,'Value Y','No'
UNION ALL SELECT 1,'Value X','Yes'
UNION ALL SELECT 3,'Value Y','Yes'
UNION ALL SELECT 2,'Value X','Maybe'
UNION ALL SELECT 2,'Value Y','Maybe'
)
-- real query starts here, replace following comma with "WITH"
SELECT
*
FROM indata
QUALIFY ROW_NUMBER() OVER(PARTITION BY Column_A ORDER BY Column_C) <= 2
;
Column_A | Column_B | Column_C |
---|---|---|
1 | Value X | Yes |
1 | Value Y | Yes |
2 | Value X | Maybe |
2 | Value Y | Maybe |
3 | Value X | No |
3 | Value Y | Yes |
As @Jon mentioned, you can achieve efficiency by using pyspark code and looping through a sql for each group
def limit_rows_per_group(df, group_column, limit):
groups = df.select(group_column).distinct().rdd.flatMap(lambda x: x).collect()
limited_dfs = []
for group in groups:
window = Window.partitionBy(group_column).orderBy(F.col("ColumnB"))
df_group = df.filter(F.col(group_column) == group)
df_group = df_group.withColumn("row_num", F.row_number().over(window))
df_group_limited = df_group.filter(F.col("row_num") <= limit).drop("row_num")
limited_dfs.append(df_group_limited)
result_df = limited_dfs[0]
for limited_df in limited_dfs[1:]:
result_df = result_df.union(limited_df)
return result_df
result_df = limit_rows_per_group(df, "ColumnA", 1000)
result_df.show()