I have this dataframe:
+------+
|my_col|
+------+
|202101|
|202209|
+------+
When writing as parquet file, I partition it based on column ‘my_col’, so I should get two partitions (two parquet files).
Then I will be reading the saved dataset, applying a filter.
- Applying the filter
.filter("my_col >= 202201")
, am I correct to think that the data from the parquet file containing my_col=202101 will not be read into the memory? - Applying the filter
.filter("substring(my_col, 1, 4) >= 2022")
, will the data from the parquet file containing my_col=202101 be read into the memory?
In the latter case I do not filter directly on key column values, but instead, a function is applied on the column. I wonder, if in this case partitioning helps to save on read time.