I'm on Spark 2.2.0, running on EMR.
I have a big dataframe df
(40G or so in compressed snappy files) which is partitioned by keys k1
and k2
.
When I query by k1
=== v1
or (k1
=== v1
&& k2 ===
v2`), I can see that it's only querying the files in the partition (about 2% of the files).
However if I cache or persist df
, suddenly those queries are hitting all the partitions and either blows up memory or is much less performant.
This is a big surprise - is there any way to do caching which preserves the partitoning information
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…