Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
569 views
in Technique[技术] by (71.8m points)

hadoop - Spark RDD - is partition(s) always in RAM?

We all know Spark does the computation in memory. I am just curious on followings.

  1. If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDDs data will reside on Spark Workers Memory?

  2. If I do not delete RDD, will it be in memory forever?

  3. If my dataset(file) size exceeds available RAM size, where will data to stored?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

If I create 10 RDD in my pySpark shell from HDFS, does it mean all these 10 RDD data will reside on Spark Memory?

Yes, All 10 RDDs data will spread in spark worker machines RAM. but not necessary to all machines must have a partition of each RDD. off course RDD will have data in memory only if any action performed on it as it's lazily evaluated.

If I do not delete RDD, will it be in memory forever?

Spark Automatically unpersist the RDD or Dataframe if they are no longer used. In order to know if an RDD or Dataframe is cached, you can get into the Spark UI -- > Storage table and see the Memory details. You can use df.unpersist() or sqlContext.uncacheTable("sparktable") to remove the df or tables from memory. link to read more

If my dataset size exceeds available RAM size, where will data to stored?

If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time, when they're needed. link to read more

If we are saying RDD is already in RAM, meaning it is in memory, what is the need to persist()? --As per comment

To answer your question, when any action triggered on RDD and if that action could not find memory, it can remove uncached/unpersisted RDDs.

In general, we persist RDD which need a lot of computation or/and shuffling (by default spark persist shuffled RDDs to avoid costly network I/O), so that when any action performed on persisted RDD, simply it will perform that action only rather than computing it again from start as per lineage graph, check RDD persistence levels here.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...