I need to read a different file in every map() ,the file is in HDFS
val rdd=sc.parallelize(1 to 10000)
val rdd2=rdd.map{x=>
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://ITS-Hadoop10:9000/"), new org.apache.hadoop.conf.Configuration())
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()
}
rdd2.count
My problem is this code
val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://ITS-Hadoop10:9000/"), new org.apache.hadoop.conf.Configuration())
takes too much running time, every time of map() needs to create a new FileSystem value. Can i put this code outside map() function so it doesn't have to create hdfs every time? Or how can i read files quickly in map()?
My code runs on multiple machines. Thank you!
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…