how can I use them in the map method?
The major problem is that each file is going to be in one file. So if you have 10M files, you'll have 10M mappers, which doesn't sound terribly reasonable. You may want to considering pre-serializing the files into SequenceFiles
(one image per key-value pair). This will make loading the data into the MapReduce job native, so you don't have to write any tricky code. Also, you'll be able to store all of your data into one SequenceFile, if you so desire. Hadoop handles splitting SequenceFiles quite well.
Basically, the way this works is, you will have a separate Java process that takes several image files, reads the ray bytes into memory, then stores the data into a key-value pair in a SequenceFile. Keep going and keep writing into HDFS. This may take a while, but you'll only have to do it once.
Is it ok to make the input(to the map) as a text file containing all of the photo path(in HDFS) with each a line, and in the map method, load the binary like: photoInputStream = getImageFromHDFS(photopath); (Actually, what is the right method to load file from HDFS during the execution of the map method?)
This is not ok if you have any sort of reasonable cluster (which you should if you are considering Hadoop for this) and you actually want to be using the power of Hadoop. Your MapReduce job will fire off, and load the files, but the mappers will be running data-local to the text files, not the images! So, basically, you are going to be shuffling the image files everywhere since the JobTracker is not placing tasks where the files are. This will incur a significant amount of network overhead. If you have 1TB of images, you can expect that a lot of them will be streamed over the network if you have more than a few nodes. This may not be so bad depending on your situation and cluster size (less than a handful of nodes).
If you do want to do this, you can use the FileSystem
API to create files (you want the open
method).
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…