The InputFormat
is responsible to provide the splits.
In general, if you have n nodes, the HDFS will distribute the file over all these n nodes. If you start a job, there will be n mappers by default. Thanks to Hadoop, the mapper on a machine will process the part of the data that is stored on this node. I think this is called Rack awareness
.
So to make a long story short: Upload the data in the HDFS and start a MR Job. Hadoop will care for the optimised execution.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…