I am working on a cluster where a dataset is kept in hdfs
in distributed manner. Here is what I have:
[hmi@bdadev-5 ~]$ hadoop fs -ls /bdatest/clm/data/
Found 1840 items
-rw-r--r-- 3 bda supergroup 0 2015-08-11 00:32 /bdatest/clm/data/_SUCCESS
-rw-r--r-- 3 bda supergroup 34404390 2015-08-11 00:32 /bdatest/clm/data/part-00000
-rw-r--r-- 3 bda supergroup 34404062 2015-08-11 00:32 /bdatest/clm/data/part-00001
-rw-r--r-- 3 bda supergroup 34404259 2015-08-11 00:32 /bdatest/clm/data/part-00002
....
....
The data is of the form:
[hmi@bdadev-5 ~]$ hadoop fs -cat /bdatest/clm/data/part-00000|head
V|485715986|1|8ca217a3d75d8236|Y|Y|Y|Y/1X||Trimode|SAMSUNG|1x/Trimode|High|Phone|N|Y|Y|Y|N|Basic|Basic|Basic|Basic|N|N|N|N|Y|N|Basic-Communicator|Y|Basic|N|Y|1X|Basic|1X|||SAM|Other|SCH-A870|SCH-A870|N|N|M2MC|
So, what I want to do is to count the total number of lines in the original data file data
. My understanding is that the distributed chunks like part-00000
, part-00001
etc have overlaps. So just counting the number of lines in part-xxxx
files and summing them won't work. Also the original dataset data
is of size ~70GB
. How can I efficiently find out the total number of lines?
See Question&Answers more detail:
os 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…