This is not something I've been able to do out of the box, and is something that can be done outside of the script with some sort of wrapper script or helper script (bash, perl, etc.). If you write a script, called last10.sh
, that would output your last 10 files, comma separated:
$ ./last10.sh
/input/file38,/input/file39,...,/input/file48
Something like this should do the trick for the most recent 10 files:
hadoop fs -ls /input/ | sort -k6,7 | tail -n10 | awk '{print $8}' | tr '
' ','
you could do:
$ pig -p files="`last10.sh`" my_mr.pig
Then, in your pig script, do:
data = LOAD '$files'
USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t')
AS (i1, i2, i3);
Pig loads up the separate files if they are comma separated like this. This would be equivalent to doing:
data = LOAD '/input/file38,/input/file39,...,/input/file48'
USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t')
AS (i1, i2, i3);
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…