Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.0k views
in Technique[技术] by (71.8m points)

apache pig - How Can I Load Every File In a Folder Using PIG?

I have a folder of files created daily that all store the same type of information. I'd like to make a script that loads the newest 10 of them, UNIONs them, and then runs some other code on them. Since pig already has an ls method, I was wondering if there was a simple way for me to get the last 10 created files, and load them all under generic names using the same loader and options. I'm guessing it would look something like:

REGISTER /usr/local/lib/hadoop/hadoop-lzo-0.4.13.jar;
REGISTER /usr/local/lib/hadoop/elephant-bird-2.0.5.jar;
FOREACH file in some_path:
    file = LOAD 'file' 
    USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t') 
    AS (i1, i2, i3);
See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

This is not something I've been able to do out of the box, and is something that can be done outside of the script with some sort of wrapper script or helper script (bash, perl, etc.). If you write a script, called last10.sh, that would output your last 10 files, comma separated:

$ ./last10.sh
/input/file38,/input/file39,...,/input/file48

Something like this should do the trick for the most recent 10 files:

hadoop fs -ls /input/ | sort -k6,7 | tail -n10 | awk '{print $8}' | tr '
' ','

you could do:

$ pig -p files="`last10.sh`" my_mr.pig

Then, in your pig script, do:

data = LOAD '$files'
       USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t')
       AS (i1, i2, i3);

Pig loads up the separate files if they are comma separated like this. This would be equivalent to doing:

data = LOAD '/input/file38,/input/file39,...,/input/file48'
       USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t')
       AS (i1, i2, i3);

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...