apache pig - How Can I Load Every File In a Folder Using PIG?

Question

Welcome To Ask or Share your Answers For Others

apache pig - How Can I Load Every File In a Folder Using PIG?

asked Oct 24, 2021 in Technique[技术] by 深蓝 (71.8m points)

apache pig - How Can I Load Every File In a Folder Using PIG?

I have a folder of files created daily that all store the same type of information. I'd like to make a script that loads the newest 10 of them, UNIONs them, and then runs some other code on them. Since pig already has an ls method, I was wondering if there was a simple way for me to get the last 10 created files, and load them all under generic names using the same loader and options. I'm guessing it would look something like:

REGISTER /usr/local/lib/hadoop/hadoop-lzo-0.4.13.jar;
REGISTER /usr/local/lib/hadoop/elephant-bird-2.0.5.jar;
FOREACH file in some_path:
    file = LOAD 'file' 
    USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t') 
    AS (i1, i2, i3);

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-23T19:18:17+0000

This is not something I've been able to do out of the box, and is something that can be done outside of the script with some sort of wrapper script or helper script (bash, perl, etc.). If you write a script, called last10.sh, that would output your last 10 files, comma separated:

$ ./last10.sh
/input/file38,/input/file39,...,/input/file48

Something like this should do the trick for the most recent 10 files:

hadoop fs -ls /input/ | sort -k6,7 | tail -n10 | awk '{print $8}' | tr '
' ','

you could do:

$ pig -p files="`last10.sh`" my_mr.pig

Then, in your pig script, do:

data = LOAD '$files'
       USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t')
       AS (i1, i2, i3);

Pig loads up the separate files if they are comma separated like this. This would be equivalent to doing:

data = LOAD '/input/file38,/input/file39,...,/input/file48'
       USING com.twitter.elephantbird.pig.load.LzoTokenizedLoader('\t')
       AS (i1, i2, i3);

Categories

apache pig - How Can I Load Every File In a Folder Using PIG?

apache pig - How Can I Load Every File In a Folder Using PIG?

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags