Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
402 views
in Technique[技术] by (71.8m points)

python - Fastest way to read file searching for pattern matches

I made a python script to analyze logs. I have one observation to share, and two questions to ask.

When I use gzip.open to open each file and go through every line, it takes around 200 seconds just to going through all the lines and files.

with gzip.open(file) as fp:
    for line in fp:
          pass

If using zcat and grep to do the work, it takes about 50 seconds.

temp = commands.getstatusoutput("zcat file* | grep pattern")

The performance difference is too huge to ignore. Is there a better way to reduce the gap?

I also noticed that the commands module is made obsolete by the subprocess module, which seems always create a temporary file. But it wouldn't be convenient, what if it is not possible to create a temporary file from where the python script is running? Any suggestion?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

'grep' contains decade's worth of optimizations, and re-implementing it in any programming language, not just Python, will be slower. *1

Therefore, if speed is important to you, your technique of calling 'grep' directly is probably the way to go. To do this using 'subprocess', without having to write any temporary files, use the 'subprocess.PIPE' mechanism:

from subprocess import Popen, PIPE

COMMAND = 'zcat file* | grep oldconfig'
process = Popen(COMMAND, shell=True, stderr=PIPE, stdout=PIPE)
output, errors = process.communicate()
assert process.returncode == 0, process.returncode
assert errors == '', errors
print('{} lines match'.format(len(output.splitlines())))

This works for me on Python3.5. I've avoided using any of the higher-level interfaces added on top of subprocess recently, so it should work fine on older versions of Python too.


(*1 for example, even with an empty 'for' loop, as you show in your question, grep is likely to still be faster, because it does not read the input line-by-line. Instead it determine the max number of characters it can seek forwards through the file, ignoring newlines completely, reading one char after each seek, searching for chars that might match any part of the regex. Only if it finds a match does it then look at the characters surrounding that match, to see if the rest of the regex matches and appropriate newlines are present. On top of that it dynamically generates code that is hard-coded to check for matches to the given regex, meaning it executes around 3 x86 instructions per input byte that it examines, and it skips examining most input bytes completely)


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

2.1m questions

2.1m answers

60 comments

57.0k users

...