The shell (and basically every programming language which is above assembly language) already knows how to loop over the lines in a file; it does not need to know how many lines there will be to fetch the next one — strikingly, in your example, sed
already does this, so if the shell couldn't do it, you could loop over the output from sed
instead.
The proper way to loop over the lines in a file in the shell is with while read
. There are a couple of complications — commonly, you reset IFS
to avoid having the shell needlessly split the input into tokens, and you use read -r
to avoid some pesky legacy behavior with backslashes in the original Bourne shell's implementation of read
, which have been retained for backward compatibility.
while IFS='' read -r lineN; do
# do things with "$lineN"
done <"$1"
Besides being much simpler than your sed
script, this avoids the problem that you read the entire file once to obtain the line count, then read the same file again and again in each loop iteration. With a typical modern disk driver, some repeated reading can be avoided by caching, but the basic fact is still that reading information from disk is on the order of 1000x slower than not doing it when you can avoid it. Especially with a large file, the cache will fill up eventually, and so you end up reading in and discarding the same bytes over and over, adding a significant amount of CPU overhead and an even more significant amount of the CPU simply doing something else while waiting for the disk to deliver the bytes you read.
In a shell script, you also want to avoid the overhead of an external process if you can. Invoking sed
(or the functionally equivalent but even more expensive two-process head -n "$i"| tail -n 1
) thousands of times in a tight loop will add significant overhead for any non-trivial input file. (On the other hand, if the body of your loop could be done in e.g. sed
or Awk instead, that's going to be a lot more efficient than a native shell while read
loop, because of the way read
is implemented. This is why while read
is also frequently regarded as an antipattern.)
The q
in the sed
script is a very partial remedy; frequently, you see variations where the sed
script will read the entire input file through to the end each time, even if it only wants to fetch one of the very first lines out of the file.
With a small input file, the effects are negligible, but perpetuating this bad practice just because it's not immediately harmful when the input file is small is just irresponsible. Just don't teach this technique to beginners. At all.
If you really need to display the number of lines in the input file, at least make sure you don't spend a lot of time seeking through to the end just to obtain that number. Maybe stat
the file and keep track of how many bytes there are on each line, so you can project the number of lines you have left (and instead of line 1/10345234
display something like line 1/approximately 10000000
?) ... or use an external tool like pv
.
Tangentially, there is a vaguely related antipattern you want to avoid, too; you want to avoid reading an entire file into memory when you are only going to process one line at a time. Doing that in a for
loop also has some additional gotchas, so don't do that, either; see https://mywiki.wooledge.org/DontReadLinesWithFor