When reading from a pipe, sort
assumes that the file is small, and for small files parallelism isn't helpful. To get sort
to utilize parallelism you need to tell it to allocate a large main memory buffer using -S
. In this case the data file is about 8GB, so you can use -S8G
. However, at least on your system with 128GB of main memory, method 2 may still be faster.
This is because sort
in method 2 can know from the size of the file that it is huge, and it can seek in the file (neither of which is possible for a pipe). Further, since you have so much memory compared to these file sizes, the data for myBigFile.tmp
need not be written to disc before awk
exits, and sort
will be able to read the file from cache rather than disc. So the principle difference between method 1 and method 2 (on a machine like yours with lots of memory) is that sort
in method 2 knows the file is huge and can easily divide up the work (possibly using seek, but I haven't looked at the implementation), whereas in method 1 sort
has to discover the data is huge, and it can not use any parallelism in reading the input since it can't seek the pipe.
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…