Using Multiple CPU Cores With Command Line Tools

command lineAll of you must have heard about how the processors in our laptops have multiple cores. It’s good that the technology is advancing in that direction. When people write programs, they can utilize these cores to increase the speed of computation. But most of the inbuilt commands don’t use these cores unless specified explicitly. If you ever want to add up a very large list, say hundreds of megabytes, or just look through it to find some particular value, you would write a simple program to do it. But going through so much data takes a lot of time if you just use a single thread. The same is true for tools like grep, bzip2, wc, awk, sed, etc. If the last sentence looked like jibber-jabber, then you should probably google those things before you proceed. They are singly-threaded and will just use one CPU core. So how do we use multiple cores in these situations?  

Your machine already has multiple cores. So we can use all of our CPU cores by using GNU Parallel and doing a little in-machine map-reduce magic.

GREP

This is one of the most popular tools that developers use. This tool is used to search for any pattern in a chunk of text. Usually, you would do this:

$ grep pattern inputfile.txt

But if the input file is big, this would be really slow. So if you have an enormous text file, do this instead:

$ cat inputfile.txt | parallel --pipe grep 'pattern'

or you can do this:

$ cat inputfile.txt | parallel --block 4M --pipe grep 'pattern'

The second command shows you how to use a block with 4 million lines. You can play with this parameter to find our how many input record lines you want per CPU core. It’s a good practice to distribute the load equally on all your cores.

AWK

Awk is a interpreted language for text processing. It’s very popular in Unix-like operating systems. Let’s say you want to add up the numbers in a file. Usually, you would do this:

$ cat inputfile.txt | awk '{s+=$1} END {print s}'

If it’s a very large file, do this instead:

$ cat inputfile.txt | parallel --pipe awk \'{s+=\$1} END {print s}\' | awk '{s+=$1} END {print s}'

Let’s understand what’s happening here. The pipe option spreads out the output to multiple chunks for the awk call, giving a bunch of sub-totals. These sub totals go into the second pipe with the identical awk call, which gives the final total. The first awk call has three backslashes in there due to the need to escape the awk call for GNU parallel.

WC

This command line tool is used to count the number of words. If you want to count the number of words, you would do this:

$ wc -l inputfile.txt

If you have a really big file, use this instead:

$ cat inputfile.txt | parallel --pipe wc -l | awk '{s+=$1} END {print s}'

What’s happening here is during the parallel call, we are mapping a bunch of calls to wc -l, generating sub-totals, and finally adding them up with the final pipe pointing to awk.

SED

We use this command line tool for a couple of different things. If you want to do replacements in a file, you would usually do this:

$ sed s^old^new^g inputfile.txt

If you have a huge number of replacements, do this instead:

$ cat inputfile.txt | parallel --pipe sed s^old^new^g

You can then pipe it into your favorite file to store the output.

BZIP2

We know that bzip2 is better at compression than gzip, but it’s slow! You would usually do this:

$ cat inputfile.bin | bzip2 --best > compressedfile.bz2

To speed it up, do this instead:

$ cat inputfile.bin | parallel --pipe --recend '' -k bzip2 --best > compressedfile.bz2

In the case of bzip2, GNU parallel is dramatically faster on multiple core machines. Give it a shot and enjoy the speed!

————————————————————————————————-

 

 

 

Leave a comment