Linux text filtering: split, join, comm, cmp, fmt, paste

Get this book -> Problems on Array: For Interviews and Competitive Programming

This article is a continuation of the first two parts, here we continue to discuss more text filtering Linux commands. We cover how to split, join, compare, format and merge data using the Linux command line, to achieve the desired ordered information.

Table of contents.

Introduction.
split.
join.
comm.
cmp.
fmt.
paste.
Summary.
References.

Introduction.

Text filtering is the process of taking an input stream of text and performing conversions on it before sending it to the output stream. A text filter reads standard input(file) and performs an operations on it then outputs the result to the output stream.

split

The split command breaks down a file into smaller files. The default size of the output files is 1000 lines whenever a large file in split with this command.
The syntax is as follows,

split [OPTION]... [FILE [PREFIX]]

To test this, create a file with > 2000 words, you could run a loop 2000 times each time printing a short text then redirect (>) output to a largeFile.txt.
Now, to split the file write,

split --verbose largeFile.txt

We use verbose mode so as to see the working of this command.
The output is as follows,

split --verbose largeFile.txt
creating file 'xaa'
creating file 'xab'
creating file 'xac'

From the output we can see that the file is being separated into three files xaa with 1000 lines, xab with 1000 lines and xac with 500 lines, largeFile.txt has 2500 lines.

We can also specify the number of lines per file by writing,

split --verbose -l500 out.txt

Here we split the file into five file each having 500 lines.
To check number of lines in a file use wc -l file command.

We can aslo split a file by size using the -b option, i.e, -b{num of bytes} for bytes, -b {num of kilobytes}K for kilobytes, -b {num of megabytes}M for megabytes and -b {num of gigabytes}G for gigabytes.

We can also split a file to n files by using the -n option as shown,

split --verbose -n7 out.txt

Here we split out.txt into 7 smaller files.

This command has been found very useful in cases such as when trying to move gigabytes of data over a network, e.g a 4GB iso file holding an OS.
For this we can use split command to split the file into n chunks or into sizes based on megabytes, send the file in small pieces over the network then merge using the cat command. To verify this merge to ensure no bits were lost we can use the md5sum utility to compare their hashes.

join

This command is useful for merging two files using a common field such as a link between related lines in both files. This is the same way SQL joins work.
The syntax is as follows,

join [OPTION]... FILE1 FILE2

We have the following two files,
t1.txt

1. Prolog
2. Haskell
3. Java
4. Fortran
5. Perl

t2.txt

1. Logic
2. Functional
3. OO
4. Procedural
5. Scripting

To join them by a common field i.e line numbers we write,

join t1.txt t2.txt > join.txt

Now join.txt will have both files combined.

If the files don't match in terms of number of lines, e.g one has 5, and the other has 6 lines, the join command works as before and will ignore non-matching lines, however if it is important to include these unmatched lines we can use the -a option followed by the number of the file containing extra lines as follows,

join -a 1 t1.txt t2.txt

This way we also match unmatched lines.

To view unmatched lines we use the -v option followed by the file name,

join -v 1 t1.txt t2.txt

comm

The comm command is used to compare two sorted files line by line.
The syntax is as follows,

comm [OPTION]... FILE1 FILE2

An example
Given two files t1.txt and t2.txt,

t1.txt

Java
Python
C++
Perl

t2.txt

Python
C#
Javascript
Perl

By default comm compares sorted files, we can use the sort command to sort the files or use the --nocheck-order option to compare unsorted files as follows,

comm --nocheck-order t1.txt t2.txt

However if both files are sorted we can use the following command,

comm t3.txt t4.txt

If the command ran successfully, we should expect three columns, the first is for lines unique to t1.txt the second for lines unique to t2.txt and the third common lines for both files.

To get only common lines we can suppress columns one and two by using -12 as follows,

comm -12 t3.txt t4.txt

To suppress 1 and 3 we use -13 and so on.

cmp

This command is useful for byte-by-byte comparison of two files.
The syntax is as follows,

cmp [OPTION]... FILE1 [FILE2 [SKIP1 [SKIP2]]]

To compare t1.txt and t2.txt we write,

cmp t1.txt t2.txt

When a difference is found, output is printed out otherwise nothing is printed.

We can also printout the differing bytes by using the -b option,

cmp -b t1.txt t2.txt

To print out all differing bytes we use the -l option,

cmp -l t1.txt t2.txt

The command outputs bytes numbers and values for all differing bytes.

We can also opt to skip the first n bytes by using the i option as follows,

cmp -i 10 t1.txt t2.txt

From the above command we skip the first 10 bytes.

To skip the first n bytes of both files we write,

cmp -i 10:10 t1.txt t2.txt

From the above command we have skipped the first 10 bytes for each file.

Note: We can also compare two directories by using the dircmp command.

fmt

This command is useful for formatting text i.e converting it into a specified width.
The syntax is as follows,

fmt [-WIDTH] [OPTION]... [FILE]...

By default fmt command breaks down long lines into smaller i.e 75 characters per line by default, for this we just execute fmt without any options.

To specify a width we can use the -w option and set the width we require.
Given the file file.txt

Java, Python, C++, Perl, Javascript, Prolog, Haskell, C, Rust

We can format it as follows,

fmt --w=20 file.txt

From the command we have specified a width using the -w option and set it to 1.

On the other hand if a file has less than 75 characters in a line, 'fmt file.txt' command will combine the shorter lines to form longer lines up to the 75 default mark.

We can also use the -u option to create uniform spacing, that is, if a file has an abnormal amount of spacing. This will have single spaces between word and two in the case of sentences.

We can also use the -t option to apply indentation to the first line of each paragraph.

The -s option is used to split the lines without joining shorter ones. This prevents shorter lines being combined to form longer ones.

paste.

We use paste to merge lines in multiple files parallelly or sequentially.
The syntax is as follows,

paste [OPTION]... [FILE]...

To merge two files t1.txt and t2.txt we write,

paste t1.txt t2.txt

To merge them sequentially we use the -s option,

paste -s t1.txt t2.txt

We can also paste using delimiters as follows,

paste -d "-" t1.txt t2.txt

From the above command we use the -d option and specify a delimiter -.

Summary.

Filters can be used to process information in very useful ways by restructuring output to generate useful information or text modifications that can be used and input for other commands.

Note that some of these commands, maybe all can be executed in git bash which can be installed in a windows environment.

References.

Linux text filtering part 1
Linux text filtering part 2
For each of the commands you can type command --help for reference.