Linux text filtering: diff, uniq, sdiff, less, more, tr, expand, unexpand

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

In this article we discuss more linux text filters used to process text data so as to produce useful information. This involve commands like diff, uniq, sdiff, less, more, tr, expand and unexpand.

Table of contents.

Introduction.
Uniq.
Diff.
Sdiff.
less.
more.
tr.
expand.
unexpand.
Summary.
References.

Prerequisites.

Text filtering part 1.

Introduction.

Text filtering is the process of taking an input stream of text and performing conversions on it before sending it to the output stream.

A filter reads standard input(file) and performs an operation on it then outputs the result to the output stream.

Filters are smaller programs which perform only a single task and can be viewed as building blocks which can be combined and used to build complex text filters.

uniq

The sytnax is as follows,

uniq [OPTION]... [INPUT [OUTPUT]]

uniq stands for unique, this command is used to remove duplicates from a text file.

The uniq command however removes duplicates that are adjacent to one another.

An example
Given a text file file.txt with the content

joe@yahoo.com
joe@yahoo.com
doe@yahoo.com
peter@yahoo.com
joe@yahoo.com

To get a unique output we write,

uniq file.txt

As you can see the first duplicate email we removed however the second remains, this is one drawback to uniq command.

From the text filtering tools learnt so far, can you think of a solution so as to obtain all unique email addresses.

diff

The syntax is as follows,

diff [OPTION]... FILES

The diff command is used to compare two files line by line, we can extend this to compare directory contents.

Commonly used options include;

-b: to ignore changes involving white spaces e.g spaces or tabs.
-B: ignore blank lines when calculating differences.
-w: ignore whitespaces.
-y: display output in two columns.

An example
Given two files t1.txt and t2.txt
t1.txt

Chairs
Tables
Windows
Furniture

t2.txt

Chairs
Tables
Electronics
Cutlery

If we run the command,

diff t1.txt t2.txt

The output

3,4c3,4
< Windows
< Furniture
---
> Electronics
> Cutlery

The output is in a prescriptive context, that is, it informs the user how to change the first to make it similar to the second.

The first line of the output consists of, line numbers corresponding to the first file, a letter(a - add, c - change, d - delete) and line numbers corresponding to the second file.

In our case the output states that lines 3, 4 in the t1.txt need to be changed to match lines 3, 4 in t2.txt.
<: this precedes lines from first file.
>: this precedes lines from the second file.
---: this separates the two files.

To display output in two columns we can write,

diff -y t1.txt t2.txt

We can also view diff output in context mode for easier understanding of the output.

diff -c t1.txt t2.txt

From the output,
The first two lines represents the from file(***) t1.txt and the second represents the to file(---) t2.txt with their file names and modification times.
The ***** is a separator of the two.
! indicates that the line is part of a group that needs changing.

We can also view diff output in the unified mode whereby the the differences will be unified into one set as follows,

diff -u t1.txt t2.txt

From the output,
- states that the line in the first file should be deleted.
+ states that the line in the second file should be added to the first.

To check differences in directories we write

diff dir1 dir2

We can also use the bdiff command just like sdiff but unlike the latter, it is usually used to handle very large files that sdiff cannot handle.

sdiff

sdiff is used for showing differences between two files and has the ability to merge interactively.

The syntax is as follows,

sdiff [OPTION]... FILE1 FILE2

It works just like diff -y file1 file2 for showing side by side file differences.

An example

sdiff t1.txt t2.txt

Options such as -b and -w can also be applied for sdiff.
Additional options include;
-w: used to specify the number of columns(default=130), e.g

sdiff -w 100 t1.txt t2.txt

We can also run sdiff interactively by writing,

sdiff t1.txt t2.txt -o out.txt

We include -o and out.txt so that output from this interactive section can be sent to out.txt file.

less

The less command is used to display file contents of a large file in small chunks, that is, given a vary large file, when we need to view it, instead of loading all of it to memory, less command accesses the file in small chunks.

The syntax is as follow,

less filename

To display line numbers we use the -N option,

less -N file.txt

We can also search with less by using the / character, that is after we execute less command, we can just type / character followed by the word we are searching for, if found the word will be highlighted. To find the next match press n and for the previous match N.

We can also start by searching a file immediately by passing the search term with the -p option as follows,

less -p searchterm file.txt

We can also less multiple files using the following syntax,

less file1.txt file2.txt

Here we move through the first file normally then after it is done, less will let us know and we can move to the next file by typing :n.

more is a simpler version of the less command. It is used to view text files while displaying one screen at a time in case a file is large.

It can also accept input from another command and arrange the output in a series of pages.

The syntax is as follows;

more [options] file

An example
To display file contents begining at line 100, we write,

more +100 file.txt

We can also use more to search for a string as follows,

more +/searchString file.txt

To limit lines displayed per page for example, 10 lines only we write,

more -10 file.txt

tr

tr stands for translate. This command translates text from lowercase to uppercase or vice versa.

tr [OPTION]... SET1 [SET2]

An example
To translate all ABCD characters to uppercase in a text file we write,

cat file.txt | tr 'abcd' 'ABCD'

To lowercase we write,

cat file.txt | tr 'ABCD' 'abcd'

We can also use it to delete characters,

cat file.txt | tr -d w

The above command deletes the letter w from the text file.

expand.

While working with files you can find that a file contains tabs whereas you need spaces. The expand is used to convert tabs to spaces in files.

The syntax is as follows,

expand [OPTION]... [FILE]...

Given the file t3.txt, we can see that there are tabs between each column.

column1         column2         column3
name            address         phonenumber
john            24th street     1234567

This can be verified by running the command cat -vet t3.txt.
To change tabs to spaces we write,

expand t3.txt

We can also change specify the size of space, e.g to convert tabs to 2 spaces each we write,

expand -t2 t3.txt

unexpand

Conversely, the unexpand command converts spaces into tabs.
The syntax is as follows,

unexpand [OPTION]... [FILE]...

To convert spaces to tabs in the file t3.txt we write,

unexpand t3.txt

We can also specify the number of tabs to use as follows,

unexpand -t3 t3.txt

Summary.

Filters can be used to process information in very useful ways by restructuring output to generate useful information or text modifications.

Note that some of these commands, maybe all can be executed in git bash which can be installed in a windows environment.

References.

For each of the commands you can type command --help for reference.
Linux text filtering: cat, tac, od, wc, head, tail, sort, cut