Linux text filtering: diff, uniq, sdiff, less, more, tr, expand, unexpand
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
In this article we discuss more linux text filters used to process text data so as to produce useful information. This involve commands like diff, uniq, sdiff, less, more, tr, expand and unexpand.
Table of contents.
- Introduction.
- Uniq.
- Diff.
- Sdiff.
- less.
- more.
- tr.
- expand.
- unexpand.
- Summary.
- References.
Prerequisites.
Introduction.
Text filtering is the process of taking an input stream of text and performing conversions on it before sending it to the output stream.
A filter reads standard input(file) and performs an operation on it then outputs the result to the output stream.
Filters are smaller programs which perform only a single task and can be viewed as building blocks which can be combined and used to build complex text filters.
uniq
The sytnax is as follows,
uniq [OPTION]... [INPUT [OUTPUT]]
uniq stands for unique, this command is used to remove duplicates from a text file.
The uniq command however removes duplicates that are adjacent to one another.
An example
Given a text file file.txt with the content
joe@yahoo.com
joe@yahoo.com
doe@yahoo.com
peter@yahoo.com
joe@yahoo.com
To get a unique output we write,
uniq file.txt
As you can see the first duplicate email we removed however the second remains, this is one drawback to uniq command.
From the text filtering tools learnt so far, can you think of a solution so as to obtain all unique email addresses.
diff
The syntax is as follows,
diff [OPTION]... FILES
The diff command is used to compare two files line by line, we can extend this to compare directory contents.
Commonly used options include;
-b: to ignore changes involving white spaces e.g spaces or tabs.
-B: ignore blank lines when calculating differences.
-w: ignore whitespaces.
-y: display output in two columns.
An example
Given two files t1.txt and t2.txt
t1.txt
Chairs
Tables
Windows
Furniture
t2.txt
Chairs
Tables
Electronics
Cutlery
If we run the command,
diff t1.txt t2.txt
The output
3,4c3,4
< Windows
< Furniture
---
> Electronics
> Cutlery
The output is in a prescriptive context, that is, it informs the user how to change the first to make it similar to the second.
The first line of the output consists of, line numbers corresponding to the first file, a letter(a - add, c - change, d - delete) and line numbers corresponding to the second file.
In our case the output states that lines 3, 4 in the t1.txt need to be changed to match lines 3, 4 in t2.txt.
<: this precedes lines from first file.
>: this precedes lines from the second file.
---: this separates the two files.
To display output in two columns we can write,
diff -y t1.txt t2.txt
We can also view diff output in context mode for easier understanding of the output.
diff -c t1.txt t2.txt
From the output,
The first two lines represents the from file(***) t1.txt and the second represents the to file(---) t2.txt with their file names and modification times.
The ***** is a separator of the two.
! indicates that the line is part of a group that needs changing.
We can also view diff output in the unified mode whereby the the differences will be unified into one set as follows,
diff -u t1.txt t2.txt
From the output,
- states that the line in the first file should be deleted.
+ states that the line in the second file should be added to the first.
To check differences in directories we write
diff dir1 dir2
We can also use the bdiff command just like sdiff but unlike the latter, it is usually used to handle very large files that sdiff cannot handle.
sdiff
sdiff is used for showing differences between two files and has the ability to merge interactively.
The syntax is as follows,
sdiff [OPTION]... FILE1 FILE2
It works just like diff -y file1 file2 for showing side by side file differences.
An example
sdiff t1.txt t2.txt
Options such as -b and -w can also be applied for sdiff.
Additional options include;
-w: used to specify the number of columns(default=130), e.g
sdiff -w 100 t1.txt t2.txt
We can also run sdiff interactively by writing,
sdiff t1.txt t2.txt -o out.txt
We include -o and out.txt so that output from this interactive section can be sent to out.txt file.
less
The less command is used to display file contents of a large file in small chunks, that is, given a vary large file, when we need to view it, instead of loading all of it to memory, less command accesses the file in small chunks.
The syntax is as follow,
less filename
To display line numbers we use the -N option,
less -N file.txt
We can also search with less by using the / character, that is after we execute less command, we can just type / character followed by the word we are searching for, if found the word will be highlighted. To find the next match press n and for the previous match N.
We can also start by searching a file immediately by passing the search term with the -p option as follows,
less -p searchterm file.txt
We can also less multiple files using the following syntax,
less file1.txt file2.txt
Here we move through the first file normally then after it is done, less will let us know and we can move to the next file by typing :n.
more
more is a simpler version of the less command. It is used to view text files while displaying one screen at a time in case a file is large.
It can also accept input from another command and arrange the output in a series of pages.
The syntax is as follows;
more [options] file
An example
To display file contents begining at line 100, we write,
more +100 file.txt
We can also use more to search for a string as follows,
more +/searchString file.txt
To limit lines displayed per page for example, 10 lines only we write,
more -10 file.txt
tr
tr stands for translate. This command translates text from lowercase to uppercase or vice versa.
tr [OPTION]... SET1 [SET2]
An example
To translate all ABCD characters to uppercase in a text file we write,
cat file.txt | tr 'abcd' 'ABCD'
To lowercase we write,
cat file.txt | tr 'ABCD' 'abcd'
We can also use it to delete characters,
cat file.txt | tr -d w
The above command deletes the letter w from the text file.
expand.
While working with files you can find that a file contains tabs whereas you need spaces. The expand is used to convert tabs to spaces in files.
The syntax is as follows,
expand [OPTION]... [FILE]...
Given the file t3.txt, we can see that there are tabs between each column.
column1 column2 column3
name address phonenumber
john 24th street 1234567
This can be verified by running the command cat -vet t3.txt.
To change tabs to spaces we write,
expand t3.txt
We can also change specify the size of space, e.g to convert tabs to 2 spaces each we write,
expand -t2 t3.txt
unexpand
Conversely, the unexpand command converts spaces into tabs.
The syntax is as follows,
unexpand [OPTION]... [FILE]...
To convert spaces to tabs in the file t3.txt we write,
unexpand t3.txt
We can also specify the number of tabs to use as follows,
unexpand -t3 t3.txt
Summary.
Filters can be used to process information in very useful ways by restructuring output to generate useful information or text modifications.
Note that some of these commands, maybe all can be executed in git bash which can be installed in a windows environment.
References.
- For each of the commands you can type command --help for reference.
- Linux text filtering: cat, tac, od, wc, head, tail, sort, cut
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.