Rha030 Workbook08 Student 3.0 0

Workbook 8.
String Processing Tools
Red Hat, Inc.
Workbook 8. String Processing Tools

by Red Hat, Inc.
Copyright 2003-2005 Red Hat, Inc.
Revision History
Revision rha030-2.0-2003_11_12-en
2003-11-12
First Revision
Revision rha030-3.0-0-en-2005-08-17T07:23:17-0400 2005-08-17
First Revision
Red Hat, Red Hat Network, the Red Hat "Shadow Man" logo, RPM, the RPM logo, PowerTools, and all Red Hat-based trademarks and logos are
trademarks or registered trademarks of Red Hat, Inc. in the United States and other countries.
Linux is a registered trademark of Linus Torvalds.
Motif and UNIX are registered trademarks of The Open Group.
Windows is a registered trademark of Microsoft Corporation.
Intel and Pentium are a registered trademarks of Intel Corporation. Itanium and Celeron are trademarks of Intel Corporation.
SSH and Secure Shell are trademarks of SSH Communications Security, Inc.
All other trademarks and copyrights referred to are the property of their respective owners.
Published 2005-08-17
Table of Contents
1. Text Encoding and Word Counting...................................................................................................... 7
Discussion .......................................................................................................................................... 7
Files .......................................................................................................................................... 7
Text Encoding........................................................................................................................... 8
Internationalization (i18n) ...................................................................................................... 12
Revisiting cat, head, and tail ................................................................................................. 15
The wc (Word Count) Command ........................................................................................... 17
Examples.......................................................................................................................................... 18
Example 1. Counting Characters ............................................................................................ 18
Example 2. Invisible Characters Are Important, Too ............................................................. 18
Example 3. Whats My Line?................................................................................................. 18
Example 4. I Want It All......................................................................................................... 19
Example 5. Linux, Dos, and Macintosh Files ........................................................................ 19
Example 6. Counting Users.................................................................................................... 19
Example 7. Counting Processes ............................................................................................. 19
Online Exercises............................................................................................................................... 20
Specification ........................................................................................................................... 20
Deliverables ............................................................................................................................ 21
Hints ....................................................................................................................................... 22
Questions.......................................................................................................................................... 22
2. Finding Text: grep................................................................................................................................ 25
Discussion ........................................................................................................................................ 25
Searching Text File Contents using grep ............................................................................... 25
Show All Occurrences of a String in a File ............................................................................ 26
Searching in Several Files at Once ......................................................................................... 27
Searching Directories Recursively ......................................................................................... 27
Inverting grep ......................................................................................................................... 28
Getting Line Numbers ............................................................................................................ 28
Limiting Matching to Whole Words....................................................................................... 29
Ignoring Case.......................................................................................................................... 29
Examples.......................................................................................................................................... 30
Example 1. Finding Simple Character Strings ....................................................................... 30
Example 2. In That Case ........................................................................................................ 30
Example 3. Matching Whole Words....................................................................................... 30
Example 4. Combining grep and xargs ................................................................................. 31
Specification ........................................................................................................................... 32
Deliverables ............................................................................................................................ 33
Questions.......................................................................................................................................... 33
3. Introduction to Regular Expressions ................................................................................................. 37
Discussion ........................................................................................................................................ 37
Introducing Regular Expressions............................................................................................ 37
Regular Expressions, Extended Regular Expressions, and the grep Command ....................39
Anatomy of a Regular Expression.......................................................................................... 39
Taking Literals Literally ......................................................................................................... 40
iii
Wildcards................................................................................................................................ 40
Common Modifier Characters ................................................................................................ 42
Anchored Searches ................................................................................................................. 44
Coming to Terms with Regex Grouping................................................................................. 45
Escaping Meta-Characters...................................................................................................... 46
Summary of Linux Regular Expression Syntax ..................................................................... 46
Regular Expressions are NOT File Globbing ......................................................................... 47
Where to Find More Information About Regular Expressions ..............................................48
Examples.......................................................................................................................................... 48
Example 1. Literal Searches ................................................................................................... 48
Example 2. Range Expressions .............................................................................................. 48
Example 3. REGEX Modifiers ............................................................................................... 49
Example 4. Anchored Searches .............................................................................................. 49
Example 5. REGEX Term Grouping...................................................................................... 50
Example 6. Is elvis in the House? .......................................................................................... 50
Example 7. Searching for Telephone Numbers ...................................................................... 51
Specification ........................................................................................................................... 54
Deliverables ............................................................................................................................ 55
Questions.......................................................................................................................................... 55
4. Everything Sorting: sort and uniq ..................................................................................................... 59
Discussion ........................................................................................................................................ 59
The sort Command................................................................................................................. 59
The uniq Command ............................................................................................................... 64
Examples.......................................................................................................................................... 67
Example 1. Sorting the Output of ps aux............................................................................... 67
Example 2. Using sort and uniq to Collect Information on Running Processes ...................69
Specification ........................................................................................................................... 70
Deliverables ............................................................................................................................ 72
Questions.......................................................................................................................................... 73
5. Extracting and Assembling Text: cut and paste ............................................................................... 77
Discussion ........................................................................................................................................ 77
The cut Command.................................................................................................................. 77
The paste Command .............................................................................................................. 81
Examples.......................................................................................................................................... 82
Example 1. Handling Free-Format Records ........................................................................... 82
Example 2. Living With Fixed-Format Records .................................................................... 83
Example 3. Using (and Misusing) a Space as a Delimiter .....................................................83
Example 4. Examples of Pasting ............................................................................................ 84
Specification ........................................................................................................................... 85
Deliverables ............................................................................................................................ 86
Questions.......................................................................................................................................... 87
iv
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat
Academy. Any other use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated,
stored in a retrieval system, or otherwise duplicated whether in electronic or print format without prior written consent of Red Hat, Inc.
If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email training@redhat.com
or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.
6. Tracking differences: diff.................................................................................................................... 91

Discussion ........................................................................................................................................ 91
The diff Command ................................................................................................................. 91
Output Formats for the diff Command................................................................................... 92
How diff Interprets Arguments .............................................................................................. 95
Customizing diff to be Less Picky ......................................................................................... 95
Recursive diffs....................................................................................................................... 97
Examples.......................................................................................................................................... 99
Example 1. Using diff to Examine New Configuration Files.................................................99
Example 2. Using diff to Examine Recent Changes to /etc/passwd .................................99
Example 3. Creating a Patch................................................................................................. 100
Online Exercises............................................................................................................................. 100
Specification ......................................................................................................................... 100
Deliverables .......................................................................................................................... 101
Questions........................................................................................................................................ 101
7. Translating Text: tr............................................................................................................................ 106
Discussion ...................................................................................................................................... 106
The tr Command .................................................................................................................. 106
Character Specification......................................................................................................... 106
Using tr to Translate Characters .......................................................................................... 107
Using tr to Delete Characters............................................................................................... 108
Using tr to Squeeze Characters............................................................................................ 109
Complementing Sets............................................................................................................. 109
One Final Caution: Avoid File Globbing! ............................................................................ 110
Examples........................................................................................................................................ 110
Example 1. Using tr to Clean Up the df Command............................................................. 110
Example 2. Using tr to Convert Dos Text Files to Unix ......................................................111
Example 3. Using tr to Count Word Frequencies 2 .............................................................. 112
Example 4. Rot13 ................................................................................................................. 113
Specification ......................................................................................................................... 114
Deliverables .......................................................................................................................... 114
Questions........................................................................................................................................ 115
8. Spell Checking: aspell........................................................................................................................ 120
Discussion ...................................................................................................................................... 120
Using aspell.......................................................................................................................... 120
Performing an Interactive Spell Check................................................................................. 121
Performing a Non-interactive Spell Check........................................................................... 122
Managing the Personal Dictionary ....................................................................................... 123
Getting Help ......................................................................................................................... 124
Examples........................................................................................................................................ 124
Example 1. Adding Service Names to aspells Personal Dictionary ...................................124
Setup ..................................................................................................................................... 125
Specification ......................................................................................................................... 125
Deliverables .......................................................................................................................... 126
Questions........................................................................................................................................ 127
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat
Academy. Any other use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated,
stored in a retrieval system, or otherwise duplicated whether in electronic or print format without prior written consent of Red Hat, Inc.
If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email training@redhat.com
or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.
9. Formatting Text (fmt) and Splitting Files (split)............................................................................. 130

Discussion ...................................................................................................................................... 130
The fmt Command ............................................................................................................... 130
The split Command.............................................................................................................. 135
Examples........................................................................................................................................ 137
Example 1. Using fmt to Clean Email ................................................................................. 137
Example 2. Using "String Processing" Tools to Manipulate Binary Data ...........................138
Specification ......................................................................................................................... 142
Deliverables .......................................................................................................................... 142
Questions........................................................................................................................................ 143
vi
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a violation
of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or print
format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email
training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.
Chapter 1. Text Encoding and Word Counting

Key Concepts
When storing text, computers transform characters into a numeric representation. This process is
referred to as encoding the text.
In order to accommodate the demands of a variety of languages, several different encoding techniques
have been developed. These techniques are represented by a variety of character sets.
The oldest and most prevalent encoding technique is known as the ASCII character set, which still
serves as a least common denominator among other techniques.
The wc command counts the number of characters, words, and lines in a file. When applied to
structured data, the wc command can become a versatile counting tool.
The cat command has options that allow representation of nonprinting characters such as NEWLINE.
The head and tail commands have options that allow you to print only a certain number of lines or a
certain number of bytes (one byte usually correlates to one character) from a file.
Discussion
In this Workbook, we begin looking at various tools for searching, sorting, extracting, and manipulating
text. Because Linux, and Unix before it, has a strong tradition of storing data in human readable text
formats, these tools should be thought of as not only aiding composition, but data manipulation in
general.
Files
What are Files?
Linux, like most operating systems, stores information that needs to be preserved outside of the context
of any individual process in files. (In this context, and for most of this Workbook, the term file is meant in
the sense of regular file). Linux (and Unix) files store information using a simple model: information is
stored as a single, ordered array of bytes, starting from at first and ending at the last. The number of bytes
in the array is the length of the file. 1
What type of information is stored in files? Here are but a few examples.
The characters that compose the book report you want to store until you can come back and finish it
tomorrow are stored in a file called (say) ~/bookreport.txt.
The individual colors that make up the picture you took with your digital camera are stored in the file
(say) /mnt/camera/dcim/100nikon/dscn1203.jpg.
The characters which define the usernames of users on a Linux system (and their home directories,
etc.) are stored in the file /etc/passwd.
The specific instructions which tell an x86 compatible CPU how to use the Linux kernel to list the files
in a given directory are stored in the file /bin/ls.
What is a Byte?
At the lowest level, computers can only answer one type of question: is it on or off? What is it? When
dealing with disks, it is a magnetic domain which is oriented up or down. When dealing with memory
chips, it is a transistor which either has current or doesnt. Both of these are too difficult to mentally
picture, so we will speak in terms of light switches that can either be on or off. To your computer, the
contents of your file is reduced to what can be thought of as an array of (perhaps millions of) light
switches. Each light switch can be used to store one bit of information (is it on, or is it off).
Using a single light switch, you cannot store much information. To be more useful, an early convention
was established: group the light switches into bunches of 8. Each series of 8 light switches (or magnetic
domains, or transistors, ...) is a byte. More formally, a byte consists of 8 bits. Each permutation of ons and
offs for a group of 8 switches can be assigned a number. All switches off, well assign 0. Only the first
switch on, well assign 1; only the second switch on, 2; the first and second switch on, 3; and so on. How
many numbers will it take to label each possible permutation for 8 light switches? A mathematician will
quickly tell you the answer is 2^8, or 256. After grouping the light switches into groups of eight, your
computer views the contents of your file as an array of bytes, each with a value ranging from 0 to 255.
Data Encoding
In order to store information as a series of bytes, the information must be somehow converted into a
series of values ranging from 0 to 255. Converting information into such a format is called data encoding.
Whats the best way to do it? There is no single best way that works for all situations. Developing the
right technique to encode data, which balances the goals of simplicity, efficiency (in terms of CPU
performance and on disk storage), resilience to corruption, etc., is much of the art of computer science.
As one example, consider the picture taken by a digital camera mentioned above. One encoding
technique would divide the picture into pixels (dots), and for each pixel, record three bytes of
information: the pixels "redness", "greenness", and "blueness", each on a scale of 0 to 255. The first
three bytes of the file would record the information for the first pixel, the second three bytes the second
pixel, and so on. A picture format known as "PNM" does just this (plus some header information, such as
how many pixels are in a row). Many other encoding techniques for images exist, some just as simple,
many much more complex.
Text Encoding
Perhaps the most common type of data which computers are asked to store is text. As computers have
developed, a variety of techniques for encoding text have been developed, from the simple in concept
(which could encode only the Latin alphabet used in Western languages) to complicated but powerful
techniques that attempt to encode all forms of human written communication, even attempting to include
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other
use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise
duplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or
otherwise improperly distributed please email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

historical languages such as Egyptian hieroglyphics. The following sections discuss many of the
encoding techniques commonly used in Red Hat Enterprise Linux.
ASCII
One of the oldest, and still most commonly used techniques for encoding text is called ASCII encoding.
ASCII encoding simply takes the 26 lowercase and 26 uppercase letters which compose the Latin
alphabet, 10 digits, and common English punctuation characters (those found on a keyboard), and maps
them to an integer between 0 and 255, as outlined in the following table.
Table 1-1. ASCII Encoding of Printable Characters
Integer Range
Character
33-47
Punctuation: !"#$%&;*(*+,-./
48-57
The digits 0 through 9
58-64
Punctuation: :;<=?>@
65-90
Capital letters A through Z
91-96
Punctuation: [\]^_
97-122
Lowercase letters a through z
123-126
Punctuation: {|}~
What about the integers 0 - 32? These integers are mapped to special keys on early teletypes, many of
which have to do with manipulating the spacing on the page being typed on. The following characters are
commonly called "whitespace" characters.
Table 1-2. ASCII Encoding of Whitespace Characters
Integer
Character
Common Name
Common
Representation
BS
Backspace
\b
HT
Tab
\t
10
LF
Line Feed
\n
12
FF
Form Feed
\f
13
CR
Carriage Return
\r
32
SPACE
Space Bar
127
DEL
Delete
Others of the first 32 integers are mapped to keys which did not directly influence the "printed page", but
instead sent "out of band" control signals between two teletypes. Many of these control signals have
special interpretations within Linux (and Unix).
Table 1-3. ASCII Encoding of Control Signals
Integer
Character
Common Name
Common
Representation
rha030-3.0-0-en-2005-08-17T07:23:17-0400

Integer
Character
Common Name
EOT
End of Transmission
BEL
Audible Terminal Bell
27
ESC
Escape
Common
Representation
\a
Generating Control Characters from the Keyboard: Control and whitespace characters can be
generated from the terminal keyboard directly using the CTRL key. For example, an audible bell can
be generated using CTRL-G, while a backspace can be sent using CTRL-H, and we have already
mentioned that CTRL-D is used to generate an "End of File" (or "End of Transmission"). Can you
determine how the whitespace and control characters are mapped to the various CTRL key
combinations? For example, what CTRL key combination generates a tab? What does CTRL-J
generate? As you explore various control sequences, remember that the reset command will restore
your terminal to sane behavior, if necessary.
What about the values 128-255? ASCII encoding does not use them. The ASCII standard only defines
the first 128 values of a byte, leaving the remaining 128 values to be defined by other schemes.
ISO 8859 and Other Character Sets

Other standard encoding schemes have been developed, which map various glyphs (such as the symbol
for the Yen and Euro), diacritical marks found in many European languages, and non Latin alphabets to
the latter 128 values of a byte which the ASCII standard leaves undefined. The following table lists a few
of these standard encoding schemes, which are referred to as character sets. The following table lists
some character sets which are supported in Linux, including their informal name, formal name, and a
brief description.
Table 1-4. Some ISO 8859 Character Sets supported in Linux
Informal Name
Formal Name
Description
Latin-1
ISO 8859-1
West European languages
Latin-2
ISO 8859-2
Central and East European

languages
Arabic
ISO 8859-6
Latin/Arabic
Greek
ISO 8859-7
Latin/Greek
Latin-9
ISO 8859-15
West European languages
All of these character encoding schemes use a common technique. They preserve the first 128 values of a
byte to encode traditional ASCII, and use the remaining 128 values to encode glyphs unique to the
particular encoding. For example, ISO 8859-1 (Latin-1) uses the value 196 to encode a Latin capital A
with an umlaut (), while ISO-8859-7 (Greek) uses the value 196 to encode the Greek capital letter
Delta (), but both use the value 101 to encode a Latin lowercase e.
Notice a couple of implications about ISO 8859 encoding.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
10
duplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used,
copied, or otherwise improperly distributed please email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

1. Each of the alternate encodings map a single glyph to a single byte, so that the number of letters
encoded in a file equals the number of bytes which are required to encode them.
2. Choosing a particular character set extends the range of characters that can be encoded, but you
cannot encode characters from different character sets simultaneously. For example, you could not
encode both a Latin capital A with a grave and a Greek letter Delta simultaneously.
Unicode (UCS)
In order to overcome the limitations of ASCII and ISO 8859 based encoding techniques, a Universal
Character Set has been developed, commonly referred to as UCS, or Unicode. The Unicode standard
acknowledges the fact that one byte of information, with its ability to encode 256 different values, is
simply not enough to encode the variety of glyphs found in human communication. Instead, the Unicode
standard uses 4 bytes to encode each character. Think of 4 bytes as 32 light switches. If we were to again
label each permutation of on and off for 32 switches with integers, the mathematician would tell you that
you would need 4,294,967,296 (over 4 billion) integers. Thus, Unicode can encode over 4 billion glyphs
(nearly enough for every person on the earth to have their own unique glyph; the user prince would
approve).
What are some of the features and drawbacks of Unicode encoding?
Scale
The Unicode standard will easily be able to encode the variety of glyphs used in human
communication for a long time to come.
Simplicity
The Unicode standard does have the simplicity of a sledgehammer. The number of bytes required to
encode a set of characters is simply the number of characters multiplied by 4.
Waste
While the Unicode standard is simple in concept, it is also very wasteful. The ability to encode 4
billion glyphs is nice, but in reality, much of the communication that occurs today uses less than a
few hundred glyphs. Of the 32 bits (light switches) used to encode each character, the first 20 or so
would always be "off".
ASCII Non-compatibility
For better or for worse, a huge amount of existing data is already ASCII encoded. In order to convert
fully to Unicode, that data, and the programs that expect to read it, would have to be converted.
The Unicode standard is an effective standard in principle, but in many respects it is ahead of its time,
and perhaps forever will be. In practice, other techniques have been developed which attempt to preserve
the scale and versatility of Unicode, while minimizing waste and maintaining ASCII compatibility. What
must be sacrificed? Simplicity.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
11
Unicode Transformation Format (UTF-8)

UTF-8 encoding attempts to balance the flexibility of Unicode, and the practicality and pervasiveness of
ASCII, with a significant sacrifice: variable length encoding. With variable length encoding, each
character is no longer encoded using simply 1 byte, or simply 4 bytes. Instead, the traditional 127 ASCII
characters are encoded using 1 byte (and, in fact, are identical to the existing ASCII standard). The next
most commonly used 2000 or so characters are encoded using two bytes. The next 63000 or so
characters are encoded using three bytes, and the more esoteric characters may be encoded using from
four to six bytes. Details of the encoding technique can be found in the utf-8(7) man page. With full
backwards compatibility to ASCII, and the same functional range of pure Unicode, what is there to lose?
ISO 8859 (and similar) character set compatibility.
UTF-8 attempts to bridge the gap between ASCII, which can be viewed as the primitive days of text
encoding, and Unicode, which can be viewed as the utopia to aspire toward. Unfortunately, the
"intermediate" methods, the ISO 8859 and other alternate character sets, are as incompatible with UTF-8
as they are with each other.
Additionally, the simple relationship between the number of characters that are being stored and the
amount of space (measured in bytes) it takes to store them is lost. How much space will it take to store
879 printed characters? If they are pure ASCII, the answer is 879. If they are Greek or Cyrillic, the
answer is closer to twice that much.
Text Encoding and the Open Source Community

In the traditional development of operating systems, decisions such as what type of character encoding to
use can be made centrally, with the possible disadvantage that the decision is wrong for some community
of the operating systems users. In contrast, in the open source development model, these types of
decisions are generally made by individuals and small groups of contributers. The advantages of the open
source model are a flexible system which can accommodate a wide variety of encoding formats. The
disadvantage is that users must often be educated and made aware of the issues involved with character
encoding, because some parts of the assembled system use one technique while others parts use another.
The library of man pages is an excellent example.
When contributors to the open source community are faced with decisions involving potentially
incompatible formats, they generally balance local needs with an appreciation for adhering to widely
accepted standards where appropriate. The UTF-8 encoding format seems to be evolving as an accepted
standard, and in recent releases has become the default for Red Hat Enterprise Linux.
The following paragraph, extracted from the utf-8(7) man page, says it well:
It can be hoped that in the foreseeable future, UTF-8 will replace
ASCII and ISO 8859 at all levels as the common character encoding on
POSIX systems, leading to a significantly richer environment for handling plain text.
Internationalization (i18n)
As this Workbook continues to discuss many tools and techniques for searching, sorting, and
manipulating text, the topic of internationalization cannot be avoided. In the open source community,
rha030-3.0-0-en-2005-08-17T07:23:17-0400
12

internationalization is often abbreviated as i18n, a shorthand for saying "i-n with 18 letters in between".
Applications which have been internationalized take into account different languages. In the Linux (and
Unix) community, most applications look for the LANG environment variable to determine which
language to use.
At the simplest, this implies that programs will emit messages in the users native language.
[elvis@station elvis]$ echo $LANG
en_US.UTF-8
[elvis@station elvis]$ chmod 666 /etc/passwd
chmod: changing permissions of /etc/passwd: Operation not permitted

[elvis@station elvis]$ export LANG=de_DE.utf8
[elvis@station elvis]$ chmod 666 /etc/passwd
chmod: Beim Setzen der Zugriffsrechte fr /etc/passwd: Die Operation ist nicht erlaubt
More subtly, the choice of a particular language has implications for sorting orders, numeric formats, text
encoding, and other issues.
The LANG environment variable

The LANG environment variable is used to define a users language, and possibly the default encoding
technique as well. The variable is expected to be set to a string using the following syntax:
LL_CC .enc
The variable context consists of the following three components.

Table 1-5. Components of LANG environment variable
Component
Role
LL
Two letter ISO 639 Language Code
CC
(Optional) Two letter ISO 3166 Country Code
enc
(Optional) Character Encoding Code Set
The locale command can be used to examine your current configuration (as can echo $LANG), while
locale -a will list all settings currently supported by your system. The extent of the support for any given
language will vary.
The following tables list some selected language codes, country codes, and code set specifications.
Table 1-6. Selected ISO 639 Language Codes
Code
Language
de
German
el
Greek
en
English
es
Spanish
fr
French
ja
Japanese
13
rha030-3.0-0-en-2005-08-17T07:23:17-0400

Code
Language
zh
Chinese
Table 1-7. Selected ISO 3166 Country Codes

Code
Country
CA
Canada
CN
China
DE
Germany
ES
Spain
FR
France
GB
Britain (UK)
GR
Greece
JP
Japan
NG
Nigeria
US
United States
Table 1-8. Selected Character Encoding Code Sets

Code
Country
utf8
UTF-8
iso88591
ISO 8859-1 (Latin 1)
iso885915
ISO 8859-15 (Latin 10)
iso88596
ISO 8859-6 (Arabic)
iso88592
ISO 8859-2 (Latin 2)
See the gettext info pages (info gettext, or pinfo gettext) for a complete listing.
Do I Really Have to Know All of This?

We have tried to introduce the major concepts and components which affect how text is encoded and
stored within Linux. After reading about character sets and language codes, one might be led to wonder,
do I really need to know about all of this? If you are using simple text, restricted to the Latin alphabet of
26 characters, the answer is no. If you are asking the question 10 years from now, the answer will
hopefully be no. If you do not fit into one of these two categories, however, you should have at least an
acquaintance with the concept of internationalization, character sets, and the role of the LANG
environment variable.
Hopefully, as the open source community converges on a single encoding technique (currently UTF-8
seems the most likely), most of these issues will disappear. Until then, these are some key points to
remember.
1. An ASCII file is already valid in one of the ISO 8559 character sets.
14
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy.
Any other use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or
otherwise duplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being
used, copied, or otherwise improperly distributed please email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

2. An ASCII file is already valid in UTF-8.
3. A file encoded in one of the ISO 8559 character sets is not valid in UTF-8, and must be converted.
4. Using UTF-8, There is a one to one mapping between characters and bytes if and only if all of the
characters are pure ASCII characters.
If you are interested in more information, several man pages provide a more detailed introduction to the
concepts outlined above. Start with charsets(7), and then follow with ascii(7), iso_8859_1(7),
unicode(7) and utf-8(7). Additionally, the iconv command can be used to convert text files from one
form of encoding to another.
Revisiting cat, head, and tail

Revisiting cat
We have been using the cat command to simply display the contents of files. Usually, the cat command
generates a faithful copy of its input, without performing any edits or conversions. When called with one
of the following command line switches, however, the cat command will indicate the presence tabs, line
feeds, and other control sequences, using the following conventions.
Table 1-9. Command Line Switches for the cat Command
Switch
Effect
-E
display line feeds (ASCII 10) as $
-T
display tabs (ASCII 9) as Î
-v
display whitespace and control characters as ^n, with n indicating the CTRL sequence
for the nonprinting character.
-A
Shows "all", same as -vET
-t
Show "all" except line feeds, same as -vT
-e
Show "all" except tabs, same as -vE
As an example, in the following, the cat command is used to display the contents of the /etc/hosts
configuration file.
[student@station student]$ cat /etc/hosts
# Do not remove the following line, or various programs

# that require network functionality will fail.
127.0.0.1
localhost.localdomain
localhost station.example.com
127.0.0.1
rha-server
192.168.0.1
station1 station1.example.com www1 www1.example.com
192.168.0.51
station51 station51.example.com
192.168.129.201 z
Using the -A command line switch, the whitespace structure of the file becomes evident, as tabs are
replaced with Î, and line feeds are decorated with $.
[student@station student]$ cat -A /etc/hosts
rha030-3.0-0-en-2005-08-17T07:23:17-0400
15
Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is
a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether
in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed
please email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

# Do not remove the following line, or various programs$
# that require network functionality will fail.$
127.0.0.1Îlocalhost.localdomainÎlocalhost station.example.com $
127.0.0.1Îrha-server$
192.168.0.1Îstation1 station1.example.com www1 www1.example.com$
192.168.0.51Îstation51 station51.example.com$
192.168.129.201Îz$
Revisiting head and tail

The head and tail commands have been used to display the first or last few lines of a file, respectively.
But what makes a line? Imagine yourself working at a typewriter: click! clack! click! clack! clack!
ziiing! Instead of the ziing! of the typewriter carriage at the end of each line, the line feed character
(ASCII 10) is chosen to mark the end of lines.
Unfortunately, a common convention for how to mark the end of a line is not shared among the dominant
operating systems in use today. Linux (and Unix) uses the line feed character (ASCII 10, often
represented \n), while Macintosh operating systems uses the carriage return character (ASCII 13, often
represented \r or ^M), and Microsoft operating systems use a carriage return/line feed pair (ASCII 13,
ASCII 10).
For example, the following file contains a list of four musicians.
[student@station student]$ cat -A musicians
elvis$
blondie$
prince$
madonna$
Had this file been created on a Microsoft or Macintosh operating system, and copied into Linux, the files
would look like the following.
[student@station student]$ cat -A musicians.dos
elvis^M$
blondie^M$
prince^M$
madonna^M$
[student@station student]$ cat -A musicians.mac
elvis^Mblondie^Mprince^Mmadonna^M[student@station student]$
Linux (and Unix) text files generally adhere to a convention that the last character of the file must be a
line feed for the last line of text. Following the cat of the file musicians.mac, which does not contain
any conventional Linux line feed characters, the bash prompt is not displayed in its usual location.
Table 1-10. Command Line Switches for the head Command
Switch
Effect
-N , -nN
Display the first N lines of the file.
-cN
Display the first N bytes of the file.
16
rha030-3.0-0-en-2005-08-17T07:23:17-0400

Table 1-11. Command Line Switches for the tail Command
Switch
Effect
-N , -nN
Display the last N lines of the file. If N is prepended by a +, display the remainder of the
file, starting at the Nth line.
-cN
Display the first N bytes of the file.
The wc (Word Count) Command

Counting Made Easy
Have you ever tried to answer a 25 words or less quiz? Did you ever have to write a 1500-word essay?
With the wc you can easily verify that your contribution meets the criteria.
The wc command counts the number of characters, words, and lines. It will take its input either from files
named on its command line or from its standard input. Below is the command line form for the wc
program:
Figure 1-1. Using the wc command
Switch
Results
-c
Compute character count.
-l
Compute line count.
-w
Compute word count.
filename
Filename to be counted. If no filename is not

specified, then the text will be read from the
standard input. For clarity, the filename will be
written as the last line of each counting report,
even if only one filename is used.
When used without any command line switches, wc will report on the number of characters, lines, and
words. Command line switches can be combined to return any combination of character count, line count
or word count.
How To Recognize A Real Character

Text files are composed using an alphabet of characters. Some characters are visible, such as numbers
and letters. Some characters are used for horizontal distance, such as spaces and TAB characters. Some
characters are used for vertical movement, such as carriage returns and line feeds.
A line in a text file is a series of any character other than a NEWLINE (line feed) character and then a
NEWLINE character. Additional lines in the file immediately follow the first line.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
17

While a computer represents characters as numbers, the exact value used for each symbol varies
depending on which alphabet has been chosen. The most common alphabet for English speakers is
ASCII, also called Latin-1. Different human languages are represented by different computer encoding
rules, so the exact numeric value for a given character depends on the human language being recorded.
So, What Is A Word?

A word is a group of printing characters, such as letters and digits, surrounded by white space, such as
space characters or horizontal TAB characters.
Notice that our definition of a word does not include any notion of meaning. Only the form of the word
is important, not its semantics. As far as Linux is concerned, a line such as:
Now is the time for all good men to foogle.
contains 10 perfectly good words: printing characters surrounded by whitespace or punctuation.
Examples
Example 1. Counting Characters
To count the characters in a file, just run wc -c:
[student@station student]$ echo hello | wc -c
In addition to the five letters in the word, the line also has a NL character at the end.
Example 2. Invisible Characters Are Important, Too

Characters that you cannot see still occupy space in a file.
[student@station student]$ echo Hello, World! > greetings
[student@station student]$ wc -c greetings
14
Keep in mind that spaces and TABs count as characters, too. Remember our typewriter analogy? Both
the spacebar and the TAB key require keystrokes; each character in a text file corresponds to a press of a
typewriter key.
Example 3. Whats My Line?

Run the command wc -l to count the lines in a file:
rha030-3.0-0-en-2005-08-17T07:23:17-0400
18

[student@station
[student@station
[student@station
[student@station
student]$
student]$
student]$
student]$
echo First line > foo

echo Second line >> foo
echo Third line >> foo
wc -l foo
3 foo
Example 4. I Want It All

Using wc without any arguments counts everything: characters, words, and lines:
[student@station
[student@station
[student@station
[student@station
station]$
student]$
student]$
student]$
echo one > x

echo two words >> x
echo three more words >> x
wc x
31 x
Example 5. Linux, Dos, and Macintosh Files

How would the wc command handle the three musician files mentioned above (one composed on a Linux
machine, one a Microsoft machine, and one a Macintosh)?
[student@station student]$ wc musicians*
4
4
0
8
4
4
4
12
29
33
29
91
musicians
musicians.dos
musicians.mac
total
For the file musicians.mac, which did not contain any conventional Linux line feed characters,
the number of lines is reported as 0.
In the above output, why does the file musicians.dos have 33 characters, while musicians and
musicians.mac only 29?
Example 6. Counting Users

The wc command is often used to count the number of things, not just lines, words, and characters. For
example, the users command generates a list users who are currently logged onto the machine. The
following line would create an alias called nusers, which would report the number of currently logged
on users.
[student@station student]$ alias nusers=users | wc -w
[student@station student]$ users
student student student student root

[student@station student]$ nusers
rha030-3.0-0-en-2005-08-17T07:23:17-0400
19
Example 7. Counting Processes

By examining the output of a command such as ps aux, which prints information about one process per
line, the wc -l command can be used to count the number of processes currently running on a machine.
Examining the output of the ps aux command, however, the initial line, which contains the column titles,
must be removed from the count.
[student@station student]$ ps aux
USER
root
root
root
root
...
PID %CPU %MEM

1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
4 0.0 0.0
VSZ
1384
0
0
0
RSS
76
0
0
0
TTY
?
?
?
?
STAT
S
SW
SW
SWN
START
Sep28
Sep28
Sep28
Sep28
TIME
0:04
0:00
0:00
0:00
COMMAND
init [
[keventd]
[kapmd]
[ksoftirqd_CPU0]
The tail command, with its ability to print the remainder of a file starting from a specified line, can be
used to remove the header line.
[student@station student]$ ps aux | tail +2
root
root
root
root
root
...
1
2
3
4
9
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1384
0
0
0
0
76
0
0
0
0
?
?
?
?
?
S
SW
SW
SWN
SW
Sep28
Sep28
Sep28
Sep28
Sep28
0:04
0:00
0:00
0:00
0:00
init [
[keventd]
[kapmd]
[ksoftirqd_CPU0]
[bdflush]
The following short script combines ps aux, tail +2, and wc, to create a new command called nprocs.
When made executable, and placed the ~/bin directory (which is part of the standard executable search
PATH), the script becomes available from the command line.
[student@station student]$ cat nprocs
#!/bin/bash
ps aux | tail +2 | wc -l
[student@station
[student@station
[student@station
[student@station
student]$
student]$
student]$
student]$
mkdir bin
mv nprocs bin
chmod a+x bin/nprocs
nprocs
86
Online Exercises
Lab Exercise
Objective: Use the wc command as a counting tool.
Estimated Time: 10 mins.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
20
Specification
1. Create the file ~/gplwords.txt, which contains the number of words (as reported by the wc
command) in the file /usr/share/doc/redhat-release-4ES/GPL as its only word.
2. Create the file ~/localusers.txt, which contains the number of locally defined users as its only
word.
3. Statically compiled libraries conventionally live in the /usr/lib directory, and have names that
start lib and end with a .a extension. Create the file ~/usrlibs.txt, which contains the number
of files whose name follows this convention in the /usr/lib directory as its only word. (Do not
include subdirectories.)
4. Create an executable script called ~/bin/nrecent. The script should expect a single argument,
which is the name of a directory. Upon execution, the script should return a single number, which is
the number of files in the directory which have been modified in the last 24 hours. The script should
generate no error messages about unaccessible directories on the standard error stream.
If you have implemented the exercises correctly, you should be able to reproduce output akin to the
following. (Do not be concerned if your actual numbers differ from those listed below).
[student@station student]$ head *.txt
==> gplwords.txt <==

2009
==> localusers.txt <==

89
==> usrlibs.txt <==
216
[student@station student]$ nrecent /var/log
22
Deliverables
1. A file called ~/gplwords.txt, which contains the number of words found in the file
/usr/share/doc/redhat-release-4ES/GPL.
2. A file called ~/localusers.txt, which contains the number of locally defined users on the Linux system.
3. A file called ~/usrlibs.txt, which contains the number of files that begin lib and end .a found in the
/usr/lib directory.
4. An executable script called ~/bin/nrecent, which expects the name of a directory as its single argument.
Upon execution, the script would return a single number which is the number of files under the specified
rha030-3.0-0-en-2005-08-17T07:23:17-0400
21

directory that have been modified in the past 24 hours. The script should return no error messages about
unaccessible directories on the standard error stream.
Hints
For the file ~/localusers.txt, recall that local users are defined in the /etc/passwd file, one user
per line.
For the script ~/bin/nrecent, recall that $1 dereferences to a bash scripts first argument. Consider
using the find command to generate a list of files that match the criteria, and then count the number of
lines (or words) in the output. You might want to use the /etc or /var/log directories to test your
script.
Questions
1. Create an empty file using the touch foo command. How many characters does it contain?
( ) a. 0
( ) b. 2
( ) c. 1
( ) d. The wc command does not work on empty files.
( ) e. None of the above
2. Create a file using the echo > foo command. How many characters does it have?
( ) a. 2
( ) b. 0
( ) c. 1
( ) d. The wc command does not work on empty files.
3. Create a file using echo -e \n\n\n\n > foo; how many words does it have?
( ) a. 2
( ) b. 1
( ) c. 4
( ) d. 5
( ) e. 0
rha030-3.0-0-en-2005-08-17T07:23:17-0400
22
4. Which of the following command lines would generate a single word output, which is the sum of the number of
words found in the files /etc/services and /etc/hosts?
( ) a. cat /etc/services /etc/hosts | wc -w
( ) b. wc -w < /etc/hosts /etc/services
( ) c. wc -w /etc/hosts /etc/services
( ) d. A and C
( ) e. All of the above
5. Which of the following command lines would generate a single word output, which is the number of users logged
into the local machine (as reported by the w command)?
( ) a. w | wc -u
( ) b. w | tail -3 | wc -w
( ) c. w | tail +3 | wc -l
( ) d. w | tail +USER | wc -c
Use the following transcript to answer the next two questions.
[student@station student]$ cat /etc/adjtime
-9.359142 1064838378 0.000000

1064838378
UTC
6. What would you expect the command wc -w < /etc/adjtime to return?

( ) a. 5
( ) b. 6
( ) c. 7
( ) d. 8
7. What would you expect the command wc -l < /etc/adjtime to return?
( ) a. 0
( ) b. 5
( ) c. 3
( ) d. 4
rha030-3.0-0-en-2005-08-17T07:23:17-0400
23
Use the following transcript to answer the next two questions.

[student@station student]$ ls -s /etc/group
4 /etc/group
[student@station student]$ ls -l /etc/group
-rw-r--r--
1 root
root
2475 Aug 17 12:34 /etc/group
8. What would the command wc -l < /etc/group return?

( ) a. 4
( ) b. 2475
( ) c. 12
( ) d. An error, because wc requires at least one filename as an argument.
( ) e. Not enough information is provided.
9. What would the command wc -c < /etc/group return?
( ) a. 4
( ) b. 2475
( ) c. 12
( ) d. An error, because wc requires at least one filename as an argument.
( ) e. Not enough information is provided.
10. Which of the following commands can be used to distinguish a tab from a series of spaces in a text file?
( ) a. cat -A
( ) b. cat -t
( ) c. cat -uT
( ) d. A and B
Notes
1. While this may seem an obvious way to do things, some operating systems take more elaborate
approaches. The Macintosh operating system, for example, stores file using two arrays of
information, a data fork and a resource fork.
24
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Chapter 2. Finding Text: grep

Key Concepts
grep is a command that prints lines that match a specified text string or pattern.
grep is commonly used as a filter to reduce output to only desired items.
grep -r will recursively grep files underneath a given directory.
grep -v prints lines that do NOT match a specified text string or pattern.
Many other command line switches allow users to specify greps output format.
Discussion
Searching Text File Contents using grep
In an earlier Lesson, we saw how the wc program can be used to count the characters, words and lines in
text files. In this Lesson we introduce the grep program, a handy tool for searching text file contents for
specific words or character sequences.
The name grep stands for general regular expression parser. What, you may well ask, is a regular
expression and why on earth should I want to parse one? We will provide a more formal definition of
regular expressions in a later Lesson, but for now it is enough to know that a regular expression is simply
a way of describing a pattern, or template, to match some sequence of characters. A simple regular
expression would be Hello, which matches exactly five characters: H, e, two consecutive l
characters, and a final o. More powerful search patterns are possible and we shall examine them in the
next section.
The figure below gives the general form of the grep command line:
Figure 2-1. Form of the grep commands
There are actually three different names for the grep tool 1:
fgrep
Does a fast search for simple patterns. Use this command to quickly locate patterns without any
wildcard characters, useful when searching for an ordinary word.
25

grep
Pattern searches using ordinary regular expressions.
egrep
Pattern searches using more powerful extended regular expressions.
The pattern argument supplies the template characters for which grep is to search. The pattern is
expected to be a single argument, so if pattern contains any spaces, or other characters special to the
shell, you must enclose the pattern in quotes to prevent the shell from expanding or word splitting it.
The following table summarizes some of greps more commonly used command line switches. Consult
the grep(1) man page (or invoke grep --help) for more.
Table 2-1. Common Command Line Switches for the grep Command
Switch
Effect
-c
Print a count of matching lines only.
-h
Suppress filename prefixes.
-e
Use expression as a search pattern. (Helpful for specifying several alternate patterns.)
expression
-i
Ignore case when determining matches.
-l
Print filenames that contain matching pattern only.
-n
Include line numbers along with matching lines.
-q
"Quiet". Do not write anything to standard out. Instead, exit with a zero exit status if
any match is found.
-r
Search all files, recursing through directories.
-w
Only match whole words.
-C
Include two lines of context before and after the matched line.
Show All Occurrences of a String in a File

Under Linux, there are often several ways of accomplishing the same task. For example, to see if a file
contains the word even, you could just visually scan the file:
[student@station student]$ cat file
This file has some words.

It also has even more words.
Reading the file, we see that the file does indeed contain the letters even. Using this method on a large
file suffers because we could easily miss one word in a file of several thousand, or even several hundred
thousand, words. We can use the grep tool to search through the file for us in an automatic search:
[student@station student]$ grep even file
It also has even more words.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
26

Here we searched for a word using its exact spelling. Instead of just a literal string, the pattern
argument can also be a general template for matching more complicated character sequences; we shall
explore that in a later Lesson.
Searching in Several Files at Once

An easy way to search several files is just to name them on the grep command line:
[student@station
[student@station
[student@station
[student@station
student]$
student]$
student]$
student]$
echo
echo
echo
grep
Every cat has one more tail than no cat. > general
No cat has nine tails. > specific
Therefore, every cat has ten tails. > fallacy
cat general specific fallacy
general:Every cat has one more tail than no cat.

specific:No cat has nine tails.
fallacy:Therefore, every cat has ten tails.
Perhaps we are more interested in just discovering which file mentions the word nine than actually
seeing the line itself. Adding the -l switch to the grep line does just that:
[student@station student]$ grep -l nine general specific fallacy
specific
Searching Directories Recursively

Grep can also search all the files in a whole directory tree with a single command. This can be handy
when working a large number of files.
The easiest way to understand this is to see it in action. In the directory /etc/sysconfig are text files
that contain much of the configuration information about a Linux system. The Linux name for the first
Ethernet network device on a system is eth0, so you can find which file contains the configuration for
eth0 by letting the grep -r command do the searching for you 2:
[student@station student]$ grep -r eth0 /etc/sysconfig 2>/dev/null
/etc/sysconfig/network-scripts/ifup-aliases:# Specify multiple ranges using \

multiple files, such as ifcfg-eth0-range0 and
/etc/sysconfig/network-scripts/ifup-aliases:# ifcfg-eth0-range1, etc. In these \
files, the following configuration variables
/etc/sysconfig/network-scripts/ifup-aliases:# The above example values create \
the interfaces eth0:0 through eth0:253 using
/etc/sysconfig/network-scripts/ifup-ipv6:#
Example: \
IPV6TO4_ROUTING="eth0-:f101::0/64 eth1-:f102::0/64"
/etc/sysconfig/network-scripts/ifcfg-eth0:DEVICE=eth0
/etc/sysconfig/networking/devices/ifcfg-eth0:DEVICE=eth0
/etc/sysconfig/networking/profiles/default/ifcfg-eth0:DEVICE=eth0
Every file in /etc/sysconfig that mentions eth0 is shown in the results.

We can further limit the files listed to only those referring to an actual device by filtering the grep -r
output through a grep DEVICE:
rha030-3.0-0-en-2005-08-17T07:23:17-0400
27

[student@station student]$ grep -r eth0 /etc/sysconfig 2>/dev/null | grep DEVICE
/etc/sysconfig/network-scripts/ifcfg-eth0:DEVICE=eth0
/etc/sysconfig/networking/devices/ifcfg-eth0:DEVICE=eth0
/etc/sysconfig/networking/profiles/default/ifcfg-eth0:DEVICE=eth0
This shows a common use of grep as a filter to simplify the outputs of other commands.
If only the names of the files were of interest, the output can be simplified with the -l command line
switch.
[student@station student]$ grep -rl eth0 /etc/sysconfig 2>/dev/null
/etc/sysconfig/network-scripts/ifup-aliases
/etc/sysconfig/network-scripts/ifup-ipv6
/etc/sysconfig/network-scripts/ifcfg-eth0
/etc/sysconfig/networking/devices/ifcfg-eth0
/etc/sysconfig/networking/profiles/default/ifcfg-eth0
Inverting grep
By default, grep shows only the lines matching the search pattern. Usually, this is what you want, but
sometimes you are interested in the lines that do not match the pattern. In these instances, the -v
command line switch inverts greps operation.
[student@station student]$ head -n 4 /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:
daemon:x:2:2:daemon:/sbin
adm:x:3:4:adm:/var/adm:
[student@station student]$ grep -v root /etc/passwd | head -n 3
bin:x:1:1:bin:/bin:
daemon:x:2:2:daemon:/sbin:
adm:x:3:4:adm:/var/adm:
Getting Line Numbers

Often you may be searching a large file that has many occurrences of the pattern. Grep will list each line
containing one or more matches, but how is one to locate those lines in the original file? Using the grep
-n command will also list the line number of each matching line.
The file /usr/share/dict/words contains a list of common dictionary words. Identify which line
contains the word dictionary:
[student@station student]$ fgrep -n dictionary /usr/share/dict/words
12526:dictionary
You might also want to combine the -n switch with the -r switch when searching all the files below a
directory:
[student@station station]$ fgrep -nr dictionary /usr/share/dict
linux.words:12526:dictionary
rha030-3.0-0-en-2005-08-17T07:23:17-0400
28

words:12526:dictionary
Limiting Matching to Whole Words

Remember the file containing our nursery rhyme earlier?
[student@station student]$ cat rhyme
The cat
sat on
the mat
at home.
Suppose we wanted to retrieve all lines containing the word at. If we try the command:
[student@station student]$ fgrep at rhyme
The cat
sat on
the mat
at home.
Do you see what happened? We matched the at string, whether it was an isolated word or part of a
larger word. The grep command provides the -w switch to imply that the specified pattern should only
match entire words.
[student@station student]$ grep -w at file
at home.
The -w switch considers a sequence of letters, numbers, and underscore characters, surrounded by
anything else, to be a word.
Ignoring Case
The string Bob has quite a meaning quite different from the string bob. However, sometimes we
want to find either one, regardless of whether the word is capitalized or not. The grep -i command solves
just this problem.
Look again at our nursery rhyme:
The cat
sat on
the mat
at home.
See if the file contains the word the, all in lowercase letters:
[student@station student]$ grep the rhyme
the mat
rha030-3.0-0-en-2005-08-17T07:23:17-0400
29

Now see which lines contain the letters t, h, and e in any combination of lower- or upper-case
letters:
[student@station student]$ grep -in the rhyme
1:The cat
3:the mat
Notice that we also used the -n switch to add the line numbers to the output.
Examples
Example 1. Finding Simple Character Strings
Verify that your computer has the system account lp, used for the line printer tools. Hint: the file
/etc/passwd contains one line for each user account on the system.
[student@station student]$ grep lp /etc/passwd
lp:x:4:7:lp:/var/spool/lpd:
Example 2. In That Case

Search for an exact copy of the pattern:
[student@station student]$ grep LP /etc/passwd
[student@station student]$
Nothing was matched because the pattern does not match the case for the account name. Search again
and ignore the case:
[student@station student]$ grep -i LP /etc/passwd
lp:x:4:7:/var/spool/lpd:
Example 3. Matching Whole Words

We have seen that grep will match the pattern wherever the pattern is located, even in the middle of
words. Search for the pattern honey in the system word dictionary /usr/share/dict/words:
[student@station student]$ grep honey /usr/share/dict/words
honey
honeybee
honeycomb
honeycombed
honeydew
honeymoon
honeymooned
rha030-3.0-0-en-2005-08-17T07:23:17-0400
30

honeymooner
honeymooners
honeymooning
honeymoons
honeysuckle
Mahoney
Evidently, the dictionary contains several words using the string honey as a root word. We can limit
the matching to whole words by using the grep -w command. The grep command considers a word to be
a group of letters, digits, or underscores surrounded by anything else. The beginning and end of a line
also qualifies as anything else, so the first or last word on a line is recognized correctly. Try to lookup
honey in the dictionary again:
[student@station student]$ grep -w honey /usr/share/dict/words
honey
For lack of a better word: perfect.
Example 4. Combining grep and xargs

Suppose that you have been placed in charge of maintaining the help file documentation for the vim
editor. As you browse through the already existing files, you notice that in some places, the help files use
the two words command line, and in other places the single word commandline. You would like the help
files to be consistent, and decide the former is correct.
You would now like to find every occurrence of the text commandline, and change them to command
line. You start by identifying which files contain the text commandline.
[student@station student]$ grep -ril commandline /usr/share/doc/vim *
/usr/share/doc/vim-common-6.1/docs/message.txt
/usr/share/doc/vim-common-6.1/docs/options.txt
/usr/share/doc/vim-common-6.1/docs/os_risc.txt
/usr/share/doc/vim-common-6.1/docs/tags
/usr/share/doc/vim-common-6.1/docs/todo.txt
/usr/share/doc/vim-common-6.1/docs/various.txt
/usr/share/doc/vim-common-6.1/docs/version5.txt
You would now like to open each of these files in the gedit text editor, so that you can make your edits.
You pipe the results of your search into the gedit command.
[student@station student]$ grep -ril commandline /usr/share/doc/vim * | gedit
The gedit editor opens, but with an empty buffer titled "untitled". This is not what you had meant! You
had wanted gedit to open the filenames that the grep command supplied on stdin, not stdin itself.
Unfortunately, thats not how gedit works. gedit (like most text editors) expects filenames to be supplied
as arguments on the command line, not using stdin.
Fortunately, there is a standard Linux (and Unix) utility that helps in just such situations: xargs. The
xargs command will read stdin, and append any words found there to the supplied command line, as
additional arguments. Hopefully, the following example will clarify. With your knowledge of the xargs
command, you modify your previous approach.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
31

[student@station student]$ grep -ril commandline /usr/share/doc/vim * | xargs gedit
Now, the gedit editor opens up with multiple buffers, one for each file output by the grep command.
Figure 2-2. Using xargs to Convert Standard In into Arguments for gedit
Notice that you never had to type in the individual file names. The words supplied on stdin were
exchanged for arguments on the command line, thus the name xargs. Nice.
Online Exercises
Lab Exercise
Objective: Use the grep command to find occurrences of specified text.
Specification
1. Create the file ~/bashusers.txt, which contains lines from the /etc/passwd file which contain
the text /bin/bash.
2. Create the file ~/nostdhome.txt, which contains only lines from the /etc/passwd file which do
not contain the text home (implying that the associated user has a nonstandard home directory).
3. Create the file ~/ansiterms.txt, which contains every line from the /etc/termcap file which
contains the text ansi, using a case insensitive search. (In other words, ansi, ANSI, Ansi, and AnSi
would all count as matches).
4. Create the file ~/mayhemnum.txt, which contains the line number of the word mayhem from the
file /usr/share/dict/words as its only word.
32
rha030-3.0-0-en-2005-08-17T07:23:17-0400

5. Create the file ~/firstredhat.txt, which contains an alphabetically sorted list of all files
underneath the /usr/share/firstboot directory (and its subdirectories) that contain the text
redhat, using a case insensitive search. The files should be listed one per line using absolute
references.
Deliverables
1. The file ~/bashusers.txt, which contains lines from the /etc/passwd file which contain the text /bin/bash.
2. The file ~/nostdhome.txt, which contains lines from the /etc/passwd file which do not contain the text
home.
3. The file ~/ansiterms.txt, which contains every line from the /etc/termcap file which contains the text
ansi, using a case insensitive search.
4. The file ~/mayhemnum.txt, which contains the line number of the word mayhem from the file
/usr/share/dict/words as its only word.
5. The file ~/firstredhat.txt, which contains an alphabetically sorted list of all files underneath the
/usr/share/firstboot directory that contain the text redhat, using a case insensitive comparison. The files
should be listed one per line using absolute references.
Questions
1. Which of the following command lines would list lines from the file /etc/group which contain the text elvis?
( ) a. grep /etc/group elvis
( ) b. echo elvis | grep /etc/group
( ) c. echo /etc/group | grep elvis
( ) d. A and C
2. To allow the search pattern HELLO to match both hello and HELLO, you would use the grep command with
which command line switch?
( ) a. -i
( ) b. -r
( ) c. -w
( ) d. -k
33
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a
violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in
electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed

3. To search an entire directory hierarchy, you would run grep using which command line switch?
( ) a. -i
( ) b. -f
( ) c. -n
( ) d. -r
4. Which of the following command lines would list all lines from /usr/share/dict/words which contain the
text sun, but only those lines, preceded by their line number?
( ) a. grep -n sun /usr/share/dict/words
( ) b. grep -N /usr/share/dict/words sun
( ) c. grep -r /usr/share/dict/words sun
( ) d. grep -r sun /usr/share/dict/words
5. Which of the following command lines would return the number of lines which contain the text freedom found in
the file /usr/share/doc/redhat-release-3ES/GPL?
( ) a. grep freedom /usr/share/doc/redhat-release-3ES/GPL | wc -w
( ) b. grep freedom /usr/share/doc/redhat-release-3ES/GPL | wc -l
( ) c. grep freedom /usr/share/doc/redhat-release-3ES/GPL | wc -c
( ) d. grep freedom /usr/share/doc/redhat-release-3ES/GPL | wc -n
6. Which of the following command lines would list the names of files (and only the names of files) found
underneath the /etc directory which contain the word network (i.e., the word networking would not count).
( ) a. grep -rwl network /etc
( ) b. grep -wl network /etc
( ) c. grep -rl network /etc
( ) d. grep -ilw network /etc
34
rha030-3.0-0-en-2005-08-17T07:23:17-0400

7. Which of the following command lines would reduce the output of the ps aux command to only those lines which
do not contain the text root?
( ) a. grep root < ps aux
( ) b. ps aux | grep -v root
( ) c. grep -x root | ps aux
( ) d. ps aux >> grep -k root
8. Which of the following would list lines from the file /etc/nsswitch.conf which contain the text nisplus?
( ) a. grep nisplus /etc/nsswitch.conf
( ) b. grep nisplus < /etc/nsswitch.conf
( ) c. grep -q nisplus /etc/nsswitch.conf
( ) d. All of the above
( ) e. A and B only
9. Which of the following would list every file under the /usr/share/gnome directory which contains the text
Free Software Foundation on a single line?
( ) a. grep -ril Free Software Foundation /usr/share/gnome
( ) b. ls /usr/share/gnome | grep -i Free Software Foundation
( ) c. grep -rc Free Software Foundation /usr/share/gnome
( ) d. grep -rl "Free Software Foundation" /usr/share/gnome
10. Which of the following command lines would list every line which contains the text cdrom from the file README,
along with two lines of context before and after the matching line?
( ) a. grep cdrom README | head -2 | tail -2
( ) b. grep -n2 cdrom README
( ) c. grep -n2 README cdrom
( ) d. grep -k2 cdrom README
rha030-3.0-0-en-2005-08-17T07:23:17-0400
35
duplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied,
or otherwise improperly distributed please email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.
Notes
1. When the original grep program was written, computers did not have much memory, so having small
programs was very desirable. Having a single program that could efficiently implement the three
different types of searching was not possible unless the program were to be extraordinarily large, so
the functions were separated into three programs. When the GNU project was started, computers
could easily handle larger programs, so all three searching techniques were built into a single
program but all three program names were kept for compatibility with other UNIX-like systems.
2. There are some files under /etc/sysconfig that ordinary users cannot read. We have used the
2>/dev/null idiom to have all error messages be ignored.
36
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Chapter 3. Introduction to Regular

Expressions
Key Concepts
Regular expressions are a standard Unix syntax for specifying text patterns.
Regular expressions are understood by many commands, including grep, sed, vi, and many scripting
languages.
Within regular expressions, . and [] are used to match characters.
Within regular expressions, +, *, and ?specify a number of consecutive occurrences.
Within regular expressions, ^ and $ specify the beginning and end of a line.
Within regular expressions, (, ), and | specify alternative groups.
The regex(7) man page provides complete details.
Discussion
Introducing Regular Expressions
In the previous chapter you saw grep used to match either a whole word or part of a word. This by its
self is very powerful, especially in conjunction with arguments like -i and -v, but it is not appropriate for
all search scenarios. Here are some examples of searches that the grep usage youve learned so far would
not be able to do:
First, suppose you had a file that looked like this:
[biafra@station]$ cat people_and_pets.txt
==========================
Name: Joe Green
Age: 36
Pets:
Name: Aida
Age: 5
Species: Cat
-----------Name: Hawn
Age: 1
Species: Goldfish
==========================
Name: Sarah Jane
Age: 29
Pets:
37
Chapter 3. Introduction to Regular Expressions

Name: Orfeus
Age: 7
Species: Dog
------------Name: Euridice
Age: 8
Species: Dog
What if you wanted to pull out just the names of the people in people_and_pets.txt? A command like
grep -w Name: would match the Name: line for each person, but also the Name: line for each
persons pet. How could we match only the Name: lines for people? Well, notice that the lines for pets
names are all indented, meaning that those lines begin with whitespace characters instead of text. Thus,
we could achieve our goal if we had a way to say "Show me all lines that begin with Name:".
Another example: Suppose you and a friend both witnessed a hit-and-run car accident. You both got a
look at the fleeing cars license plate and yet each of you recalls a slightly different number. You read the
license number as "4I35VBB" but your friend read it as "413SV88". It seems that what you read as an I
in the second character, your friend read as a 1. Similar differences appear in your interpretations of
other parts of the license like 5 vs S and BB vs 88. The police, having taken both of your
statements, now need to narrow down the suspects by querying their database of license plates for plates
that might match what you saw.
One solution might be to do separate queries for "4I35VBB" and "413SV88" but doing so assumes that
one of you is exactly right. What if the perpetrators license number was actually "4135VB8"? In other
words, what if you were right about some of the characters in question but your friend was right about
others? It would be more effective if the police could query for a pattern that effectively said: "Show me
all license numbers that begin with a 4, followed by an I or a 1, followed by a 3, followed by a 5
or an S, followed by a V, followed by two characters that are each either a B or an 8".
Query scenarios like these can be solved using regular expressions. While computer scientists sometimes
use the term "regular expression" (or "regex" for short) to describe any method of describing complex
patterns, in Linux and many programming languages the term refers to a very specific set of special
characters used for solving problems like the above. Regular expressions are supported by a large
number of tools including grep, vi, find and sed.
To introduce the usage of regular expressions, lets look at some solutions to two problems introduced
earlier. Dont worry if these seem a bit complicated, the remainder of the unit will start from scratch and
cover regular expressions in great detail.
A regex that could solve the first problem, where we wanted to say "Show me all lines that begin with
Name:" might look like this:
[biafra@station]$ grep ^Name: people_and_pets.txt
Name: Joe Green

Name: Sarah Jane
...thats it! Regular expressions are all about the use of special characters, called metacharacters to
represent advanced query parameters. The carat ("^"), as shown here, means "Lines that begin with...".
Note, by the way, that the regular expression was put in single-quotes. This is a good habit to get into
early on as it prevents bash from interpreting special characters that were meant for grep.
Ok, so what about the second problem? That one involved a much more complicated query: "Show me
all license numbers that begin with a 4, followed by an I or a 1, followed by a 3, followed by a 5
rha030-3.0-0-en-2005-08-17T07:23:17-0400
38

or an S, followed by a V, followed by two characters that are each either a B or an 8". This could
be represented by a regular expression that looks like this:
4[I1]3[5S]V[B8]{2}
Wow, thats pretty short considering how long it took to write out what we were looking for! There are
only two types of regex metacharacters used here: square braces ([]) and curly braces ({}). When two
or more characters are shown within square braces it means "any one of these". So [B8] near the end of
the expression means "B or 8". When a number is shown within curly braces it means "this many of
the preceding character". Thus, [B8]{2} means "two characters that are each either a B or an 8".
Pretty powerful stuff!
Now that youve gotten a taste of what regular expressions are and how they can be used, lets start from
scratch and cover them in depth.
Regular Expressions, Extended Regular Expressions, and

the grep Command
As the Unix implementation of regular expression syntax has evolved, new metacharacters have been
introduced. In order to preserve backward compatibility, commands usually choose to implement regular
expressions, or extended regular expressions. In order to not become bogged down with the differences,
this Lesson will introduce the extended syntax, summarizing differences at the end of the discussion.
One of the most common uses for regular expressions is specifying search patterns for the grep
command. As was mentioned in the previous Lesson, there are three versions of the grep command.
Reiterating, the three differ in how they interpret regular expressions.
fgrep
The fgrep command is designed to be a "fast" grep. The fgrep command does not support regular
expressions, but instead interprets every character in the specified search pattern literally.
grep
The grep command interprets each patterns using the original, basic regular expression syntax.
egrep
The egrep command interprets each patterns using extended regular expression syntax.
Because we are not yet making a distinction between the basic and extended regular expression syntax,
the egrep command should be used whenever the search pattern contains regular expressions.
Anatomy of a Regular Expression

In our discussion of the grep program family, we were introduced to the idea of using a pattern to
identify the file content of interest. Our examples were carefully constructed so that the pattern contained
exactly the text for which we were searching. We were careful to use only literal characters in our
regular expressions; a literal character matches only itself. So when we used hello as the regular
expression, we were using a five-character regular expression composed only of literal characters. While
rha030-3.0-0-en-2005-08-17T07:23:17-0400
39

this let us concentrate on learning how to operate the grep program, it didnt allow us to get a full
appreciation of the power of regular expressions. Before we see regular expressions in use, we shall first
see how they are constructed.
A regular expression is a sequence of:
Literal Characters
Literal characters match only themselves. Examples of literals are letters, digits and most special
characters (see below for the exceptions).
Wildcards
Wildcard characters match any character. Within a regular expression, a period (.) matches any
character, be it a space, a letter, a digit, punctuation, anything.
Modifiers
A modifier alters the meaning of the immediately preceding pattern character. For example, the
expression ab*c matches the strings ac, abc, abbc, abbbc, and so on, because the
asterisk (*) is a modifier that means any number of (including zero). Thus, our pattern means to
match any sequence of characters consisting of one a, a (possibly empty) series of b characters,
and a final c character.
Anchors
Anchors establish the context for the pattern, such as "the beginning of a line", or "the end of a
word". For example, the expression cat would match any occurrence of the three letters, while
^cat would only match lines that begin cat.
Each of these are discussed in more detail in the sections below.
Taking Literals Literally

Literals are straightforward because each literal character in a regular expressions matches one, and only
one, copy of itself in the searched text. Uppercase characters are distinct from lowercase characters, so
that A does not match a.
Wildcards
The "dot" wildcard
The character . is used as a placeholder, to match one of any character. In the following example, the
pattern matches any occurrence of the literal characters x and s, separated by exactly two other
characters.
[student@station student]$ grep "x..s" /usr/share/dict/words | head -5
antitoxins
axers
axles
rha030-3.0-0-en-2005-08-17T07:23:17-0400
40

axons
boxers
Bracket Expressions: Ranges of Literal Characters

Normally a literal character in a regex pattern matches exactly one occurrence of itself in the searched
text. Suppose we want to search for the string hello regardless of how it is capitalized: we want to
match Hello and HeLLo as well. How might we do that?
A regex feature called a bracket expression solves this problem neatly. A bracket expression is a range of
literals enclosed in square brackets ([ and ]). For example, the regex pattern [Hh] is a character
range that matches exactly one character: either an uppercase H or a lowercase h letter. Notice that it
doesnt matter how large the set of characters within the range is, the set matches exactly one character,
if it matches any at all. A bracket expression that matches the set of lowercase vowels could be written
[aeiou] and would match exactly one vowel.
In the following example, bracket expressions are used to find words from the file
/usr/share/dict/words. In the first case, the first five words that contain three consecutive
(lowercase) vowels are printed. In the second case, the first 5 words that contain lowercase letters in the
pattern of vowel-consonant-vowel-consonant-vowel-consonant are printed.
If the first character of a bracket expression is a ^, the interpretation is inverted, and the bracket
expression will match any single occurrence of a character not included in the range. For example, the
expression [âeiou] would match any character that is not a vowel. The following example first lists
words which contain three consecutive vowels, and secondly lists words which contain three consecutive
consonant-vowel pairs.
[student@station student]$ egrep [aeiou][aeiou][aeiou] /usr/share/dict/words
| head -5
absenteeism
Achaean
Achaeans
acquaint
acquaintance
[student@station student]$ egrep [aeiou][âeiou][aeiou][âeiou][aeiou][âeiou]
/usr/share/dict/words | head -5
abased
abasement
abasements
abases
abasing
Range Expressions vs. Character Classes: Old School and New School
Another way to express a character range is by giving the start- and end-letters of the sequence this way:
[a-d] would match any character from the set a, b, c or d. A typical usage of this form would be
[0-9] to represent any single digit, or [A-Z] to represent all capital letters.
How are the characters ordered? For example, does uppercase C come before or after lowercase b?
Recall the discussion about character encoding from the first Lesson. The encoded value of the letter is
rha030-3.0-0-en-2005-08-17T07:23:17-0400
41

used to determine if one character is "lesser" or "greater" than another. As long as the character set which
defines the encoding is ordered correctly, as is the case with ASCII, all is well. But what about the
Latin-1 (ISO-8859-1) character set? Does really come after z?
As an alternative to such quandaries, modern regular expression make use character classes. Character
classes match any single character, using language specific conventions to decide if a given character is
uppercase or lowercase, or if it should be considered part of the alphabet or punctuation. The following
table lists some supported character classes, and the ASCII equivalent range expression, where
appropriate.
Table 3-1. Regular Expression Character Classes
Expression
Character Class
ASCII equivalent range
[:alnum:]
alphanumeric
A-Za-z0-9
[:alpha:]
alphabet character
A-Za-z
[:blank:]
space or tab
[:digit:]
numeric digit
0-9
[:lower:]
lowercase letters
a-z
[:punct:]
printable characters, excluding

spaces and alphanumerics
[:space:]
whitespace character
[:upper:]
uppercase letter
A-Z
Character classes avoid problems you may run into when using regular expressions on systems that use
different character encoding schemes where letters are ordered differently. For example, suppose you
were to run the command:
[elvis@station]$ grep [A-Z] /usr/share/dict/words
On a Red Hat Enterprise Linux system, this would match every word in the file, not just those that
contain capital letters as one might assume. This is because in unicode (utf-8), the character encoding
scheme that RHEL uses, characters are alphabetized case-insensitively, so that [A-Z] is equivalent to
[AaBbCc...etc]. On older systems, though, a different character encoding scheme is used where
alphabetization is done case-sensitively. On such systems [A-Z] would be equivalent to [ABC...etc].
Character classes avoid this pitfall. You can run:
[elvis@station]$ grep [[:upper:]] /usr/share/dict/words
on any system regardless of the encoding scheme being used and it will only match lines that contain
capital letters.
For more details about the predefined range expressions, consult the grep manual page. For more
information on character encoding schemes under Linux, refer back to chapter 8.3. To learn about how
character encoding schemes are used to support other languages in Red Hat Enterprise Linux, begin with
the locale manual page.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
42
Common Modifier Characters

We saw a common usage of a regex modifier in our earlier example ab*c to match an a and c
character with some number of b letters in between. The * character changed the interpretation of the
literal b character from matching exactly one letter to matching any number of bs.
Here are a list of some common modifier characters:
b?
The question mark (?) means either one or none: the literal character is considered to be
optional in the searched text. For example, the regex pattern ab?c matches the strings ac, and
abc, but not abbc.
b*
The asterisk (*) modifier means any number of (including zero) of the preceding literal
character. The regex pattern ab*c matches the strings ac, abc, abbc, and so on.
b+
The plus (+) modifier means one or more, so the regex pattern b+ matches a non-empty
sequence of bs. The regex pattern ab+c matches the strings abc and abbc, but does not
match ac.
b{m,n}
The brace modifier is used to specify a range of between m and n occurrences of the preceding
character. The regex pattern b{2,4} would match abbc and abbbc, and abbbbc, but not
abc or abbbbbc.
b{n}
With only one integer, the brace modifier is used to specify exactly n occurrences for the preceding
character.
In the following example, egrep prints lines from /usr/share/dict/words that contain patterns
which start with a (capital or lowercase) a, might or might not next have a (lowercase) b, but then
definitely follow with a (lowercase) a.
[student@station student]$ egrep [Aa]b?a /usr/share/dict/words | head -5
Aarhus
Aaron
Ababa
aback
abaft
The following example prints lines which contain patterns which start al, then use the . wildcard to
specify 0 or more occurrences of any character, followed by the pattern bra.
[student@station student]$
algebra
algebraic
algebraically
algebras
rha030-3.0-0-en-2005-08-17T07:23:17-0400
egrep al.*bra /usr/share/dict/words | head
43

calibrate
calibrated
calibrates
calibrating
calibration
calibrations
Notice we found variations on the words algebra and calibrate. For the former, the .* expression
matched ge, while for the latter, it matched the letter i.
The expression .*, which is interpreted as "0 or more of any character", shows up often in regex
patterns, acting as the "stretchable glue" between two patterns of significance.
As a subtlety, we should note that the modifier characters are greedy: they always match the longest
possible input string. For example, given the regex pattern:
t.*e
and the input stream:

now is the time
our pattern matches:

the time
instead of just the. When used in simple searches, such as grep, the difference is usually insignificant.
When regular regular expressions are used in find and replace operations, however, as is done with many
text editors, the difference becomes significant.
Anchored Searches
Four additional search modifier characters are available:
^foo
A caret (^) matches the beginning of a line. Our example ^foo matches the string foo only
when it is at the beginning of a line
foo$
A dollar sign ($) matches the end of a line. Our example foo$ matches the string foo only at
the end of a line, immediately before the newline character.
\<foo\>
By themselves, the less than sign (<) and the greater than sign (>) are literals. Using the
backslash character to escape them transforms them into meaning first of a word and end of a
word, respectively. Thus the pattern \>cat\< matches the word cat but not the word
catalog.
You will frequently see both ^ and $ used together. The regex pattern ^foo$ matches a whole line that
contains only foo and would not match that line if it contained any spaces.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
44

The \< and \> are also usually used as pairs.
In the following an example, the first search lists all lines that contain the letters ion anywhere on the
line. The second search only lists lines which end in ion.
[student@station student]$ egrep ion /usr/share/dict/words | head -5
abbreviation
abbreviations
abduction
abductions
aberration
[student@station student]$ egrep ion$ /usr/share/dict/words | head -5
abbreviation
abduction
aberration
abjection
ablation
Coming to Terms with Regex Grouping

The same way that you can use parenthesis to group terms within a mathematical expression, you also
use parenthesis to collect regular expression pattern specifiers into groups. This lets the modifier
characters ?, * and + apply to groups of regex specifiers instead of only the immediately preceding
specifier.
Suppose we need a regular expression to match either foo or foobar. We could write the regex as
foo(bar)? and get the desired results. This lets the ? modifier apply to the whole string bar
instead of only the preceding r character.
Grouping regex specifiers using parenthesis becomes even more flexible when the pipe symbol (|) is
used to separate alternative patterns. Using alternatives, we could rewrite our previous example as
(foo|foobar). Writing this as foo|foobar is simpler and works just as well, because just like
mathematics, regex specifiers have precedence. While you are learning, always enclose your groups in
parenthesis.
In the following example, the first search prints all lines from the file /usr/share/dict/words which
contain four consecutive vowels (compare the syntax to that used when first introducing range
expressions, above). The second search finds words that contain a double o or a double e, followed
(somewhere) by a double e.
[student@station student]$ egrep [aeiou]{4} /usr/share/dict/words | head -5
aqueous
dequeue
dequeued
dequeues
dequeuing
[student@station student]$ egrep (o|e){2}. *ee /usr/share/dict/words
bookkeeper
bookkeepers
bookkeeping
Chattahoochee
doorkeeper
rha030-3.0-0-en-2005-08-17T07:23:17-0400
45

freewheel
Greentree
Escaping Meta-Characters
Sometimes you need to match a character that would ordinarily be interpreted as a regular expression
wildcard or modifier character. To temporarily disable the special meaning of these characters, simply
escape them using the backslash (\) character. For example, the regex pattern cat. would match the
letters cat followed by any character: cats or catchup. To match only the letters cat. at the
end of a sentence, use the regex pattern cat\. to disable interpreting the period as a wildcard
character.
Note one distracting exception to this rule. When the backslash character precedes a < or >
character, it enables the special interpretation (anchoring the beginning or ending of a word) instead of
disabling the special interpretation. Shudder. It even gets worse - see the footnote at the bottom of the
following table.
Summary of Linux Regular Expression Syntax

The following table summarizes regular expression syntax, and identifies which components are found in
basic regular expression syntax, and which are found only in the extended regular expression syntax.
Table 3-2. Summary of Linux Regular Expression Syntax
Character
Role
Regex Syntax
Interpretation
wildcard
basic
match one of any

character
[abc], [a-z]
inclusion range
basic
match one of any

character included in
range
[âbc], [â-z]
exclusion range
basic
match one of any

character not included in
range
modifier
extended
match 0 or 1 of
preceding term
modifier
basic
match 0 or more of
preceding term
modifier
extended
match 1 or more of
preceding term
{m,n}
modifier
extended
match between m and n

(inclusively)
occurrences of the
preceding term
46
rha030-3.0-0-en-2005-08-17T07:23:17-0400

Character
Role
Regex Syntax
Interpretation
{n}
modifier
extended
match exactly n
occurrences of the
preceding term
anchor
basic
mark beginning of a line
anchor
basic
mark end of a line
\<
anchor
basic
mark beginning of a
word
\>
anchor
basic
mark end of a word
(...)
grouping
basic
allow modifiers to act

on a group of characters
(... | ...)
grouping
extended
allow alternate patterns

to be specified
escape a
extended (basic)
escape (or enable)

special interpretation of
the following character.
Notes:
a. When using extended regular expressions, the backslash (usually) strips special interpretation from
the following character. Red Hat Enterprise Linux uses GNU extensions when parsing basic regular
expressions, however, which use the backslash to enable extendedish interpretation of the following
character. For example, the expression e\{3\} would match eee when using basic regular
expressions. shudder-shudder.
Regular Expressions are NOT File Globbing

When first encountering regular expressions, students understandably confuse regular expressions with
pathname expansion (file globbing). Both are used to match patterns in text. Both share similar
metacharacters (*, ?, [...]), etc.). However, they are distinctly different. The following table
compares and contrasts regular expressions and file globbing.
Table 3-3. Comparing and Contrasting Regular Expressions and File Globbing
Regular Expressions
File Globbing
Implemented within search or search and replace

utilities, such as grep, vi, sed, and many scripting
languages such as perl, python, etc.
Implemented by the bash shell for the purpose of

matching filenames, and to a lesser extent is found
in some applications and scripting languages.
Uses the expression .* for stretchable glue.
Uses the expression * for stretchable glue.
Uses the expression . to match exactly one of

any character.
Uses the expression ? to match exactly one of

any character.
In the following example, the first argument is a regular expression, specifying text which starts with an
l and ends .conf, while the second argument is a file glob which specifies all files in the /etc
rha030-3.0-0-en-2005-08-17T07:23:17-0400
47

directory whose filename starts with l and ends .conf.
[student@station student]$ egrep l.*\.conf /etc/l*.conf
/etc/ldap.conf:# @(#)$Id: 087_warning.dbk,v 1.2 2004/01/07 16:39:53 bowe Exp $

/etc/libuser.conf:# Set this only if it differs from the default in /etc/krb5.conf.
/etc/ltrace.conf:; ltrace.conf
Take a close look at the second line of output. Why was it matched by the specified regular expression?
In a similar vain, when specifying regular expressions on the bash command line, care must be taken to
quote or escape the regex meta-characters, lest they be expanded away by the bash shell with unexpected
results. In all of the examples found in this discussion, the first argument to the egrep command is
protected with single quotes for just this reason.
Where to Find More Information About Regular Expressions

We have barely scratched the surface of the usefulness of regular expressions. The explanation we have
provided will be adequate for your daily needs, but even so, regular expressions offer much more power,
making even complicated text searches simple to perform.
For more online information about regular expressions, you should check:
The regex(7) manual page.
The grep(1) manual page.
Examples
Example 1. Literal Searches
Now that we understand regular expressions in more detail, let us revisit some earlier examples and see
them in a new light.
Given the file rhyme that contains the text:
The cat sat on the mat at home.
The regular expression at matches:
the at in cat; and
the at in mat; and
the at in at.
The regular expression \<at\> matches only the individual word at.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
48
Example 2. Range Expressions

A range expression matches exactly one instance of any one of the characters listed by the range
expression.
[student@station
[student@station
[student@station
[student@station
[student@station
student]$
student]$
student]$
student]$
student]$
echo
echo
echo
echo
grep
bar > file

car >> file
far >> file
are >> file
[cf]ar file
car
far
The range expression [cf]ar matches either a c or an f followed by ar.
Example 3. REGEX Modifiers

Modifiers control how many occurrences of the preceding regex specifier are matched:
[student@station
[student@station
[student@station
[student@station
student]$
student]$
student]$
student]$
echo
echo
echo
echo
ac
> file
abc
>> file
abbc >> file
abbbc >> file
The question mark (?) matches exactly one occurrence of the preceding specifier, if it exists.
[student@station student]$ egrep ab?c file
ac
abc
The plus sign (+) matches one or more of the preceding specifier:
[student@station station]$ egrep ab+c file
abc
abbc
abbbc
The asterisk (*) matches any number, including zero, occurrences of the preceding specifier:
[student@station student]$ egrep ab*c file
ac
abc
abbc
abbbc
Example 4. Anchored Searches

Anchored searches are used to match strings only at the beginning or ending of the input line.
[student@station student]$ echo "i am sam"
[student@station student]$ echo "sam i am"
rha030-3.0-0-en-2005-08-17T07:23:17-0400
> file
>> file
49

[student@station student]$ echo "am i sam"
[student@station student]$ echo "sam"
[student@station student]$ cat file
>> file
>> file
i am sam
sam i am
am i sam
sam
[student@station student]$ egrep ^sam file
sam i am
[student@station student]$ egrep sam$ file
i am sam
am i sam
sam
[student@station student]$ egrep ^sam$ file
sam
Where ^ and $ anchor to lines, the anchors \< and \> match the beginning and ends of words:
[student@station student]$ egrep \<am\> file
i am sam
sam i am
am i sam
Example 5. REGEX Term Grouping

Use parenthesis to group several regex specifiers into a single unit. Use the pipe symbol (|) to indicate
alternatives.
Suppose we are writing a letter. We could write a regular expression to match the greeting line like this:
^Dear (Dr|Mr|Ms)\.
This would match the lines:

Dear Dr. Smith
Dear Mr. Smith
Dear Ms. Smith
but not match a greeting such as:

Dear Miss Smith
Perhaps we did not match the greeting because we forgot to add the period after the abbreviation. This
regex pattern would match either way:
^Dear (Dr|Mr|Ms)\.?
whether or not the period was present.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
50
Example 6. Is elvis in the House?

The user blondie would like to create a script which checks to see if someone is defined as a local user on
a Linux system. The script takes one argument, which is expected to be a username. She could use the id
command to confirm if a user named username existed, but the id command would include users who
might be defined by an NIS server, or some other type of network accessible database, instead of on the
local machine. She instead decides to examine the local user database (the /etc/passwd file) directly.
She creates the following script.
[blondie@station blondie]$ cat inhouse
#!/bin/bash
if [ ! $# == 1 ]; then
echo "usage: inhouse USERNAME"

exit 1
fi
if grep -q "^$1:" /etc/passwd; then
echo "$1 is in the house."
else
echo "$1 is not in the house."
fi
In this stanza, the script ensures it was passed exactly one argument.
This line contains the interesting regular expression. The grep command will look for a line which
begins with the argument, trailed by a :. Recalling the structure of the /etc/passwd file,
usernames satisfy these conditions.
Saving the file, and making it executable, blondie tries out the script on the (existing) user elvis and
(non-existing) user barney.
[blondie@station blondie]$ mv inhouse bin/
[blondie@station blondie]$ chmod a+x bin/inhouse
[blondie@station blondie]$ inhouse elvis
elvis is in the house.

[blondie@station blondie]$ inhouse barney
barney is not in the house.
Example 7. Searching for Telephone Numbers

The combination of regular expressions and the grep command creates a powerful tool for extracting
desired nuggets from large amounts of information. In the following, elvis recalls noticing a phone
number somewhere within the /usr/share/doc directory, but he has no recollection where. He begins
a process of searching for all phone numbers within the /usr/share/doc directory (which in this case
contains nearly 12000 files).
He begins by noting the fact that all United States phone numbers have at least 7 digits, conventionally
written with the first three digits separated from the last four with either a - or a space, such as
rha030-3.0-0-en-2005-08-17T07:23:17-0400
51

555-1212 or 555 1212. He begins by recursively searching through all files in the /usr/share/doc
directory for such a pattern.
[elvis@station doc]$ egrep -r [[:digit:]]{3}(-| )[[:digit:]]{4} .
./hwdata-0.75/COPYING:
59 Temple Place, Suite 330, Boston, MA 02111-1307 U
SA
./hwdata-0.75/COPYING:
Foundation, Inc., 59 Temple Place, Suite 330, Boston,
MA 02111-1307 USA
./libart_lgpl-2.3.11/COPYING:
59 Temple Place, Suite 330,
Boston, MA 02111-1307 USA
...
After observing the first few lines, elvis realizes that his regex pattern is too general. He is matching zip
codes as well as phone numbers. He refines his pattern, specifying that any preceding character or
trailing character must not be a number.
[elvis@station doc]$ egrep -r [^[:digit:]][[:digit:]]{3}(-| )[[:digit:]]{4}[^[:digit:]] .
./libjpeg-6b/README:642-4900, or from Global Engineering Documents at (800) 8547179. (ANSI

./libjpeg-6b/README:
phone (408) 944-6300, fax (408) 944-6314
./bash-2.05b/article.ms:\f(CRgnu@prep.ai.mit.edu\fP or call \f(CR+1-617-876-3296\fP
./esound-0.2.28/esound.ps:7 w(5)p Black 0 TeXcolorgray 795 1077 a Fm(esd)p Black
./esound-0.2.28/esound.ps:7 w(5)p Black 0 TeXcolorgray 795 1168 a(esdctl)p Black
./esound-0.2.28/esound.ps:0 TeXcolorgray 596 1554 a Fk(4.)20 b(Miscellaneous)d(Information)p
./esound-0.2.28/esound.ps:7 w(9)p Black 0 TeXcolorgray 795 1665 a Fm(New)k(Featu
r)o(es)p
...
This time, elviss search precedes much better, until he hits the file esound.ps. This file contains
PostScript, which routinely uses numbers written in ASCII text to specify coordinates. Knowing that he
was not examining a PostScript file, elvis devises a way to exclude all files that end with the .ps
extension from his search. He first uses the find command to list every file in the directory. He next
greps the output down to all files that do not end .ps. He then uses the xargs command to feed these
filenames into his original grep command as arguments. Because his files are now being specified
individually as command line arguments, he no longer needs to grep recursively.
[elvis@station doc]$ find . | egrep -v \.ps$ | xargs egrep
[^[:digit:]][[:digit:]]{3}(-| )[[:digit:]]{4}[^[:digit:]]
./libjpeg-6b/README:642-4900, or from Global Engineering Documents at (800) 8547179. (ANSI

./libjpeg-6b/README:
phone (408) 944-6300, fax (408) 944-6314
./bash-2.05b/article.ms:\f(CRgnu@prep.ai.mit.edu\fP or call \f(CR+1-617-876-3296
\fP
./gawk-3.1.1/README_d/README.solaris:# P.O. Box 354
Home Phone: +972
8 979-0381
Fax: +1 603 761-6761
./gawk-3.1.1/README_d/README.solaris:Columbus, Ohio 43210-1174
1-614
-292-5310 (Office/Answering Device)
...
After observing a few more lines of output, elvis realizes he should also exclude files that end .fig and
.pdf from his search, as they also contain many ASCII numbers and are cluttering his output.
Modifying his regular expression in his first grep command, he repeats his search.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
52

[elvis@station doc]$ find . | egrep -v \.(ps|fig|pdf)$ |
xargs egrep -h -C2 [^[:digit:]][[:digit:]]{3}(-| )[[:digit:]]{4}[^[:digit:]]
...
Now that the search seems to be going well, elvis revises the output formatting, asking grep to not
display filenames, and give 2 lines of context around each phone number.
[elvis@station doc]$ find . | egrep -v \.(ps|pdf|fig)$ |
xargs egrep -h -C2 [^[:digit:]][[:digit:]]{3}(-| )[[:digit:]]{4}[^[:digit:]]
its much cheaper and includes a great deal of useful explanatory material.)
In the USA, copies of the standard may be ordered from ANSI Sales at (212)
642-4900, or from Global Engineering Documents at (800) 854-7179. (ANSI
doesnt take credit card orders, but Global does.) Its not cheap: as of
1992, ANSI was charging $95 for Part 1 and $47 for Part 2, plus 7%
-1778 McCarthy Blvd.
Milpitas, CA 95035
phone (408) 944-6300, fax (408) 944-6314
A PostScript version of this document is available by FTP at
ftp://ftp.uu.net/graphics/jpeg/jfif.ps.gz. There is also a plain text
-The Free Software Foundation sells tapes and CD-ROMs
containing Bash; send electronic mail to
\f(CRgnu@prep.ai.mit.edu\fP or call \f(CR+1-617-876-3296\fP
for more information.
.PP
-# -# Aharon (Arnold) Jones
arnold@sleeve.com [ <<=== NOTE: NEW ADDRESS!! ]
# P.O. Box 354
Home Phone: +972 8 989-0381
Fax: +1 603 761-6761
# Nof Ayalon
Cell Phone: +972 51 227-545
(See www.efax.com)
# D.N. Shimshon 97784
Laundry increases exponentially in the
-The Ohio State University
http://www.math.ohio-state.edu/~nevai/
231 West Eighteenth Avenue
http://www.math.ohio-state.edu/~jat/
Columbus, Ohio 43210-1174
1-614-292-5310 (Office/Answering Device)
The United States of America
1-614-292-1479 (Math Dept Fax)
-,-*~^~*-,._.,-*~^~*-,._.,-*~^~*-,._.,-*~^~*-,._.,-*~^~*-,
Joe Farwell
| phone 610-843-6020
| Platinum technology
Systems Administrator | vmail 800-123-9096 x7512 | 620 W. Germantown Pike
joe@platinum.com
| fax
610-872-6021
| Plymouth Meeting,Pa,19462
~*-,._.,-*~^~*-,._.,-*~^~*-,._.,-*~^~*-,._.,-*~^~*-,._.,-*~
delay needs to be calibrated using outside sources.
...
Note that names and numbers have been altered in this output.
All told, elvis ends up with 289 "hits", which he can skim in a reasonable amount of time.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
53
Online Exercises
Lab Exercise
Objective: Use regular expressions to search for patterns of text.
Specification
1. Create a short executable bash script named ~/bin/ispython, which expects a single argument,
which is a filename. If the supplied filenames first line is exactly #!/usr/bin/python (nothing
more, nothing less), the script should print the number 1. Otherwise, the script should print the
number 0.
2. You are looking for files in the /etc directory (but not subdirectories) that contain a standard United
States long distance phone number, written using the pattern of 1-###-###-####, where each # is
replaced with a numeric digit. Collect the filenames of every file in the /etc directory which
contains such a pattern of numbers, and place them in the file ~/etcphone.txt, one file name per
line, sorted alphabetically, using absolute references.
3. The file /usr/share/doc/bash-*/NEWS contains many itemized lists, with list items marked by
lines whose first characters are a series of one or more letters, followed by a period and space, as in
the following:
y.
New prompting expansions: \a, \e, \H, \T, \@, \v, \V.
z.
Variable expansion in prompt strings is now controllable via a shell

option (shopt prompt vars).
aa. Bash now defaults to using command-oriented history.

bb. The history file ($HISTFILE) is now truncated to $HISTFILESIZE after
being written.
Create the following files, each of which contains the number which answers the specified question
as its single word.
filename
question
newsitems.txt
How many lines begin with a series of one or more letters, followed by a
period?
newsitems23.txt
How many lines begin with a series of two or three letters, followed by a
period?
newsitems2.txt
How many lines begin with a series of exactly two letters, followed by a
period?
54
rha030-3.0-0-en-2005-08-17T07:23:17-0400

filename
question
newsitems3.txt
How many lines begin with a series of exactly three letters, followed by a
period?
4. The file /usr/share/dict/words contains a collection of common dictionary words, stored one
per line. Both common words and proper names are included, each appropriately capitalized.
Using only the egrep command, determine which words start with a capital letter followed only by
vowels. Do not include single letter words. (For the purposes of this exercise, consider vowels as
only the letters A, E, I, O, or U, both uppercase and lowercase.)
List these words, one per line and sorted alphabetically, in the file ~/vowel2.txt.
Deliverables
1. A script called ~/bin/ispython, which, when executed with a single filename as an argument, will print 1 if
the specified files first line is exactly #!/usr/bin/python. Otherwise, the script should print 0 (hint: This can
be accomplished by combining the head and grep commands).
2. The file ~/etcphone.txt, which contains a list of all files in the /etc directory (but not subdirectories) which
contain the pattern 1-###-###-####, where each # is replaced by a numeric digit. The files should be listed as
absolute references, one per line, alphabetized.
3. The files ~/newsitems.txt, ~/newsitems23.txt, ~/newsitems2.txt, and ~/newsitems3.txt, each
of which contain a single number as their only word. The number should be the answer to the respective
question about the file /usr/share/doc/bash-*/NEWS in the table above.
4. The file ~/vowel2.txt, which contains an alphabetically sorted list of all words from
/usr/share/dict/words which start with a capital letter followed only by vowels. (Exclude single letter
words).
Questions
In all of the following questions, the term regular expression implies extended regular expression syntax.
1. Which of the following characters is a regular expression literal character?
( ) a. ?
( ) b. _
( ) c. *
55
rha030-3.0-0-en-2005-08-17T07:23:17-0400

( ) d. $
2. Which of the following regular expressions would match the word bookkeeper?
( ) a. o+ke+
( ) b. o{2}[ke]{4}
( ) c. ô+k+e+
( ) d. o+.*e+$
3. Which of the following regular expressions would match United States 5 digit or 9 digit zip codes, which have the
pattern of ##### or #####-#### respectively, with each # replaced with a numeric digit?
( ) a. [[:digit:]]-{5|4}
( ) b. [[:digit:]]{5}(-[[:digit:]]{4})?
( ) c. [[:digit:]]-{5}[[:digit:]]{4}?
( ) d. [[:digit:]]{9}[-[[:digit:]]]{4}?
4. Which of the following regular expressions would only match a line that contains entirely capital letters, spaces,
and tabs, regardless of the current character set?
( ) a. ^[A-Z]*$
( ) b. ^[A-Z[:blank:]]*$
( ) c. ^[[:upper:][:blank:]]*$
( ) e. B and C
Use the following transcript to answer the next 4 questions.
[student@station student]$ cat /etc/crontab
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/
# run-parts
01 * * * * root
02 4 * * * root
22 4 * * 0 root
42 4 1 * * root
run-parts
run-parts
run-parts
run-parts
/etc/cron.hourly
/etc/cron.daily
/etc/cron.weekly
/etc/cron.monthly
56
rha030-3.0-0-en-2005-08-17T07:23:17-0400

5. Which of the following commands would print the first 4 lines only from the file /etc/crontab?
( ) a. egrep $[[:upper:]] /etc/crontab
( ) b. egrep ^[[:upper:]] /etc/crontab
( ) c. egrep [[:upper:]]$ /etc/crontab
( ) d. egrep [^[:upper:]] /etc/crontab
6. Which of the following commands would print the last 4 lines only from the file /etc/crontab?
( ) a. egrep .* /etc/crontab
( ) b. egrep \* /etc/crontab
( ) c. egrep * /etc/crontab
( ) d. egrep *{1} /etc/crontab
7. Which of the following commands would print the last 2 lines only from the file /etc/crontab?
( ) a. egrep cron.[weekly|monthly] /etc/crontab
( ) b. egrep cron.(weekly|monthly) /etc/crontab
( ) c. egrep cron.{weekly|monthly} /etc/crontab
( ) d. egrep cron.(weekly|monthly)? /etc/crontab
8. Which of the following would print only the line that contains the filename /etc/cron.hourly from the file
/etc/crontab?
( ) a. egrep (\* ){4} /etc/crontab

( ) b. egrep ^0[13579] /etc/crontab
( ) c. egrep ^[01[:punct:][:blank:]]{10} /etc/crontab
( ) e. A and B only
57
rha030-3.0-0-en-2005-08-17T07:23:17-0400

9. Which of the following regular expressions would match 3.14159?
( ) a. 3.14159
( ) b. 3\.14[:digit:]+
( ) c. [[:digit:]\.]{7}
( ) e. A and C only
The following is extracted from the procmailrc(5) man page. Ignore the line break between the words X-Envelope
and Apparently.
(^((Original-)?(Resent-)?(To|Cc|Bcc)|(X-Envelope
|Apparently(-Resent)?)-To):(.*[^-a-zA-Z0-9_.])?)
10. Which of the following lines would match the regular expression?
( ) a. Resent-Cc: elvis@example.com
( ) b. Original-Resent-Bcc: elvis@example.com
( ) c. To: elvis@example.com
( ) e. A and C only
(It could have been worse... the following regular expression is also found in the procmailrc(5) man
page.)
(^(Mailing-List:|Precedence:.*(junk|bulk|list)|To: Multiple
recipients of |(((Resent-)?(From|Sender)|X-Envelope-From):|>?From
)([^>]*[^(.%@a-z0-9])?(Post(ma?(st(e?r)?|n)|office)|(send)?Mail(er)?
daemon|m(mdf|ajordomo)|n?uucp|LIST(SERV|proc)|NETSERV|o(wner|ps)
|r(e(quest|sponse)|oot)|b(ounce|bs\.smtp)|echo|mirror|s(erv(ices?|er)
|mtp(error)?|ystem)|A(dmin(istrator)?|MMGR|utoanswer))(([^).!:az0-9][-_a-z0-9]*)?[%@>\t ][^<)]*($.*$.*)?)?$([^>]|$)))
rha030-3.0-0-en-2005-08-17T07:23:17-0400
58
Chapter 4. Everything Sorting: sort and uniq

Key Concepts
The sort command sorts data alphabetically.
sort -n sorts numerically.
sort -u sorts and removes duplicates.
sort -k and -t sorts on a specific field in patterned data.
Discussion
In previous Workbooks, we have introduced the sort command in its simplest form: a tool for arranging
the lines of a file or output from a command alphabetically. This Lesson will present the sort command
in more detail.
The sort Command

Basic Sorting
Sorting is the process of arranging records into a specified sequence. Examples of sorting would be
arranging a list of usernames into alphabetical order, or a set of file sizes into numeric order.
In its simplest form, the sort command will alphabetically sort lines (including any whitespace or control
characters which are encountered). The sort command uses the local locale (language definition) to
determine the order of the characters (referred to as the collating order). In the following example,
madonna first displays the contents of the file /etc/sysconfig/mouse as is, and then sorts the
contents of the file alphabetically.
[madonna@station madonna]$ cat /etc/sysconfig/mouse
FULLNAME="Generic - 2 Button Mouse (PS/2)"

MOUSETYPE="ps/2"
XEMU3="yes"
XMOUSETYPE="PS/2"
DEVICE=/dev/psaux
[madonna@station madonna]$ sort /etc/sysconfig/mouse
DEVICE=/dev/psaux
FULLNAME="Generic - 2 Button Mouse (PS/2)"
MOUSETYPE="ps/2"
XEMU3="yes"
XMOUSETYPE="PS/2"
If called with arguments, the arguments are interpreted as (possibly multiple) filenames to be sorted. If
called without argument, the sort command will sort whatever it reads from standard in.
59
Modifying the Sort Order

By default, the sort command sorts lines alphabetically. The following table lists command line switches
which can be used to modify this default sort order.
Table 4-1. Command Line Switches for Specifying Sort Order
Switch
Effect
-b, --ignore-leading-blanks
Ignore spaces and tabs at the beginning of a line.
-d, --dictionary-order
Consider only blanks and alphanumeric characters.
-f, --ignore-case
Treat all characters as uppercase.
-g, --general-numeric-sort
Compare words as floating point numbers.
-n, --numeric-sort
Compare words as integers.
-r, --reverse
Sort in descending rather than ascending order.
As an example, madonna is examining the file sizes of all files that start with an m in the /var/log
directory.
[madonna@station madonna]$ ls -s1 /var/log/m *
20
3104
1552
1952
1236
4
384
636
216
560
/var/log/maillog
/var/log/maillog.1
/var/log/maillog.2
/var/log/maillog.3
/var/log/maillog.4
/var/log/messages
/var/log/messages.1
/var/log/messages.2
/var/log/messages.3
/var/log/messages.4
She next sorts the output with the sort command.

[madonna@station madonna]$ ls -s /var/log/m * | sort
1236
1552
1952
20
216
3104
384
4
560
636
/var/log/maillog.4
/var/log/maillog.2
/var/log/maillog.3
/var/log/maillog
/var/log/messages.3
/var/log/maillog.1
/var/log/messages.1
/var/log/messages
/var/log/messages.4
/var/log/messages.2
Without being told otherwise, the sort command sorted the lines alphabetically (with 1952 coming
before 20). Realizing this is not what she intended, madonna adds the -n command line switch.
[madonna@station madonna]$ ls -s /var/log/m * | sort -n
4 /var/log/messages
20 /var/log/maillog
216 /var/log/messages.3
rha030-3.0-0-en-2005-08-17T07:23:17-0400
60

384
560
636
1236
1552
1952
3104
/var/log/messages.1
/var/log/messages.4
/var/log/messages.2
/var/log/maillog.4
/var/log/maillog.2
/var/log/maillog.3
/var/log/maillog.1
Better, but madonna would prefer to reverse the sort order, so that the largest files come first. She adds
the -r command line switch.
[madonna@station madonna]$ ls -s /var/log/m * | sort -nr
3104
1952
1552
1236
636
560
384
216
20
4
/var/log/maillog.1
/var/log/maillog.3
/var/log/maillog.2
/var/log/maillog.4
/var/log/messages.2
/var/log/messages.4
/var/log/messages.1
/var/log/messages.3
/var/log/maillog
/var/log/messages
Why ls -1?: Why was the -1 command line switch given to the ls command in the first example, but
not the others? By default, when the ls command is using a terminal for standard out, it will group the
filenames in multiple columns for easy readability. When the ls command is using a pipe or file for
standard out, however, it will print the files one file per line. The -1 command line switch forces this
behavior for for terminal output as well.
Specifying Sort Keys

In the previous examples, the sort command performed its sort based on the first characters found on a
line. Often, formatted data is not arranged so conveniently. Fortunately, the sort command allows users
to specify which column of tabular data to use for determining the sort order, or, in more formally, which
column should be used as the sort key.
The following table of command line switches can be used to determine the sort key.
Table 4-2. Command Line Switches for Specifying Sort Keys
Switch
Effect
-k, --key=POS
Use the key at POS to determine sort order.
-t, --field-separator=SEP
Use the character(s) SEP to separate fields

(instead of simply whitespace).
Sorting Output by a Particular Column

As an example, suppose madonna wanted to reexamine her log files, using the long format of the ls
rha030-3.0-0-en-2005-08-17T07:23:17-0400
61

command. She tries simply sorting her output numerically.
[madonna@station madonna]$ ls -l /var/log/m * | sort -n
-rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------
1
1
1
1
1
1
1
1
1
1
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
1260041
1581750
1993522
216885
31187
3172217
387345
567049
644859
651
Sep
Sep
Sep
Sep
Oct
Oct
Oct
Sep
Sep
Oct
14
28
22
22
5
5
5
14
28
5
04:05
06:15
10:16
10:22
06:05
04:05
04:07
04:08
06:22
05:40
/var/log/maillog.4
/var/log/maillog.2
/var/log/maillog.3
/var/log/messages.3
/var/log/maillog
/var/log/maillog.1
/var/log/messages.1
/var/log/messages.4
/var/log/messages.2
/var/log/messages
Now that the sizes are no longer reported at the beginning of the line, she has difficulty. Instead, she
repeats her sort using the -k command line switch to sort her output by the 5th column, producing the
desired output.
[madonna@station madonna]$ ls -l /var/log/m * | sort -n -k5
-rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------rw-------
1
1
1
1
1
1
1
1
1
1
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
root
651
31187
216885
387345
567049
644859
1260041
1581750
1993522
3172217
Oct
Oct
Sep
Oct
Sep
Sep
Sep
Sep
Sep
Oct
5
5
22
5
14
28
14
28
22
5
05:40
06:05
10:22
04:07
04:08
06:22
04:05
06:15
10:16
04:05
/var/log/messages
/var/log/maillog
/var/log/messages.3
/var/log/messages.1
/var/log/messages.4
/var/log/messages.2
/var/log/maillog.4
/var/log/maillog.2
/var/log/maillog.3
/var/log/maillog.1
Specifying Multiple Sort Keys

Next, madonna is examining the file /etc/fdprm, which tables low level formatting parameters for
floppy drives. She uses the grep command to extract the data from the file, stripping away comments and
blank lines.
[madonna@station madonna]$ grep "^[[:alnum:]]" /etc/fdprm
360/360
1200/1200
360/720
720/720
720/1440
360/1200
720/1200
1440/1440
1440/1200
1680/1440
cbm1581
800/720
720
2400
720
1440
1440
720
1440
2880
2880
3360
1600
1600
9
15
9
9
9
9
9
18
18
21
10
10
2
2
2
2
2
2
2
2
2
2
2
2
40
80
40
80
80
40
80
80
80
80
80
80
0
0
1
0
0
1
0
0
0
0
2
0
0x2A
0x1B
0x2A
0x2A
0x2A
0x23
0x23
0x1B
????
0x0C
0x2A
0x2A
0x02
0x00
0x02
0x02
0x02
0x01
0x01
0x00
????
0x00
0x02
0x02
0xDF
0xDF
0xDF
0xDF
0xDF
0xDF
0xDF
0xCF
????
0xCF
0xDF
0xDF
0x50
0x54
0x50
0x50
0x50
0x50
0x50
0x6C
???? # ?????
0x6C # ?????
0x2E
0x2E
She next sorts the data numerically, using the 5th column as her key.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
62

[madonna@station madonna]$ grep "^[[:alnum:]]" /etc/fdprm | sort -n -k5
360/1200
360/360
360/720
1200/1200
1440/1200
1440/1440
1680/1440
720/1200
720/1440
720/720
800/720
cbm1581
720
720
720
2400
2880
2880
3360
1440
1440
1440
1600
1600
9
9
9
15
18
18
21
9
9
9
10
10
2
2
2
2
2
2
2
2
2
2
2
2
40
40
40
80
80
80
80
80
80
80
80
80
1
0
1
0
0
0
0
0
0
0
0
2
0x23
0x2A
0x2A
0x1B
????
0x1B
0x0C
0x23
0x2A
0x2A
0x2A
0x2A
0x01
0x02
0x02
0x00
????
0x00
0x00
0x01
0x02
0x02
0x02
0x02
0xDF
0xDF
0xDF
0xDF
????
0xCF
0xCF
0xDF
0xDF
0xDF
0xDF
0xDF
0x50
0x50
0x50
0x54
???? # ?????
0x6C
0x6C # ?????
0x50
0x50
0x50
0x2E
0x2E
Her data is successfully sorted using the 5th column, with the formats specifying 40 tracks grouped at the
top, and 80 tracks grouped at the bottom. Within these groups, however, she would like to sort the data
by the 3rd column. She adds an additional -k command line switch to the sort command, specifying the
third column as her secondary key.
[madonna@station madonna]$ grep "^[[:alnum:]]" /etc/fdprm | sort -n -k5 -k3
360/1200
360/360
360/720
720/1200
720/1440
720/720
800/720
cbm1581
1200/1200
1440/1200
1440/1440
1680/1440
720
720
720
1440
1440
1440
1600
1600
2400
2880
2880
3360
9
9
9
9
9
9
10
10
15
18
18
21
2
2
2
2
2
2
2
2
2
2
2
2
40
40
40
80
80
80
80
80
80
80
80
80
1
0
1
0
0
0
0
2
0
0
0
0
0x23
0x2A
0x2A
0x23
0x2A
0x2A
0x2A
0x2A
0x1B
????
0x1B
0x0C
0x01
0x02
0x02
0x01
0x02
0x02
0x02
0x02
0x00
????
0x00
0x00
0xDF
0xDF
0xDF
0xDF
0xDF
0xDF
0xDF
0xDF
0xDF
????
0xCF
0xCF
0x50
0x50
0x50
0x50
0x50
0x50
0x2E
0x2E
0x54
???? # ?????
0x6C
0x6C # ?????
Now the data has been sorted primarily by the fifth column. For rows with identical fifth columns, the
third column has been used to determine the final order. An arbitrary number of keys can be specified by
adding more -k command line switches.
Specifying the Field Separator
The above examples have demonstrated how to sort data using a specified field as the sort key. In all of
the examples, fields were separated by whitespace (i.e., a series of spaces and/or tabs). Often in Linux
(and Unix), some other method is used to separate fields. Consider, for example, the /etc/passwd file.
[madonna@station madonna]$ head /etc/passwd
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
rha030-3.0-0-en-2005-08-17T07:23:17-0400
63

mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
news:x:9:13:news:/etc/news:
The lines are structured into seven fields each, but the fields are separated using a : instead of
whitespace. With the -t command line switch, the sort command can be instructed to use some specified
character (such as a :) to separate fields.
In the following, madonna uses the sort command with the -t command line switch to sort the first 10
lines of the /etc/passwd file by home directory (the 6th field).
[madonna@station madonna]$ head /etc/passwd | sort -t: -k6
bin:x:1:1:bin:/bin:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
halt:x:7:0:halt:/sbin:/sbin/halt
daemon:x:2:2:daemon:/sbin:/sbin/nologin
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
The user bin, with a home directory of /bin, is now at the top, and the user mail, with a home directory
of /var/spool/mail, is at the bottom.
Summary
In summary, we have seen that the sort command can be used to sort structured data, using the -k
command line switch to specify the sort field (perhaps more than once), and the -t command line switch
to specify the field delimiter.
The -k command line switch can receive more sophisticated arguments, which serve to specify character
positions within a field, or customize sort options for individual fields. See the sort(1) man page for
details.
The uniq Command

The uniq program is used to identify, count, or remove duplicate records in sorted data. If given
command line arguments, they are interpreted as filenames for files on which to operate. If no arguments
are provided, the uniq command operates on standard in. Because the uniq command only works on
already sorted data, it is almost always used in conjunction with the sort command.
The uniq command uses the following command line switches to qualify its behavior.
Table 4-3. Command Line Arguments For uniq
-c, --count
Prefix line with the number of its occurrences; this is the length of
the run.
64
rha030-3.0-0-en-2005-08-17T07:23:17-0400

-d, --repeated
Print only duplicated lines.
-f, --skip-fields=n
Avoid comparing the first nfields; fields are delimited by

whitespace.
-i, --ignore-case
Ignore case.
-s, --skip-charsn
Skip the first n characters.
-u, --unique
Print only unique lines.
-w, --check-chars=n
Compare no more than n characters in each line.
In order to understand the uniq commands behavior, we need repetitive data on which to operate. The
following python script simulates the rolling of three six sided dice, writing the sum of 100 roles once per
line. The user madonna makes the script executable, and then records the output in a file called trial1.
[madonna@station madonna]$ cat three_dice.py
#!/usr/bin/python
from random import randint
for i in range(100): print randint(1,6)+randint(1,6)+randint(1,6)
[madonna@station madonna]$ chmod 755 three_dice.py
[madonna@station madonna]$ ./three_dice.py > trial1
[madonna@station madonna]$ wc trial1
100
100
260 trial_run
[madonna@station madonna]$ head trial1
10
10
10
13
8
8
10
10
8
6
Reducing Data to Unique Entires

Now, madonna would like to analyze the data. She begins by sorting the data and piping the output
through the uniq command.
[madonna@station madonna]$ sort -n trial1 | uniq
4
5
6
7
8
9
10
11
12
13
rha030-3.0-0-en-2005-08-17T07:23:17-0400
65

14
15
16
17
18
Without any command line switches, the uniq command has removed duplicate entries, reducing the
data from 100 lines to only 15. Easily, madonna sees that the data looks reasonable: the sum of every
combination for three six sided die is represented, with the exception of 3. Because only one combination
of the dice would yield a sum of 3 (all ones), she expects it to be a relatively rare occurrence.
Counting Instances of Data

A particularly convenient command line switch for the uniq command is -c, or --count. This causes the
uniq command to count the number of occurrences of a particular record, prepending the result to the
record on output.
In the following example, madonna uses the uniq command to reproduce its previous output, this time
prepending the number of occurrences of each entry in the file.
[madonna@station madonna]$ sort -n trial1 | uniq -c
1
4
6
10
10
13
13
9
13
4
8
4
1
2
2
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
As would be expected (by a statistician, at least), the largest and smallest numbers have relatively few
occurrences, while the intermediate numbers occur more numerously. The first column can be summed
to 100 to confirm that the uniq command identified every occurrence.
Identifying Unique or Repeated Data with uniq

Sometimes, people are just interested in identifying unique or repeated data. The -d and -u command line
switches allow the uniq command to do just that. In the first case, madonna identifies the dice
combinations that occur only once. In the second case, she identifies combinations that are repeated at
least once.
[madonna@station madonna]$ sort -n trial0 | uniq -u
rha030-3.0-0-en-2005-08-17T07:23:17-0400
66

16
[madonna@station madonna]$ sort -n trial1 | uniq -d
5
6
7
8
9
10
11
12
13
14
15
17
18
Examples
Example 1. Sorting the Output of ps aux
The user madonna is examining the processes running on her local machine. She is familiar with the ps
aux command, which tables information about every running process.
[madonna@station madonna]$ ps aux | head -4
USER
root
root
root
PID %CPU %MEM

1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
VSZ
1380
0
0
RSS
76
0
0
TTY
?
?
?
STAT
S
SW
SW
START
02:05
02:05
02:05
TIME
0:04
0:00
0:00
COMMAND
init [
[keventd]
[kapmd]
The following table identifies the selected columns.

Table 4-4. Selected Columns from the ps aux Command
Column Number
Title
Role
USER
The user that owns the process.
PID
The process ID of the process
%CPU
The relative CPU utilization of

the process
%MEM
The relative Memory utilization

of the process
VSZ
The "virtual size" of the process,

or how much memory the
process has requested
67
rha030-3.0-0-en-2005-08-17T07:23:17-0400

Column Number
Title
Role
RSS
The "resident size" of the

process, or how much actual
memory it is consuming.
The user madonna would like to order processes in terms of some of these parameters. She first orders
the processes by their virtual memory size, sorting numerically, and listing in descending order. Notice
the use of the tail +2 command, to remove the header from the list of processes.
[madonna@station madonna]$ ps aux | tail +2 | sort -rn -k5 | head
gdm
madonna
apache
apache
apache
apache
apache
apache
apache
apache
1074
1844
909
908
907
906
905
904
903
902
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
5.2
0.2
0.5
0.5
0.5
0.5
0.5
0.5
0.5
0.5
37828
19436
18320
18320
18320
18320
18320
18320
18320
18320
13396 ?
632 pts/0
1464 ?
1464 ?
1464 ?
1464 ?
1464 ?
1464 ?
1464 ?
1468 ?
S
S
S
S
S
S
S
S
S
S
02:06
03:42
02:06
02:06
02:06
02:06
02:06
02:06
02:06
02:06
0:01
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
/usr/bin/gdmgreeter
sort -rn -k5
/usr/sbin/httpd
/usr/sbin/httpd
/usr/sbin/httpd
/usr/sbin/httpd
/usr/sbin/httpd
/usr/sbin/httpd
/usr/sbin/httpd
/usr/sbin/httpd
The gdmgreeter (which manages logins for the X graphical environment) and httpd daemon (which
implements the Apache Web Server) are the largest processes on her machine, in terms of the amount of
memory they are requesting. (Note also, the sort command made an appearance).
Next, madonna sorts the output by the sixth column, which tables the resident memory sizes of the
processes.
gdm
root
root
xfs
elvis
madonna
root
root
madonna
root
1074
1066
914
978
1664
1748
1662
1746
1752
885
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
5.2 37828 13396 ?

2.5 17836 6444 ?
1.2 9916 3140 ?
0.9 4816 2512 ?
0.7 6768 2020 ?
0.7 6768 2008 ?
0.6 6716 1736 ?
0.6 6716 1724 ?
0.5 4388 1472 pts/2
0.5 18248 1468 ?
S
R
S
S
S
S
S
S
S
S
02:06
02:06
02:06
02:06
03:31
03:31
03:31
03:31
03:31
02:06
0:01
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
/usr/bin/gdmgreet
/usr/X11R6/bin/X
cupsd
xfs -droppriv
/usr/sbin/sshd
/usr/sbin/sshd
/usr/sbin/sshd
/usr/sbin/sshd
-bash
/usr/sbin/httpd
Interestingly, a different collection of processes make the top of the list, including the X server, and
several instances of the sshd daemon (which implements the Secure Shell service). Presumably, these are
the processes that are currently active.
Next, madonna sorts by the third column, relative CPU activity.
elvis
elvis
blondie
xfs
smmsp
rpc
1744 33.8
1745 33.7
1826 33.3
978 0.0
864 0.0
586 0.0
rha030-3.0-0-en-2005-08-17T07:23:17-0400
0.1
0.1
0.1
0.9
0.0
0.0
3408 400 pts/1

3412 400 pts/1
3412 400 pts/2
4816 2512 ?
5732
4 ?
1548
4 ?
R
R
R
S
S
S
03:31
03:31
03:32
02:06
02:06
02:05
6:01
6:00
5:45
0:00
0:00
0:00
cat /dev/zero
cat /dev/zero
cat /dev/zero
xfs -droppriv
sendmail: Queue
portmap
68

root
root
root
root
914
9
894
885
0.0
0.0
0.0
0.0
1.2 9916 3140 ?

0.0
0
0 ?
0.0 1572 172 ?
0.5 18248 1468 ?
S
SW
S
S
02:06
02:05
02:06
02:06
0:00
0:00
0:00
0:00
cupsd
[bdflush]
crond
/usr/sbin/httpd
Her machine is not seeing much current activity, with the exception of three different cat processes,
which seem to be evenly dividing her CPU.
Example 2. Using sort and uniq to Collect Information on

Running Processes
Continuing to examine the processes running on her machine, madonna next uses the ps command with
the -e switch, which specifies to list every process, and the -o switch, which takes a list of column names
as an argument. The -o command line switch allows madonna to list only the information she which
interests her. She finds the following entries in a table of format specifiers in the ps(1) man page.
Table 4-5. Selected Format Specifiers for the ps Command
Tag
Specifies
cmd
The short name of the command
pid
The process ID
state
The current state of the process (R=running,

S=sleeping)
user
The user who owns the process
As some examples of using the -o command line switch, madonna first tables processes with their
process ID, the user who owns the process, and the command that is running.
[madonna@station madonna]$ ps -e -o pid,user,cmd | head -5
PID
1
2
3
4
USER
root
root
root
root
CMD
init [
[keventd]
[kapmd]
[ksoftirqd_CPU0]
Next, she merely tables process id and state.

[madonna@station madonna]$ ps -e -o pid,state| head -5
PID
1
2
3
4
S
S
S
S
S
Now that she has built up some familiarity with the ps command and the -o command line switch, she is
ready to begin asking some questions. She first wants to know who is running processes on the machine,
and how many processes they are running. She tables all processes, listing only the username of who
rha030-3.0-0-en-2005-08-17T07:23:17-0400
69

owns the process. She then passes the output through sort and uniq -c. Notice again the use of the tail +2
command, to strip the header from the output of the ps command.
[madonna@station madonna]$ ps -e -o user | tail +2 | sort | uniq -c
8
2
1
3
1
5
48
1
1
1
apache
blondie
daemon
elvis
gdm
madonna
root
rpc
smmsp
xfs
She would prefer the output to be sorted, so she adds one more sort to the end of the pipe.
[madonna@station madonna]$ ps -e -o user | tail +2 | sort | uniq -c | sort -rn
48
8
6
3
2
1
1
1
1
1
root
apache
madonna
elvis
blondie
xfs
smmsp
rpc
gdm
daemon
Now blondie easily sees that root and apache are running the most processes (presumably daemons in the
background), followed by madonna, elvis, and blondie (presumably interactive users). How many of
these processes are currently running, and how many are sleeping? Using a similar trick, but this time
listing the process state instead of the user owner, she comes up with her answer.
[madonna@station madonna]$ ps -e -o state | tail +2 | sort | uniq -c | sort -rn
73 S
5 R
Most of the processes on her machine (73) are sleeping, while relatively few (5) are running (which
implies they are actively using the CPU).
Online Exercises
Lab Exercise
Objective: Use the sort and uniq command to manage information efficiently.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
70
Specification
1. The file /etc/fstab is used to predefine mount points on your system. The third column of this file
specifies the filesystem type of the device to be mounted.
Sort the contents of this file in alphabetically ascending order, using the third column as your
primary key. Store the output in the newly created file ~/fstab.byfs
2. The file /proc/modules lists currently loaded kernel modules, along with the module size (the
second column) and a current usage count (the third column).
Sort the contents of this file in numerically descending order, using the usage count (third column)
as your primary key, and the module size (second column) as your secondary key. Store the results
in the file ~/modules.byuc
3. Sort the /etc/passwd file in alphabetically ascending order, using the users login shell as your
primary key. Store the results in the file newly created file passwd.bylogin
4. The stat command uses the --format command line switch to specify its output format. As seen in
the stat(1) man page (or stat --help), the following command line will list the permissions of a file
in octal notation.
[student@station student]$ stat --format="%a" /etc/passwd
644
Use this command to list the permissions on all files (and directories, etc.) in the /etc/ directory
(but not subdirectories). Use the sort and uniq commands to reduce this information into a simple
table, with the first column being the number of times that the octal mode specified in the second
column occurs. The table should be sorted in numerically descending order, using the number of
occurrences (the first column) as your primary key. Store the table in a newly created file called
~/etcmodes.txt
If completed correctly, your table should have a form similar to the following. (Do not be concerned
if the actual values of your table differ.)
[student@station student]$ cat etcmodes.txt
127
63
16
15
6
5
3
2
1
1
1
644
755
600
777
640
664
444
400
775
750
440
5. The df command lists currently mounted disk partitions, along with the current disk usage. The
fourth column of this commands output lists the amount of available space in blocks.
Create an executable script called ~/bin/avail. When executed, the script should list available
partitions (the output of the df command), sorted in numerically descending order, using the amount
of available space (the fourth column) as the primary key. The header line generated from the df
command should be stripped from the output.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
71
Deliverables
1. The file fstab.byfs, which contains the contents of the /etc/fstab file sorted in alphabetically ascending order,
using the third column as the primary key.
2. The file modules.byuc, which contains the contents of the /proc/modules file sorted in numerically
descending order, using the third column as the primary key, and the second column as the secondary key.
3. The file passwd.bylogin, which contains the contents of the /etc/passwd file sorted in alphabetically
ascending order, using the users login shell as the primary key.
4. The file etcmodes.txt, which tables the octal permissions (modes) of all files in the /etc directory. The
second column of the table should be the octal mode, and the first column the number of files to which the mode
applies. The tables should be sorted in numerically descending order, using the first column as a primary key.
5. The executable script ~/bin/avail, which when executed (without arguments) displays the output of the df
command sorted in numerically descending order, using the fourth column as its primary key. The header line
produced by the df command should be stripped from the output.
If you have performed the exercises correctly, you should be able to generate output similar to the following. Do not
be concerned if your actual values differ.
[student@station student]$ head -5 fstab.byfs modules.byuc passwd.bylogin etcmodes.txt
==> fstab.byfs <==

/dev/sda1
none
/home/elvis/case.img
LABEL=/
LABEL=/boot
==> modules.byuc <==
ip_tables
ext3
jbd
yenta_socket
ds
/mnt/camera
/dev/pts
/home/elvis/case
/
/boot
15096
70784
51924
13504
8680
auto
noauto,user
0 0
devpts gid=5,mode=620 0 0
ext2
noauto,loop,user,exec 0 0
ext3
defaults
1 1
ext3
defaults
1 2
5 [iptable_filter ipt_MASQUERADE iptable_nat]

2
2 [ext3]
2
2
==> passwd.bylogin <==

alice:x:4021:4021::/home/alice:/bin/bash
apache:x:48:48:Apache:/var/www:/bin/bash
arnold:x:4012:4012::/home/arnold:/bin/bash
blondie:x:505:505::/home/blondie:/bin/bash
==> etcmodes.txt <==
127 644
63 755
16 600
15 777
6 640
[student@station student]$ avail
rha030-3.0-0-en-2005-08-17T07:23:17-0400
72
Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a violation of U.S.
and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or print format without
prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email training@redhat.com or phone tollfree (USA) +1 866 626 2994 or +1 (919) 754 3700.

/dev/hda3
none
/dev/hda1
5131108
127616
124427
4505312
0
26268
365144
127616
91735
93% /
0% /dev/shm
23% /boot
Questions
1. Which of the following are legitimate invocations of the sort command?
( ) a. sort -k5 -t: data
( ) b. sort -n data
( ) c. sort -rn -k3 data
( ) d. sort -k2 < data
2. Which of the following would sort the file data in numerically descending order, using the third column as the
primary key?
( ) a. sort -rn -k3 data
( ) b. sort -r -3 data
( ) c. sort -n -p3 data
( ) d. All of the Above
( ) e. A or C
3. Which of the following command lines would sort the file /etc/passwd in numerically ascending order, using
the third : separated field as a primary key?
( ) a. sort --numeric-sort -k3 -t: /etc/passwd
( ) b. sort -nk3 -t: /etc/passwd
( ) c. sort -n k3,: /etc/passwd
( ) e. A and B
Use the output from the following command to answer the next 2 questions.
[student@station hwdata]$ head -20 MonitorsDB
#
# Monitor information for use by Xconfigurator
# Supplies similar information to the data in the /usr/X11R6/lib/X11/Cards
rha030-3.0-0-en-2005-08-17T07:23:17-0400
73

#
#
#
#
#
#
#
#
#
file about video cards.

Each line has format:
<Manufacturer>; <Monitor name>; <EISA ID (if any)>; <horiz sync in \
Khz>; <verc sync in Hz>; DPMS support
Horiz and vert sync can be a range; like 35.2-55.75; or 31.5,35.5
BUT remember to use ; to separate fields
Aamazing; Aamazing CM-8426; cm-8426; 31.0-60.0; 40.0-80.0; 1

Aamazing; Aamazing MS-8431; ms-8431; 15.0-36.0; 50.0-70.0; 1
Acer; Acer 11D; API440B; 31.0-35.5; 50.0-90.0
Acer; Acer 1455; API5514; 30.0-54.0; 50.0-120.0
Acer; Acer 1555; API5515; 30.0-54.0; 50.0-120.0
Acer; Acer 15P; acer_15p; 15.0-70.0; 45.0-90.0; 1
Acer; Acer 211c; API9708; 30.0-107.0; 50.0-160.0
4. Which of the following bash command lines would sort the listed monitors in alphabetically ascending order by
EISA ID?
( ) a. grep -v # MonitorsDB | sort -t; -k3
( ) b. grep -v \# MonitorsDB | sort -t\; -k3
( ) c. grep -v \# MonitorsDB | sort -n -t\; -k2
( ) d. grep -v \# MonitorsDB | sort -r -t\; -k2
5. Which of the following bash command lines would sort the listed monitors in numerically descending order,
using the (first value of) the monitors horizontal "sync frequency" as the primary key?
( ) a. grep -v \# MonitorsDB | sort -nr -t\; -k4
( ) b. grep -v # MonitorsDB | sort -r -t; -k4
( ) c. grep -v # MonitorsDB | sort -t; -k4
( ) d. grep -v \# MonitorsDB | sort -t\; -k4
Use the output from the following command to answer the next 3 questions.
[student@station log]$ ls -l
-rw------drwxr-xr-x
-rw------drwxr-xr-x
-rw-r--r-drwxr-xr-x
drwx------
1
2
1
2
1
2
2
root
servlet
root
lp
root
root
root
rha030-3.0-0-en-2005-08-17T07:23:17-0400
root
servlet
root
sys
root
root
root
2533
4096
9795
4096
5761
4096
4096
Oct 6 02:06 boot.log

Jan 20 2003 ccm-core-cms
Oct 6 05:50 cron
Oct 5 04:07 cups
Oct 6 02:05 dmesg
Oct 6 02:06 gdm
Oct 5 04:07 httpd
74

-r--------rw-------rw-------rwx-----drwxrwsr-x
-rw-r--r-drwxr-xr-x
drwx------rw-r--r--rw-------rw------drwxr-x---rw-rw-r-drwxr-xr-x
-rw-------rw-rw-r--rw-r--r-total 12532
1
1
1
1
2
1
2
2
1
1
1
2
1
2
1
1
1
root
root
root
postgres
root
root
root
root
root
root
root
squid
root
root
root
root
root
root
19136220 Oct 6 03:31 lastlog
root
184228 Oct 6 05:08 maillog
root
20899 Oct 6 04:51 messages
postgres
0 Apr 1 2003 pgsql
rha
4096 Aug 27 10:58 rha
root
22166 Oct 6 04:10 rpmpkgs
root
4096 Oct 6 02:10 sa
root
4096 Apr 5 2003 samba
root
41382 Aug 21 15:47 scrollkeeper.log
root
1161 Oct 6 03:31 secure
root
0 Oct 5 04:07 spooler
squid
4096 Aug 18 07:05 squid
root
0 Oct 5 04:08 up2date
root
4096 Feb 3 2003 vbox
root
0 Oct 5 04:08 vsftpd.log
utmp
122880 Oct 6 03:31 wtmp
root
39015 Oct 6 05:56 xorg.0.log
6. Which of the following command lines would reorder this output in numerically descending order, using the file
size (the fifth column) as the primary key?
( ) a. ls -l | grep -v ^t | sort -rn -k5
( ) b. ls -l | grep -v ^t | sort -n -k4
( ) c. ls -l | grep -v ^t | sort -r -k5
( ) d. ls -l | grep -v ^t | sort -t: -k5
7. Which of the following command lines would reorder this output in numerically ascending order, using the link
count (the second column) as a primary key, and the file size (the fifth column) as the secondary key?
( ) a. ls -l | grep -v ^t | sort -rn -k5,2
( ) b. ls -l | grep -v ^t | sort -n -k2 -k5
( ) c. ls -l | grep -v ^t | sort -n -k5 -k2
( ) d. ls -l | grep -v ^t | sort -t: -k5 -k2
8. Which of the following command lines would reorder this output in alphabetically ascending order, using the
group owner (the fourth column) as the primary key, and the filename (the ninth column) as the secondary key?
( ) a. ls -l | grep -v ^t | sort -r -k9 -k4
( ) b. ls -l | grep -v ^t | sort -t- -k4,9
( ) c. ls -l | grep -v ^t | sort -k4 -k9
( ) d. ls -l | grep -v ^t | sort -rn -k4 -k9
75
rha030-3.0-0-en-2005-08-17T07:23:17-0400

9. Which of the following would print the number of occurrences of each record in the file data?
( ) a. sort -c data
( ) b. sort data | uniq
( ) c. sort data | uniq -c
( ) d. uniq data | sort -c
10. Which of the following would print repeated records from the file data?
( ) a. sort -r data
( ) b. uniq data | sort -r
( ) c. sort data | uniq -d
( ) d. sort data | uniq
rha030-3.0-0-en-2005-08-17T07:23:17-0400
76
Chapter 5. Extracting and Assembling Text:

cut and paste
Key Concepts
The cut command extracts texts from text files, based on columns specified by bytes, characters, or
fields.
The paste command merges two text files line by line.
Discussion
In this Lesson, we explore two commands that are used to extract columns from a stream of text, or
assemble columns into a wider stream: cut and paste.
The cut Command

Extracting Text with cut
The cut command extracts columns of text from a text file or stream. Imagine taking a sheet of paper that
lists rows of names, email addresses, and phone numbers. Rip the page vertically twice so that each
column is on a separate piece. Hold onto the middle piece which contains email addresses, and throw the
other two away. This is the mentality behind the cut command.
The cut command interprets any command line arguments as filenames of files on which to operate, or
operates on the standard in stream if none are provided. In order to specify which bytes, characters, or
fields are to be cut, the cut command must be called with one of the following command line switches.
Table 5-1. "Mandatory" Command Line Switches for the cut Command.
Switch
Effect
-b list
Extract bytes specified in list
-c list
Extract characters specified in list
-f list
Extract fields specified in list
The list arguments are actually a comma-separated list of ranges. Each range can take one of the
following forms.
Table 5-2. Range Specifications
N
Only item number N .
N-
Items N through the end of the line.
77
Chapter 5. Extracting and Assembling Text: cut and paste

N -M
Items N through M (inclusive).
-M
From the beginning of the line through item number M (inclusive).
All items from the beginning of the line through the end of the line.
Extracting text by Character Position with cut -c

With the -c command line switch, the list specifies a characters position in a line of text, where the
first character is character number 1. As an example, the file /proc/interrupts lists device drivers,
the interrupt request (IRQ) line to which they attach, and the number of interrupts which have occurred
on that IRQ line. (Do not be concerned if you are not yet familiar with the concepts of a device driver or
IRQ line. Focus instead on how cut is used to manipulate the data).
[student@rosemont student]$ cat /proc/interrupts
0:
1:
2:
3:
5:
8:
10:
11:
12:
14:
15:
NMI:
ERR:
CPU0
4477340
25250
0
7344
310187
1
166
6575295
544632
80379
341407
0
0
XT-PIC
XT-PIC
XT-PIC
XT-PIC
XT-PIC
XT-PIC
XT-PIC
XT-PIC
XT-PIC
XT-PIC
XT-PIC
timer
keyboard
cascade
ehci-hcd
usb-uhci, ohci1394
rtc
usb-uhci, eth1
usb-uhci, eth0, Audigy
PS/2 Mouse
ide0
ide1
Because the characters in the file are formatted into columns, the cut command can extract particular
regions of interest. If just the IRQ line and the number of interrupts were of interest, the rest of the file
could be cut away, as in the following example. (Note the use of the grep command to first reduce the
file to just the lines pertaining to interrupt lines.)
[student@rosemont student]$ grep [[:digit:]]: /proc/interrupts
0:
1:
2:
3:
5:
8:
10:
11:
12:
14:
15:
| cut -c1-15
4512997
27954
0
7344
312095
1
166
6629756
545523
81025
344239
Alternately, if only the device drivers bound to particular IRQ lines were of interest, multiple ranges of
characters could be specified.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
78

0:
1:
2:
3:
5:
8:
10:
11:
12:
14:
15:
| cut -c1-5,34-
timer
keyboard
cascade
ehci-hcd
usb-uhci, ohci1394
rtc
usb-uhci, eth1
usb-uhci, eth0, Audigy
PS/2 Mouse
ide0
ide1
If the character specifications were reversed, can the cut command be used to rearrange the ordering of
the data?
| cut -c34-,1-5
0: timer
1: keyboard
2: cascade
...
The answer is no. Text will appear only once, in the same order it appears in the source, even if the range
specifications are overlapping or rearranged.
Extracting Fields of Text with cut -f

The cut command can also be used to extract text that is structured not by character position, but by
some delimiter character, such as a TAB or :. The following command line switches can be used to
further qualify what is meant by a field, or more selective select source lines.
Table 5-3. Command Line Switches for cut -f
Switch
Effect
-d DELIM
Use DELIM to separate fields on input, instead of the default TAB character.
-s
Do not include lines that do not contain the delimiter character (useful for
stripping comments and headers).
--outputdelimiter=STRING
On output, use the text specified by STRING instead of the input field
delimiter.
For example, the file /usr/share/hwdata/pcitable lists over 3000 vendor IDs and device IDs
(which can be probed from PCI devices), and the kernel modules and text strings which should be
associated with them, separated by tabs.
[student@rosemont hwdata]$ head -15 pcitable
#
#
#
#
This file is automatically generated from isys/pci. Edit

it by hand to change a driver mapping. Other changes will
be lost at the next merge - you have been warned.
Edit by hand to change a driver mapping. Changes to descriptions
rha030-3.0-0-en-2005-08-17T07:23:17-0400
79

#
#
#
#
will be lost at the next merge - you have been warned.

If you run makeids, please make sure no entries are lost.
The format is ("%d\t%d\t%s\t"%s"\n", vendid, devid, moduleName, cardDescription)
or ("%d\t%d\t%d\t%d\t%s\t"%s"\n", vendid, devid, subvendid, subdevid, moduleName, cardDesc
0x0675
0x0675
0x09c1
0x0e11
0x0e11
0x0e11
0x1700
0x1702
0x0704
0x0001
0x0002
0x0046
"unknown"
"Dynalink|IS64PH ISDN Adapter"
"hisax" "Dynalink|IS64PH ISDN Adapter"
"unknown"
"Arris|CM 200E Cable Modem"
"ignore"
"Compaq|PCI to EISA Bridge"
"ignore"
"Compaq|PCI to ISA Bridge"
"cciss" "Compaq|Smart Array 64xx"
The following example extracts the third and fourth column, using the default TAB character to separate
fields. Note the use of the -s command line switch, which effective strips the header lines (which do not
contain any TABs).
[student@rosemont hwdata]$ cut -s -f3,4 pcitable | head
"unknown"
"Dynalink|IS64PH ISDN Adapter"
"hisax" "Dynalink|IS64PH ISDN Adapter"
"unknown"
"Arris|CM 200E Cable Modem"
"ignore"
"Compaq|PCI to EISA Bridge"
"ignore"
"Compaq|PCI to ISA Bridge"
"cciss" "Compaq|Smart Array 64xx"
"unknown"
"Compaq|NC7132 Gigabit Upgrade Module"
"unknown"
"Compaq|NC6136 Gigabit Server Adapter"
"tmspci"
"Compaq|Netelligent 4/16 Token Ring"
"ignore"
"Compaq|Triflex/Pentium Bridge, Model 1000"
As another example, suppose we wanted to obtain a list of the most commonly referenced kernel
modules in the file. We could use a similar cut command, along with tricks learned in the last Lesson, to
obtain a quick listing of the number of times each kernel module appears.
[student@rosemont hwdata]$ cut -s -f3 pcitable | sort | uniq -c | sort -rn | head
1988
148
83
70
37
37
36
24
21
20
"unknown"
"ignore"
"aic7xxx"
"gdth"
"e100"
"Card:ATI Rage 128"
"3c59x"
"Card:ATI Mach64"
"tulip"
"agpgart"
Many of the entries are obviously unknown, or intentionally ignored, but we do see that the aic7xxx SCSI
driver, and the e100 Ethernet card driver, are commonly used.
Extracting Text by Byte Position with cut -b

The -b command line switch is used to specify which text to extract by bytes. Extracting text using the -b
command line switch is very similar in spirit as using -c. In fact, when dealing with text encoded using
the ASCII or one of the ISO 8859 character sets (such as Latin-1), the two are identical. The -b switch
rha030-3.0-0-en-2005-08-17T07:23:17-0400
80

differs from -c, however, when using character sets with variable length encoding, such as UTF-8 (a
standard character set on which many people are converging, and the default in Red Hat Enterprise
Linux).
As a quick example, consider the following three characters of Germen text: fr. When using UTF-8
encoding, the two characters which are part of the ASCII character set, f and r, are encoded using a
single byte. The , however, is encoded using two bytes, as is evidenced by the wc command.
[elvis@station elvis]$ echo fr
| wc -c
Accounting, we have one byte each for the letters f and r, one byte for the newline which was
appended to the output, leaving two bytes for the .
When using cut -c, the would be considered a single character, but when using cut -b, would be
considered two bytes, as in the following example.
[elvis@station elvis]$ echo f | cut -c 1-2
f
[elvis@station elvis]$ echo f | cut -b 1-2
f?
The first time, the cut command counted the two bytes used to encode the as a single character, but
the second time, it was considered two bytes. As a result, the character was "cut in half" by the cut
command, and the terminal was not able to display it correctly.
Usually, cut -c is the proper way to use the cut command, and cut -b will only be necessary for technical
situations.
Note: Notice the inconsistent nomenclature between with wc and cut. With wc -c, the wc command
really returns the number of bytes contained in a string, while cut -c measures text in characters.
Unfortunately, the wc command makes no equivalent distinction made between characters and
bytes.
The paste Command

The paste command is used to combine multiple files into a single output. Recall the fictional piece of
paper which listed rows of names, email addresses, and phone numbers. After tearing the paper into three
columns, what if we had glued the first back to the third, leaving a piece of paper listing only names and
phone numbers? This is the concept behind the paste command.
The paste command expects a series of filenames as arguments. The paste command will read the first
line from each file, join the contents of each line inserting a TAB character in between, and write the
resulting single line to standard out. It then continues with the second line from each file.
Consider the following two files as an example.
[student@station student]$ cat file-1
File-1 Line 1
File-1 Line 2
rha030-3.0-0-en-2005-08-17T07:23:17-0400
81

File-1 Line 3
[student@station student]$ cat file-2
File-2 Line 1
File-2 Line 2
File-2 Line 3
The paste command would output this:

[student@station student]$ paste file-1 file-2
File-1 Line 1 File-2 Line 1

If we had more than two files, the first line of each file would become the first line of the output. The
second output line would contain the second lines of each input file, obtained in the order we gave them
on the command line. As a convenience, the filename - can be supplied on the command line. For this
"file", the paste command would read from standard in.
Table 5-4. Command Line Switches for paste
Option
Description
-d list
Reuse characters from list for delimiters (instead of the default

TAB character).
-s, --serial
Transpose the result, so that each line in the first file is pasted into a
single line, each line of the second file is pasted into the next single
line, etc.
Examples
Example 1. Handling Free-Format Records
In a free-format record layout, input record items are identified by their position on the line, not by their
character position. Input fields are expected to be separated by exactly one TAB character, but any
character that does not appear in the data items themselves may be used. Each occurrence of the
delimiter separates a field.
Our favorite example file /etc/passwd has fields separated by exactly one colon (:) character. Field 1
is the account name and field 7 gives the shell program used. Using the cut command, we could output a
new file with just the account name and the shell name:
[student@station student]$ cut -d: -f1,7 /etc/passwd
root:/bin/bash
bin:/sbin/nologin
daemon:/sbin/nologin
adm:/sbin/nologin
...
rha030-3.0-0-en-2005-08-17T07:23:17-0400
82

Notice that the output lines use the same field delimiters as do the input records. We can change that with
the --output-delimiter switch:
[student@station student]$ cut -d: -f7,1 --output-delimiter=, /etc/passwd
root,/bin/bash
bin,/sbin/nologin
daemon,/sbin/nologin
adm,/sbin/nologin
lp,/sbin/nologin
...
Example 2. Living With Fixed-Format Records

In a fixed-format record, data are assigned specific character positions, or columns, that are the same in
each input line. Use the -c switch to identify the input character positions copied to the each output line.
[student@station student]$ cat fixed-data
abc123
def456
hij789
lkm012
We can clip out just characters 3 and 4 like this:

[student@station student]$ cut -c3-4 fixed-data
c1
f4
j7
m0
Example 3. Using (and Misusing) a Space as a Delimiter

The mount command, without arguments, returns a list of which devices are mounted to which mount
points, along with the filesystem type and relevant mount options.
[student@station student]$ mount
/dev/hda3 on / type ext3 (rw)

none on /proc type proc (rw)
usbdevfs on /proc/bus/usb type usbdevfs (rw)
/dev/hda1 on /boot type ext3 (rw)
none on /dev/pts type devpts (rw,gid=5,mode=620)
none on /dev/shm type tmpfs (rw)
automount(pid780) on /misc type autofs (rw,fd=5,pgrp=780,minproto=2,maxproto=3)
Noticing that the words are separated by a single spaces, the cut command can be used to easily extract
the third and fifth words (which contain the mount point, and filesystem type, respectively). The
command must be supplied with the -d " " command line switch, which instructs it to treat spaces as a
field delimiters.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
83

[student@station student]$ mount | cut -d" " -f3,5
/ ext3
/proc proc
/proc/bus/usb usbdevfs
/boot ext3
/dev/pts devpts
/dev/shm tmpfs
/misc autofs
Will the same technique work for the df command?

[student@station student]$ df
Filesystem
/dev/hda3
/dev/hda1
none
1K-blocks
5131108
124427
127616
Used Available Use% Mounted on

4502000
368456 93% /
26268
91735 23% /boot
0
127616
0% /dev/shm
[student@station student]$ df | cut -d" " -f1,5
Filesystem
/dev/hda3
/dev/hda1
none
Apparently not. The cut command is using a space for a field delimiter. The catch is that the cut
command does not collapse multiple spaces into a single space, but treats them individually. Where does
the fifth "field" occur in the df commands output? Somewhere about halfway between the first two
columns. The cut command dutifully prints the first and (empty) fifth field.
Unfortunately, this is a commonly encountered limitation of the cut command. Fortunately, we will find
techniques in a later Lesson that can be used to overcome it.
Example 4. Examples of Pasting

Our initial example showed the most common usage of paste, where the first lines from all the input files
are concatenated together and separated by a delimiter character; the process then repeats for the
subsequent lines. The -s option pastes all the lines from the first input file into the first output line, then
pastes all the lines from the second input file into the second output line, and so on:
[student@station student]$ paste -d: -s file-1 file-2
File-1 Line 1:File-1 Line 2:File-1 Line 3

File-2 Line 1:File-2 Line 2:File-2 Line 3
Recall that the -d switch list argument can take more than one character. This can be used to provide a
different delimiter between each pair of portions written to the output. The list characters are recycled
if necessary:
[student@station student]$ paste -d+-/ file-1 file-2 file-1 file-2 file-1
File-1 Line 1+File-2 Line 1-File-1 Line 1/File-2 Line 1+File-1 Line 1
rha030-3.0-0-en-2005-08-17T07:23:17-0400
84
Online Exercises
Lab Exercise
Objective: Use cut and paste to manage text.
Specification
1. Use the cut command to extract a list of usernames and login shells from the /etc/passwd file,
where the resulting usernames and login shells are separated by a single space. Sort the resulting list
in ascending alphabetical order, using the login shell as the primary key, and the username as a
secondary key. Store the result in the newly created file ~/usershells.txt.
2. The file /proc/cpuinfo contains information about your systems detected CPU. Use the cut
command to extract only the values, not the names or the : that is used to separate the names from
the values. Store the resulting list of values in the newly created file ~/cpuvalues.txt.
3. The file /etc/sysconfig/init is used to define parameters which configure your machines
startup method. Parameters are defined using the same syntax used by the bash shell, i.e.,
NAME=value.
Use some combination of the grep and cut commands to generate a list of the parameter names
found in this file, one name per line. Do not include the parameter values or the = which is used to
separate them, or any of the comment or empty lines found in the original file. Sort the names in
alphabetically ascending order, and store them in the newly create file ~/initparams.txt.
4. The following script can be used to print a series of 10 random numbers.
#!/bin/bash
for i in $(seq 10); do
echo $RANDOM
done
Create the script in a file of your choosing, and make the file executable. Execute the script 5
separate times, each time recording the output in a file named ~/trial1, ~/trial2, ~/trial3,
etc.
Create a file called titles, which contains the words run1, run2, ... run10, one per line, on each of
ten lines.
Use the paste command to combine the files named titles, trial1, trial2, trial3 trial4,
and trial5, in that order, into a file called trials. Use the default TAB character to separate the
columns.
If you have completed the lab correctly, you should be able to generate output similar to the following.
Do not be concerned if some of the values differ.
[student@station student]$ head -4 usershells.txt cpuvalues.txt
initparams.txt titles trial[15] trials
rha030-3.0-0-en-2005-08-17T07:23:17-0400
85

==> usershells.txt <==
news
alice /bin/bash
apache /bin/bash
arnold /bin/bash
==> cpuvalues.txt <==
0
GenuineIntel
6
8
==> initparams.txt <==
BOOTUP
LOGLEVEL
MOVE_TO_COL
PROMPT
==> titles <==
run1
run2
run3
run4
==> trial1 <==
9739
9089
6712
25993
==> trial5 <==
32029
1474
8347
31709
==> trials <==
run1
9739
run2
9089
run3
6712
run4
25993
27486
27496
7089
12188
12465
13136
7467
31152
14282
15835
7969
12746
32029
1474
8347
31709
Deliverables
1. The file usershells.txt, which contains a list of all users and login shells defined in the /etc/passwd file,
separated by a space. The lines should be sorted in alphabetically ascending order, using login shells as the
86
rha030-3.0-0-en-2005-08-17T07:23:17-0400
of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or
print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email

primary key, and usernames as the secondary key.
2. The file cpuvalues.txt, which lists the values for parameters found in the file /proc/cpuinfo, one per line.
The values should appear in the same order as they appear in the file /proc/cpuinfo.
3. The file initparams.txt, which contains an alphabetically ascending list of parameter names found in the file
/etc/sysconfig/init.
4. Five files named ~/trial1, ~/trial2, ... ~/trial5, which list 10 random integers, one integer per line.
5. The file ~/titles, which contains the 10 words run1, run2, ... run10, one word per line.
6. The file trials, which contains the contents of the six files titles, trial1, ... trial5 pasted into a single
file, using the paste command.
Questions
1. Which of the following command lines would extract characters 10-20 from each line of the file /etc/xpdfrc?
( ) a. cut -c10-20 /etc/xpdfrc
( ) b. cut -c10,20 /etc/xpdfrc
( ) c. cut -f10-20 /etc/xpdfrc
( ) d. cut -f10,20 /etc/xpdfrc
2. Which of the following command lines would extract the second and fourth fields from the /etc/group file?
Recall that the /etc/group file uses a : to separate fields.
( ) a. cut -d: -f2,4 /etc/group
( ) b. cut -f:2,4 /etc/group
( ) c. cut -t: -f2-4 /etc/group
( ) d. cut -t: -f2,4 /etc/group
The file Web defines a palette of colors by listing RGB (Red, Green, and Blue) values for each color, one triplet per
line. Use the following transcript to answer the next 2 questions.
[student@station student]$ cat Web
GIMP Palette
# Netscape -- GIMP Palette file
255 255 255
255 255 204
255 255 153
rha030-3.0-0-en-2005-08-17T07:23:17-0400
87

255
255
255
255
255
255
255
255
204
204
102
051
000
255
204
3. Which of the following command lines would extract the green values (i.e, the second column), omitting the two
header lines?
( ) a. tail +3 Web | cut -d" " -f2
( ) b. cut -s -f2 Web
( ) c. grep "[[:digit:]]" Web | cut -c5-7
( ) e. A and C only
4. Which of the following command lines would extract the red and blue values (i.e., the first and third columns),
separating them with a : instead of a space on output? The two header lines should again be omitted.
( ) a. tail +3 Web | cut -d: -f1,3
( ) b. tail +3 Web | cut -c1-3,9-11 -o:
( ) c. tail +3 Web | cut -d" " -f1,3 --output-delimiter=:
( ) e. B and C only
The file /usr/share/gimp/1.2/gradients/Abstract_1 defines gradient parameters by listing 13 numbers,
separated by spaces. Use the following transcript to answer the following 2 questions.
[student@station gradients]$ cat Abstract_1
GIMP Gradient
6
0.000000 0.286311
0.572621 0.657763
0.716194 0.734558
0.749583 0.784641
0.824708 0.853088
0.876461 0.943172
0.572621
0.716194
0.749583
0.824708
0.876461
1.000000
0.269543
0.215635
0.040368
0.680490
0.553909
1.000000
0.259267
0.407414
0.833333
0.355264
0.351853
0.000000
1.000000
0.984953
0.619375
0.977430
0.977430
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
0.215635
0.040368
0.680490
0.553909
1.000000
1.000000
0.407414
0.833333
0.355264
0.351853
0.000000
1.000000
0.984953
0.619375
0.977430
0.977430
1.000000
0.000000
5. Which of the following command lines would extract the even numbered columns,omitting the first two header
lines?
( ) a. tail +3 Abstract_1 | cut -d" " -c2,4,6,8,10,12
( ) b. tail +3 Abstract_1 | cut -d" " -f2-12:2
( ) c. tail +3 Abstract_1 | cut -d" " -f2,4,6,8,10,12
88
rha030-3.0-0-en-2005-08-17T07:23:17-0400
1.000000
1.000000
1.000000
1.000000
1.000000
1.000000
0
0
0
0
0
0

( ) e. B and C
6. Which of the following command lines would extract the first two columns, omitting the first two header lines?
( ) a. tail +3 Abstract_1 | cut -c1-18
( ) b. tail +3 Abstract_1 | cut -f1,2
( ) c. tail +3 Abstract_1 | cut -c1,18
( ) e. A and C
The file /proc/iomem displays physical memory ranges, : separated from the devices which are using them.
[student@station student]$ tail /proc/iomem
e8000000-ebffffff : ATI Technologies Inc Rage Mobility M3 AGP 2x

e8000000-e87fffff : vesafb
fbffd000-fbffd07f : 3Com Corporation Mini PCI 56k Winmodem
fbffd400-fbffd4ff : 3Com Corporation Mini PCI 56k Winmodem
fbffd800-fbffd87f : 3Com Corporation 3c556 Hurricane CardBus
fbffdc00-fbffdc7f : 3Com Corporation 3c556 Hurricane CardBus
fbffe000-fbffffff : ESS Technology ES1983S Maestro-3i PCI Audio Accelerator
fd000000-feffffff : PCI Bus #01
fdffc000-fdffffff : ATI Technologies Inc Rage Mobility M3 AGP 2x
ffe00000-ffffffff : reserved
7. Which of the following command lines would reliably display only the starting memory address from each range?
( ) a. cut -d- -f1 /proc/iomem
( ) b. cut -c 1-8 /proc/iomem
( ) c. cut -d: -f1 /proc/iomem | cut -d- -f1
( ) d. A and C
( ) e. None of the Above
The file /proc/mounts lists all currently mounted devices, along with their mount points, file systems, and mount
options, each separated by spaces.
[student@station student]$ cat /proc/mounts
rootfs / rootfs rw 0 0
/dev/root / ext3 rw 0 0
/proc /proc proc rw 0 0
usbdevfs /proc/bus/usb usbdevfs rw 0 0
/dev/hda1 /boot ext3 rw 0 0
none /dev/pts devpts rw 0 0
none /dev/shm tmpfs rw 0 0
automount(pid780) /misc autofs rw 0 0
rha030-3.0-0-en-2005-08-17T07:23:17-0400
89

8. Which of the following command lines would generate a list of currently used filesystems (listed in the third
column), and the number of times they are being being used?
( ) a. cut -d" " -f3 /proc/mounts | sort | uniq -c
( ) b. cut -c3 /proc/mounts | sort -c
( ) c. cut -f3 /proc/mounts | sort -u | uniq -c
( ) d. cut -d" " -f3 /proc/mounts | sort -uc
9. Which of the following command lines would combine the files a.txt, b.txt, and c.txt?
( ) a. paste -a a.txt b.txt c.txt
( ) b. paste -j a.txt b.txt c.txt
( ) c. paste -m a.txt b.txt c.txt
( ) d. paste a.txt b.txt c.txt
10. Which of the following command lines would combine the files a.txt, b.txt, and c.txt, using a : to
separate contents of each?
( ) a. paste -d: a.txt b.txt c.txt
( ) b. paste -m -d: a.txt b.txt c.txt
( ) c. paste -t: a.txt b.txt c.txt
( ) d. paste -t: -j a.txt b.txt c.txt
rha030-3.0-0-en-2005-08-17T07:23:17-0400
90
Chapter 6. Tracking differences: diff

Key Concepts
The diff command summarizes the differences between two files.
The diff command supports a wide variety of output formats, which can be chosen using various
command line switches. The most commonly used of these is the unified format.
The diff command can be told to ignore certain types of differences, such as changes in white space or
capitalization.
diff -r recursively summarizes the differences between two directories.
When comparing directories, the diff command can be told to ignore files whose filenames match
specified patterns.
Discussion
The diff Command
The diff command is designed to compare two files that are similar, but not identical, and generate output
that describes exactly how they differ. The diff command is commonly used to track changes to text files,
such as reports, web pages, shell scripts, or C source code. Also, utilities coexist with the diff command,
so that given a version of a file, and the output of the diff command comparing it to some other version,
the file can be brought up to date automatically. Most notable of these commands is the patch command.
We first introduce the diff command by way of example. In the open source community, documentation
generally sacrifices correctness of spelling or grammar for timeliness, as demonstrated in the following
README.pam_ftp file.
[blondie@station blondie]$ cat README.pam_ftp
This is the README for pam_ftp

-----------------------------This module is an authentication module that does simple ftp
authentication.
Recognized arguments:
"debug"
"users="
"ignore"
print
comma
could
allow
debug messages
separated list of users which
login only with email adress
invalid email adresses
Options for:
auth:
for authentication it provides pam_authenticate() and
pam_setcred() hooks.
91

James Anderson <james@anderson.us>, 17. June 1999
Noticing that the words address and addresses are misspelled, blondie sets out to apply changes, first by
correcting the misspelled words, and secondly by appending a line recording her revisions. She first
makes a copy of the file, appending the .orig extension. She secondly makes her edits.
[blondie@station blondie]$ cp README.pam_ftp README.pam_ftp.orig
[blondie@station blondie]$ nano README.pam_ftp
She now uses the diff command to compare the two revisions of the file.
[blondie@station blondie]$ diff README.pam_ftp.orig README.pam_ftp
11,12c11,12
<
could login only with email adress
<
"ignore"
allow invalid email adresses
-->
could login only with email address
>
"ignore"
allow invalid email addresses
18a19
> Spelling corrections applied by blondie, 22 Sep 2003
Without yet going into detail about diffs syntax, we see that the command has identified the differences
between the two files, exemplifying the essence of the diff command. The diff command is so commonly
used, that its output is often referred to as a noun, as in "Heres the diff between those two files".
Output Formats for the diff Command

The diff command was conceived in the early days of the Unix community. Over time, improvements
have been made in how diff annotates changes. To preserve backward compatibility, however, older
formats are still available. The following lists commonly used diff formats.
"Standard" diff
Originally, the diff command was used to preserve bandwidth over slow network connections.
Rather than transferring a new version of a file, a summary of the revisions would be transferred
instead. This summary was in a format that was easily recognized by the ed command line editor,
which is seldom used today. Examining the previous output, one can imagine the ed editor being
asked to change lines 11 and 12, and append a line after line 18.
Soon, however, room for improvement was found. What if an administrator accidentally applied the
changes twice? The ed editor would happily make the changes, corrupting the contents of the file.
The solution is a context sensitive diff.
Context diff (diff -c)
The context sensitive diff is generated by specifying the -c or -C N command line switches. (The
second form is used to specify that exactly N lines of context should be generated.) Consider the
following example.
[blondie@station blondie]$ diff -c README.pam_ftp.orig README.pam_ftp
*** README.pam_ftp.orig 2003-10-07 15:30:05.000000000 -0400
92
rha030-3.0-0-en-2005-08-17T07:23:17-0400
use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

--- README.pam_ftp
***************
*** 8,18 ****
"debug"
"users="
!
!
"ignore"
2003-10-07 15:30:17.000000000 -0400
print
comma
could
allow
debug messages
Options for:
auth: for authentication it provides pam_authenticate() and
--- 8,19 ---"debug"
"users="
!
!
"ignore"
print
comma
could
allow
debug messages
login only with email address
invalid email addresses
Options for:
auth: for authentication it provides pam_authenticate() and
+ Spelling corrections applied by blondie, 22 Sep 2003
Obviously, the context diff includes several lines of surrounding context before identifying changes.
Changes are annotated by using a ! to mark lines that have changed, + to mark lines that have
been added, and - to mark lines that have been removed. Using a content diff, utilities can
automatically detect when an administrator accidentally tries to update a file twice.
Unified diff (diff -u)
The unified diff is generated by specifying the -u or -U N command line switches. (The second
form is used to specify that exactly N lines of context should be generated.) Rather than duplicating
lines of context, the unified diff attempts to record changes all in one stanza, creating a more
compact, and arguably more readable, output.
[blondie@station blondie]$ diff -u README.pam_ftp.orig README.pam_ftp
--- README.pam_ftp.orig 2003-10-07 15:30:05.000000000 -0400

+++ README.pam_ftp
2003-10-07 15:30:17.000000000 -0400
@@ -8,11 +8,12 @@
"debug"
"users="
+
+
"ignore"
"ignore"
print
comma
could
allow
could
allow
debug messages
login only with email address
invalid email addresses
Options for:
93
rha030-3.0-0-en-2005-08-17T07:23:17-0400
use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

auth:
for authentication it provides pam_authenticate() and


+Spelling corrections applied by blondie, 22 Sep 2003
Rather than identifying a line as "changed", the unified diff annotates that the original version
should be deleted, and the new version added.
Side by side diff (diff -y)
The previous three formats were meant to be easy to read by some other utility, such as the ed editor
or the patch utility. In contrast, the "side by side" format is intended to be read by humans. As the
name implies, the two versions of the file are displayed side by side, with annotations in the middle
that help identify changes. The following example requests a side by side diff using the -y command
line switch, and further qualifies that the output should be formatted to 80 columns with -W80.
[blondie@station blondie]$ diff -y -W80 README.pam_ftp.orig README.pam_ftp

------------------------------

------------------------------
This module is an authentication modu

authentication.
This module is an authentication modu

authentication.
"debug"
"users="
"ignore"
print
comma
could
allow
debug m
separat
login o |
invalid |
"debug"
"users="
"ignore"
print
comma
could
allow
debug m
separat
login o
invalid
Options for:
auth:
for authentication it provide
Options for:
auth:
for authentication it provide
James Anderson <james@anderson.us>, 1
James Anderson <james@anderson.us>, 1

> Spelling corrections applied by blond
While the output would be more effective using a wide terminal, it does provide an intuitive feel for
the differences between the two files.
Quiet diff (diff -q)
The quiet diff merely reports if two files differ, not the nature of the differences.
[blondie@station blondie]$ diff -q README.pam_ftp.orig README.pam_ftp
Files README.pam_ftp.orig and README.pam_ftp differ
if-then-else Macro diff (diff -D tag )

This format generates differences using a syntax recognized by the cpp pre-processor. It allows
either the original version or the new version to be included by defining the specified tag . While
beyond the scope of this course, it is included for the benefit of those familiar with the cpp C
preprocessor.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
94

Other, less commonly used output formats exist as well. Which format is the right one? The answer
depends on the preferences of the generator of the "diff", or the expectations of whoever might be
receiving the "diff". The diff command is often used in the open source community to communicate
suggestions about exact changes to the source code of some program, in order to fix a bug or add a
feature. In this context, the unified diff format is almost always preferred.
The following table summarizes some of the various command line switches which can be used to
specify output format for the diff command.
Table 6-1. Command Line Switches for Specifying diff Output Format
Switch
Effect
-c
Generate the context sensitive format
-C, --context[=N ]
Generate the context sensitive format, using N lines of context, if

specified.
-u
Generate the unified format
-U, --unified[=N ]
Generate the unified format, using N lines of context, if specified.
-N
Another format for specifying N lines of context. Only used with -c

or -u.
-y, --side-by-side
Generate the side by side format
-W, --width=N
Use N columns when generating side by side format.
--left-column
Print only the left column when using the side by side format.
-q, --brief
Only report if files differ, not the details of the difference.
How diff Interprets Arguments

The diff command expects to be called with two arguments, a from-file and a to-file (or, in other words,
an oldfile and a newfile). The output of the diff command describes what must be done to the from-file to
create the to-file.
If one of the filenames refers to a regular file, and the other a directory, the diff command will look for a
file of the same name in the specified directory. If both are directories, the diff command will compare
files in both directories, but will not recurse to subdirectories (unless the -r switch is specified, see
below). Additionally, the special file name - will cause the diff command to read from standard in
instead of a regular file.
Customizing diff to be Less Picky

If not told otherwise, the diff command will diligently track all differences between two files. Several
command line switches can be used to cause the diff command to have a more relaxed behavior. The
following table summarizes the relevant command line switches.
Table 6-2. Command Line Switches that Specify diffs Pickyness
95
rha030-3.0-0-en-2005-08-17T07:23:17-0400

Switch
Effect
-b, -w, --ignore-all-space
Ignore white space when comparing lines.
-B, --ignore-blank-lines
Ignore white space when comparing lines.
-i, --ignore-case
Ignore changes in case (i.e., consider upper and lower case

characters equivalent).
-I,
--ignore-matching-lines=regex
Ignore changes that insert or delete lines which match the

mandatory argument regex .
As an example, consider the following two files.

[blondie@station blondie]$ cat cal.txt
September 2003
Su Mo Tu We Th Fr
1 2 3 4 5
7 8 9 10 11 12
14 15 16 17 18 19
21 22 23 24 25 26
28 29 30
Sa
6
13
20
27
[blondie@station blondie]$ cat cal_edited.txt
====================
==== This Month ====
====================
September 2003
Su Mo
1
7 8
14 15
21 22
28 29
Tu
2
9
16
23
30
We
3
10
17
24
Th
4
11
18
25
Fr
5
12
19
26
Sa
6
13
20
27
The file cal_edited.txt differs in two respects. First, a four line header was added to the top.
Secondly, an extra (empty) line was added to the bottom. An "ordinary" diff recognizes all of these
changes.
[blondie@station blondie]$ diff cal.txt cal_edited.txt
0a1,4
> ====================
> ==== This Month ====
> ====================
>
9a14
>
With the -B command line switch, however, the diff command ignores the new, empty line at the bottom.
[blondie@station blondie]$ diff -B cal.txt cal_edited.txt
rha030-3.0-0-en-2005-08-17T07:23:17-0400
96

0a1,4
> ====================
> ==== This Month ====
> ====================
>
With the -I command line switch, the diff command can be told to also ignore any lines that begin with a
=.
[blondie@station blondie]$ diff -B -I "^=" cal.txt cal_edited.txt
Recursive diffs
The diff command can act recursively, descending two similar directory trees and annotating any
differences. The following table lists command line switches relevant to diffs recursive behavior.
Table 6-3. Command Line Switches for Using diff Recursively
Switch
Effect
-r, --recursive
When comparing directories, recurse through subdirectories as

well.
-x, --exclude=pattern
When comparing directories recursively, omit filenames that match

pattern.
-X, --exclude-from=file
When comparing directories recursively, omit filenames that match

patterns specified in file.
As an example, blondie is examining two versions of a project called vreader. The project involves
Python scripts which convert calendering information from the vcal format to an XML format. She has
downloaded two versions of the project, vreader-1.2.tar.gz and vreader-1.3.tar.gz, and
expanded each of the archives into her local directory.
[blondie@station blondie]$ ls
vreader-1.2
vreader-1.2.tar.gz
vreader-1.3
vreader-1.3.tar.gz
The directories vreader-1.2 and vreader-1.3 have the following structure.

vreader-1.2/
|-- addressbook.vcard
|-- calendar.ics
|-- conv_db.py
|-- conv_db.pyc
|-- datebook.xml
|-- templates/
|
-- datebook.xml
-- vreader.py
vreader-1.3/
|-- calendar.ics
|-- conv_db.py
rha030-3.0-0-en-2005-08-17T07:23:17-0400
97

|-|-|-|-|
--
conv_db.pyc
datebook.out.xml
datebook.xml
templates/
-- datebook.xml
vreader.py
In order to summarize the differences between the two versions. She runs a recursive diff on the two
directories.
[blondie@station blondie]$ diff -r vreader-1.[23]
Binary files vreader-1.2/conv_db.pyc and vreader-1.3/conv_db.pyc differ

Only in vreader-1.3: datebook.out.xml
diff -r vreader-1.2/templates/datebook.xml vreader-1.3/templates/datebook.xml
15a16
> <event description="Linux users 331 dabney" categories="" uid="-1010079065" start="8732466
diff -r vreader-1.2/vreader.py vreader-1.3/vreader.py
6a7
> time_offset = 0 # in hours
348c349
<
return utime
-->
return utime + time_offset*3600
The diff command recurses through the two directories, and notes the following differences.
1. The two binary files vreader-1.2/conv_db.pyc and vreader-1.3/conv_db.pyc differ.
Because they are not text files, however, the diff command does not try to annotate the differences.
2. The complementary file to vreader-1.3/datebook.out.xml is not found in the vreader-1.2
directory.
3. The files vreader-1.2/templates/datebook.xml and
vreader-1.3/templates/datebook.xml differ, and diff annotates the changes.
4. The files vreader-1.2/vreader.py and vreader-1.3/vreader.py differ, and diff annotates
the changes.
Often, when comparing more complicated directory trees, there are files that are expected to change, and
files that are not. For example, the file conv_db.pyc is compiled Python code automatically generated
from the text Python script file conv_db.py. Because blondie is not interested in differences between
the compiled versions of the file, she uses the -x command line switch to exclude the file form her
comparisons. Likewise, she is not interested in the files ending .xml, so she specifies them with an
additional -x command line switch.
[blondie@station blondie]$ diff -r -x "*.pyc" -x "*.xml" vreader-1.[23]
diff -r -x *.pyc -x *.xml vreader-1.2/vreader.py vreader-1.3/vreader.py

6a7
> time_offset = 0 # in hours
348c349
<
return utime
-->
return utime + time_offset*3600
rha030-3.0-0-en-2005-08-17T07:23:17-0400
98

Now the output of the diff command is limited to only the file vreader-1.2/vreader.py and its
complement in vreader-1.3.
As an alternative to listing file patterns to exclude on the command line, they may be collected in a
simple text file which is specified instead, using the -X command line switch. In the following, blondie
has created and uses such a file.
[blondie@station blondie]$ cat diff_excludes.txt
*.pyc
*.xml
*.py
[blondie@station blondie]$ diff -r -X diff_excludes.txt vreader-1.[23]
Because blondie included *.py in her list of file patterns to exclude, the diff command is left with
nothing to say.
Examples
Example 1. Using diff to Examine New Configuration Files
After updating her sendmail RPM package, blondie notices that she has a new configuration file in her
/etc/mail directory, sendmail.cf.rpmnew. She would like to see how this file compares to her
already existing configuration file, /etc/mail/sendmail.cf. She uses diff to summarize the
differences.
[blondie@station blondie]$ diff /etc/mail/sendmail.cf /etc/mail/sendmail.cf.rpmnew
19,21c19,21
< ##### built by root@station.example.com on Tue Apr 1 15:09:38 EST 2003
< ##### in /etc/mail
< ##### using /usr/share/sendmail-cf/ as configuration include directory
--> ##### built by bhcompile@daffy.perf.redhat.com on Wed Sep 17 14:45:22 EDT 2003
> ##### in /usr/src/build/308253-i386/BUILD/sendmail-8.12.8/cf/cf
> ##### using ../ as configuration include directory
40d39
<
101c100
< DSnimbus.example.com
--> DS
She is satisfied that the new version of the configuration file differs only by some comment lines, and the
lack of a local configuration she had added to her version of the file.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
99
Example 2. Using diff to Examine Recent Changes to

/etc/passwd
The system administrator on a machine has noticed that the useradd utility creates a backup of the
/etc/passwd file whenever it makes a change to it, named /etc/passwd-. As the user root, she would
like to view the most recent change to the /etc/passwd file.
[root@station root]# diff /etc/passwd- /etc/passwd
68a69
> desktop:x:80:80:desktop:/var/lib/menu/kde:/sbin/nologin
Apparently, a new system user has recently been added, probably as a result of adding new software
using an RPM package file.
Example 3. Creating a Patch

After downloading the vreader-1.3.tar.gz archive and expanding its contents, blondie decides that
she could improve on the project and would like to make changes to the file vreader.py. She first
makes a copy of the "pristine" source (the version she unpacked from the distributed archive), and then
edits her copy of the file.
[blondie@station blondie]$ tar xzf vreader-1.3.tar.gz
[blondie@station blondie]$ cp -a vreader-1.3 vreader-1.3.local
[blondie@station blondie]$ nano vreader-1.3.local/vreader.py
After editing her copy of the file, she would like to submit her changes to the person who coordinates
changes to the vreader project. In the open source community, this person is usually referred to as the
maintainer of the project. Rather than sending a full copy of her version, she records the differences
between her version and the original in a file called vreader-1.3.blondie.patch.
[blondie@station blondie]$ diff -ru vreader-1.3 vreader-1.3.local
> vreader-1.3.blondie.patch
She now emails only the patch file to the project maintainer, who can easily use a command called patch
to apply the changes to pristine version.
[blondie@station blondie]$ mail -s "my changes" maintainer@patch.org <
vreader-1.3.blondie.patch
Online Exercises
Lab Exercise
Objective: Use the diff command to track changes to files.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
100
Specification
1. Use the diff command to annotate the differences between the files
/usr/share/doc/pinfo-0*/COPYING and /usr/share/doc/mtools-3*/COPYING, using the
context sensitive format. Record the output in the newly created file ~/COPYING.diff. When
specifying the filenames on the command line, list the pinfo file first, and use an absolute reference
for both.
2. Create a local copy of the directory /usr/share/gedit-2, using the following command (in your
home directory).
[student@station student]$ cp -a /usr/share/gedit-2 .
To your local copy of the gedit-2 directory, make the following changes.
a. Remove any two files.
b. Create an arbitrarily named file somewhere underneath the gedit-2 directory, with arbitrary
content.
c. Using a text editor, delete three lines from any file in the gedit-2/taglist directory.
Once you have finished, generate a recursive "diff" between /usr/share/gedit-2 and your copy,
gedit-2. Record the output in the newly created file ~/gedit.diff. When specifying the
directories on the command line, specify the original copy first, and use an absolute reference for
both. Do not modify the contents of your gedit-2 unless you also reconstruct your file
~/gedit.diff.
Deliverables
1. The file ~/COPYING.diff, which contains a context sensitive "diff" of the files
/usr/share/doc/pinfo*/COPYING and /usr/share/doc/mtools*/COPYING, where the pinfo version
of the file is used as the original, and each file is specified using an absolute reference.
2. The file ~/gedit.diff, which contains a recursive "diff" of the directories /usr/share/gedit-2 and
~/gedit-2. Both directories should be specified using absolute references, and the system directory should be
used as the original.
Questions
1. Which of the following command lines would generate a "diff" of the two files using the context sensitive format?
( ) a. diff -y origfile newfile
101
rha030-3.0-0-en-2005-08-17T07:23:17-0400

( ) b. diff -k origfile newfile
( ) c. diff -c origfile newfile
( ) d. diff --context-sensitive origfile newfile
2. Which of the following command lines would generate a "diff" of the two files using the unified format?
( ) a. diff -u origfile newfile
( ) b. diff -U2 origfile newfile
( ) c. diff --unified origfile newfile
( ) e. A and B only
3. Which of the following command lines would generate a "diff" of the two files using the side by side format?
( ) a. diff -s origfile newfile
( ) b. diff -y origfile newfile
( ) c. diff --side origfile newfile
( ) e. A and C only
Use the following two directory structures to answer the next 2 questions.
vreader-1.2/
|-- calendar.ics
|-- conv_db.py
|-- conv_db.pyc
|-- datebook.xml
|-- templates/
|
-- datebook.xml
-- vreader.py
vreader-1.3/
|-- calendar.ics
|-- conv_db.py
|-- conv_db.pyc
|-- datebook.out.xml
|-- datebook.xml
|-- templates/
|
-- datebook.xml
-- vreader.py
2 directories, 15 files
rha030-3.0-0-en-2005-08-17T07:23:17-0400
102

4. Which of the following command lines would compare the files vreader-1.2/datebook.xml and
vreader-1.3/datebook.xml?
( ) a. diff -u vreader-1.2/datebook.xml vreader-1.3

( ) b. diff -u vreader-1.2 vreader-1.3/datebook.xml
( ) c. diff -u datebook.xml vreader-1.2 vreader-1.3
( ) e. A and B only
5. Which of the following would include a summary of the differences between the files
vreader-1.2/templates/datebook.xml and vreader-1.3/templates/datebook.xml?
( ) a. diff vreader-1.2 vreader-1.3
( ) b. diff -r vreader-1.2 vreader-1.3
( ) c. diff -r vreader-1.2/templates vreader-1.3
( ) d. diff -r -x "*.xml" vreader-1.2 vreader-1.3
Use the output of the following command to answer the next 2 questions.
[student@station
[student@station
[student@station
[student@station
[student@station
student]$
student]$
student]$
student]$
student]$
cal > cal.txt

cp cal.txt cal2.txt
cp cal.txt cal3.txt
echo "" >> cal2.txt
echo "hello world" >> cal3.txt
6. Which of the following command lines would report no differences between the files cal.txt and cal2.txt?
( ) a. diff --no-white-space cal.txt cal2.txt
( ) b. diff -B cal.txt cal2.txt
( ) c. diff -w cal.txt cal2.txt
( ) d. diff -I cal.txt cal2.txt
7. Which of the following command lines would report no differences between the files cal.txt and cal3.txt?
( ) a. diff -I "^world" cal.txt cal3.txt
( ) b. diff --ignore-regex "world$" cal.txt cal3.txt
( ) c. diff -i "world" cal.txt cal3.txt
( ) d. diff -r cal.txt cal3.txt
rha030-3.0-0-en-2005-08-17T07:23:17-0400
103

Use the output of the following command to answer the next 2 questions.
[student@station student]$ diff -u /etc/sysconfig/tux tux
--- /etc/sysconfig/tux 2003-01-28 20:11:34.000000000 -0500

+++ tux 2003-09-08 05:56:09.000000000 -0400
@@ -3,7 +3,6 @@
# TUXTHREADS sets the number of kernel threads (and associated daemon
# threads) that will be used. $TUXTHREADS defaults to the number of
# CPUs on the system.
-# TUXTHREADS=1
# DOCROOT is the document root; it works the same way as other web
# servers such as apache. /var/www/html/ is the default.
@@ -22,7 +21,7 @@
# are opened as user/group root, which means that the _init() function,
# if it exists, is run as root. This feature is only designed to help
# protect from programming mistakes; it is NOT really a security mechanism.
-# DAEMON_UID=nobody
+DAEMON_UID=tux
# DAEMON_GID=nobody
# CGIs can be started in a chroot environment by default.
8. This is an example of which diff output format?

( ) a. The "standard" format
( ) b. The unified format
( ) c. The context sensitive format
( ) d. The side by side format
( ) e. Not enough information is provided
9. Which of the following best describes the differences between the files /etc/sysconfing/tux and tux?
( ) a. Two lines have been added to the file tux.
( ) b. One line has been removed from and one line changed in the file /etc/sysconfig/tux.
( ) c. One line has been removed from and one line changed in the file tux.
( ) d. Two lines have been added to the file /etc/sysconfig/tux.
( ) e. Not enough information has been provided
10. Which of the following command lines would not report differences in capitalization between the two files?
( ) a. diff -i origfile newfile
( ) b. diff --ignore-capitalization origfile newfile
( ) c. diff -I origfile newfile
( ) d. diff -I [[:upper:]] origfile newfile
104
rha030-3.0-0-en-2005-08-17T07:23:17-0400

rha030-3.0-0-en-2005-08-17T07:23:17-0400
105
Chapter 7. Translating Text: tr

Key Concepts
The tr command performs translations on data read from standard in.
In its most basic form, the tr command performs byte for byte substitutions.
Using the -d command line switch, the tr command will delete specified characters from a stream.
Using the -s command line switch, the tr command will squeeze a series of repeated characters in a
stream into a single instance of the character.
Discussion
The tr Command
The tr command is a versatile utility that performs character translations on streams. Translating can
mean replacing one character for another, deleting characters, or "squeezing" characters (collapsing
repeated sequences of a character into one). Each of these uses will be examined in the following
sections.
Unlike all of the previous commands in this section, the tr command does not expect filenames as
arguments. Instead, the tr command operates exclusively on the standard in stream, reserving command
line arguments to specify transformations.
The following table specifies the various ways of invoking the tr command.
Table 7-1. Invocation Syntax for the tr Command
Syntax
Effect
tr SET1 SET2
Substitute the characters specified in SET2 for the complementary

characters specified in SET1.
tr -d SET
Delete all characters specified in SET .
tr -s SET
Squeeze all characters specified in SET .
tr -s SET1 SET2
First substitute all characters found in SET2 for the complementary

characters found in SET1 then squeeze all characters found in SET2.
tr -ds SET1 SET2
First delete all characters found in SET1, then squeeze all characters
found in SET2.
106
Character Specification
As the above table makes clear, the tr command makes extensive use of characters defined in sets. The
syntax for defining a range of characters is based upon the range specifier found in regular expressions.
The following expressions may be used when specifying characters.
Table 7-2. Specifying Characters for the tr Command
Syntax
Character(s)
literal
Most characters match a literal translation of themselves.
\n
The new line character.
\r
The return character.
\t
The (horizontal) tab character.
\\
The \ character.
[A-Z]
The range of characters bounded by the specified characters.

Deprecated, because how the ordering of the range is determined is
dependent on the character set used to encode the data.
[:alnum:]
All letters and digits.
[:alpha:]
All letters.
[:blank:]
All horizontal white space.
[:digit:]
All digits.
[:lower:]
All lower case characters.
[:print:]
All printable characters.
[:punct:]
All punctuation characters.
[:space:]
All horizontal or vertical white space.
[:upper:]
All upper case characters.
The table is not meant to be a complete list. Consult the tr(1) man page, or tr --help, for more
information.
Using tr to Translate Characters

Unless instructed otherwise (using command line switches), the tr command expects to be called with
two arguments, each of which specify a range of characters. For each of the characters specified in the
first set, the tr will substitute the character found in the same position in the second set. Consider the
following trivial example.
[madonna@rosemont madonna]$ echo "abcdefghi" | tr fed xyz
abczyxghi
Notice that in the output, the character d is replaced with the character z, e is replaced with the
character y, and f is replaced with the character x. The ordering of the sets is important. The third
letter from the first set is replaced with the third letter from the second set.
What happens if the lengths of the two sets have unequal lengths? the second set is extended to the length
of the first set by copying the last character.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
107

[madonna@rosemont madonna]$ echo "abcdefghi" | tr fed xy
abcyyxghi
A classic example of the tr command is to translate text into all upper case or all lower case letters. The
"old school" syntax for such a translation would use character ranges.
[madonna@rosemont madonna]$ cat /etc/hosts
# Do not remove the following line, or various programs

# that require network functionality will fail.
127.0.0.1
localhost.localdomain
localhost rha-server
192.168.0.254
rosemont.example.com
rosemont
192.168.0.51
hedwig.example.com
hedwig h
192.168.129.201 z
[madonna@rosemont madonna]$ tr a-z A-Z < /etc/hosts
# DO NOT REMOVE THE FOLLOWING LINE, OR VARIOUS PROGRAMS

# THAT REQUIRE NETWORK FUNCTIONALITY WILL FAIL.
127.0.0.1
LOCALHOST.LOCALDOMAIN
LOCALHOST RHA-SERVER
192.168.0.254
ROSEMONT.EXAMPLE.COM
ROSEMONT
192.168.0.51
HEDWIG.EXAMPLE.COM
HEDWIG H
192.168.129.201 Z
As mentioned in the Lesson on regular expressions, however, range specifications can produce odd
results when various character sets are considered. The "new school" approach is to use character classes.
[madonna@rosemont madonna]$ tr [:lower:] [:upper:] < /etc/hosts
# DO NOT REMOVE THE FOLLOWING LINE, OR VARIOUS PROGRAMS

# THAT REQUIRE NETWORK FUNCTIONALITY WILL FAIL.
127.0.0.1
LOCALHOST.LOCALDOMAIN
LOCALHOST RHA-SERVER
192.168.0.254
ROSEMONT.EXAMPLE.COM
ROSEMONT
192.168.0.51
HEDWIG.EXAMPLE.COM
HEDWIG H
192.168.129.201 Z
Recalling that the ordering of the character ranges is important to the tr command, the character classes
would need to generate consistently ordered ranges. Only the [:lower:] and [:upper:] character classes
are guaranteed to do so, implying that they are the only classes appropriate for use when using tr for
character translation.
Using tr to Delete Characters

When invoked with the -d command line switch, the tr command adopts a radically different behavior.
The tr command now expects a single argument (as opposed to two, above), which is again a set of
characters. The tr command will now filter the standard in stream, deleting each of the specified
characters writing it to standard out.
Consider the following couple of examples.
[madonna@station madonna]$ echo abcdefghi | tr -d def
abcghi
[madonna@station madonna]$ echo hark, I hear an elephant! | tr -d [:upper:][:punct:]
hark
hear an elephant
rha030-3.0-0-en-2005-08-17T07:23:17-0400
108

In the first case, the specified literal characters d, e, and f were deleted. In the second case, all
characters that belonged to either the [:punct:] or [:upper:] character classes were deleted.
Using tr to Squeeze Characters

By using the -s command line switch, the tr command can be used to squeeze a continues series of
characters into a single character. If called with one argument, the tr command will simply squeeze the
specified set of characters, as in the following example.
[madonna@station madonna]$ echo "aaabbbcccdddeeefffggg" | tr -s bcf
aaabcdddeeefggg
If called with the -s command line switch and two arguments, the tr command will perform substitutions
(as if the -s had not been specified), but the squeeze any characters from the second set.
[madonna@station madonna]$ echo "aaabbbcccdddeeefffggg" | tr -s bcf xye
aaaxydddeggg
Notice that this is essentially the same as performing the two operation separately.
[madonna@station madonna]$ echo "aaabbbcccdddeeefffggg" | tr bcf xye
aaaxxxyyydddeeeeeeggg
[madonna@station madonna]$ echo "aaabbbcccdddeeefffggg" | tr bcf xye | tr -s xye
aaaxydddeggg
Lastly, the tr command can be called with both the -s and -d command line switches. In this case, the tr
command expects two arguments. The tr command will first delete the first set of characters, and then
squeeze the second set.
[madonna@station madonna]$ echo "aaabbbcccaaadddeeefffggg" | tr -ds bcf ae
adddeggg
Note the order of operations carefully. This command is essentially the same as a delete (tr -d) followed
by a squeeze (tr -s).
[madonna@station madonna]$ echo "aaabbbcccaaadddeeefffggg" | tr -d bcf
aaaaaadddeeeggg
[madonna@station madonna]$ echo "aaabbbcccaaadddeeefffggg" | tr -d bcf
| tr -s ae
adddeggg
Complementing Sets
Other than -s and -d, there are only two command line switches which modify trs behavior, tabled
below.
Table 7-3. Command Line Switches for the tr Command
Switch
Effect
109
rha030-3.0-0-en-2005-08-17T07:23:17-0400

Switch
-c, --complement
-t, --truncate-set1
Effect
Complement SET1 before operating (i.e., use the set of characters excluded by
SET1)
Truncate the length of SET1 to that of SET2 before operating.
As a quick example of the -c command line switch, the following deletes every character that is not a
vowel or a white space character from standard in.
[madonna@station madonna]$ echo aaabbbcccdddeee | tr -cd aeiouAEIOU[:space:]
aaaeee
One Final Caution: Avoid File Globbing!

One final note before we leave our as and es and head for more practical examples.
In some of the previous examples, madonna was careful to protect expressions such as [:punct:] with
single quotes, and sometimes she was not. When she didnt, she got lucky. Consider the following
sequence.
[madonna@station madonna]$ echo hark, I hear an elephant! | tr -d [:punct:]
hark I hear an elephant

[madonna@station madonna]$ touch n
[madonna@station madonna]$ echo hark, I hear an elephant! | tr -d [:punct:]
hark, I hear a elephat!
Why did madonna get two very different results from the same command line? If you dont know the
answer, and even if you do, you should protect arguments to the tr command with quotes.
Examples
Example 1. Using tr to Clean Up the df Command
Recall a few Lessons ago, when we were discussing the cut command, and its ability to extract fields of
text from a stream. We tried to use the cut command to extract the first and fifth fields from the df
commands output, specifying a space as the field delimiter.
[madonna@station madonna]$ df
Filesystem
/dev/hda3
/dev/hda1
none
1K-blocks
5131108
124427
127616
Used Available Use% Mounted on

4499548
370908 93% /
26268
91735 23% /boot
0
127616
0% /dev/shm
[madonna@station madonna]$ df | cut -d" " -f1,5
Filesystem
/dev/hda3
/dev/hda1
none
rha030-3.0-0-en-2005-08-17T07:23:17-0400
110

We previously identified the problem with this approach. The cut command does not recognize a series
of spaces as separating two fields, but a series of fields (one for each space). With her newfound
knowledge of the tr command, madonna knows how to solve the problem.
She first uses the tr command to squeeze multiple spaces into a single space.
[madonna@station madonna]$ df | tr -s
Filesystem 1K-blocks Used Available Use% Mounted on

/dev/hda3 5131108 4499556 370900 93% /
/dev/hda1 124427 26268 91735 23% /boot
none 127616 0 127616 0% /dev/shm
Now, she can use the cut command to easily extract the appropriate columns.
[madonna@station madonna]$ df | tr -s | cut -d" " -f1,5
Filesystem Use%
/dev/hda3 93%
/dev/hda1 23%
none 0%
Example 2. Using tr to Convert Dos Text Files to Unix

The user madonna has recently discovered Project Gutenberg, an online repository for texts which have
entered the public domain. 1 She has downloaded one of her favorite texts, A Tale of Two Cities, and
stored it in the file 2city12.txt.
Upon examining the file, she realizes that it uses the DOS convention for separating lines (a carriage
return/new line pair), as illustrated by the ^M$ combination when using the cat -A command.
[madonna@station madonna]$ head -5 2city12.txt | cat -A
The Project Gutenberg Etext of A Tale of Two Cities, by Dickens^M$

^M$
Please take a look at the important information in this header.^M$
We encourage you to keep this file on your own disk, keeping an^M$
electronic path open for the next readers. Do not remove this.^M$
She would prefer the text to use the Unix convention (a single new line character). She uses the tr
command to delete all instances of the carriage return character, storing the result into the file
2city12unix.txt.
[madonna@station madonna]$ tr -d \r < 2city12.txt > 2city12unix.txt
In order to confirm that the conversion happened appropriately, she performs a couple of checks. She first
examines the file with cat -A, and notes that the ^M characters have been removed.
[madonna@station madonna]$ head -5 2city12unix.txt | cat -A
The Project Gutenberg Etext of A Tale of Two Cities, by Dickens$

$
Please take a look at the important information in this header.$
We encourage you to keep this file on your own disk, keeping an$
electronic path open for the next readers. Do not remove this.$
rha030-3.0-0-en-2005-08-17T07:23:17-0400
111

Secondly, she performs a word count on both files, using the wc command.
[madonna@station madonna]$ wc 2city12*
16364
16364
32728
137667 787603 2city12.txt

137667 771239 2city12unix.txt
275334 1558842 total
She notes that the difference in the number of characters (bytes) in the two files is the same as the
number of lines in the files (787603 - 771239 = 16364). This is appropriate if the tr command deleted
one character per line, as expected.
Finally, because she is no longer interested in keeping the DOS formatted version of the file, she renames
the file 2city12unix.txt to 2city12.txt.
[madonna@station madonna]$ mv 2city12unix.txt 2city12.txt
Example 3. Using tr to Count Word Frequencies
Good writing often requires that authors avoid overusing certain key words. The user madonna would
like to put the test to Charles Dickens. She first uses a text editor to extract the opening paragraph from
the text.
[madonna@station madonna]$ cat para1
It was the best of times, it was the worst of times,

it was the age of wisdom, it was the age of foolishness,
it was the epoch of belief, it was the epoch of incredulity,
it was the season of Light, it was the season of Darkness,
it was the spring of hope, it was the winter of despair,
we had everything before us, we had nothing before us,
we were all going direct to Heaven, we were all going direct
the other way--in short, the period was so far like the present
period, that some of its noisiest authorities insisted on its
being received, for good or for evil, in the superlative degree
of comparison only.
She would now like to generate a count of how often particular words are used. In order to use the uniq
-c command, she would like to rearrange the text so that the words appear one per line. She outlines the
following plan.
1. Delete all punctuation marks.
2. Convert all uppercase characters into lowercase, so that It and it are considered the same word.
3. Covert every space character into a new line character, and squeeze multiple new line characters into
one.
She begins implementing her plan one step at a time, so that she can observe the intermediate results.
[madonna@station madonna]$ tr -d [:punct:]
< para1 | head -5
It was the best of times it was the worst of times

it was the age of wisdom it was the age of foolishness
it was the epoch of belief it was the epoch of incredulity
rha030-3.0-0-en-2005-08-17T07:23:17-0400
112

it was the season of Light it was the season of Darkness
it was the spring of hope it was the winter of despair
head -5
it
it
it
it
it
was
was
was
was
was
the
the
the
the
the
< para1 | tr [:upper:] [:lower:] |
best of times it was the worst of times

age of wisdom it was the age of foolishness
epoch of belief it was the epoch of incredulity
season of light it was the season of darkness
spring of hope it was the winter of despair

tr -s \n | head -5
< para1 | tr [:upper:] [:lower:] |
it
was
the
best
of
At this point, madonna is comfortable enough with sort and uniq to finish off the process.
[madonna@station madonna]$ tr -d [:punct:] < para1 | tr [:upper:] [:lower:] |
tr -s \n | sort | uniq -c | sort -rn | head -5
14
12
11
10
4
the
of
was
it
we
Inspired by her progress, she next repeats the technique on the entire text. (The process took about 8
seconds on a 700MHz processor).
[madonna@station madonna]$ tr -d [:punct:] < 2city12.txt | tr [:upper:] [:lower:] |
tr -s \n | sort | uniq -c | sort -rn | head -5
8082
4967
4061
3517
2952
the
and
of
to
a
Example 4. Rot13
In the early days of Usenet newsgroups, people adopted a convention for obscuring text called rot13.
Suppose you were posting a joke, and wanted to include the punch line, but did not want the punch line to
be immediately obvious. The punch line could be transformed by rotating each letter by 13 places, so that
a would become n, b would become o, and z would become m, as in the following example.
Q: Why did the chicken cross the road?
A: Gb trg gb gur bgure fvqr.
How would someone find the answer? By piping the text through a tr implemented rot13 translator.
[madonna@station madonna]$ echo "Gb trg gb gur bgure fvqr." | tr A-Za-z N-ZA-Mn-za-m
rha030-3.0-0-en-2005-08-17T07:23:17-0400
113

To get to the other side.
Online Exercises
Lab Exercise
Objective: Gain familiarity with the tr command.
Specification
1. The /etc/passwd file uses colons as a field delimiter. Create the file ~/passwd.tsv, which is a
copy of the /etc/passwd file converted to use tabs as field delimiters (i.e., every : is converted
to a tab).
2. Create the file ~/file_roller.converted, which is a copy of the file
/usr/share/file-roller/glade/file_roller.glade, with the following transformations.
a. Convert all tabs to spaces.
b. Convert double quotes (") to single quotes (). (Do not use backticks ().)
3. Create a file called ~/openssl.converted, which is a copy of the file
/usr/share/ssl/openssl.cnf, with the following transformations.
a. All comments lines (lines whose first non-whitespace character is a #) are removed.
b. All empty lines are removed.
c. All upper case letters are folded into lower case letters.
d. All digits are replaced with the underscore character (_).
Deliverables
1. The file ~/passwd.tsv, which is a copy of the /etc/passwd file with tabs substituted for colons.
2. The file ~/file_roller.converted, which is a copy of the file
/usr/share/file-roller/glade/file_roller.glade, with all tabs converted to spaces, and all double
114
rha030-3.0-0-en-2005-08-17T07:23:17-0400
of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in electronic or
print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed please email

quotes (") converted to single quotes ().
3. The file ~/openssl.converted, which is a copy of the file /usr/share/ssl/openssl.cnf, with all
comment lines (those whose first non-whitespace character is a #) removed, all empty lines removed, all
upper case letters converted to lower case, and all numeric digits replaced with the underscore character (_).
Questions
1. Which of the following command lines would convert all ASCII carriage return characters in the file text.mac
to ASCII new line characters?
( ) a. tr -c CR LF text.mac
( ) b. tr \r \n text.mac
( ) c. tr CR LF < text.mac
( ) d. tr \r \n < text.mac
2. Which of the following command lines would squeeze a series of repeated space characters in the file df.out
into a single space?
( ) a. tr --squeeze \s df.out
( ) b. tr -s " " df.out
( ) c. tr -ds " " < df.out
( ) d. tr -s SPC < df.out
3. Which of the following command lines would delete the trailing slash (/) from the string etc/?
( ) a. echo etc/ | tr -d [:punct:]
( ) b. tr -d / etc/
( ) c. tr -d [:letter:] < /etc/fstab
( ) d. echo etc/ | tr -cd /
In the following transcript, madonna is trying to save the ls(1) man page, and then edit it, only to find that it is full of
control characters and other mess.
[madonna@station madonna]$ man ls > ls.man.out
rha030-3.0-0-en-2005-08-17T07:23:17-0400
115

[madonna@station madonna]$ head ls.man.out | cat -A
LS(1)
FSF
$
$
$
N^HNA^HAM^HME^HE$
ls - list directory contents$
$
S^HSY^HYN^HNO^HOP^HPS^HSI^HIS^HS$
l^Hls^Hs [_^HO_^HP_^HT_^HI_^HO_^HN]... [_^HF_^HI_^HL_^HE]...$
$
LS(1)$
4. Which of the following commands would effectively remove all of the ^H control sequences from the file
ls.man.out?
( ) a. tr -d ^H < ls.man.out
( ) b. tr -cd [:lower:] ls.man.out
( ) c. tr -d [:punct:] ls.man.out
( ) d. tr -cd [:print:][:space:] < ls.man.out
After successfully removing the ^h control sequences, and storing the results in the file ls.man.noh, madonna is
still left with a mess to clean up.
[madonna@station madonna]$ tail +5 ls.man.noh | head
NNAAMMEE
ls - list directory contents
SSYYNNOOPPSSIISS
llss [_O_P_T_I_O_N]... [_F_I_L_E]...
DDEESSCCRRIIPPTTIIOONN
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of --ccffttuuSSUUXX nor ----ssoorrtt.
Mandatory arguments to long options are mandatory for short options
5. Which of the following command lines would remove all underscores from the file ls.man.noh?
( ) a. tr -d [:alnum:] < ls.man.noh
( ) b. tr -d _ < ls.man.noh
( ) c. tr -cd _ ls.man.noh
( ) d. tr -d _ ls.man.noh
After successfully removing the _ characters, and storing the results in ls.man.noh_, madonna is still frustrated by
a large number of "doubled" letters and hyphens (-).
rha030-3.0-0-en-2005-08-17T07:23:17-0400
116

[madonna@station madonna]$ tail +5 ls.man.noh_ | head
NNAAMMEE
SSYYNNOOPPSSIISS
llss [OPTION]... [FILE]...
DDEESSCCRRIIPPTTIIOONN
Sort entries alphabetically if none of --ccffttuuSSUUXX nor ----ssoorrtt.
Mandatory arguments to long options are mandatory for short options
6. Which of the following command lines would convert the doubled letters and hyphens into a single instance?
( ) a. tr -s [:alpha:] ls.man.noh_
( ) b. tr -s [:lower:-] ls.man.noh_
( ) c. tr -cs [:alpha:]- < ls.man.noh_
( ) d. tr -s [:alpha:-] < ls.man.noh_
( ) e. tr -s [:alpha:]- < ls.man.noh_
She successfully removes the doubled characters, and stores the results in the file ls.man.clean.
[madonna@station madonna]$ tail +5 ls.man.clean | head
NAME
SYNOPSIS
ls [OPTION]... [FILE]...
DESCRIPTION
Sort entries alphabetically if none of -cftuSUX nor -sort.
7. After applying the correct answer from the previous question, what potential inaccuracies are in the text?
( ) a. Any line originally containing an underscore has been deleted from the text.
( ) b. Any line containing repeated punctuation characters now has only a single instances of the character.
( ) c. Any word originally containing a sequence of repeated letters now has only a single instance of the letter.
( ) d. Any line originally containing a bracket ([ ]) or colon has been deleted from the text.
( ) e. There should be no distortions to the original text.
117
rha030-3.0-0-en-2005-08-17T07:23:17-0400

8. Which of the following command lines could madonna have used to both delete the underscore characters from
the file ls.man.noh, and squeeze doubled letters and hyphens, using the tr command only once?
( ) a. tr -sd [:alpha:-] _ < ls.man.noh
( ) b. tr -ds _ [:alpha:]- < ls.man.noh
( ) c. tr -csd _ [:alpha:]- < ls.man.noh
( ) d. tr -ds [:alpha:]- [:punct:] < ls.man.noh
The user madonna is now having trouble using the tr command to delete punctuation characters from a file. While
trying to diagnose her problem, she runs the following commands.
[madonna@station madonna]$ ls
2city12.txt
ls.man.clean
ls.man.noh
ls.man.noh_
ls.man.out
para1
[madonna@station madonna]$ echo "test, one, two, three" | tr -d [:punct:]
es, one, wo, hree
9. What is the best advice you can give her?

( ) a. Avoid using character classes. Specify the punctuation characters literally.
( ) b. When using character classes, the tr command must be invoked with the -c command line switch.
( ) c. When specifying character classes on the command line, always protect the bracketed expressions with
quotes.
( ) d. She is using the wrong syntax for specifying the character class. She should use [[:punct:]] instead.
( ) e. None of the above adequately explain her problem.
Use the following transcript to answer the next question.
[madonna@station madonna]$ echo aaabbbcccdddeee | ???????
ebbbe
10. Which of the following expression could replace the expression ????????
( ) a. tr -s acde
( ) b. tr -s acd e
( ) c. tr -d acd e
( ) d. tr -d acd | tr -s e
118
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Notes
1. Project Gutenberg is based at the web site http://gutenberg.net.
2. This example is inspired by a similar example found in the coreutils info page (info coreutils).
119
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Chapter 8. Spell Checking: aspell

Key Concepts
The aspell -c command performs interactive spell checks on files.
The aspell -l command performs a non-interactive spell check on the standard in stream.
The aspell dump command can be used to view the systems master or a users personal dictionary.
The command aspell create personal and aspell merge personal can be used to create or append to a
users personal dictionary from a word list.
Discussion
In the Red Hat Enterprise Linux distribution, the aspell utility is the primary utility for checking the
spelling of text files. In this Lesson, we learn how to use aspell to interactively spell check a file and
customize the spell checker with a personal dictionary.
Using aspell
When running aspell, the first argument (other than possible command line switches) is interpreted as a
command, telling aspell what to do. The following commands are supported by aspell.
Table 8-1. Aspell commands
Command
Action
-c file, check file
Perform an interactive spell check on the file

file.
-l, list
Print a list of misspelled words found in the

standard in stream.
config
Dump the current aspell configuration to standard

out.
dump master|personal|repl
Dump a copy of the master word list, personal

word list, or personal replacement list,
respectively.
create master|personal|repl
Create the master word list, personal word list, or

personal replacement list, respectively, reading
entries from standard in.
merge master|personal|repl
Merge entries read from standard in into the

master word list, personal word list, or personal
replacement list, respectively.
The following table lists some of the more common command line switches that are used with the aspell
120

command.
Table 8-2. Command Line Switches for the aspell Command
Switch
Effect
-W --ignore=N
Ignore words less than N characters. (By default, only single

letters are ignored.)
--ignore-case
Ignore case when performing word comparisons.
-p, --personal=filename
Use the word list filename for the personal word list.
-x, --dont-backup
Do not create a backup file when performing the spell check.
Performing an Interactive Spell Check

The user prince has composed the following message, which he plans to email to the user elvis.
[prince@station prince]$ cat toelvis
Hey Elvis!
I heard that you were about to take the lab test for the string
procesing workbook in Red Hat Academy. IIRC, its prety
straightforward, if youve been keeping up with the exercises.
LOL, Prince
Before sending the message, prince uses aspell -c to perform an interactive spell check.
[prince@station prince] aspell -c toelvis
Upon execution, the aspell command open an interactive session, highlighting the first recognized
misspelled word.
Hey Elvis!
I heard you were about to take the lab test for the string
IIRC, its prety
straightforward, if youve been keeping up with the exercises.
procesing workbook in Red Hat Academy.
LOL, Prince
=====================================================================
1) processing
6) preceding
2) precessing
7) professing
3) precising
8) promising
4) proceeding
9) proposing
rha030-3.0-0-en-2005-08-17T07:23:17-0400
121

5) prosing
i) Ignore
I) Ignore all
r) Replace
R) Replace all
a) Add
x) Exit
=====================================================================
?
At this point, prince has a "live" keyboard, meaning that single key presses will take effect without him
needing to use the return key. He may choose from the following options.
Use Suggested Replacement
The aspell command will do its best to suggest replacements for the misspelled word from its
library. If it has found a correct suggestion (as in this case, it has), that suggestion can be replaced
by simply hitting the numeric key associated with it.
Ignore the Word
By pressing i, aspell will simply ignore the word this instance and move on. Pressing capital I will
cause aspell to ignore all instances of the word in the current file.
Replace the Word
If aspell was not able to generate an appropriate suggestion, prince may use r to manually replace
the word. When finished, aspell will pick up again, first rechecking the specified replacement. By
using capital R, aspell will remember the replacement and automatically replace other instances of
the misspelled word.
Add the Word to the Personal Dictionary
If prince would like aspell to learn a new word, so that it will not be flagged when checking future
files, he may press a to add the word to his personal dictionary.
Exit aspell
By pressing x, prince can immediately exit the interactive aspell section. Any spelling corrections
already implemented will be saved.
As prince proceeds through the interactive session, aspell flags procesing, prety, IIRC, and LOL as
misspelled. For the first two, prince accepts aspells suggestions for the correct spelling. The last two
"words" are abbreviations that prince commonly uses in his emails, so he adds them to his personal
dictionary. Unfortunately, because its is a legitimate word, aspell does not report princes misuse of it.
When finished, prince now has two files, the corrected version of toelvis, and an automatically
generated backup of the original, toelvis.bak.
[prince@station prince]$ ls
toelvis
toelvis.bak
[prince@station prince]$ diff toelvis.bak toelvis
4c4
< processing workbook in Red Hat Academy.
--> processing workbook in Red Hat Academy.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
IIRC, its prety

IIRC, its pretty
122
Performing a Non-interactive Spell Check

Using the -l command line switch, the aspell command can be used to perform spell checks in a
non-interactive batch mode. Used this way, aspell simple reads standard in, and writes to standard out
every word it would flag as misspelled.
In the following, suppose prince performed a non-interactive spell check before he had run the aspell
session interactively.
[prince@station prince]$ aspell -l < toelvis
procesing
IIRC
prety
LOL
The aspell utility lists the four words it would flag as misspelled. After the interactive spell check, prince
performs a non-interactive spell check on his backup of the original file.
[prince@station prince]$ aspell -l < toelvis.bak
procesing
prety
Because the words IIRC and LOL were added to princes personal dictionary, they are no longer flagged
as misspelled.
Managing the Personal Dictionary

By default, the aspell command uses two dictionaries when performing spell checks: the system wide
master dictionary, and a users personal dictionary. When prince chooses to add a word, the word gets
stored in his personal dictionary. He uses aspells ability to dump to view his personal dictionary.
[prince@station prince]$ aspell dump personal
LOL
IIRC
Likewise, he could dump the systems master dictionary as well.

[prince@station prince]$ aspell dump master | wc -l
153675
[prince@station prince]$ aspell dump master | grep "âdd. *ion$"
addiction
addition
adduction
The aspell command can also automatically create a personal dictionary (if it doesnt already exist), or
merge into it (if it does) using words read from standard in. Suppose prince has a previous email
message, in which he used many of his commonly used abbreviations. He would like to add all of the
abbreviations found in that email to his personal dictionary. He first uses aspell -l to extract the words
from the original message.
[prince@station prince]$ aspell -l < good_email.txt
FWIW
rha030-3.0-0-en-2005-08-17T07:23:17-0400
123

AFK
RSN
TTFN
After observing the results, he decides to add all of these words to his personal dictionary, using aspell
merge personal. When he finishes, he again dumps his (expanded) personal dictionary.
[prince@station prince]$ aspell -l < good_email.txt | aspell merge personal
[prince@station prince]$ aspell dump personal
TTFN
AFK
LOL
RSN
IIRC
FWIW
What happens if prince tries to create the personal dictionary instead?

[prince@station prince]$ echo "foo" | aspell create personal
Sorry I wont overwrite "/home/prince/.aspell.english.pws"
In aspells unwillingness to clobber an already existing personal dictionary, we discover where it is

stored: ~/.aspell.language.pwd.
Getting Help
Where would prince expect to find help for the aspell command?
[prince@station prince]$ man aspell
No manual entry for aspell
A reasonable first guess, but in this case wrong. Like most commands, aspell will generate a usage
summary when called with the --help command line switch. Additional documentation can be found in
the /usr/share/doc/aspell-0*/man-text/ directory (as simple text files), or
/usr/share/doc/aspell-0*/man-html/ in html format. The following command, when executed
from an X terminal, will start prince off in the html based documentation.
[prince@station prince]$ mozilla /usr/share/doc/aspell-0.33.7.1/man-html/index.html
Examples
Example 1. Adding Service Names to aspells Personal
Dictionary
The user prince is commonly answering questions related to Linuxs networking services in his emails,
and aspell consistently flags the conventional service names as misspelled words. He would like to add
rha030-3.0-0-en-2005-08-17T07:23:17-0400
124

the service names found in the file /etc/services to his personal dictionary.
He first spell checks the /etc/services file non-interactively, and stores the results in
~/services.maybe.
[prince@station prince]$ aspell -l < /etc/services > services.maybe
Using the less pager to browse the file services.maybe, he finds many duplicate entries. He makes life
easier for himself (and eventually aspell) by regenerating the list, removing duplicates.
[prince@station prince]$ aspell -l < /etc/services | sort | uniq > services.maybe
Browsing the file again, prince is satisfied that the list contains words he would rather not have flagged as
misspelled. He adds the word list to his personal dictionary.
[prince@station prince]$ aspell merge personal < services.maybe
For confirmation, he again spell checks, non-interactively, the /etc/services file.

[prince@station prince]$ aspell -l < /etc/services
As expected, no words were flagged as misspelled.
Online Exercises
Lab Exercise
Objective: Use the aspell command to perform routine spell checks.
Setup
In order to prepare for this Exercise, remove any personal dictionaries (or replacement list) you have
accumulated using the following command.
[student@station student]$ rm .aspell*
Specification
1. Generate a list of all words that aspell flags as misspelled found in all files underneath the
/etc/sysconfig directory, and its subdirectories. The list should be alphabetically ascending
sorted, and duplicates words should be removed. Store the list (one word per line) in the file
~/sysconfig.spell.txt.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
125
Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy.
Any other use is a violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or
otherwise duplicated whether in electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being
used, copied, or otherwise improperly distributed please email training@redhat.com or phone toll-free (USA) +1 866 626 2994 or +1 (919) 754 3700.

With creative use of the find, xargs, cat, sort, and uniq commands, this can be accomplished in one
command line.
2. Copy the file /usr/share/doc/which-*/README into your home directory. Perform an
interactive spell check on the file, using the aspell spell checker. Use the following policies for the
following misspelled words.
a. Add the following words to your personal dictionary: gcc, stdout, texinfo, stdin, usr, csh
b. Use aspells suggestions to correct: explicity (should read explicitly)
c. Manually replace the word Litmaath with Smith.
d. Ignore all other flagged words.
If performed correctly, you should be able to reproduce output similar to the following.
[student@station student]$ diff README.bak README
20c20
< Maarten Litmaath called which-v6, he was using -i as option
--> Maarten Smith called which-v6, he was using -i as option
59c59
<
to explicity search for normal binaries, while using
-->
to explicitly search for normal binaries, while using
73c73
<
ful to explicity search for normal binaries, while
-->
ful to explicitly search for normal binaries, while
[student@station student]$ aspell dump personal
stdout
usr
csh
texinfo
gcc
stdin
Deliverables
1. The file ~/sysconfig.spell.txt, which contains an alphabetically ascending sorted list of all words flagged
by aspell as misspelled found in all files underneath the /etc/sysconfig directory, and its subdirectories. The
file should not contain duplicate words.
2. The file ~/README, which is a copy of the file /usr/share/doc/which-*/README, which has been spell
checked with the aspell command. The word explicity should be replaced with explicitly, and the word Litmaath
with Smith.
3. An aspell personal dictionary that contains exactly the words gcc, stdout, texinfo, stdin, usr, and csh.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
126
Questions
1. Which of the following command lines would start an interactive aspell spell check on the file report.txt?
( ) a. aspell report.txt
( ) b. aspell -c report.txt
( ) c. aspell -c < report.txt
( ) d. aspell < report.txt
2. Which of the following command lines would start a non-interactive aspell spell check on the file report.txt?
( ) a. aspell -l report.txt
( ) b. aspell < report.txt
( ) c. aspell -b report.txt
( ) d. aspell -l < report.txt
3. Which of the following cannot be performed when aspell flags an unrecognized word during an interactive spell
check?
( ) a. The unrecognized word can be added to the systems master dictionary.
( ) b. The unrecognized word can be added to the users personal dictionary.
( ) c. The unrecognized word can replaced from a list of suggested replacements.
( ) d. The unrecognized word can manually replaced by the user.
( ) e. All of the above actions can be performed.
4. Assuming the file mywords.txt contains a series of whitespace separated words, which of the following
command lines could be used to add the words to a users personal dictionary?
( ) a. aspell merge mywords.txt
( ) b. aspell merge personal mywords.txt
( ) c. aspell merge personal < mywords.txt
( ) d. aspell merge < mywords.txt
127
rha030-3.0-0-en-2005-08-17T07:23:17-0400

5. Which of the following command lines would dump the words contained in the master dictionary to standard out?
( ) a. aspell dump master
( ) b. aspell -d master
( ) c. aspell dump
( ) d. aspell -m
6. Which of the following actions can be performed directly by the aspell command when performing a
non-interactive spell check?
( ) a. The unrecognized words can be added to the systems master dictionary.
( ) b. The unrecognized words can be added to the users personal dictionary.
( ) c. The unrecognized words can be replaced automatically with aspells first suggested replacement.
( ) d. The unrecognized words can be decorated with a +++ character sequence, so they can be easily searched
for with a text editor.
( ) e. None of the above can be performed by aspell directly using a non-interactive spell check.
7. Which of the following command lines would effectively add all of the unrecognized words in the file
report.txt to a users personal dictionary?
( ) a. aspell -l report.txt | sort | uniq | aspell merge personal
( ) b. aspell -l < report.txt | sort | uniq | aspell merge personal
( ) c. aspell < report.txt | sort | uniq | aspell merge personal
( ) d. aspell -l < report.txt | sort | uniq | aspell merge
8. Which of the following command lines would perform an interactive spell check of the file report.txt, but not
create a backup file?
( ) a. aspell -x report.txt
( ) b. aspell -x -c report.txt
( ) c. aspell -X < report.txt
( ) d. aspell -b -c < report.txt
128
rha030-3.0-0-en-2005-08-17T07:23:17-0400

9. Which of the following would list all unrecognized words in the file report.txt which are greater than 4
characters in length?
( ) a. aspell -W4 -l < report.txt
( ) b. aspell -W3 report.txt
( ) c. aspell -W5 -c < report.txt
( ) d. aspell -W3 report.txt
10. Which of the following would replace a users (already existing) personal dictionary with the words found in the
file mywords.txt?
( ) a. aspell create personal < mywords.txt
( ) b. aspell -c merge personal < mywords.txt
( ) c. aspell -r create personal < mywords.txt
( ) d. aspell clobber personal < mywords.txt
( ) e. Once a personal dictionary exists, it cannot be removed by the aspell command directly.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
129
Chapter 9. Formatting Text (fmt) and Splitting

Files (split)
Key Concepts
The fmt command can reformat text to differing widths.
Using the -p command line switch, the fmt command will only reformat text that begins with the
specified prefix, preserving the prefix.
The split command can be used to split a single file into multiple files based on either a number of
lines or a number of bytes.
Discussion
The fmt Command
Motivation for the fmt Command
Hopefully, the Lessons in this Workbook encountered so far have demonstrated the powerful ways that
text can be manipulated using basic Linux (and Unix) command line utilities. Because Linux provides
such a useful toolkit of text manipulation commands, the data that people handle is often left as simple
text. The /etc/passwd file is the classic example. Rather than embedding user definitions in some
database that requires a custom utility for access, they are defined in a simple text file that anyone with
knowledge of the grep command can search.
The common use of the simple text editor follows as a natural result of the common occurrence of the
simple text file. We emphasize again that text editors are not word processors. Elaborate word processing
applications, such as OpenOffice or AbiWord, generally store information using elaborate markup or
binary formatting to define fonts, colors, and other such details about the texts appearance. In contrast,
simple text editors such as nano, vim, or gedit store just the data: what you see is what you get. As a
result, users use text editor to edit text files with much more control and predictability.
One side effect of the variety of text editors in Linux, and in particular the coexistence of text editors and
word processors, is the inconsistencies with which word wrapping is handled. To a word processor, and
many HTML based text entry forms, new line characters are usually considered not worthy of the
concern of users. A user begins typing text, without ever using the RETURN key, and the application
decides when to wrap a line and where to insert a new line character. While this is not a problem, and
perhaps even desirable, for writing a letter to a friend, it can cause significant problems when editing a
line based configuration file (such as the /etc/passwd file, the /etc/hosts file, the /etc/fstab file,
etc..., etc...).
As an example of the inconsistencies of various text editors, the user elvis tries a simple experiment. He
types the first sentence from the previous paragraph using four different applications: the nano text
130
Chapter 9. Formatting Text (fmt) and Splitting Files (split)

editor, the vim text editor, the gedit text editor, and the OpenOffice word processor. In each case, he
types the sentence without ever hitting the RETURN key, and saves the document as
side_effect.extension using the default settings. The only exception is the OpenOffice word
processor, whose default format uses binary encoding. For this applications, elvis saved the file twice.
Once, using the "default" settings (the OpenOffice format), and once choosing the simplest "save as text"
setting possible.
Figure 9-1. Text Handling Using gedit
Figure 9-2. Text Handling Using gvim
Figure 9-3. Text Handling Using OpenOffice
rha030-3.0-0-en-2005-08-17T07:23:17-0400
131

Figure 9-4. Text Handling Using nano
What result does wc show? The four different applications used four different conventions for displaying
and saving the simple text sentence (five, if you include the binary OpenOffice format).
[elvis@station elvis]$ wc side_effect.* 2>/dev/null
1
1
16
0
3
21
31
31
109
31
31
233
188
188
4950
187
190
5703
side_effect.gedit
side_effect.gvim
side_effect.ooffice.sxw
side_effect.ooffice.txt
side_effect.nano
total
The nano text editor was the only application that implemented word wrapping by default. Although
elvis never hit the return key, three ASCII new line characters were inserted. The gedit and gvim
applications were consistent with Linux (and Unix) convention: they did not insert new line characters in
the middle of the text, but they would not let a text file end without a terminating new line character.
Although consistent with each other in terms of how the file was stored, they differed in how the text was
presented to the user: gedit wrapped the text at word boundaries, while gvim wrapped the text only when
it could fit no more on a line. Like gedit, the OpenOffice application wrapped the text while displaying
it, but did not add the conventional Linux new line to the end of the file while saving it to disk. We cant
even begin to discuss why the OpenOffice standard format took nearly 5000 bytes of binary data to store
about 200 characters.
All of this is to say that how an application handles the word wrapping issues is not obvious to the casual
user, and often, when reading text with one utility that was written by another, word wrapping issues
cause problems.
Rewrapping Text with the fmt Command

The fmt command is used to rewrap text, inserting newlines at word boundaries to create lines of a
specified length (75 character) by default. As a quick example, consider how the fmt command reformats
the file side_effect.gedit.
[elvis@station elvis]$ cat side_effect.gvim
One side effect of the variety of text editors in Linux, and in particular the c
oexistence of text editors and word processors, is the inconsistencies with whic
h word wrapping is handled.
[elvis@station elvis]$ fmt side_effect.gvim
One side effect of the variety of text editors in Linux, and in

particular the coexistence of text editors and word processors, is the
rha030-3.0-0-en-2005-08-17T07:23:17-0400
132

inconsistencies with which word wrapping is handled.
[elvis@station elvis]$ fmt side_effect.gvim
31
| wc
188
The cat command, true to its nature, performed no formatting on the file when it displayed it. The fact
that the lines wrapped at 80 characters is a side effect of the terminal that was displaying it. The fmt
command, on the other hand, wrapped the text at word boundaries so that no line was over 75 characters
in length.
fmt Command Syntax

Like most of the text processing commands encountered in this Workbook, the fmt command interprets
arguments as filenames on which to operate, or operates on standard in if none are provided. Its output is
written to standard out. The following table list command lines switches that can be used to modify fmts
behavior.
Table 9-1. Command Line Switches for the fmt Command
Switch
Effect
-w, --width=N , -N
Format text to N columns.
-p, --prefix=STRING
Only format lines beginning with STRING .
-u, --uniform spacing
Enforce spacing of one space between words, two spaces between

sentences.
Formatting to a Specific Width

The maximum width of the resulting text can be specified with the -w N command line switch, or more
simply just -N , where N is the maximum line width measured in characters. In the following example,
elvis reapplies the format command to the file side_effect.gvim, formatting it first to a width of 60
characters, and then to a width of 40 characters.
[elvis@station elvis]$ fmt -w60 side_effect.gvim
One side effect of the variety of text editors in Linux,

and in particular the coexistence of text editors and
word processors, is the inconsistencies with which word
wrapping is handled.
[elvis@station elvis]$ fmt -40 side_effect.gvim
One side effect of the variety of text

editors in Linux, and in particular the
coexistence of text editors and word
processors, is the inconsistencies with
which word wrapping is handled.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
133
Formatting Text with a Prefix

Often, text is found with some sort of decoration or prefix. Particularly when commenting source code or
scripts, all of the text of the comment needs to be marked with the appropriate comment character. The
following snippet of text is found in the /usr/include/db_cxx.h header file for the C++
programming language.
//
//
//
//
//
//
//
//
//
//
//
As a rule, each DbFoo object has exactly one underlying DB_FOO struct
(defined in db.h) associated with it. In some cases, we inherit directly
from the DB_FOO structure to make this relationship explicit. Often,
the underlying C layer allocates and deallocates these structures, so
there is no easy way to add any data to the DbFoo class. When you see
a comment about whether data is permitted to be added, this is what
is going on. Of course, if we need to add data to such C++ classes
in the future, we will arrange to have an indirect pointer to the
DB_FOO struct (as some of the classes already have).
Suppose a programmer edited the comment, adding the following few words on the second line.
[elvis@station elvis]$ cat cxx_comment.txt
//
// As a rule, each DbFoo object has exactly one underlying DB_FOO struct
// (defined in db.h) associated with it. In some cases, but we really dont
expect many of them, we inherit directly
// from the DB_FOO structure to make this relationship explicit. Often,
// the underlying C layer allocates and deallocates these structures, so
// there is no easy way to add any data to the DbFoo class. When you see
// a comment about whether data is permitted to be added, this is what
// is going on. Of course, if we need to add data to such C++ classes
// in the future, we will arrange to have an indirect pointer to the
// DB_FOO struct (as some of the classes already have).
//
Because each line of the text begins with a //, and ends with an ASCII new line character, readjusting
the line to fit back into 80 characters would involve pushing some words to the next line, which would
then also need to be reformatted, and so on.
Fortunately, the fmt command with the -p command line switch makes life much easier.
[elvis@station elvis]$ fmt -70 -p"// " cxx_comment.txt
//
//
//
//
//
//
//
//
//
//
//
As a rule, each DbFoo object has exactly one underlying DB_FOO

struct (defined in db.h) associated with it. In some cases,
but we really dont expect many of them, we inherit directly
from the DB_FOO structure to make this relationship explicit.
Often, the underlying C layer allocates and deallocates these
structures, so there is no easy way to add any data to the
DbFoo class. When you see a comment about whether data is
permitted to be added, this is what is going on. Of course,
if we need to add data to such C++ classes in the future, we
will arrange to have an indirect pointer to the DB_FOO struct
(as some of the classes already have).
rha030-3.0-0-en-2005-08-17T07:23:17-0400
134
The fmt command did all of the hard work, and preserved the prefix characters.
The split Command

Dividing files with the split Command
Suppose someone has a file that is too large to handle as a single piece. For that, or for some other
reason, the split command will divide the file into smaller file, each a specified number of lines or bytes.
As an example, elvis generate the following pointless 1066 line file.
[elvis@station elvis]$ for i in $(seq 1066); do
> echo "this is line number $i of a pointless file." >> pointless.txt
> done
[elvis@station elvis]$ wc pointless.txt
1066
9594
47929 pointless.txt
[elvis@station elvis]$ tail -5 pointless.txt
this
this
this
this
this
is
is
is
is
is
line
line
line
line
line
number
number
number
number
number
1062
1063
1064
1065
1066
of
of
of
of
of
a
a
a
a
a
pointless
pointless
pointless
pointless
pointless
file.
file.
file.
file.
file.
Now elvis uses the split command to divide the file into smaller files, each of 200 lines.
[elvis@station elvis]$ split -200 pointless.txt sub_pointless_
[elvis@station elvis]$ wc sub_pointless_a *
200
200
200
200
200
66
1066
1800
1800
1800
1800
1800
594
9594
8892
9000
9000
9000
9001
3036
47929
sub_pointless_aa
sub_pointless_ab
sub_pointless_ac
sub_pointless_ad
sub_pointless_ae
sub_pointless_af
total
[elvis@station elvis]$ tail -5 sub_pointless_ad
this
this
this
this
this
is
is
is
is
is
line
line
line
line
line
number
number
number
number
number
796
797
798
799
800
of
of
of
of
of
a
a
a
a
a
pointless
pointless
pointless
pointless
pointless
file.
file.
file.
file.
file.
split Command Syntax

In addition to any command lines switches, the split command expects either zero, one or two arguments.
split [SWITCHES ] [FILENAME [PREFIX ] ]
rha030-3.0-0-en-2005-08-17T07:23:17-0400
135

If called with one or two arguments, the first argument is the name of the file to split. If called with two
arguments, the second argument is used as a prefix for the newly created files. If called with no
arguments, or if the first argument is the special filename -, the split command will operate on
standard in.
The action of the split command is to split FILENAME into smaller files titled PREFIX aa, PREFIX ab, etc.
Table 9-2. Command Line Switches for the split Command
Switch
Effect
-l, --lines=N , -N
Split input into files of N lines.
-b, --bytes=N
Split input into files of N bytes.
-l, --lines=N , -N
Split input into files of N lines. a
--line-bytes=N
Split input into files of at most N bytes, but perform split at the end
of a line.
-a, --suffix=N
Use suffixes of N characters (default N =2).
Notes:
a. When specifying N , a single letter suffix can be included which acts as a multiplier: b=512,
k=1024, and M=1024*1024.
Splitting Standard In
In the previous Lesson, we saw that aspells master dictionary can be dumped using the following
command.
[elvis@station elvis]$ aspell dump master | wc
153675
153675 1502478
The user elvis would like to store a copy of the dictionary, but he would like to break it down into files of
100 lines each. Realizing that this will create 1536 files, his resulting filenames will run out of letters if
he does not bump up the suffix length to 3 (26*26 = 676). Because he wants to specify the string dict_ as
a prefix, he must supply two arguments, so he uses the special filename - to cause split to read from
standard in.
[elvis@station dict]$ aspell dump master | split -100 -a3 - dict_
[elvis@station dict]$ ls
dict_aaa
dict_aab
dict_aac
...
dict_ahb
dict_ahc
dict_ahd
dict_ahe
dict_ahf
dict_ahg
dict_ahh
dict_ahi
dict_ahl
dict_ahm
dict_ahn
dict_aow
dict_aox
dict_aoy
dict_awh
dict_awi
dict_awj
dict_bds
dict_bdt
dict_bdu
dict_bld
dict_ble
dict_blf
dict_bso
dict_bsp
dict_bsq
dict_bzz
dict_caa
dict_cab
dict_aom
dict_aon
dict_aoo
dict_aop
dict_aoq
dict_aor
dict_aos
dict_aot
dict_avx
dict_avy
dict_avz
dict_awa
dict_awb
dict_awc
dict_awd
dict_awe
dict_bdi
dict_bdj
dict_bdk
dict_bdl
dict_bdm
dict_bdn
dict_bdo
dict_bdp
dict_bkt
dict_bku
dict_bkv
dict_bkw
dict_bkx
dict_bky
dict_bkz
dict_bla
dict_bse
dict_bsf
dict_bsg
dict_bsh
dict_bsi
dict_bsj
dict_bsk
dict_bsl
dict_bzp
dict_bzq
dict_bzr
dict_bzs
dict_bzt
dict_bzu
dict_bzv
dict_bzw
dict_cha
dict_chb
dict_chc
rha030-3.0-0-en-2005-08-17T07:23:17-0400
136

dict_ahj
dict_ahk
dict_aou
dict_aov
dict_awf
dict_awg
dict_bdq
dict_bdr
dict_blb
dict_blc
dict_bsm
dict_bsn
dict_bzx
dict_bzy
[elvis@station dict]$ wc dict_*
100
100
100
100
100
100
788 dict_aaa
790 dict_aab
1008 dict_aac
...
100
100
75
153675
100
1215
100
1206
75
917
153675 1502478
dict_cha
dict_chb
dict_chc
total
Examples
Example 1. Using fmt to Clean Email
While using the mutt terminal based mailer, elvis saves and then views the following email message.
[elvis@station elvis]$ cat email.txt
I believe the phone number of the rental property is

888-555-1212. If not, the phone number of the rental office
is 888-555-1313. Ill have my cellphone with me, also:
888-555-1414.
On September 24 (15:32 EDT), blondie wrote:
>
> What phone numbers will everyone have in case I get lost?
>
>
> >> So it turns out that mapquest gives more sane die wreck shuns than would I
were I to have to produce them from memory. So heres how to get to the house
assuming you have traveled to the eastern most end of I-92. This route will ta
ke you through the heart of downtown Springfield. In my opinion its the best w
ay to get there because you spend the most time on the superslb that is I-92. T
he stretch down Market Street is narrow so drive with care. Once you turn off o
f Market Street take your time and gawk at the lovely historic homes. If youve
got time to kill on the way out, Springfields a nice riverfront to take a strol
l.
The email is composed of different included sections, each of which was presumably written by a
different author using a different text editor. The first few lines are fine, but then the last included
comment is all one long line.
Before replying, elvis cleans up the message using the fmt command.
[elvis@station elvis]$ fmt -p"> >> " -w60 email
rha030-3.0-0-en-2005-08-17T07:23:17-0400
137

I believe the phone number of the rental property is
888-555-1212. If not, the phone number of the rental office
is 888-555-1313. Ill have my cellphone with me, also:
888-555-1414.
On September 24 (15:32 EDT), Jane Doe wrote:
> What phone numbers will everyone have in case I get lost?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
So it turns out that mapquest gives more sane die

wreck shuns than would I were I to have to produce
them from memory. So heres how to get to the house
assuming you have traveled to the eastern most end
of I-92. This route will take you through the heart
of downtown Springfield. In my opinion its the best
way to get there because you spend the most time on
the superslb that is I-92. The stretch down Market
Street is narrow so drive with care. Once you turn
off of Market Street take your time and gawk at
the lovely historic homes cum B&Bs now operated by
Wilmingtons gay hospitality mafia. If youve got
time to kill on the way out, Wilmingtons a nice
riverfront to take a stroll.
Notice that the fmt command only operated on lines that began with the > >> prefix. (In this case,
there was only one.) The rest of the text was left alone.
Example 2. Using "String Processing" Tools to Manipulate

Binary Data
Most of this Workbook has focused on developing a toolkit of commands for processing text. Many of
the commands work equally well on bytes, without attempting to interpret the bytes into text characters.
The user elvis has created an abstract image using the gimp image manipulation program, and saved the
file as clouds.pnm using the PNM format.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
138

Figure 9-5. Elviss Abstract Image of Clouds clouds.pnm
In the first Lesson, the PNM format was mentioned as a simple example of encoding images. The picture
is first reduced to an array of dots ("pixels"), and then the color of each pixel is encoded into three bytes
of raw data, its "redness", "greenness", and "blueness", each as a value from 0 to 255.
A few lines of ASCII text are prepended to file, to identify the format, the number of pixels in each row,
the number of rows, and the "depth" of the image. (The depth is the number of integers which are used to
encode each color component. Using the scheme described in the previous paragraph, the image would
have a depth of 255).
After a little experimenting with the head command, elvis determines that his image file consists of four
lines of ASCII text, followed by binary data.
[elvis@station elvis]$ head -4 clouds.pnm
P6
# CREATOR: The GIMPs PNM Filter Version 1.0

256 256
255
Attempting to figure out the header, elvis assumes the following.
The text P6 probably acts as magic. Magic is the term for specific strings (or bytes) that identify
(often binary) file formats. A collection of "magic" identifiers is cataloged in the file
/usr/share/magic. (For the curious, try grep P6 /usr/share/magic.)
Apparently, any line in the ASCII header that begins with a # is interpreted as a comment.
These two numbers probably identify the number of pixel in a row, and the number of rows in the
image. His image is an array of 256x256 pixels.
The last number defines the depth of the image, elvis assumes.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
139

The remainder of the file is raw data. The user elvis would like to split the image into four horizontal
slices. Of course, the right way to do this would be to use an image editor, such as gimp. Instead, elvis is
going to use command line tools.
He first separates the image into its header, and its raw data.
[elvis@station elvis]$ head -4 clouds.pnm > clouds.hdr
[elvis@station elvis]$ tail +5 clouds.pnm > clouds.dat
[elvis@station elvis]$ cat clouds.hdr
P6
256 256
255
[elvis@station elvis]$ wc clouds.dat
wc: clouds.dat:1: Invalid or incomplete multibyte or wide character

0
8 196608 clouds.dat
While the number of lines and words reported by the wc command are meaningless, the number of
"characters" is really the number of bytes in the file. Performing a quick calculation, elvis determines
that an image of 256x256 pixels, with each pixel requiring 3 bytes of data, should be 256*256*3=196608
bytes in length. The wc commands character count agrees.
Next, elvis uses the split command to divide the images raw data into four slices, each 196608/4=49512
bytes in size.
[elvis@station elvis]$ split -b49512 clouds.dat clouds_
[elvis@station elvis]$ wc clouds.dat clouds_ * 2>/dev/null
0
0
0
0
0
0
8
1
1
1
8
19
196608
49152
49152
49152
49152
393216
clouds.dat
clouds_aa
clouds_ab
clouds_ac
clouds_ad
total
Now that elvis has four slices of raw data from the original image, each of which contains one fourth of
the original number of rows. He used a text editor to update the header information to reflect his change,
and stores the updated header in the file clouds.newhdr.
[elvis@station elvis]$ diff -u clouds.hdr clouds.newhdr
--- clouds.hdr 2003-10-10 04:40:28.000000000 -0400

+++ clouds.newhdr
2003-10-10 04:40:43.000000000 -0400
@@ -1,4 +1,4 @@
P6
-256 256
+256 64
255
As the diff command reveals, elviss only edit was to change the number which defines the number of
rows from 256 to 256/4=64. Now elvis creates 4 new PNM image files by prepending the modified
header to the split image data. When finished, he views his images with the "Eog of GNOME" viewer,
eog.
rha030-3.0-0-en-2005-08-17T07:23:17-0400
140

[elvis@station
[elvis@station
[elvis@station
[elvis@station
[elvis@station
elvis]$
elvis]$
elvis]$
elvis]$
elvis]$
cat
cat
cat
cat
eog
clouds.newhdr
clouds.newhdr
clouds.newhdr
clouds.newhdr
clouds_row*
clouds_aa
clouds_ab
clouds_ac
clouds_ad
>
>
>
>
clouds_row1.pnm
clouds_row2.pnm
clouds_row3.pnm
clouds_row4.pnm
Figure 9-6. Row 1 of Elviss Split Image (clouds_row1.png)
Why would elvis want to use command line tools? One answer is precision. Most graphical image
editors use mouse selections to perform these types of operations, which can lead to frustration when
trying to perform exacting edits. The second answer is automation. Suppose elvis had 283 images to
which he needed to perform the same operation. The process used above could be easily automated by
recording the commands in a bash script. (While the need for this level of precision or automation is
hard to imagine when handling abstract images, consider someone who might be handling images
routinely created by a medical imaging device.)
rha030-3.0-0-en-2005-08-17T07:23:17-0400
141
Online Exercises
Lab Exercise
Objective: Effectively use the fmt and split commands.
Specification
1. Use the grep command to print every word in the file /usr/share/dict/words which contains
the text ee. Use the fmt command to reformat the output into lines of (the default) 75 characters
width. Store the result in the file ee_lines.txt.
2. The file /usr/share/doc/bash*/loadables/cut.c contains a couple of large sections of
comment text, whose lines all begin with the text *. Use the fmt command to reformat only the
comment text to a width of 40 characters. Store the result in the file ~/cut40.c.
If performed correctly, you should be able to reproduce results similar to the following.
[student@station student]$ tail +62 cut40.c | head
* NEGLIGENCE OR OTHERWISE) ARISING

* IN ANY WAY OUT OF THE USE OF THIS
* SOFTWARE, EVEN IF ADVISED OF THE
* POSSIBILITY OF SUCH DAMAGE.
*/
#ifndef lint
static const char copyright[] =
"@(#) Copyright (c) 1989, 1993\n\
The Regents of the University of California.
All rights reserved.\n";
3. The file /usr/share/zoneinfo/zone.tab lists the locations of cities used to identify timezones
and locals. Use the split command to split this file into files of 80 lines each (except, of course, for
the last file, which will collect the remainder). The new files should exist in your home directory,
and all have the form ~/zone_aa, where the letters aa iterate with each file.
Deliverables
1. The file ee_lines.txt, which contains every word from the file /usr/share/dict/words which contains
the text ee, reformatted to a width of 75 characters per line.
2. The file ~/cut40.c, which contains the contents of the file /usr/share/doc/bash-*/loadables/cut.c,
where all lines beginning with the characters * have been reformatted to a width of 40 characters.
3. The contents of the file /usr/share/zoneinfo/zone.tab, split into files of 80 lines each, with each
rha030-3.0-0-en-2005-08-17T07:23:17-0400
142

resulting file named ~/zone_aa, where aa iterates for each file.
Questions
1. Which of the following command lines would reformat the contents of the file email.txt to a width of 40
characters?
( ) a. fmt -w40 email.txt
( ) b. format -w40 email.txt
( ) c. fmt -W40 email.txt
( ) d. format -W40 email.txt
2. Which of the following command lines would reformat all comment lines within the shell script conv.sh (all
lines that begin with #) to a width of 40 characters?
( ) a. fmt -p# -w40 conv.sh
( ) b. fmt -p\# -w40 conv.sh
( ) c. fmt --prefix=# -w40 conv.sh
( ) d. fmt --pre=# -w40 conv.sh
3. Which of the following command lines would reformat the contents of the file letter.txt to a width of 75
characters per line?
( ) a. fmt -75 letter.txt
( ) b. fmt < letter.txt
( ) c. fmt --width=75 letter.txt
( ) e. A and C only
143
rha030-3.0-0-en-2005-08-17T07:23:17-0400
Copyright (c) 2003-2005 Red Hat, Inc. All rights reserved. For use only by a student enrolled in a Red Hat Academy course taught at a Red Hat Academy. Any other use is a
violation of U.S. and international copyrights. No part of this publication may be photocopied, duplicated, stored in a retrieval system, or otherwise duplicated whether in
electronic or print format without prior written consent of Red Hat, Inc. If you believe Red Hat course materials are being used, copied, or otherwise improperly distributed

4. Which of the following commands would split standard in into files of 1000 lines each which all start data_?
( ) a. split -l1000 data_
( ) b. split - data_
( ) c. split --lines=1k data_
( ) d. split -p"data_"
5. Which of the following would split the binary file data.out into files of 1 kilobyte each?
( ) a. split -b1k data.out
( ) b. split -b1024 data.out
( ) c. split --bytes=1024 data.out
( ) e. B and C only
6. Which of the following would split the contents of the file data.txt into files named data_00.txt, where 00 is
replaced with a two digit file number?
( ) a. split -f "data_%02d.txt" data.txt
( ) b. split --format="data_%02d.txt" data.txt
( ) c. split --format="data_##.txt" data.txt
( ) d. split -f "data_##.txt" data.txt
7. Which of the following command lines would reformat the contents of the file report.txt to a with of 50
characters, and then split the reformatted content into files of 2000 lines each?
( ) a. fmt -50 report.txt | split -l 2000
( ) b. split -l 2000 report.txt | fmt -50
( ) c. split --lines=2000 report.txt | fmt -w50
( ) e. B and C only
144
rha030-3.0-0-en-2005-08-17T07:23:17-0400

8. Which of the following would split the contents of the file data.txt into files of no larger than 5000 bytes?
( ) a. split --bytes=5k data.txt
( ) b. split -b5k data.txt
( ) c. split -b5000 data.txt
( ) e. A and B only
9. Which of the following would split the contents of the file report.txt into files of 2000 lines each, where each
resulting files filename starts chapter_?
( ) a. split -l2k -p"chapter_" < report.txt
( ) b. split -l2k -p"chapter_" report.txt
( ) c. split -l2000 - chapter_ < report.txt
( ) e. A and C only
10. Which of the following would split the file output.dat into files of exactly 2048 bytes?
( ) a. split -b2k output.dat
( ) b. split -2048 output.dat
( ) c. split -b2b output.dat
( ) d. split -b2M output.dat
rha030-3.0-0-en-2005-08-17T07:23:17-0400
145

Rha030 Workbook08 Student 3.0 0

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Rha030 Workbook08 Student 3.0 0

Transféré par

Droits d'auteur :

Formats disponibles

Workbook 8.

String Processing Tools

Red Hat, Inc.

Workbook 8. String Processing Tools

6. Tracking differences: diff.................................................................................................................... 91

9. Formatting Text (fmt) and Splitting Files (split)............................................................................. 130

Chapter 1. Text Encoding and Word Counting

Chapter 1. Text Encoding and Word Counting

Chapter 1. Text Encoding and Word Counting

The digits 0 through 9

Capital letters A through Z

Lowercase letters a through z

Chapter 1. Text Encoding and Word Counting

Audible Terminal Bell

ISO 8859 and Other Character Sets

West European languages

Central and East European

West European languages

Chapter 1. Text Encoding and Word Counting

Chapter 1. Text Encoding and Word Counting

Unicode Transformation Format (UTF-8)

Text Encoding and the Open Source Community

Chapter 1. Text Encoding and Word Counting

chmod: changing permissions of /etc/passwd: Operation not permitted

The LANG environment variable

The variable context consists of the following three components.

Two letter ISO 639 Language Code

(Optional) Two letter ISO 3166 Country Code

(Optional) Character Encoding Code Set

Chapter 1. Text Encoding and Word Counting

Table 1-7. Selected ISO 3166 Country Codes

Table 1-8. Selected Character Encoding Code Sets

ISO 8859-1 (Latin 1)

ISO 8859-15 (Latin 10)

ISO 8859-6 (Arabic)

ISO 8859-2 (Latin 2)

Do I Really Have to Know All of This?

Chapter 1. Text Encoding and Word Counting

Revisiting cat, head, and tail

display line feeds (ASCII 10) as $

display tabs (ASCII 9) as ^I

Shows "all", same as -vET

Show "all" except line feeds, same as -vT

Show "all" except tabs, same as -vE

# Do not remove the following line, or various programs

Chapter 1. Text Encoding and Word Counting

Revisiting head and tail

Display the first N lines of the file.

Display the first N bytes of the file.

Chapter 1. Text Encoding and Word Counting

Display the first N bytes of the file.

The wc (Word Count) Command

Compute character count.

Compute line count.

Compute word count.

Filename to be counted. If no filename is not

How To Recognize A Real Character

Chapter 1. Text Encoding and Word Counting

So, What Is A Word?

contains 10 perfectly good words: printing characters surrounded by whitespace or punctuation.

Example 2. Invisible Characters Are Important, Too

Example 3. Whats My Line?

Chapter 1. Text Encoding and Word Counting