February 09, 2013

sed tricks

The stream editor, most commonly know as sed, is a wonderful tool for modifying data from files and stdin. In this article I will be using the BSD variant of sed which is a little bit different from the GNU variant of sed but I will try to point out where the differences are in my examples.

Usage

One of the most common ways of using sed is:

cat file.txt | sed COMMAND

And the other

sed COMMAND FILE

It is possible to have multiple commands using the -e argument:

sed -e COMMAND -e COMMAND ..

Or even like this:

sed 'COMMAND;COMMAND;..'

The COMMAND can be a lot of different things but usually it will be the substitution pattern: s/regular expression/replacement/flags

Here is an example:

echo "aabb" | sed -e 's/a/A/' -e 's/b/B/g'
AaBB

Notice that the first command is only run on the first occurrence and the second is run on all using the g flag. Keep in mind that it works on a line-by-line basis.

Another important aspect is in-place alteration of files:

sed -i (EXT) COMMAND FILE

This command will edit FILE using COMMAND and if the extension EXT is given then a backup is saved to the FILE with EXT appended to the filename. It is generally recommended to produce backup files so nothing is lost unintentionally.

Group matching

Often it is necessary to match some block of data and substitute some portions without removing other parts, or to alter the order of blocks. A group is matched using parentheses in the command and referenced with \\1,\\2.. etc., for instance:

echo "foobar" | sed 's/\\(foo\\)\\(bar\\)/\\2\\1/'
barfoo

Two groups are matched (“foo” and “bar”) and their order is reversed. Notice that the parentheses have to be escaped in order to prevent matching the actual characters “(“ and “)”.

Extended regular expressions

The expressions I have used up until now were basic regular expressions but a more powerful variant exists, namely the extended regular expressions. In this mode, along with a lot of stuff, it’s not necessary to escape parentheses and the POSIX character sets are available. Example:

echo "foo \\t bar" | sed -E 's/(foo)[[:space:]]*(bar)/\\2 \\1/'
bar foo

The character set [:space:] matches any whitespace character. -E is replaced with -r in GNU sed.

Here is a great reference of the different regular expressions, both basic and extended ones.

Case sensitivity

In GNU sed there is a flag to turn off case sensitivity, namely the “i” flag. Sadly this flag is not available in BSD sed so one has to turn to other possibilities. One is to decapitalize all letters before piping to sed, but that doesn’t work for files and in-place modifications. However, case can be ignored when using character sets and similar constructs so keep that in mind, i.e. [:alpha:] will match “A” as well as “a”. Selective parts of a regular expression can sometimes require specific characters so if one wants to match both cases it can be done using ranges, like [a-cA-C] for instance.

Different approaches exist for decapitalizing using pipes. Here is one using tr:

echo "Hello, World" | tr '[:upper:]' '[:lower:]'
hello, world

And here is one using awk:

echo "Hello, World" | awk '{print tolower($0)}'
hello, world

If all else fails Perl has a sed-like substitution syntax that accepts the “i” flag:

echo "HeLlo" | perl -pe 's/l/./ig'
He..o

Greedy vs non-greedy matching

sed is greedy by default, meaning it will try to match as much as possible when using + or *. Here’s a greedy example:

echo "aaa(bbb)aaa(bbb)aaa" | sed -E 's/\\(.*\\)/./'
aaa.aaa

The above matches a “(“ then anything greedily until next “)”. Suppose we wanted to only match the first “(bbb)” and not “(bbb)aaa(bbb)”, then we could do the following:

echo "aaa(bbb)aaa(bbb)aaa" | sed -E 's/\\([^\\(]*\\)/./'
aaa.aaa(bbb)aaa

As before it matches a “(“ then matching non-greedily for “(“ (meaning not matching a “(“) until we match the closing “)”.

Examples

A friend of mine recently asked me how to replace entries of the form “<email address>” with “<–removed–>” using sed. The following was the solution:

cat file.txt | sed -E 's/(<).*@.*(>)/\\1--removed--\\2/g'

It’s not a correct regexp for matching email addresses but given the knowledge of the presence of <, > and @ it was fitting in his scenario.

The proper way of doing it would be the following:

cat file.txt | sed -E 's/(<)[[:alpha:][:digit:]\\._%\\+-]+@[[:alpha:][:digit:]\\.-]+\\.[[:alpha:]]{2,4}(>)/\\1--removed--\\2/g'

Another useful thing is to escape spaces in filenames and use them with commands:

#!/bin/sh
DIR=`echo $1 | sed 's/ /\\\\\\\\ /g'`
eval echo "Listing ${DIR}:"
eval ls -l ${DIR}

I used double-escaped backslashes because the final output should be, for instance, “/test\ one\ seven/foobar.txt” and not “/test one seven/foobar.txt” so that the commands can interpret the paths correctly (using the eval command). The effect is that the script can be called with both an escaped or non-escaped path, i.e. "/test one seven" or /test\\ one\\ seven.

When creating a Linux distribution of a program or similar that is required to run off-the-bat, and where the source code might not be available, it is often needed to get all the shared library dependencies of certain binary files. The following script will copy these to the destination directory: (Linux only!)

#!/bin/sh
# Usage: getshared.sh <binary> <dest directory>
BIN=`echo $1 | sed 's/ /\\\\\\\\ /g'`
DST=`echo $2 | sed 's/ /\\\\\\\\ /g'`
eval mkdir -p ${DST}
eval ldd ${BIN} | grep "=>" | sed 's/.*\\s*=>\\s*\\(.*\\)(.*)/\\1/' | awk '{print $1};' | eval xargs -I{} cp -vuL {} ${DST}

Here the sed command retrieves the dependency library from the output using the \\1 group. A sample line from ldd could be

libc.so.6 => /usr/x86_64-linux-gnu/libc.so.6 (0x00007f6eb8c000)

Assume we have a file we want to modify in-place, like an INI configuration file like this:

[Section]
enableStuff=false  
someVar=1337
enableOther = false
dontTouchThis="wicked"
oldSchool: false

And let’s say we wanted to replace all instances of “false” with “true”:

sed -i .bck -E 's/([[:alnum:]]+)[^\\s]*([=:])[^\\s]*false/\\1\\2true/' conf.ini

You will get the following result in conf.ini (along with the backup in conf.ini.bck):

[Section]
enableStuff=true  
someVar=1337
enableOther=true
dontTouchThis="wicked"
oldSchool:true

Notice that I remove all whitespace and that I allow both “=” and “:” as the delimiter because some implementations use “:” however uncommon it may be.

That concludes my tips and tricks using sed.

Posted with : sed, BSD, GNU