www.maxpagani.org
Home
CV
Blog
Photos
readings
downloads
work
links
F.A.Q.

interests
 travels
 trekking
 videogames
  working
  programming
  playing
 miniatures
 fitness
 programming
  C, C++
  Java
  Shell
  misc.

guestbook

mail

Shell Programming and Regular Expressions.


Writing my shell script to produce this site I stumbled on some intricacies on dealing with regular expressions. I wanted to extract from a document all the strings that matched against a given regular expression.

Suppose you have a file containing text. Inside this text there are some words that are marked with trailing and ending special character sequence, such as "%!word%" or "%?word%". The goal is to write the simplest script to extract all these words and discard the rest.

Sounds simple enough, doesn't it? So simple that there seems to be no obvious direct way to do that.

I had this problem in the past and I always solved it by using sed. Since sed is line oriented, you find from the beginning to the start of your matching text and replace it with an empty string, the same goes for the line end. In the above example I used the following command:

sed -e "s/^[^%]\+%//g" -e "s/%[^%]\*$//g"

The first expression cuts the trailing characters, while the latter cuts the ending characters. As you see it is not crystal clear... the additional mess about the sequence not containing '%' is needed to avoid that the regular expression matches too much. Had I specified "s/^.\+%//g" the line would have been cut from the beginning up to the last % character!

Unfortunately the line isn't just obscure, it is wrong, too. Suppose you have the following line:

foo %!bar% baz %!quux% foo

After processing it through the above sed statement, you would get:

%!bar% baz %!quux%

Which is not the desiderd output.

There is another Unix command used to search for regular expression: grep. So I browsed the man pages to find the right invocation of switch and command line flags to do what I wanted.

It turned out that with grep (at least in the GNU incarnation) you can use the '-o' flag to print just the matched string and not the whole line:

grep -o "%![^%]\+%"

Also this solution proved to be unapplicable because only the first matching string on the line is printed. If you run it on the example above you get:

%!bar%

I was on the edge of going towards other ways, such as awk or even writing my own C program, when I read the end of the sed man page:
SEE ALSO
awk(1), ed(1), grep(1), tr(1), perlre(1), sed.info, any of various
books on sed, the sed FAQ (http://sed.sf.net/grabbag/tutorials/sed-
faq.html), http://sed.sf.net/grabbag/.

Quickly I pointed my browser to http://sed.sf.net/grabbag/ and started to look through the large amount of scripts and tutorials, just to find that no one could do what I want. As a last resource I entered the links section and tried (nearly random) Yao-Jen Chang's sed page. This site is a very well organized collection of sed and perl scripts.

By looking in the 'Working on a string/extraction' I found what I was looking for: List every string which matches PAT in a file, one per line.

The script is not a one-liner, neither a multi-page monster. Here it is (a couple of typos fixed and my pattern used):

[ 1] s/%![^%]\+%/\n&\n/
[ 2] /\n/!d
[ 3] s/.*/\n&\n/
[ 4] s/\n[^\n]*\n/\n/g
[ 5] s/^\n\(.*\)\n/\1/g

It works by splitting the line around every match, basically converting it in multiple lines. The splitted line has a sequence of:

\n DonMatch \n Match \n DontMatch ... \n Match \n DontMatch \n

The matching parts are extracted by line [4] replacing non matching parts with a single newline.

Good but overkilling, quite far from my idea on little and simple, but I got the enlighting message - split the line around the matching pattern.

For my script I used a combination of sed and grep. Sed is used for splitting lines around matching string, then I get the lines I'm interested in with grep. I'm not completely satisfied by the solution because it involves the use of two commands while one should be enough. Also I need a third run to remove the extra "%!" and "%". To this purpouses I used the bash string functions, in the following way:

    fileRegExp="%![^%]\+%"
    local files=$(sed "s/$fileRegExp/\n&\n/g" $1 | grep $fileRegExp )
    local i
for i in $files do local name
name=${i:2:${#i}-3} ... and so on

Conclusions

I'm still not convinced about the non-existance of a simple and straightforward way to print matching strings. Anyway this problem suggested two winning strategies -
- read your manual up to the last line,
- try to transform the problem to something you can solve.

But beware, the latter leads to "If all you have is a hammer, everything looks like a nail" :-).

Massimiliano Pagani


created with vim   Valid HTML 4.01! This page has been visited times
This site and its content is (C) by Massimiliano Pagani
Last modified 2008/05/12 13:52:28