Regular Expressions (RE) aka ‘Pattern Matching’



Note: Incomplete - Information might be wrong at some places. I did copy paste from some old file I wrote years back while learning regular expressions.

(C) Prashant N Mhatre

The discussion doesn’t come anywhere close to explain all of the flavors of regular expressions but we’ll hit on the high points. I feel it’s worth investing some time to practice REs. You’ll be amazed to see it’s use and results in different programs/tools.

(Search, Find and Replace, Delete, Extract Particular Fields, Parsing, String Comparisons etc)

Compiler language grammar

I haven’t concentrated on theoretical aspects like lazy qualifier…etc.

Just practical examples. No matter what’s your role is, I feel you MUST understand and use REs to reduce tedious and time-consuming text editing. REs are extensively used in awk, sed, grep, lex, perl, python, ruby, c#, javascript , tcl, vi and it’s clones etc


^

beginning of line

$

end of line

.

any single character EXCEPT A NEWLINE

Quantifiers: They influence quantity

?

0 or 1 character

+

1 or many characters

*

0 or many characters

Interval Quantifier:

{min,max} range minimum/maximum => zero/infinity

{2}

exact 2 matches

{2,}

at least two matches(or more)

{,5}

at least 0, at most 5 occurrences

{2,5}

2 or 3 or 4 or 5 occurrences (2-5) occurrences

colou?r will match ‘color’ or ‘colour’

————-

Word Boundaries:

\< \>

\b \b

sed “s/\<[0-90-9]*//g replace digits in the words starting with digits

\and\ will match anything having ‘and’

and, andrew, sand, command, demanding

\band\b will match exactly ‘and’ not ‘andrew’, ’sand’, ‘command’

\band anything beginning with ‘and’

will match and, andrew not sand, command, demanding

and\b anything ending with and

will match and, sand, command, not andrew, demanding

\ escape character

————-

[] character class - match any character among the group

a-z range

| Alternation (or)

Meta characters have different meaning inside and outside character class.

[0-5]

Range : ‘0′,’1′,’2′,’3′,’4′,’5′

[-05]

‘-’ or ‘0′ or ‘5′

[^0-5]

any character not in ‘0′,’1′,’2′,’3′,’4′,’5′

[a.z]

‘a’, or ‘.’ or ‘z’

[A-Z]

any character among all 26 uppercase letters


c[au]t

matches ‘cat’ or ‘cut’

cat|cut

matches ‘cat’ or ‘cut’

c(a|u)t

matches ‘cat’ or ‘cut’

c[a|u]t

matches ‘cat’,'c|t’ or ‘cut’

[pP]rashant

matches ‘prashant’ or ‘Prashant’

(Prashant|Narayan)Mhatre

matches ‘PrashantMhatre’ or ‘NarayanMhatre’

(a|[bc])d

matches ‘ad’ or ‘bd’ or ‘cd’


If you understood the above examples, I guess the concept is clear to you. There are different ways to get the desired results.

[abc],[a-c],(a|b|c) are effectively the same but this CANNOT BE GENERALISED. (don’t get any shortcut ideas, this is just a coincidence)


Why Character Classes?

(1) Easy to specify ranges [a-z] covers all 26 letters than writing them individually

(2) Easy translations tr/[A-Z]/[a-z]

(3) Compact representation [pnm] Instead of (p|n|m)

[A-Z].*

Any string beginning with UPPERCASE letter

[^a-c].*

Any string not beginning with ‘a’, ‘b’, or ‘c’

[A-Z]{2}.*

Any string beginning with 2 UPPERCASE letters


Few more examples to firm up the concept in your mind.

/[0-9]+/

line contains at least 1 digit ANYWHERE

/^[0-9]+/

line contains at least 1 digit in the BEGINNING

/[0-9]+$/

line contains at least 1 digit at the END

/^[0-9]+$/

entire line contains ONLY DIGITS

Whitespace: space (ascii 32), tab (\t), new line (\n), carriage return (\r)

Search and Replace: (Very useful in sed)


s/pnm/Prashant/

will replace the FIRST occurrence of pnm with Prashant

s/pnm/Prashant/g

will replace ALL occurrences of pnm with Prashant

s/^[ \t]*//

Delete spaces and tabs in front of each line (ltrim)

s/[ \t]*$//

Delete spaces and tabs at the end of each line (rtrim)

s/^[ \t]*//;s/[ \t]*$//

Delete leading and trailing spaces (trim)

/^prashant$/

Line contains only ‘prashant’, nothing else

/^prashant$/i

Same as above, case insensitive match

Delete Blank Lines

Delete lines starting with character ‘a’

/^.{5}ant/

If ‘ant’ is a part of a line from 6th character

Back referencing / tagging: Allows us group and reuse matches

( ) Group text without capturing alternation (Prashant|Mhatre)

\( \) Group text with capturing the result

File: test_file

A|-|Prashant Mhatre|Washington & Tokyo</FONT>

sed ’s/^A|.|\(.*\)|.*/\1/’ test_file

Prashant Mhatre

sed ’s/^.*|\(.*\)|\(.*\)<\/FONT>/\1 \2/’ test_file

Prashant Mhatre Washington & Tokyo

Perl specific

\d

any digit

\w

any character

\s

any whitespace character

\D

= [^\d] Any noncharacter

\W

= [^\w] any nonword character

\S

= [^\s] any nonwhitespace character

[\d\D]

Any digit and any non digit, i.e. match anything (including newline)

[^\d\D]

anything that is not digit, not non-digit, hence nothing


Leave a Comment

You must be logged in to post a comment.