Note: Incomplete - Information might be wrong at some places. I did copy paste from some old file I wrote years back while learning regular expressions.
(C) Prashant N Mhatre
I haven’t concentrated on theoretical aspects like lazy qualifier…etc.
Just practical examples. No matter what’s your role is, I feel you MUST understand and use REs to reduce tedious and time-consuming text editing. REs are extensively used in awk, sed, grep, lex, perl, python, ruby, c#, javascript , tcl, vi and it’s clones etc
|
|
|
|
^ |
beginning of line |
|
$ |
end of line |
|
. |
any single character EXCEPT A NEWLINE |
|
|
|
|
Quantifiers: They influence quantity |
|
|
? |
0 or 1 character |
|
+ |
1 or many characters |
|
* |
0 or many characters |
|
Interval Quantifier: {min,max} range minimum/maximum => zero/infinity |
|
|
{2} |
exact 2 matches |
|
{2,} |
at least two matches(or more) |
|
{,5} |
at least 0, at most 5 occurrences |
|
{2,5} |
2 or 3 or 4 or 5 occurrences (2-5) occurrences |
colou?r will match ‘color’ or ‘colour’
————-
Word Boundaries:
\< \>
\b \b
sed “s/\<[0-90-9]*//g replace digits in the words starting with digits
\and\ will match anything having ‘and’
and, andrew, sand, command, demanding
\band\b will match exactly ‘and’ not ‘andrew’, ’sand’, ‘command’
\band anything beginning with ‘and’
will match and, andrew not sand, command, demanding
and\b anything ending with and
will match and, sand, command, not andrew, demanding
\ escape character
————-
[] character class - match any character among the group
a-z range
| Alternation (or)
|
|
|
|
[0-5] |
Range : ‘0′,’1′,’2′,’3′,’4′,’5′ |
|
[-05] |
‘-’ or ‘0′ or ‘5′ |
|
[^0-5] |
any character not in ‘0′,’1′,’2′,’3′,’4′,’5′ |
|
[a.z] |
‘a’, or ‘.’ or ‘z’ |
|
[A-Z] |
any character among all 26 uppercase letters |
|
|
|
|
c[au]t |
matches ‘cat’ or ‘cut’ |
|
cat|cut |
matches ‘cat’ or ‘cut’ |
|
c(a|u)t |
matches ‘cat’ or ‘cut’ |
|
c[a|u]t |
matches ‘cat’,'c|t’ or ‘cut’ |
|
[pP]rashant |
matches ‘prashant’ or ‘Prashant’ |
|
(Prashant|Narayan)Mhatre |
matches ‘PrashantMhatre’ or ‘NarayanMhatre’ |
|
(a|[bc])d |
matches ‘ad’ or ‘bd’ or ‘cd’ |
If you understood the above examples, I guess the concept is clear to you. There are different ways to get the desired results.
Why Character Classes?
(2) Easy translations tr/[A-Z]/[a-z]
(3) Compact representation [pnm] Instead of (p|n|m)
|
|
|
|
[A-Z].* |
Any string beginning with UPPERCASE letter |
|
[^a-c].* |
Any string not beginning with ‘a’, ‘b’, or ‘c’ |
|
[A-Z]{2}.* |
Any string beginning with 2 UPPERCASE letters |
|
|
|
Few more examples to firm up the concept in your mind.
|
|
|
|
/[0-9]+/ |
line contains at least 1 digit ANYWHERE |
|
/^[0-9]+/ |
line contains at least 1 digit in the BEGINNING |
|
/[0-9]+$/ |
line contains at least 1 digit at the END |
|
/^[0-9]+$/ |
entire line contains ONLY DIGITS |
Whitespace: space (ascii 32), tab (\t), new line (\n), carriage return (\r)
Search and Replace: (Very useful in sed)
|
|
|
|
s/pnm/Prashant/ |
will replace the FIRST occurrence of pnm with Prashant |
|
s/pnm/Prashant/g |
will replace ALL occurrences of pnm with Prashant |
|
s/^[ \t]*// |
Delete spaces and tabs in front of each line (ltrim) |
|
s/[ \t]*$// |
Delete spaces and tabs at the end of each line (rtrim) |
|
s/^[ \t]*//;s/[ \t]*$// |
Delete leading and trailing spaces (trim) |
|
/^prashant$/ |
Line contains only ‘prashant’, nothing else |
|
/^prashant$/i |
Same as above, case insensitive match |
|
|
Delete Blank Lines |
|
|
Delete lines starting with character ‘a’ |
|
/^.{5}ant/ |
If ‘ant’ is a part of a line from 6th character |
Back referencing / tagging: Allows us group and reuse matches
( ) Group text without capturing alternation (Prashant|Mhatre)
\( \) Group text with capturing the result
File: test_file
A|-|Prashant Mhatre|Washington & Tokyo</FONT>
sed ’s/^A|.|\(.*\)|.*/\1/’ test_file
Prashant Mhatre
Prashant Mhatre Washington &
|
Perl specific |
|
|
\d |
any digit |
|
\w |
any character |
|
\s |
any whitespace character |
|
\D |
= [^\d] Any noncharacter |
|
\W |
= [^\w] any nonword character |
|
\S |
= [^\s] any nonwhitespace character |
|
[\d\D] |
Any digit and any non digit, i.e. match anything (including newline) |
|
[^\d\D] |
anything that is not digit, not non-digit, hence nothing |