Archive for Regular expressions

Atoms in Regular Expressions

Posted in Perl with tags , , on December 14, 2008 by vinaychilakamarri

I read about something interesting today. There is a term called ‘Atom’  when you are dealing with regular expressions. Knowing the importance of  how you formulate a regular expression becomes important because some
expressions might turn out as something you usually don’t anticipate. Talking about quantifiers in the context of  atoms, the importance of atoms becomes more vivid when you consider this example: ha{6}, (ha){6}. Although
they seem to be very similar, the additional parentheses treat ‘ha’ as a single atom and match only those words whose pattern contains ‘ha’ for 6 consecutive times. The former expression tries to match only those patterns in which ‘h’ is followed by 6 a’s. Let’s say I wanted to match a word that starts with ‘pre’ and ends with ‘tion’

These are some of the possibilities that I should be expecting:
premonition
predilection
prediction
etc.

So, a possible expression can be:

^(pre)+([a-z]*|[A-Z]*|\\s)*(ion)+\$

Any word that beings with the atom “pre”, in which the atom “pre” Must occur at least one time and can occur more than one time after the beginning, and there can be 0 or many characters in the middle, starting from a-z OR A-Z OR space characters with a quantifier *, which indicates that the characters can occur 0 or more times, with an external quantifier that ensures that the letter case is ignored during the match, and it only matches the patterns that end with the atom, “ion”, which is specified by the concluding $

Example 2:

^?[0-9]+(ord)+\$?

Says that the atom ‘ord’ has to be preceded at least once by a number. The + quantifier after [0-9] signifies that the numbers can be repeated any number of times. Also the ^? signifies that the numbers need not start from the beginning of the pattern in question. Also what the “\$?” means is that the pattern need not essentially end with the atom “ord”