regexp

What's gonna happen then?

The shorten regexp alphabet (~5 minutes)
Real practices (~55 minutes .. or depends)

Regular expression in everyday tasks

Text searching
Text validating
Text filtering
Text extracting
Text replacing
Text formatting
Text cleaning
Text playing
Text ...

Regular expression in everything

Almost every IDE + powerful editor (Eclipse, Bracket, PHPStorm, Atom, VIM, Netbean, Notepad++, emacs, gedit ...)
Apache .htaccess
Linux sed, bashscript, sh
Windows PowerShell
...

Regular expression for developers

Perl, PHP
C, C++
.NET
Java
Python
Ruby
MySQL
Javascript
...

Dive into

Let's take a look at the Javascript RegExp and String first ..

Season 1

Regexp alphabet

Trimming whitespaces

                        
// Stripping whitespace 
//   from the beginning 
//   and the end of the string 
var s = '  no-free-space Offspring Digital   '; // '  no-free-space Offspring Digital   '
s.replace(/^ +/, '').replace(/ +$/, '');        //   'no-free-space Offspring Digital'
s.replace(/^ +| +$/, '');                       //   'no-free-space Offspring Digital   '
s.replace(/^ +| +$/g, '');                      //   'no-free-space Offspring Digital'
s.replace(/^\s+|\s+$/g, '');                    //   'no-free-space Offspring Digital'

File extension filter

                        
// Is this file an image?
// .. or ..
// Does this filename
//   end with the defined text?
var file = 'this-is-an-image.png'
  , rgx = /\.jpg$|\.png$|\.gif$/;
file.match(rgx); // [".png"]

$_ email address validate

                        
// We need a regexp
//   to validate email add
var email = 'mr.john-doe@offspringdigital.com'
  , rgx = ???;
email.match(rgx); // ["mr.john-doe@offspringdigital.com"]

{name}@{domain-name}.{domain-extension}
... is case insensitive
{name} starts with a alpha (ie. a-z)
... followed by alphanumeric or . _ -
... has at least 4 chars, at most 20 chars
{domain-name} starts with alphanumeric (ie. a-z and 0-9)
... then followed by alphanumeric or _ -
... has at least 2 chars, at most 10 chars
{domain-extension} is alpha only
... and has at least 2 chars, at most 4 chars

var rgx = /^[a-zA-Z][a-zA-Z0-9._-]{3,19}@[a-zA-Z0-9][a-zA-Z0-9_-]{1,9}\.[a-zA-Z]{2,4}$/; // this is really just a translation
var rgx = /^[a-z][a-z0-9._-]{3,19}@[a-z0-9][a-z0-9_-]{1,9}\.[a-z]{2,4}$/i;               // add a i modifier to shorten the regexp
var rgx = /^[a-z][\w.-]{3,19}@[a-z0-9][\w-]{1,9}\.[a-z]{2,4}$/i;                         // since \w ~= [a-z0-9_], let's use it

$_ Australia phone number

http://en.wikipedia.org/wiki/%2B61

                        
// We need a regexp
//   to validate Au phone number
var phone = '+61 0 2345 6789'
  , rgx = ???;
phone.match(rgx); // ["+61 0 2345 6789"]

(0x)xxxxxxxx
+61xxxxxxxxx
x is a digit (0 - 9)
Free whitespaces (all whitespaces are ignored)

var rgx = /^\s*\(0\s*\d\s*\)\s*\d\s*\d\s*\d\s*\d\s*\d\s*\d\s*\d\s*\d\s*$|^\s*\+6\s*1\s*\d\s*\d\s*\d\s*\d\s*\d\s*\d\s*\d\s*\d\s*\d\s*$/;
var rgx = /^\s*\(0\s*\d\s*\)\s*(\d\s*){8}$|^\s*\+6\s*1\s*(\d\s*){9}$/;

$_ remove duplicated spaces

                        
// We need a regexp
//   to remove duplicated
//   whitespaces
var s = 'It has   a lot     spaces';    // 'It has   a lot     spaces'
s.replace(???);                         // 'It has a lot spaces'

Replace 2 or more continous spaces into only one space

s.replace(/ +/, ' ');    // just strip spaces, ignore tab, so we don't use \t
s.replace(/ {1,}/, ' '); // this is identical to the above
s.replace(/ {2,}/, ' '); // this is more optimal

$_ parse nodejs arguments

                        
// From terminal, we type
$node app.js --first-name=john --last-name=doe

                        
// This is the app.js file
//   and we need a regexp
//   to extract passed
//   values from cli

// This is the opts
//   and it should accept cli arguments
//   but in the snake-case (not the camelCase)
var opts = {
    firstName: 'offspring',
    lastName: 'digital'
};

var arguments = system.argv; // ['--first-name=john', '--last-name=doe'];
for(var optKey in opts) {
    var rgx = ???;

    // Let's search in the argument list
    //   for an argument that
    //   correspond to this optKey,
    //   if any
    for(var i in arguments) {
        var argument = arguments[i];
        var match = rgx.exec(argument);

        // We've found it!
        if(match) {
            var value = ???;
            opts[optKey] = value; // .. and save it to our opts
            break; // .. then stop searching
        }
    }
}

console.log(opts); // {firstName: 'john', lastName: 'doe'}

var rgx = new RegExp('--' + optKey + '=(["\']?)(.+)\\1', 'g');
..
var value = match[2];

Season 2

Back-reference to a lazy & greedy group

Number range

                        
// We need a regexp
//   to validate that
//   1979 < x < 1991
var number = '1990'
  , rgx = /^19(?:79|8[0-9]|9[01])$/;
number.match(rgx); // ["1990"]

Remove duplicated words

                        
// So we don't need duplicated words
var s = 'A duplicated duplicated duplicated word';
s.replace(/(.+?\b)\1+/g, '$1'); // "A duplicated word"

$_ ip address

http://en.wikipedia.org/wiki/IP_address

                        
// We need a regexp
//   to validate an ip address
var ip = '192.168.0.102';
var rgx = ???;
rgx.match(ip); // ["192.168.0.102"]

x.x.x.x
x is a number from 0 - 255
0, 00, 000 are all accepted

// Let's break the x down:
// x ~= (0-255) ~= (0-9) | (10-99) | (100-199) | (200-249) | (250-255)
// Translate it into regexp language (not that padding is accepted)
// 0{0,2}\d|0?\d{2}|1\d{2}|2[0-4]\d|25[0-5]
var rgx = /^0{0,2}\d|0?\d{2}|1\d{2}|2[0-4]\d|25[0-5]\.0{0,2}\d|0?\d{2}|1\d{2}|2[0-4]\d|25[0-5]\.0{0,2}\d|0?\d{2}|1\d{2}|2[0-4]\d|25[0-5]\.0{0,2}\d|0?\d{2}|1\d{2}|2[0-4]\d|25[0-5]$/

$_ remove duplicated spaces + trim

                        
// We need a regexp
//   to remove duplicated whitespaces
//   as well as trim it
var s = '    It has   a lot     spaces   ';    // '    It has   a lot     spaces    '
s.replace(???);                                // 'It has a lot spaces'

Replace all spaces at the beginning and the end of the string
Replace 2 or more continous spaces into only one space

s.replace(/^ +| +$|( ) +/g, '$1'); // We're talking about space, not tab, so we don't use \s

$_ linkify

                        
// We need 2 regexp,
//   one for link
//   one for username,
//   and their replacements
var text = 'We have a http://google.com?q=linkify link from @offspring';
var rgxLink = ???
  , replaceLink = ???
  , rgxTwitterUsername = ???
  , replaceTwitterUsername = ???;

text
.replace(rgxLink, replaceLink)
.replace(rgxTwitterUsername, replaceTwitterUsername);
// We have a http://google.com?q=linkify link from @offspring

A link starts with http:// or https://
A link ends before a space, or EOL
A twitter username starts right after the letter @
A twitter username ends before a space, or EOL

Season 3

Look around

Mailinator email barrier

here's the non-existed@mailinator.com mailbox

                        
// We're gonna make
//   a simple temp email barrier
var wrongEmail = 'non-existed@mailinator.com'
  , rightEmail = 'john.doe@gmail.com';

// Since we've already had the email validate regexp,
//   so let's assume that we're gonna reuse it here,
//   and add just a tiny change to the domain name part
var rgx = /^.+?@(?!mailinator|reallymymail).+?\..+$/;
wrongEmail.test(rgx); // false
rightEmail.test(rgx); // true

$_ file extension barrier

                        
// We're gonna accept every files
//   except images (jpg, png, gif)
// .. or ..
// The filename could end with anything
//   except .jpg, .png, .gif
var file = 'this-is-an-image.png'
  , rgx = ???;
file.match(rgx); // null

var rgx = /\.(?!jpg$|png$|gif$)/;

$_ clean HTML comment

http://www.quirksmode.org/css/condcom.html

                        
// Strip every html comment
//   except IE comment
var html = '...';   // The source of an html file
html.replace(???);  // Striped comment HTML content

An IE conditional comment could not be stripped
An IE comment starts with <!--
.. followed by if IE, if lt IE8, if gte IE9 ..
An IE comment ends with -->

html.replace(/<!--\s*(?!\[if\s+(lte?\s+|gte?\s+)?\s*IE).*-->/gmi, '');

A few bonus ..

Wanna play?

Wanna try?

Wanna be a regexp guru?

Mastering Regular Expressions, Jeffrey Friedl or visit his blog for some beautiful photos http://regex.info/blog/
http://www.regular-expressions.info/
http://www.rexegg.com/
http://www.regexbuddy.com/
http://google.com
...

Thanks for sitting

Should you have any question,
feel free to ask Google

What else?

Hmm .. nope. Have a good evening, by the way! And if you've suddenly become another RegExp lover .. feel free to drop me a joke at minhnd.it@gmail.com

Character classes
[abc]	a or b or c
[o-s]	one of the characters in the range from o to s or c
[^a]	one character that is not a
[^o-s]	one character that is not in the range from o to s

Characters
\d	digit (equals to [0-9] )
\w	word (equals to [a-zA-Z0-9_])
\s	whitespace (equals to [ \t\r])
.	any character except \n
\	escape a special char

Quantifier
*	0 or more (greedy)
+	1 or more (greedy)
?	0 or 1
{3, 7}	3 to 7 times
*?	0 or more (lazy ~ non-greedy)
+?	1 or more (lazy ~ non-greedy)

Anchors & boundaries
^	start of line/string
$	end of line/string
\b	word boundary (position with \w on oneside only)

Logic
\|	OR
()	capturing group
(?:)	non-capturing group

Modifier
g	global
i	case insensitive
m	the . (dot) matchs multiple lines

String replacement
$n	n-th captured group
$&	entire matched string