regexp

REGister to become an EXPert

What's gonna happen then?

  • The shorten regexp alphabet (~5 minutes)
  • Real practices (~55 minutes .. or depends)

Regular expression in everyday tasks

  • Text searching
  • Text validating
  • Text filtering
  • Text extracting
  • Text replacing
  • Text formatting
  • Text cleaning
  • Text playing
  • Text ...

Regular expression in everything

  • Almost every IDE + powerful editor (Eclipse, Bracket, PHPStorm, Atom, VIM, Netbean, Notepad++, emacs, gedit ...)
  • Apache .htaccess
  • Linux sed, bashscript, sh
  • Windows PowerShell
  • ...

Regular expression for developers

  • Perl, PHP
  • C, C++
  • .NET
  • Java
  • Python
  • Ruby
  • MySQL
  • Javascript
  • ...

Dive into

Let's take a look at the Javascript RegExp and String first ..

Season 1

Regexp alphabet

Trimming whitespaces

                        
// Stripping whitespace 
//   from the beginning 
//   and the end of the string 
var s = '  no-free-space Offspring Digital   '; // '  no-free-space Offspring Digital   '
s.replace(/^ +/, '').replace(/ +$/, '');        //   'no-free-space Offspring Digital'
s.replace(/^ +| +$/, '');                       //   'no-free-space Offspring Digital   '
s.replace(/^ +| +$/g, '');                      //   'no-free-space Offspring Digital'
s.replace(/^\s+|\s+$/g, '');                    //   'no-free-space Offspring Digital'
                        
                    

File extension filter

                        
// Is this file an image?
// .. or ..
// Does this filename
//   end with the defined text?
var file = 'this-is-an-image.png'
  , rgx = /\.jpg$|\.png$|\.gif$/;
file.match(rgx); // [".png"]
                        
                    

$_ email address validate

                        
// We need a regexp
//   to validate email add
var email = 'mr.john-doe@offspringdigital.com'
  , rgx = ???;
email.match(rgx); // ["mr.john-doe@offspringdigital.com"]
                        
                    
  • {name}@{domain-name}.{domain-extension}
  • ... is case insensitive
  • {name} starts with a alpha (ie. a-z)
  • ... followed by alphanumeric or . _ -
  • ... has at least 4 chars, at most 20 chars
  • {domain-name} starts with alphanumeric (ie. a-z and 0-9)
  • ... then followed by alphanumeric or _ -
  • ... has at least 2 chars, at most 10 chars
  • {domain-extension} is alpha only
  • ... and has at least 2 chars, at most 4 chars

$_ Australia phone number

http://en.wikipedia.org/wiki/%2B61
                        
// We need a regexp
//   to validate Au phone number
var phone = '+61 0 2345 6789'
  , rgx = ???;
phone.match(rgx); // ["+61 0 2345 6789"]
                        
                    
  • (0x)xxxxxxxx
  • +61xxxxxxxxx
  • x is a digit (0 - 9)
  • Free whitespaces (all whitespaces are ignored)

$_ remove duplicated spaces

                        
// We need a regexp
//   to remove duplicated
//   whitespaces
var s = 'It has   a lot     spaces';    // 'It has   a lot     spaces'
s.replace(???);                         // 'It has a lot spaces'
                        
                    
  • Replace 2 or more continous spaces into only one space

$_ parse nodejs arguments

                        
// From terminal, we type
$node app.js --first-name=john --last-name=doe
                        
                    
                        
// This is the app.js file
//   and we need a regexp
//   to extract passed
//   values from cli

// This is the opts
//   and it should accept cli arguments
//   but in the snake-case (not the camelCase)
var opts = {
    firstName: 'offspring',
    lastName: 'digital'
};

var arguments = system.argv; // ['--first-name=john', '--last-name=doe'];
for(var optKey in opts) {
    var rgx = ???;

    // Let's search in the argument list
    //   for an argument that
    //   correspond to this optKey,
    //   if any
    for(var i in arguments) {
        var argument = arguments[i];
        var match = rgx.exec(argument);

        // We've found it!
        if(match) {
            var value = ???;
            opts[optKey] = value; // .. and save it to our opts
            break; // .. then stop searching
        }
    }
}

console.log(opts); // {firstName: 'john', lastName: 'doe'}
                        
                    

Season 2

Back-reference to a lazy & greedy group

Number range

                        
// We need a regexp
//   to validate that
//   1979 < x < 1991
var number = '1990'
  , rgx = /^19(?:79|8[0-9]|9[01])$/;
number.match(rgx); // ["1990"]
                        
                    

Remove duplicated words

                        
// So we don't need duplicated words
var s = 'A duplicated duplicated duplicated word';
s.replace(/(.+?\b)\1+/g, '$1'); // "A duplicated word"
                        
                    

$_ ip address

http://en.wikipedia.org/wiki/IP_address
                        
// We need a regexp
//   to validate an ip address
var ip = '192.168.0.102';
var rgx = ???;
rgx.match(ip); // ["192.168.0.102"]
                        
                    
  • x.x.x.x
  • x is a number from 0 - 255
  • 0, 00, 000 are all accepted

$_ remove duplicated spaces + trim

                        
// We need a regexp
//   to remove duplicated whitespaces
//   as well as trim it
var s = '    It has   a lot     spaces   ';    // '    It has   a lot     spaces    '
s.replace(???);                                // 'It has a lot spaces'
                        
                    
  • Replace all spaces at the beginning and the end of the string
  • Replace 2 or more continous spaces into only one space

$_ linkify

                        
// We need 2 regexp,
//   one for link
//   one for username,
//   and their replacements
var text = 'We have a http://google.com?q=linkify link from @offspring';
var rgxLink = ???
  , replaceLink = ???
  , rgxTwitterUsername = ???
  , replaceTwitterUsername = ???;

text
.replace(rgxLink, replaceLink)
.replace(rgxTwitterUsername, replaceTwitterUsername);
// We have a http://google.com?q=linkify link from @offspring
                        
                    
  • A link starts with http:// or https://
  • A link ends before a space, or EOL
  • A twitter username starts right after the letter @
  • A twitter username ends before a space, or EOL

Season 3

Look around

Mailinator email barrier

here's the non-existed@mailinator.com mailbox
                        
// We're gonna make
//   a simple temp email barrier
var wrongEmail = 'non-existed@mailinator.com'
  , rightEmail = 'john.doe@gmail.com';

// Since we've already had the email validate regexp,
//   so let's assume that we're gonna reuse it here,
//   and add just a tiny change to the domain name part
var rgx = /^.+?@(?!mailinator|reallymymail).+?\..+$/;
wrongEmail.test(rgx); // false
rightEmail.test(rgx); // true
                        
                    

$_ file extension barrier

                        
// We're gonna accept every files
//   except images (jpg, png, gif)
// .. or ..
// The filename could end with anything
//   except .jpg, .png, .gif
var file = 'this-is-an-image.png'
  , rgx = ???;
file.match(rgx); // null
                        
                    

$_ clean HTML comment

http://www.quirksmode.org/css/condcom.html
                        
// Strip every html comment
//   except IE comment
var html = '...';   // The source of an html file
html.replace(???);  // Striped comment HTML content
                        
                    
  • An IE conditional comment could not be stripped
  • An IE comment starts with <!--
  • .. followed by if IE, if lt IE8, if gte IE9 ..
  • An IE comment ends with -->

A few bonus ..

Wanna play?

Wanna try?

Wanna be a regexp guru?

Thanks for sitting

Should you have any question,
feel free to ask Google

What else?

Hmm .. nope. Have a good evening, by the way! And if you've suddenly become another RegExp lover .. feel free to drop me a joke at minhnd.it@gmail.com

cheat sheet
Character classes
[abc]a or b or c
[o-s]one of the characters in the range from o to s or c
[^a]one character that is not a
[^o-s]one character that is not in the range from o to s
Characters
\ddigit (equals to [0-9] )
\wword (equals to [a-zA-Z0-9_])
\swhitespace (equals to [ \t\r])
.any character except \n
\escape a special char
Quantifier
*0 or more (greedy)
+1 or more (greedy)
?0 or 1
{3, 7}3 to 7 times
*?0 or more (lazy ~ non-greedy)
+?1 or more (lazy ~ non-greedy)
Anchors & boundaries
^start of line/string
$end of line/string
\bword boundary (position with \w on oneside only)
Logic
|OR
()capturing group
(?:)non-capturing group
Look around
?=look ahead
?!negative look ahead
Modifier
gglobal
icase insensitive
mthe . (dot) matchs multiple lines
String replacement
$nn-th captured group
$&entire matched string