Friday, December 4, 2015

RegExp - Quick Reference


Characters
\d
Most engines: one digit from 0 to 9
file_\d\d
file_25
\d
.NET, Python 3: one Unicode digit in any script
file_\d\d
file_9
\w
Most engines: "word character": ASCII letter, digit or underscore
\w-\w\w\w
A-b_1
\w
.Python 3: "word character": Unicode letter, ideogram, digit, or underscore
\w-\w\w\w
-
\w
.NET: "word character": Unicode letter, ideogram, digit, or connector
\w-\w\w\w
-۳
\s
Most engines: "whitespace character": space, tab, newline, carriage return, vertical tab
a\sb\sc
a b c
\s
.NET, Python 3, JavaScript: "whitespace character": any Unicode separator
a\sb\sc
a b c
\D
One character that is not adigit as defined by your engine's \d
\D\D\D
ABC
\W
One character that is not aword character as defined by your engine's \w
\W\W\W\W\W
*-+=)
\S
One character that is not awhitespace character as defined by your engine's \s
\S\S\S\S
Yoyo
Quantifiers
+
One or more
Version \w-\w+
Version A-b1_1
{3}
Exactly three times
\D{3}
ABC
{2,4}
Two to four times
\d{2,4}
156
{3,}
Three or more times
\w{3,}
regex_tutorial
*
Zero or more times
A*B*C*
AAACC
?
Once or none
plurals?
plural
More Characters
.
Any character except line break
a.c
abc
.
Any character except line break
.*
whatever, man.
\.
A period (special character: needs to be escaped by a \)
a\.c
a.c
\
Escapes a special character
\.\*\+\?    \$\^\/\\
.*+?    $^/\
\
Escapes a special character
\[\{\(\)\}\]
[{()}]
Logic
|
Alternation / OR operand
22|33
33
( … )
Capturing group
A(nt|pple)
Apple (captures "pple")
\1
Contents of Group 1
r(\w)g\1x
regex
\2
Contents of Group 2
(\d\d)\+(\d\d)=\2\+\1
12+65=65+12
(?: … )
Non-capturing group
A(?:nt|pple)
Apple
More White-Space
\t
Tab
T\t\w{2}
T     ab
\r
Carriage return character
see below

\n
Line feed character
see below

\r\n
Line separator on Windows
AB\r\nCD
AB
CD
\N
Perl, PCRE (C, PHP, R…): one character that is not a line feed
\N+
ABC
\v
.NET, JavaScript, Python, Ruby: vertical tab


\v
Perl, PCRE (C, PHP, R…), Java: one vertical whitespace character: line feed, carriage return, vertical tab, form feed, paragraph or line separator


\V
Perl, PCRE (C, PHP, R…), Java: any character that is not a vertical whitespace


\R
Perl, PCRE (C, PHP, R…), Java: one line break (carriage return + line feed pair, and all the characters matched by \v)


More Quantifiers
+
The + (one or more) is "greedy"
\d+
12345
?
Makes quantifiers "lazy"
\d+?
1 in 12345
*
The * (zero or more) is "greedy"
A*
AAA
?
Makes quantifiers "lazy"
A*?
empty in AAA
{2,4}
Two to four times, "greedy"
\w{2,4}
abcd
?
Makes quantifiers "lazy"
\w{2,4}?
ab in abcd
Character Classes
[ … ]
One of the characters in the brackets
[AEIOU]
One uppercase vowel
[ … ]
One of the characters in the brackets
T[ao]p
Tap or Top
-
Range indicator
[a-z]
One lowercase letter
[x-y]
One of the characters in the range from x to y
[A-Z]+
GREAT
[ … ]
One of the characters in the brackets
[AB1-5w-z]
One of either: A,B,1,2,3,4,5,w,x,y,z
[x-y]
One of the characters in the range from x to y
[ -~]+
[^x]
One character that is not x
[^a-z]{3}
A1!
[^x-y]
One of the characters not in the range from x to y
[^ -~]+
[\d\D]
One character that is a digit or a non-digit
[\d\D]+
Any characters, including new lines, which the regular dot doesn't match
[\x41]
Matches the character at hexadecimal position 41 in the ASCII table, i.e. A
[\x41-\x45]{3}
ABE
Anchors and Boundaries
^
Start of string or start of linedepending on multiline mode. (But when [^inside brackets], it means "not")
^abc .*
abc (line start)
$
End of string or end of linedepending on multiline mode. Many engine-dependent subtleties.
.*? the end$
this is the end
\A
\Aabc[\d\D]*
abc (string......start)
\z
the end\z
this is...\n...the end
\Z
the end\Z
this is...\n...the end\n
\G


\b
Bob.*\bcat\b
Bob ate the cat
\b
Bob.*\b\кошка\b
Bob ate the кошка
\B
c.*\Bcat\B.*
copycats
(?=…)
(?=\d{10})\d{5}
01234 in0123456789
(?<=…)
(?<=\d)cat
cat in 1cat
(?!…)
(?!theatre)the\w+
theme
(?<!…)
\w{3}(?<!mon)ster
Munster
POSIX Classes
[:alpha:]
PCRE (C, PHP, R…): ASCII letters A-Z and a-z
[8[:alpha:]]+
WellDone88
[:alpha:]
Ruby 2: Unicode letter or ideogram
[[:alpha:]\d]+
кошка99
[:alnum:]
PCRE (C, PHP, R…): ASCII digits and letters A-Z and a-z
[[:alnum:]]{10}
ABCDE12345
[:alnum:]
Ruby 2: Unicode digit, letter or ideogram
[[:alnum:]]{10}
кошка90210
[:punct:]
PCRE (C, PHP, R…): ASCII punctuation mark
[[:punct:]]+
?!.,:;
[:punct:]
Ruby: Unicode punctuation mark
[[:punct:]]+
,:
[…-[…]]
.NET: character class subtraction. One character that is in those on the left, but not in the subtracted class.
[a-z-[aeiou]]
Any lowercase consonant
[…-[…]]
.NET: character class subtraction.
[\p{IsArabic}-[\D]]
An Arabic character that is not a non-digit, i.e., an Arabic digit
[…&&[…]]
Java, Ruby 2+: character class intersection. One character that is both in those on the left and in the && class.
[\S&&[\D]]
An non-whitespace character that is a non-digit.
[…&&[…]]
Java, Ruby 2+: character class intersection.
[\S&&[\D]&&[^a-zA-Z]]
An non-whitespace character that a non-digit and not a letter.
[…&&[^…]]
Java, Ruby 2+: character class subtraction is obtained by intersecting a class with a negated class
[a-z&&[^aeiou]]
An English lowercase letter that is not a vowel.
[…&&[^…]]
Java, Ruby 2+: character class subtraction
[\p{InArabic}&&[^\p{L}\p{N}]]
An Arabic character that is not a letter or a number
None of these are supported in JavaScript. In Ruby, beware of (?s) and (?m). 
(?i)
(?i)Monday
monDAY
(?s)
(?s)From A.*to Z
From A to Z
(?m)
(?m)1\r\n^2$\r\n^3$
1 2 3
(?m)
(?m)From A.*to Z
From A to Z
(?x)
(?x) # this is a # comment
abc # write on multiple # lines
[ ]d # spaces must be # in brackets
abc d
(?n)

(?d)
The dot and the ^ and $ anchors are only affected by \n

Other Syntax
Perl, PCRE (C, PHP, R…), Java: treat anything between the delimiters as a literal string. Useful to escape metacharacters.
\Q(C++ ?)\E
(C++ ?)


No comments:

Post a Comment