PDA

View Full Version : Regex Tutorial


Zeus
03-24-2008, 01:19 PM
Right, as I'm not about to waste my time going indepth with ALL of regex's functions (and God alone knows there's many), I decided to post Urza's tutorial on this network.

Urza is a member of the SwiftIRC scripting community, and you can check them out on Irc.SwiftIRC.net -j #mSL

What is regex?
A simple definition i use is that its kinda like search and replace on steroids.
It lets you search for sections of text within a string that match a pattern.

How is it used? (in mIRC)
$regex([name], text, regex)
This is used to check if the given text matches the regex.
[Name] is an option name that can be used to reger to the regex matches later.
text is the text you want to search, and regex is the regex pattern
Note that the $regex returns the number of matches, NOT what the matches actually were. $regml is needed for that.

$regsub([name], text, regex, subtext, %var)
This is used to replace text matched by a regex with some other text. And is probably the most confusing way to use regex
[Name],text and regex are the same as above
subtext is the text you want to replace the matches with.
%var will contain the final result after replacements have been made.
The most common use (in my experience atleast) of this is to strip html tags
alias striptags { var %x,%y = $regsub($1-,/(<[^>]+>)/g,$null,%x) | return %x }
%x contains the result after replacements have been made
%y contains the number of replacements made (the value actually returned by the regsub identifier)
if you have no previous experience the regex probably makes absoltely no sense right now but hopefully it will later :P

$regsubex([name], text, regex, subtext)
This is similar to $regsub above but instead of returning the number of replacements made it simply returns the text after replacements have been made

$regml([name],N)
This is used to reger to the backreferenced (explained later) values found by a previous use of $regex or $regsub
[name] is again the optional name of the regex. Only required if the previous regex/regsub command was named
N refers to which match you want to use.
$regml(1) = first match of an unnamed regex
$regml(SomeRegex,5) = 5th match of a regex named SomeRegex

Regex matchtext triggers
Regex can also be used as a matchtext in text/action events.
The format for this is much like a regular event but is prefexid with a $ sign and the matchext is a regex expression
A simple example is the ! or]Some examples:
This is just a couple examples of places ive used regex in working scripts. and an explanation of how the regex in each works
on *:text:*:#: {
Â*Â*if ($regex($1-,/(.+?)\1{10}/)) kb 10m $chan $nick Repetition of $+(',$regml(1),') 10+ times

Â*Â*if ($len($1-) > 10) && ($round($calc($regex($1-,/[A-Z]/g) / $len($remove($1-,$chr(32))) * 100),0) > 66) kb 10m $chan $nick Drop the caps - $+($v1,%) used

Â*Â*var %regex = /#[^,]*,0/
Â*Â*if ($regex($1-,%regex)) kb 10m $chan $nick Dont try and make people part channels kay
}
This is an extract from the scripts some of you would of seen in action in #swiftswitch,#mirc. (note the kb is an alias i made so if you plan on copying any of those you'll have to change that bit :P)

/(.+?)\1{9}/
This is used to match excessive repetitions within 1 line.
(.+?) matches ANY character 1 or more times, lazily.
\1 matches whatever was put into the backreference (ie whatever was matched by the (.+?))
{9} means the previous match must be repeated 9 times.

The lazyness in the inition backreferenced section is optional here. The same strins would be matched with or without it.
But if for example someone spammed with
abcabcabcabcabcabcabcabcabcabcabcabcabcabcabcabcab cabcabcabc
the lazy option simply matches the repetition of 'abc' whereas the greedy option would match repetitions of 'abcabc'

/[A-Z]/g
This is a simple one and simply matches all the capital letters.
used in the $regex identifier it returns the number of capital letters in the string.
(Note the /g is required, as without it the result would always be 1)

/#[^,]*,0/
This is used to catch people trying to make people part channels with the #*,0 trick.
# matches the character #
[^,]* matches matches anything that ISNT a , 0 or more times
,0 matches the characters ,0

At first glance it may appear that this will not work if a person uses, for example, #1,000 with more than 1 0 at the end.
the reason it does is that there is no end of line anchor, so the expression doesnt care what is before or after the match, as long as it actually finds a match
If you wanted to be able to reference the the 'channel' that was used later you could use something like
/(#[^,]*,0[^ ]*)/ - This is the same as above except it looks for any non space character 0 or more times after matching the 0. The result is then stored in backreference 1 or $regml(1)

Also note that i declared the regex in a variable. In many cases this must be done as otherwise the $regex identifier will think any , in the expression is a seperator character


alias striptags { var %x,%y = $regsub($1-,/(<[^>]+>)/g,$null,%x) | return %x }
Back to the striptags alias you saw above.
< matches the character <
[^>]+ matches anything that ISNT > 1 or more times
> matches the character >
/g forces it to continue checking after the first match

used as it is in the regsub identifier all matches are replaced with $null effectively stripping all full html tags from a string.

Â*Â*Â*Â*if ($regex($2-,/^1 ?\+ ?1$/)) %send $c1(%n,%c,%send,$2- =) $c2(%n,%c,%send,3)
this is a line from the calc command in my bot to tell people that 1+1 is 3
^ matches the beginning of the string (in thise case $2-)
1 matches the character 1
? matches an optional space.
\+ matches the character + (note it is prefixed with a \ as the + itself is a specil character)
? matches another options space
1$ matches a 1 followed by the end of the string

Â*Â*if ($regex(%t,/<td class="tablebottom">(.*)</td>/)) {
Â*Â*Â*Â*var %data = $regml(1)
Another extract from my bot. this looks for data in a cell of a table to be used in processing
<td class="tablebottom"> simple matches the opening tag.
(.*) matches Any string
</td> again simply matches the closing tag

Note that if the format of the page you are reading is not known this regex would not be a good idea as greedy (.*) would cause unwanted results if more then 2 cells were written on the same line
i have only used this format as i know that each line in the html being read will only ever contain one cell.
If this is not the case a negated character class would be a better option than the .*[/b]

Questions? Comments? I'd be happy to answer any pertaining to regex. You may get help by joining #mIRC on the Dodian IRC network (Irc.Dodian.Com)

DarkHeath
03-24-2008, 01:26 PM
Ah, I remember reading this on SwiftIRC forums. Urza ftw?