From b49d8ebefe9b10c53a6a09ad564e22111b7b25c6 Mon Sep 17 00:00:00 2001 From: Stef Walter Date: Sat, 20 Sep 2003 07:12:49 +0000 Subject: Initial Import --- doc/language_v2.htm | 359 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 359 insertions(+) create mode 100644 doc/language_v2.htm (limited to 'doc/language_v2.htm') diff --git a/doc/language_v2.htm b/doc/language_v2.htm new file mode 100644 index 0000000..9e20493 --- /dev/null +++ b/doc/language_v2.htm @@ -0,0 +1,359 @@ + + + + The rep language + + + + +

The rep Language

+ +

Syntax
+Comments
+Commands
+Script Notes

+ +

Syntax

+

A rep script is made of various commands. The commands are detailed below, but here's a few basics:

+

Options:Certain command can take various options. These are wrapped in parentheses, and separated by commas when multiple options are present.

+

Text: Certain commands need a bit of text in order to do their thing. This could be a regular expresson, or perhaps replacement text. This text follows any options if present. The first character of the text is the quote character and is used to detect the end of the data block. If the quote character is used inside the text it must be escaped with a backslash. You can use any of these characters as a quote character:

+ +
"~`!@#$%^&*[]|'><./?+=-;:
+ +

The examples in this document will use a double quote (") as the quote character

+ +

A general example of a rep command format would go something like this:

+ +
command (option, option) "data"
+ +

A command can extend to multiple lines. The following is valid:

+ +
command (options,
+options)
+"multiline
+data"
+ +

A command can be followed by curly brackets. Generally this means that the result of the command applies to whatever commands are inside the curly brackets. If the command fails (for example a match that doesn't match) then the stuff inside the curly brackets isn't executed:

+ +
command
+{
+    more commands
+}
+ +

Comments

+ +

A comment starts in a # sign and extends to the end of the line. Comments are not valid inside data. For example:

+ +
# Is a comment
+ +

Commands

+ +

match

+
syntax: match (not, once, find, tag, 0-9) "regexp"
+

Matches regular expression. A full study of regular expressions is outside the scope of this document. You can also match simple text, but you'll need to escape (with a backslash) any characters used by regular expressions. Those are:

+ +
.$%^*+?{}[]|()
+ +

Matches are not case sensitive unless specified with the 'case' option (see below).

+ +

The statements inside the match are only executed if the match is successful. In addition the statements inside the match can only operate on the text that was matched. The following matches the word 'Hi There!' in a document and then matches the 'Hi' part.

+ +
match "Hi There!"
+{
+    # This only matches the above 'Hi'
+    match "Hi"
+    {
+        # Do something with 'Hi'
+    }
+}
+ +

The match command can have several options:

+ +

not: Executes the contained statements if it doesn't match.

+ +
match (not) "can't find me"
+ +

once: Makes sure this match only matches once in a document.

+ +
match (once) "<title>"
+ +

find: Don't restrict statements inside the match to the text that was matched. This is useful for just verifying if something is there.

+ +
match (find) "check"
+ +

tag: Makes a tag match. This is explained further below. + +

0-9: Restricts the statements inside the match to the specified group (wrapped with paretheses) in the regular expression. 0 is the entire statement and 1 through 9 are numbered groups.

+ +
match (1) "Johnny (Smith)"
+{
+    # Now we can do something with 'Smith'
+}
+ +

replace

+ +
syntax: replace "replace text"
+ +

Replaces the matched text with new text. For example the following replaces 'Hello' with 'Yo' anywhere in the document:

+ +
match "Hello"
+{
+    replace "Yo"
+}
+ +

You can include text groups (which were wrapped in parentheses) that were matched in the previous regular expression. These are specified by using a percent and the group number. %0 specifies all the matched text, and %1 - %9 are the numbered groups.

+ +

For example the following replaces all <img> tags with <image> tags:

+ +
match "<img(.*?)>"
+{
+    replace "<image%1>"
+}
+ +

After text has been replaced it is locked. It cannot be matched again.

+ +

else

+ +
syntax: else
+ +

Executes the contained statements if the above statements failed. For example the following executes if the match fails

+ +
match "Yo"
+{
+    # Do whatever we do with "Yo"
+}
+else
+{
+    # Didn't match. Do something else
+}
+ +

lock

+ +
syntax: lock
+ +

Locks text so it cannot be matched again. Useful to exclude portions of the document from replacements later on. The folling would lock all paragraphs in an HTML document.

+ +
match "<p>.*?</p>"
+{
+    lock
+}
+ +

loop

+ +
syntax: loop
+ +

Repeats the contained code until no more matches can be found. The following (dumb) example replaces all a's inside 'aardvark' with e's:

+ +
match "aardvark"
+{
+    loop
+    {
+        match "a"
+        {
+            replace "e"
+        }
+    }
+}
+ +

Note that the entire document is actually wrapped in an invisible loop command. The entire document loops until no more matches can be found. However because of locking, the same text will generally not match more than once. In the above example that specific instance of 'aardvark' wouldn't match again after a single 'a' inside had been replaced or locked. Because of this the loop command comes in quite handy.

+ +

function

+ +
syntax: function function_name
+ +

Defines a function which can be called with the call command. Function names are case sensitive, can be up to 40 characters long, and can consist of letters and the underscore.

+ +
function test
+{
+    # do whatever here
+}
+ +

call

+ +
syntax: call function_name
+ +

Calls a function. The function must have been defined earlier in the document.

+ +
call test 
+ +

The 'call' bit of the statement can be omitted, so the above can be shortened to:

+ +
test
+ +

For example the following function is used for multiple cases:

+ +
function(atoe)
+{
+    loop
+    {
+        match "a"
+        {
+            replace "e"
+        }
+    }
+}
+
+match "aaron"
+{
+    atoe
+}
+
+match "aardvark"
+{
+    atoe
+}
+ +

return

+ +
syntax:return (number)
+ +

When inside a function returns from that function to the code which called it. A function will normally return when it's end is reached, but return can be used to return earlier.

+ +

You can also return a success code. This must be either 0 (for fail) or 1 (for success). In the code that called the function you can use else to take action on failure.

+ +

end

+ +
syntax: end
+ +

Stops the script at the current location. No more matches are done.

+ +

stop

+ +
syntax: stop "optional error message"
+ +

Stops the script as if an error had occurred. You can include an error message.

+ +

set

+ +
syntax: set variable_name "value"
+ +

Assigns the a value to a variable. The value can include backslash references to matched groups just like replace text can. Variables can be used both in later match statements, replaces or any text portion of a command. Variable names can consist of letters and the underscore. The name should alse be less than 40 characters long. Variables are used like so:

+ +
%variable_name
+ +

The following example gets a heading from an HTML document and sets the title to it:

+ +
match "<h1>(.*?)<h1>"
+{
+    setv heading "%1"
+}
+
+match "<head>"
+{
+    replace "<head><title>%heading</title>"
+}
+ +

An example of using a variable as a match would be the following. The title is matched, and then searched for in the document. If it's found it's made bold:

+ +
match (once) "<title>(.*?)</title>"
+{
+    set title "%1"
+}
+
+match "%title"
+{
+    replace "<b>%0</b>
+}
+ +

If a variable is used whose value hasn't been set yet, it's value is blank. You may also use environment variables in your document. This is how you'd pass values to your script from outside.

+ +

add

+ +
syntax: add variable_name "value"
+ +

Similar to set but insteads makes a variable array and adds the value to it. That means multiple values can be set to a variable. This is only useful during matches where any one of the variable values will match. If a multiple variable is used anywhere else an error will result.

+ +

The syntax for using a multiple variable is:

+ +
%(variable_name)
+ +

If used without the parentheses the variable will act like a normal variable and the first of it's values will be used.

+ +

The following example matches any HTML table element:

+ +
set telement "table"
+add telement "td"
+add telement "tr"
+add telement "tbody"
+add telement "thead"
+
+match "%(telement)"
+{
+    # Do whatever
+}
+ +

clr

+ +
syntax: clr variable_name
+ +

Clears a variable of it's value.

+ +

options

+ +
syntax: options(case, line)
+ +

Sets various options for your script. These options will apply only inside the set of curly braces that they are set. To set global options put the options command at the top of your file. The options are:

+ +

case: Makes the matches case sensitive.

+ +

line: Restricts matches to one line.

+ +

delimiter: Sets the delimiter for the data portions of commands. By default the double-quote is used, but if you need to match this often, then you can change the delimiter to something else.

+ +

The following makes matches case sensitive:

+ +
options(case)
+ +

Script Notes

+ +

First of all you'll need to know regular expressions to get anything useful out of rep scripting. The regular expressions used in rep are PCRE (Perl Compatible Regular Expressions). They're not explained here but you can get tons of info online for them.

+ +

Once your script gets a little more than just two or three commands, it's important to understand how it gets run:

+ +

The entire script is run over and over again in a loop until no more matches are made. Although it's difficult, it is possible to throw your script in an endless loop. If this is the case then execution stops after a million loops.

+ +

Portions of the document that have been locked or replaced cannot be matched again. Also if a portion of text to be matched has been locked it cannot be matched. Keep this in mind when making your scripts. If you wish to match and replace multiple items inside another match, you'll need to use a loop command to do so.

+ +

The rep processor can be used in a buffered mode where only a portion of the document is operated on at one time. This greatly increases the speed of the processor. But you have to be careful that any matches you make will fit inside that buffer. In many cases (for example matching the entire <body> tag of an HTML document, you won't be able to use buffered mode reliably.)

+ +

Tag Matches

+

Tag matches match a starting and closing tag of your choice. They also take into consideration that there may be other tags inside that could match. The opening and closing tag are separated by an equal sign ('=').

+ +

For example the following would match a set of <div> tags in an HTML document. It would pair up the correct set of <div> tags even if there were other contained tags that could match:

+ +
match (tag) "<div>=</div>"
+ +

The regular expression groups are handled differently when using tag matching:

+ +

0: This is the entire match as usual.
+1: The text of the opening tag.
+2: The contained text.
+3: The text of the closing tag.

+ + + + + -- cgit v1.2.3