Lasso Soft Inc. > Home

  • Articles

Regular Expression Challenge - Discussion and Winners

The tip of the week for August 17, 2007 announces the winners of the second Lasso Programming Challenge. Congratulations to Bil Corry and Johan Sölve. The tip includes a description of the challenge, a downloadable archive of answers to the challenge, documentation of the answer which LassoSoft created for the challenge, and a discussion of how the entries which were received used different strategies to solve the challenge.

Regular Expression Challenge

The second Lasso Programming Challenge was to solve a number of problems using regular expressions. The problems were selected to represent a range of real world solutions. This discussion of the answers will help to show different methods of approaching the same problem using regular expressions. See the challenge page for full details of the challenge.

Winners!

All of the entries we received for the challenge were excellent so it was a difficult task to select two winners. All of the challengers included answers to all of the problems, so the answers to the more complex problems had to be used to decide the winner.

Bil Corry - Bil Corry is our grand prize winner. We liked in particular the way that several of his solutions used only one regular expression on the entire input text rather than making several passes.

Johan Sölve - Johan Sölve is our runner up. We liked in particular the caching which was used on the Lasso Reference problem to help cut down the number of database operations required.

Ke Carlton - Honorable mention goes to Ke Carlton. The custom type he created encapsulates the solutions so that they could be easily used within an actual Lasso solution.

The different implementations used by each of the entries shows how different programmers can approach the same problem. Three of the entries and Fletcher's official LassoSoft answer have been collected together along with the original challenge files into a download so you can see how the different entries were put together.

 

Download File 

Answers

This section contains a brief discussion each problem and the answers we received. We highlight what we consider to be the best answer and also show interesting variations. The first six problems have one line answers, but demonstrate many useful regular expressions. The last two problems required more complex programming and demonstrate more advanced techniques.

Before reading the answers, I would encourage you to take the original Challenge.Lasso file and attempt to solve the problems on your own. Then, take a look at these answers and at this discussion to see different methods of solving the problems.

1 - Phone Numbers

The problem was to convert phone numbers like (###) ###-#### to the format ###-###-####. Each of the solutions was similar, but there were a few variations.

This solution uses [String->Replace] with a [RegExp] object that supplies the -Find and -Replace pattern.

 $answerText = $originalText;
  $answerText -> (replace: (regexp:
      -find='\\((\\d{3})\\) (\\d{3}-\\d{4})',
      -replace='\\1-\\2'));

 

This solution uses the procedural [String_ReplaceRegExp] tag to replace the phone numbers within $originalText with the new format.

  $answerText = (string_replaceregexp: $originalText,
      -find='\\((\\d\\d\\d)\\) (\\d\\d\\d)\\-(\\d\\d\\d\\d)',
      -replace='\\1-\\2-\\3', -ignorecase);

 

The regular expressions themselves demonstrate two different methods. The first uses the {3} and {4} patterns to specify a particular number of repetitions of the \d digit placeholder. The second uses explicit repetition of the digit placeholder.

2 - White Space

The problem was to strip white space (returns, tabs, and spaces) from a block of text. Any sequence of white space which included a return or newline should be collapsed to a single return/newline pair. Any other sequence of white space should be collapsed to a single space.

Two different techniques were used for this problem. At least two regular expressions were required in order to work with the two different types of white space (with returns and without).

One technique replaces any string of returns or newlines with a token %RETURN%. Then, all white space is removed from the string, and finally the %RETURN% tokens are replaced by an actual return/newline pair. Care should also be taken to ensure that text does not already contain the token %RETURN% or extra returns will be inserted.

  $answerText = (string_replaceregexp: $originalText,
      -find='(\r\n)+|\r+|\n+', -replace='%RETURN%', -ignorecase);
  $answerText = (string_replaceregexp: $answerText,
      -find='\\s+', -replace=' ', -ignorecase);
  $answerText = (string_replaceregexp: $answerText,
      -find='(\\s*(%RETURN%)+\\s*)+', -replace='\r\n', -ignorecase);

The other technique uses two regular expressions. The first collapses any sequence of white space which includes a return or a newline to a single return/newline pair. The second collapses all other white space (sequences of spaces or tabs) to single spaces.

var: 'answerText' = (string_replaceregexp: $originalText,
      -Find='\\s*[\r\n]\\s*',
      -Replace='\r\n');
  var: 'answerText' = (string_replaceregexp: $answerText,
      -Find='[ \t]+',
      -Replace=' ');

 

3 - Remove Duplicate Lines

The problem was to remove duplicate lines from a block of text, leaving only one instance. For example, a list of names could be reduced so it contains only one instance of each name. The problem assumes that the string is already sorted so duplicate lines will be adjacent.

The most elegant solution requires only a single regular expression. The second set of parentheses matches a line ([^rn]*) by finding a string of any characters except return or newline.

The third set of parentheses (rn\1)+ uses the \1 pattern to match the line which was just matched, preceded by a return/newline pair. That is a duplicate line. The + symbol allows one or more duplicates to be matched.

The first and fourth sets of parentheses contains a look-behind and a look-ahead assertion which ensure that the matched lines start and end with a return/newline pair or abut to the start or end of the input text.

The replacement pattern is the first group.

  $answerText = (string_replaceregexp: $originalText,
      -find='(?<=\r\n|^)([^\r\n]*)(\r\n\\1)+(?=\r\n|$)',
      -replace='\\1', -ignorecase);

 

Another possibility is to use a simple regular expression which replaces a pair of lines with a single line and then to use a [While] ... [/While] loop until the replacement fails. The while loop will run as many times as the maximum number of repetitions of one name. If "John Doe" appeared four times, then the while loop would run four times, the last time returning the same string as was input.

Note that the originalText has an extra return/newline pair added to the end of it. This is necessary because the -Find pattern expects every line to be ended with a return/newline pair.

  var: 'tempText' = $originalText + '\r\n';
  while: $tempText != $answerText;
    var: 'answerText' = $tempText;
    var: 'tempText' = (string_replaceregexp: $tempText,
        -Find='(.*?)\r\n\\1\r\n',
        -Replace='\\1\r\n');
  /while;

 

One final variation uses the [RegExp] type to find each line within the original text. The tag remembers each line in a #last variable. The current line is only appended to the output if it does not match the #last line.

The [RegExp] type is initialized with the -Find pattern and the input. The [While] ... [/While] loop uses [RegExp->Find] to advance through each match in the input. The current match is inspected using [RegExp->MatchString]. If it does not match #last then the loop is allowed to advance (Lasso appends the match string unchanged in this case). Otherwise, an empty string is appended to the output. Finally, the [RegExp->AppendTail] tag appends the remainder of the input to the output. And, the [RegExp->Output] tag is used to replace the original text with the answer text.

 define_tag:'dedupe';
    local:'reg'  = regExp( -find = '(.*?\r\n|$)',
      -replace = '',
      -input  = self->'result',
      -ignorecase);
    local:'last' = '';
    local:'rep' = '';
    while:#reg->find;
      local:'match' = #reg->matchString;
      #match == #last ? #reg->appendReplacement('');
      #last=#match;
    /while;
    #reg->appendTail;
    self->'result' = #reg->output;
    return:self;
  /define_tag;

 

Finally, Lasso does have one great non-regexp way of doing this same operation. Create an array of lines by splitting the original text on the return/newline pair. Create a set and insert every element from the array into it. The set is only allowed to contain unique elements so this filters out duplicates. Then, join the set back into a string using a return/newline pair. This code has the advantage of not requiring that the original text be sorted.

 Var: 'anserArray' = $originalText->(Split: '\r\n');
  Var: 'answerSet' = Set;
  $answerSet->(InsertFrom: $anserArray->Iterator);
  Var: 'answerText' = $answerSet->(Join: '\r\n');

 

The same code can be written as a single line like this:

  Var: 'answerText' = Set->(InsertFrom: $originalText->(Split: '\r\n')->Iterator) & (Join: '\r\n');

 

4 - Strip HTML

The problem was to strip HTML tags out of a block of text. The basic answer requires only a single regular expression. It wasn't in the problem, but it seems beneficial to replace breaks with return/newline pairs to maintain the basic flow of the text. These two regular expressions do that.

 

Another variation is to use a single regular expression. This code uses a [RegExp] tag as the find/replace pattern for the [String->Replace] tag.

  $answerText = $originalText;
  $answerText -> (replace: (regexp:
      -find='(<[!/]?(--|\\w+)[^>]*>)', -replace=''));
  var: 'answerText' = (string_replaceregexp: $originalText,
      -Find='<br />',
      -Replace='\r\n');
  var: 'answerText' = (string_replaceregexp: $answerText,
      -Find='<[^>]+>',
      -Replace='');

 

Finally, a third variation did a more complex operation which converted breaks, lists, and other HTML elements to ASCII equivalents. This is how the custom tag looks. Take a look at the implementation files to see how the various elements of the type this is defined within are set up.

  define_tag:'plainText';
    self->'result' = (
    self->htmlStripTags->replaceAll(
    self->htmlStripLinks->replaceAll(
    self->htmlListToPlain->replaceAll(
    self->htmlParasToPlain->replaceAll(
    self->htmlBreaksToPlain->replaceAll(
    self->htmlStripSpaces->replaceAll(
    self->htmlStripWhite->replaceAll(
    -input=self->'result'))))))));
    return:self;
  /define_tag;

 

5 - Decorate Web URLs

The problem was to decorate Web URLs so links like http://www.lassosoft.com are linked as http://www.lassosoft.com. This is another problem which requires a single regular expression. The complexity of that regular expression depends on how complete you want to be about the URLs that you catch with the tag. Here are a couple variations.

  $answerText = (string_replaceregexp: $originalText, https)://([a-z0-9:$-_@.&+-!*"\'(),%=;/#?]+)(\\.|\\s|\\?|\\!)', -replace='<a href="\\1://\\2">\\1://\\2</a>\\3', -ignorecase);

 

This variation uses a [RegExp] object in [String->Replace] to perform the work.

  $answerText = $originalText;
  $answerText -> (replace: (regexp:
      -find='(\\bhttps?://(&|\\w|&|[-_./?%#])+)',
      -replace='<a href="\\1">\\1</a>'));

 

See the included answer files for an additional variation which uses a custom [String_Convert] type and an ->HTML member tag to perform a series of search/replaces to link HTML tags and email addresses at once.

6 - Decorate Email Addresses

The problem was to decorate email addresses so links like challenge@lassosoft.com are linked as challenge@lassosoft.com. This is another problem which requires a single regular expression. The complexity of that regular expression depends on how complete you want to be about the URLs that you catch with the tag. Here are a couple variations.

  $answerText = (string_replaceregexp: $originalText, -find='b([A-Z0-9._%+-]+@[A-Z0-9.-]+.[A-Z]{2,4})\\b', -replace='<a href="mailto:\\1">\\1</a>', -ignorecase);


This variation uses a [RegExp] object in [String->Replace] to perform the work.

  $answerText = $originalText;
  $answerText -> (replace: (regexp:
[[mailto:
      -find='(\b(\w|[-_.])+@(\w|[-_])+\.(\w|[-_.])+)'|
      -find='(\b(\w|[-_.])+@(\w|[-_])+\.(\w|[-_.])+)']],
      -replace='<a href="mailto:\\1">\\1</a>'));

See the included answer files for an additional variation which uses a custom [String_Convert] type and an ->HTML member tag to perform a series of search/replaces to link HTML tags and email addresses at once.

 

7 - Wiki Style Formatting (see separate article)

8 - Lasso Reference Links

The problem here is to markup Lasso tags like [Field], -Database, or [Records] ... [/Records] with links to the Lasso Reference. We only want to link tags which are actually in the Lasso Reference and we want to use the proper syntax for those links. A tag [Encode_TagLink] was provided which requires a tag name as input and returns a URL in the proper format if the tag is included in the Lasso Reference.

The problem then boils down to finding each tag reference, checking [Encode_TagLink], and adding a link around the tag reference if a URL was returned.

Again, in the interest in not adding too many more thousands of bytes to this tip of the week, we will walk through Fletcher's solution. However, please do see the variations in the other answers to see several different methods of approaching the same problem. One solution uses only a single regular expression rather than the two required here. One solution caches the results of [Encode_TagLink] to cut down on database operations. And, one solution uses custom tags to perform all the work so it is packaged up in order to make it easy to incorporate into Lasso pages.

This code uses two regular expressions: one to find tags in square brackets like [RegExp] and a second to find parameters like -Database. The square bracket expression also handles an optional container tag closing tag as in [Inline] ... [/Inline].

  var: 'answerText' = $originalText;
  iterate: (array: '\\[(.+?)\\](?: ... \\[/\\1\\])?',
        '-[A-Za-z0-9]+'), (var: 'markup_pattern');

 

A [RegExp] object is initialized with the current regular expression. A [While] ... [/While] loop is used to cycle through all the found patterns within the input text

    var: 'tag_regexp' = (regexp: -input=$answerText, -find=$markup_pattern);
    while: $tag_regexp->find;

 

Each match is passed to [Encode_TagLink]. If the result is not empty then an anchor tag is inserted around the match. Otherwise, no action is taken and Lasso simply appends the match string to the output unchanged.

      var: 'match' = $tag_regexp->matchstring;
      var: 'taglink' = (encode_taglink: $match);
      if: $taglink != '';
        $tag_regexp->(appendreplacement:
            '<a href="' + $taglink + '">' + $match + '</a>');
      /if;
    /while;

 

At the end of the while loop we append the remainder of the input and we replace the answerText variable with its new output.

    $tag_regexp->appendtail;
    var: 'answerText' = $tag_regexp->output;
  /iterate;

 

That's it! Again, take a look at the other solutions for significant variations on this code.

Conclusion

The submitted solutions show a number of different methods of solving the same problems. Some use [String_ReplaceRegExp] or [String_FindRegExp], others make use of the Lasso 8.5 [RegExp] type to cut down on the amount of work required. Some make use of many simple regular expressions executed in order, others make use of one complex regular expression.

Please note that periodically LassoSoft will go through the notes and may incorporate information from them into the documentation. Any submission here gives LassoSoft a non-exclusive license and will be made available in various formats to the Lasso community.

LassoSoft Inc. > Home

 

 

©LassoSoft Inc 2015 | Web Development by Treefrog Inc | PrivacyLegal terms and Shipping | Contact LassoSoft