Lasso Soft Inc. > Home

  • Articles

Regular Expression Challenge - Wiki Style Formatting

This article continues from the tip of the week for August 17, 2007 that announced the winners of the second Lasso Programming Challenge. Congratulations to Bil Corry and Johan Sölve. The tip includes a description of the challenge, a downloadable archive of answers to the challenge, documentation of the answer which LassoSoft created for the challenge, and a discussion of how the entries which were received used different strategies to solve the challenge.

Wiki Style Formatting

This was one of the more complex problems which required some serious regular expression muscle to accomplish. The four solutions differ significantly on how they implemented the solution this problem, but all of them came up with something that worked. See the challenge file for a full description of the Wiki-style syntax which was specified for this problem.

The basic idea was that codes like *Bold Text* could be used to markup text, [http://www.apple.com] would mark up a link, !myimage.png! would insert an image, a hyphen on a line by itself would insert an <hr />, an empty line would insert a <br />, lines which start with H# (1-7) would be formatted as a header, lines which start with a hyphen and a space would be formatted as part of an unordered list, and lines which start with a number sign and a space would be formatted as part of an ordered list.

We're going to walk through Fletcher's solution to the problem, but please do see the variations provided by the others. They show different variations including using fewer regular expressions to the text is looped over only once and using multiple simple regular expressions for each syntax type.

 

We create a [RegExp] object which fines lines that start with one of the prefixes. And we set up a number of variables that store the current state of our search/replace operation. The line_state variable will tell us if we are currently in an ordered list or an unordered list.

  var: 'line_regexp' = (regexp: -input=$answerText,
      -find='\r\n(H\\d|-|#) (.*?)(?=\r\n)');
  var: 'line_state' = 'none', 'line_prefix' = '', 'line_suffix' = '';
 var: 'answerText' = $originalText;

 

First, we store the originalText in answerText where it will be manipulated. We create a variable line_map which stores a mapping from line prefixes to the markup required for that line prefix.

  var: 'line_map' = (map:
      '-' = (array: '<li>', '</li>'),
      '#' = (array: '<li>', '</li>'),
      'h1' = (array: '<h1>', '</h1>'),
      'h2' = (array: '<h2>', '</h2>'),
      'h3' = (array: '<h3>', '</h3>'),
      'h4' = (array: '<h4>', '</h4>'),
      'h5' = (array: '<h5>', '</h5>'),
      'h6' = (array: '<h6>', '</h6>'),
      'h7' = (array: '<h7>', '</h7>'),
    );

 

Loop through all the lines in the text checking what marker was found at the start of the line. We insert <ul> tags if an unordered list item is seen and we are not currently in a list. Similarly for <ol> tags. If we are looking at an empty marker then we end the current list. And, a little logic is required to switch between list types if necessary.

 

If the current marker is contained in our line_map then we insert a replacement using the markup defined in the map. Otherwise, we don't do a replacement and the line remains unchanged.

  while: $line_regexp->find;
    var: 'marker' = $line_regexp->(matchstring: 1);
    if: $line_state == 'none';
      if: $marker == '-';
        var: 'line_prefix' = '<ul>';
        var: 'line_state' = 'unordered';
      else: $marker == '#';
        var: 'line_prefix' = '<ol>';
        var: 'line_state' = 'ordered';
      /if;
    else: $line_state == 'unordered' && $marker == '#';
      var: 'line_prefix' = '</ul><ol>';
      var: 'line_state' = 'ordered';
    else: $line_state == 'ordered' && $marker == '-';
      var: 'line_prefix' = '</ol><ul>';
      var: 'line_state' = 'unordered';
    else;
      var: 'line_pos' = $line_regexp->(matchposition);
      var: 'line_next' = $answerText->(substring: $line_pos->first +
          $line_pos->second + 2, 2);
      if: $line_state == 'ordered' && $line_next != '# ';
        var: 'line_suffix' = '</ol>';
        var: 'line_state' = 'none';
      else: $line_state == 'unordered' && $line_next != '- ';
        var: 'line_suffix' = '</ul>';
        var: 'line_state' = 'none';
      /if;
    /if;
    if: $line_map >> $marker;
      var: 'delimiters' = $line_map->(find: $marker);
      $line_regexp->(appendreplacement: $line_prefix +
          $delimiters->first + $line_regexp->(matchstring: 2) +
          $delimiters->second + $line_suffix);
    /if;
    var: 'line_prefix' = '', 'line_suffix' = '';
  /while;

 

At the end of the while loop we append the remainder of the input and we replace the answerText variable with its new output.

  $line_regexp->appendtail;
  var: 'answerText' = $line_regexp->output;

 

Now, we move on to the markup in the text. We are going to look for strings which start and end with certain symbols or are surrounded by square brackets or curly brackets. We define a markup_map which tells us what markup corresponds to various markers.

  var: 'markup_map' = (map:
      '*' = (array: '<b>', '</b>'),
      '+' = (array: '<em>', '</em>'),
      '_' = (array: '<u>', '</u>'),
      '-' = (array: '<del>', '</del>'),
      '??' = (array: '<cite>', '</cite>'),
      '^' = (array: '<sup>', '</sup>'),
      '~' = (array: '<sub>', '</sub>'),
      '{{' = (array: '<tt>', '</tt>'),
    );

 

We iterate through three different regular expression patterns. The first finds markup that is surrounded by the same character start and finish. The second finds markup surrounded by curly brackets. And, the third finds markup surrounded by square brackets. We used three regular expressions in order to reduce the complexity of the regular expressions.

  iterate: (array: '(\\!|\\*|\\+|_|~|\\^|-|\\?\\?)(.*?)\\1',
      '(\\{\\{)(.*?)\\}\\}', '(\\[)(.+?)\\]'), (var: 'markup_pattern');
    var: 'markup_regexp' = (regexp: -input=$answerText,
        -find=$markup_pattern);

 

For each regular expression we loop through all the matches in the input. We find the marker and we perform a replacement appropriate to the marker.

    while: $markup_regexp->find;
      var: 'marker' = $markup_regexp->(matchstring: 1);

 

If the marker is an exclamation point then we process it as an image reference.

      if: $marker == '!';
        var: 'params' = $markup_regexp->(matchstring: 2)->(split: '|');
        var: 'output' = '<img src="' + $params->(get: 1) + '"';
        $params->(removefirst);
        if: $params >> 'right';
          $output += ' align="right"';
          $params->(removeall: 'right');
        else: $params >> 'left';
          $output += ' align="left"';
          $params->(removeall: 'left');
        else: $params >> 'center';
          $output += ' align="center"';
          $params->(removeall: 'center');
        /if;
        iterate: $params, (var: 'param');
          var: 'dimensions' = (string_findregexp: $param,
              -find='(\\d+)x(\\d+)');
          if: $dimensions->size == 3;
            $output += ' width="' +
                (integer: $dimensions->(get: 2)) +
                '" height="' +
                (integer: $dimensions->(get: 3)) + '"';
          /if;
        /iterate;
        $output += ' />';
        $markup_regexp->(appendreplacement: $output);

 

If the marker is a square bracket then we process it as a link which needs an anchor tag.

      else: $marker == '[';
        var: 'params' = $markup_regexp->(matchstring: 2)->(split: '|');
        var: 'output' = '<a href="' + $params->(get: 1) + '">';
        if: $params->size >= 2;
          $output += $params->(get: 2);
        else;
          $output += $params->(get: 1);
        /if;
        $output += '</a>';
        $markup_regexp->(appendreplacement: $output);

 

Otherwise, if the marker is in our markup_map we process it as markup which requires opening and closing tags defined in the map.

      else: $markup_map >> $marker;
        var: 'delimiters' = $markup_map->(find: $marker);
        $markup_regexp->(appendreplacement: $delimiters->first +
            $markup_regexp->(matchstring: 2) + $delimiters->second);
      /if;
    /while;

 

At the end of the while loop we append the remainder of the input and we replace the answerText variable with its new output.

    $markup_regexp->appendtail;
    var: 'answerText' = $markup_regexp->output;
  /iterate;

 

 

Finally, we do three straight search/replace operations to take care of lines which consist of a single hyphen, inserting breaks for double return/newline pairs, and stripping any remaining return/newline pairs.

  $answerText->(replace: '\r\n-\r\n', '<hr />');
  $answerText->(replace: '\r\n\r\n', '<br />');
  $answerText->(replace: '\r\n', ' ');

 

That's it! Take a look at the four implementations to see several variations. The code performs similar work in all four cases, but using significantly different methods.

Please note that periodically LassoSoft will go through the notes and may incorporate information from them into the documentation. Any submission here gives LassoSoft a non-exclusive license and will be made available in various formats to the Lasso community.

LassoSoft Inc. > Home

 

 

©LassoSoft Inc 2015 | Web Development by Treefrog Inc | PrivacyLegal terms and Shipping | Contact LassoSoft