[Zope-dev] Structured text issues (again?)

jlagarde@bigfoot.com jlagarde@bigfoot.com
Wed, 20 Mar 2002 13:40:37 -0800


How extensively is STX actually used? I've been looking at it myself
recently, and the whole system seem rather simplistic to me in how it
parses the text. I'm talking specifically  of the STX version currently
standard in Zope 2.5 (and 2.4 I think), which I believe is STXNG; I
haven't gone back to look at how the previous version worked. I have
looked at some of the past mailing list posts on this topic, and
although it is clear that things have been improved a lot, I am
surprised that not more was done so far.

I explain the problems I see next, followed by a proposed algorithm
change and some rough code to make things "better". I'm doing this now
for two reasons: 

First, if I'm missing something important wrt the reasons things are as
they are, I'm obviously interested to know before I spend any more time
on this.

Second, maybe these proposed changes are actually a step in the right
direction, or might help someone else do what they need, so I'm
providing the code of what I've done so far, as is, in case it can be of
help (I am unlikely to have the time to come up with a polished set of
changes myself any time soon). Or I guess what I'm saying is (wink wink
nudge nudge): If someone else feels like picking up on this and
finishing it up so I don't have to, feel free ;-)

The biggest problem I see is that the various text types are given
somewhat arbitrary preference from the order in which they appear in the
text_types list. Given that the patterns in text_types are looked at in
order, with the first match breaking the raw string in half, then any
other structure that would have spanned a larger part of the string but
is lower in the text_types list is effectively ignored.

For example, since doc_strong is currently listed before doc_emphasize,
"*emphasized **strong** emphasized*" does enclose "strong" in <strong>
tags, but completely ignores the single *'s because they do not form a
matching pair (because the parsing of **strong** breaks the rest in two
separate strings "*emphasized " and " emphasized*").

Following the same reasoning, I assumed that **strong *emphasized*
strong** would work better, but it did not! This time, it's because the
regexp for doc_strong is rather simplistic, as it does not allow ANY '*'
within (strongem_punc), whereas it should only care to not allow the
specific pattern '**' inside. In this case, wouldn't the easiest
solution be to simply use the non-greedy matching? i.e replace: 
   r'\*\*([%s%s%s\s]+?)\*\*' % (letters, digits, strongem_punc) 
with:
   r'\*\*(.*?)\*\*'
or actually better:
   r'(?!\*\*\*)\*\*(?<!\*\*\*)(.*?)(?!\*\*\*)\*\*(?<!\*\*\*)' 

I think the last pattern is best because it will not recognize the
middle **** as anything in "**this: **** does not matter**", as I think
should normally be expected. BTW, I make no claim that the regexp above
is either the most elegant or the most efficient; this is the first one
I came up with that did what I wanted ;-)

Now, back to the problem with the ordered nature of text_types (the
reason "*emphasized **strong** emphasized*" does not work as expected).
Besides the extra computing required, any reason why the structures with
the largest span shouldn't be recognized first, regardless of the order
of text_types? I.E. what I propose is to go through all the text_types,
collecting the matching patterns, and only once this is done choose the
one with the largest span. Then proceed recursively with the enclosed
text until no pattern matches. This permits to succeed in quoting
structured text patterns: "'**not bold**'", and bolding quoted text: "**
some text '<this is quoted>' **". With the current implementation, none
of those work, i.e. "'**not bold**'" ends up being bolded and not
quoted, and "** some text '<this is quoted>' **" is a total mess because
the text in <> is interpreted as SGML instead of being quoted as
requested.

The changes I have made so far (all in DocumentClass.py):

The simplest are the few regexp changes I have made for doc_strong,
doc_emphasize, and doc_literal (actually doc_literal probably doesn't
matter (I used the pattern proposed by someone else on this list to make
the quoting more obvious), but the changes to doc_strong and
doc_emphasize are required to make my other changes work).

doc_strong becomes: 
r'(?!\*\*\*)\*\*(?<!\*\*\*)(.*?)(?!\*\*\*)\*\*(?<!\*\*\*)'
doc_emphasize becomes:
r'(?!\*\*)\*(?<!\*\*)(.*?)(?!\*\*)\*(?<!\*\*)'
and doc_literal becomes:
r"(\W+|^)``([%s%s%s\s]+)''([%s]+|$)" % (letters, digits, literal_punc,
phrase_delimiters)

The big changes are to the parse and color_text methods. 

Parse now only returns the first match found of a type rather than all
of them, and it returns the start and end indices of the match so that
the span size can be computed in color_text:

    def parse(self, raw_string, text_type,
              type=type, st=type(''), lt=type([])):
        
       """
       Parse accepts a raw_string, an expr to test the raw_string,
       and the raw_string's subparagraphs.
       
       Parse will continue to search through raw_string until 
       all instances of expr in raw_string are found. NOT!!!
       
       If no instances of expr are found, raw_string is returned.
       Otherwise a list of substrings and A SINGLE instance is returned
       """

       tmp = []    # the list to be returned if raw_string is split
       append=tmp.append
       
       if type(text_type) is st: text_type=getattr(self, text_type)

       start = end = 0 # because I'm returning those now
       while 1:
          t = text_type(raw_string)

          if not t: break
          #an instance of expr was found
          t, start, end    = t

          if start: append(raw_string[0:start])

          tt=type(t)
          if tt is st:
             # if we get a string back (when would this happen?), add it
to text to be parsed
             raw_string = t+raw_string[end:len(raw_string)]
             # should I break or not here? If I break, same as removing
the while
          else:
             if tt is lt:
                # is we get a list, append it's elements
                tmp[len(tmp):]=t
             else:
                # normal case, an object
                append(t)
             #Do not keep processing once a match found!
             raw_string = raw_string[end:len(raw_string)]
             break

       if not tmp: return (raw_string,0,0) # nothing found
       
       if raw_string: append(raw_string)
       elif len(tmp)==1:
          return (tmp[0],start,end)
       
       return (tmp,start,end)

In color_text, instead of looping over the text_types only once, I loop
over all of them for every recursive pass. For any pass, I select the
match (no matter what type) with the largest span, and recurse on its
content. I think that the code could be made a lot more efficient (as
long as they are not overlapping, I should be able to collect more than
one match in a single pass, among other things), but I just wanted to
see if the result of the parsing would give the results I wanted for
now, and it seems it does (but I repeat, I haven't tested extensively so
far).

    def color_text(self, str, types=None):
       """Search the paragraph for each special structure
       """
       if types is None: types=self.text_types

       if type(str) is StringType:
          max = 0
          parsed = 0
          for text_type in types:
             res, start, end = self.parse(str,text_type)
             if res != str:
                parsed = 1
             # keep the option with the largest span only
             if end-start >= max:
                finalres = res
                max = end-start

          if parsed and type(finalres) is ListType: # *** this may cause
other problems **
             return self.color_text(finalres)
          else:
             return finalres # end recursion
        
       elif type(str) is ListType:
          res = []
          for sub in str:
             subres = self.color_text(sub)
             if type(subres) is not ListType: subres = [subres]
             res += subres
          return res

       else:
          res = map(self.color_text,str.getColorizableTexts())
          # To avoid stuff like
StructuredTextSGML(StructuredTextSGML('<I>'))
          if len(res) == 0 or type(res[0]) != type(str) or \
             res[0].getColorizableTexts() != str.getColorizableTexts():
             
             str.setColorizableTexts(res)
          return str

So there you have it. I find that the results produced by this code make
a lot more sense than what is produced by the current implementation. I
guess one problem with structured text may be that there are differences
of opinion as to what the actual rules and output should be (e.g. should
list items be singly spaced or double spaced), but I really don't see
who could argue that a preference in the order of markup like emphasis,
strong, and underline makes any sense. Or am I missing something?

Cheers,

Jean