[Zope] MSWordDocument and Logictran's R2NET

Bryan Capitano Bryan@capitanoweb.com
Mon, 16 Dec 2002 10:16:18 -0800


> -----Original Message-----
> From: zope-admin@zope.org [mailto:zope-admin@zope.org]On Behalf Of Johan
> Carlsson [EasyPublisher]
> Sent: Monday, December 16, 2002 8:12 AM
> To: Bryan Capitano
> Cc: zope@zope.org
> Subject: RE: [Zope] MSWordDocument and Logictran's R2NET
>
>
> At 07:55 2002-12-16 -0800, Bryan Capitano said:
>
> >I have not tried the MSWordDocument product before. That's interesting,
> >thanks for sharing it.
> >I am familiar with a commercial product from Logictran called
> 'R2NET'. With
> >this software you can easily convert Word (RTF) files to HTML or XHTML or
> >XML. I use the product extensively at the Linux command line. It
> is easy to
> >use, very powerful and robust. It gives you lots of control over how
> >documents are converted through a translation file which you can
> customize
> >if you want more custom output.  I think it would be easy to
> plug into Zope.
> >Bryan
>
> How does Logictran's R2NET compare to vwWare (which is use by
> MSWordDocuments on Unix)?
> It seems like they are quite similar.
>
> Regards,
> Johan Carlsson
>

Johan,

I had evaluated wvWare a couple months ago for a web-to-print project
(sharing documents between a website and a printed book publication). wvWare
wasn't nearly as feature-rich or robust as R2NET.
For example:
1. I was not able to use wvWare to convert DOC/RTF into XML using my own
DTD. (I can with R2NET).
2. wvWare did not recognize some of the more complex RTF control codes for
font "styles", tables, or anything much more complicated than plain text. It
does recognize fonts, font sizes, and italics/bold/etc. But in Word you can
define actual styles that you can re-use or apply to sections of a document.
wvWare doesn't capture style information.
3. In the publishing world, documents often have hidden codes embedded in
the document. In particular, I was concerned about RTF codes \xe, \txe, and
\tc.  In the document these look like: {xe "this looks like an index code."}
or see-also entries like this: {xe "trees" \t "See also Shrubs"}. You might
also want to use some hidden table-of-contents codes embedded in your
document like this: {tc "Chapter 1, Trees and Shrubs" \l 1}.  R2NET will
extract this information from RTF documents and put them in your XML if you
tell it HOW by using the translation files. wvWare can't do this, at least
not to my knowledge.

For these reasons, I think wvWare is a good "basic" converter. It's a good
first step, and useful for basic doc-->html needs. But if you need more
power and extensibility, and if you want to dump Word documents into your
own pre-defined XML DTD, then R2NET is worth the $69 dollars.

You could also write your own Perl RTF parser by making use of
RTF::Tokenizer. I have done this too. It is a more difficult road, but gives
you absolute flexibility. There may be a similar RTF tokenizer for Python???

Best regards,
Bryan


Bryan R. Capitano
President,
CAPITANO WEb CONSULTING
Tel: 541-344-0747
Email: Bryan@capitanoweb.com
URL: http://www.capitanoweb.com