[Zope] Strip all HTML

Dylan Reinhardt zope@dylanreinhardt.com
Tue Aug 5 14:39:11 EDT 2003


You didn't mention what problem you're having... but it would appear
that case-sensitive matching is one of them.

re.sub (sadly) doesn't support flags like I (ignore case) or S (dot
matches newline character).  I have no idea why not.

However, not all is lost.  re.compile supports flags and will give you
an object that has a sub method.  Go figure.

So you can try something like:

-----

import re

style = re.compile('<style.*?>.*?</style>', re.I | re.S)
script = re.compile('<script.*?>.*?</script>', re.I | re.S)
tags = re.compile('<.*?>', re.S)

return tags.sub('', script.sub('', style.sub('', text)))


-----

Note that in this case there is no need to check for comments
separately... they'll be matched by the tags pattern.  

Once that works, you may want to do some other things like replace <br>
with line breaks, etc.  But this should be enough to make progress with.

HTH,

Dylan




On Tue, 2003-08-05 at 05:26, ken@practical.org wrote:
> Hi all,
> 
> I want to display a text-only version of a web page captured with the DocumentLibrary product (no longer supported).
> 
> This product uses the 'Catalog Support' HTML converter available here:
> 
> http://www.dieter.handshake.de/pyprojects/zope/CatalogSupport.html
> 
> However this converter, like the others I have tried (Strip-o-Gram, as well as an external method based on striphtml.py), seem unable to remove the content of <style></style> or <script></script> tags. So I get plenty of hits with a search for 'children' or 'window' or 'background'...
> 
> Has anyone else confronted this problem?
> 
> I have also made feeble attempts such as the following Script (Python), without success:
> 
> import string
> import re
> 
> text = re.sub('<STYLE.*?>.*?</STYLE>', '', data)
> text = re.sub('<STYLE.*?>.*?</STYLE>', '', text)
> text = re.sub('<style.*?>.*?</style>', '', text)
> text = re.sub('<script.*?>.*?</script>', '', text)
> text = re.sub('<!--.*?-->', '', text)
> text = re.sub('<.*?>', ' ', text)
> return text
> 
> I sure would appreciate some help on this...
> 
> Thanks,
> 
> Ken
> 
> 
> 
> _______________________________________________
> Zope maillist  -  Zope@zope.org
> http://mail.zope.org/mailman/listinfo/zope
> **   No cross posts or HTML encoding!  **
> (Related lists - 
>  http://mail.zope.org/mailman/listinfo/zope-announce
>  http://mail.zope.org/mailman/listinfo/zope-dev )





More information about the Zope mailing list