[Zope-CMF] link checking a CMF site

Tres Seaver tseaver@zope.com
Sat, 2 Feb 2002 14:43:54 -0500 (EST)


On Thu, 31 Jan 2002, Ian Clatworthy wrote:

> We have a CMF-based portal with lots of content (both HTML and
> Structured Text) that I want to link check as soon as possible.
> The 1.2 documentation mentions a link checker as something to
> expect post 1.2. Has this development started yet? If so, is
> the code 1.2-specific or can I use it on a 1.1-based site?

I spent a fair amount of time looking for an existing Python
link-ripper (the first part of a checker) before I found that
it was so simple in Python that nobody had packaged it.

  #! /usr/bin/python
  import re
  import urlparse

  class LinkRipper:
      """
          Package utilities for ripping HTML and STX links from a
          string or a file.
      """
      href = re.compile( r'href="(.*?)"', re.IGNORECASE )
      a_href = re.compile( r'<a.*?href="(.*?)"', re.IGNORECASE )
      img_src = re.compile( r'<img.*?src="(.*?)"', re.IGNORECASE )
      link_href = re.compile( r'<link.*?href="(.*?)"', re.IGNORECASE )

      def _rip( self, text, pattern ):

          if type( text ) != type( '' ): # then assume a file
              text = text.read()

          return pattern.findall( text )

      def rip_href( self, text ):
          """
              Extract all 'href=""' targets from text.
          """
          return self._rip( text, self.href )

      def rip_a_href( self, text ):
          """
              Extract all '<a href=""' targets from text.
          """
          return self._rip( text, self.a_href )

      def rip_img_src( self, text ):
          """
              Extract all '<img src=""' targets from text.
          """
          return self._rip( text, self.img_src )

      def rip_link_href( self, text ):
          """
              Extract all '<link href=""' targets from text.
          """
          return self._rip( text, self.link_href )

  _ripper = None
  def rip_links( text ):
      global _ripper
      if _ripper is None:
          _ripper = LinkRipper()
      links = _ripper.rip_a_href( text )
      parsed = map( urlparse.urlparse, links )
      return map( urlparse.urlunparse, parsed )

  if __name__ == '__main__':
      import sys, urlparse
      _ripper = LinkRipper()
      for link in _ripper.rip_a_href( sys.stdin ):
          print urlparse.urlparse( link )

> If not, is there some generic Zope link checking code I can
> leverage? Likewise, if anyone has any design advice on the
> best way to do this or things to watch out for, I'd appreciate
> it. If I need to write something, it would be good if it was
> general enough for others to use.

I hope that is enough for a start;  I am unlikely to be able to
push it very far myself in the near term, but would be glad to
add a contributed 'portal_links' tool to the core.

Tres.
-- 
===============================================================
Tres Seaver                                tseaver@zope.com
Zope Corporation      "Zope Dealers"       http://www.zope.org