[Zope-CMF] Re: Searching multilingual CMF sites

Greg Ward gward@python.net
Fri, 7 Mar 2003 17:22:51 -0500


On 07 March 2003, To nuxeo-localizer@nongnu.org said:
> [sorry for posting to two lists, but I'm really not sure if the
> Localizer community or the CMF community is the right place to ask!]
> 
> What's the best way to implement searching on a multilingual site?  I've
> got a CMF site with bilingual content up-and-running thanks to
> Localizer, and managed to cobble together a fairly functional "search"
> box by stealing some scripts from Plone.  But it gets weird when you
> cross language boundaries.

OK, I'm well on the way to solving this problem.  Thought I'd share my
approach for posterity -- future archive-searchers will no doubt thank
me.  ;-)

Turns out this was a CMF/Zope question; Localizer barely enters into it.
(It's only needed to find out the user's current language at search
time.)  Here's what I did:

  * in $portal/portal_catalog, create two new vocabularies:
    vocab_en and vocab_fr

  * then pop over to the "Indexes" tab and create two new indeces:
    SearchableText_en and SearchableText_fr.  Use the corresponding
    language-specific vocabulary in each index.

  * I already had a SearchableText() method in LocDublinCore,
    which all of my content classes inherit from (shamelessly
    stolen from Rainer Thaden's LocCMFProduct); I extended it to
    have a language-neutral mode and language-specific modes,
    then added trivial SearchableText_en() and SearchableText_fr(
    wrappers.  Here's the code:

      def SearchableText (self, language=None):
          words = []
          for pty in self._local_properties.keys():
              pty_val = self._local_properties[pty]
              if language is None:        # index all languages
                  for (lang, val) in pty_val.items():
                      if lang and val:
                          words.append(val)
              else:                       # only index selected language
                  val = pty_val.get(language)
                  if val:
                      words.append(val)

          return " ".join(words)

      def SearchableText_en (self):
          return self.SearchableText(language="en")

      def SearchableText_fr (self):
          return self.SearchableText(language="fr")

    This is fairly evil, since it grubs rudely through data structures
    inherited from LocalPropertyManager (part of Localizer).  I didn't
    see a clean + efficient way to do this, so I went with rude +
    efficient.  ;-(

    Also, hard-coding the set of languages into those two wrapper
    methods is Just Wrong.  I think I can get around that with a clever
    __getattr__() method, but haven't done that yet.

  * finally, I modified the search method to select the index to
    search based on the user's current language.  My search form
    looks (roughly) like this:

      <form name="searchform" action="search"
            tal:attributes="action string:${portal_url}/search" method="GET">
        <input id="searchGadget" 
               name="text" 
               type="text" 
               size="15" 
               value="">
      </form>

    And here's the Python Script that processes this form:

      text = context.REQUEST.get("text")
      if text:
          lang = context.Localizer.get_selected_language()
          key = "SearchableText_%s" % lang
          query = {key : text}
          return context.portal_catalog(query)
      else:
          return []

...and this works fine!  There are only two problems left:

  * search results are shown in the language that was current when
    the object was cataloged, presumably because of the way ZCatalog
    harvests meta-data at catalog-time.  I suspect I can fix this if
    I can persuade ZCatalog to harvest meta-data in all available
    languages.

  * searching for words with non-ASCII characters is tricky -- IMHO,
    searching for "francais" should yield the same as searching for
    "français", ie. the index should take care of collapsing accented
    characters somehow.  But I'm no linguist -- that might just
    squeak by with accents in French, but whether the same approach
    would work for Nordic å or German ß, I don't know.  Anyways,
    this should be up to either the index or the vocabulary -- it's
    not my problem!

-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/