[Zope] Preventing duplicates in ZCatalog

Wankyu Choi wankyu@neoqst.com
Tue, 22 Apr 2003 18:51:19 +0900


Dear all,

I have a very evasive problem again ;-) with ZCatalog. I'm very sorry if
this topic has been dealt with; I couldn't find any info on this one.

Here's a very simple concept I'd like to implement in my applications =
using
ZCatalog:

	** Every entry in the built-in ZCatalog should be unique.  No
duplicates!**

However, every object's reference could get duplicated ( that is, =
inserted
more than once ) when the object is accessed via acquisition. Say, when =
I
have two folders, *dir1*, *dir2" and an object "obj1" in the dir1 =
folder:

	* Accessing obj1 via /dir1/obj1 catalogs it with the uid
"/dir1/obj1".

	* Accessing it again via "/dir1/obj1", skips catalogging the object
or recatalog it with the same uid.

	* Accessing it via "/dir2/obj1" or "/dir2/dir1/obj1" duplicates the
catalog entry with with a different uid.

It shouldn't be a serious problem when searching the catalog except for
degraded, though negligible, performance due to too many duplicates, but =
it
is indeed a serious problem when presenting found results: "/dir1/obj1",
"/dir2/dir1/obj1", etc; they're different references pointing to the =
same
object. ( I found the portal_catalog in one of my virtual hosts had tons =
of
duplicates: more than 30,000 entries were removed when I recreated the
catalog. )

I have a message board application with a built-in catalog and it =
recatalogs
an article when a user views it by increasing its read count. ( Okay, it
bloats the ZODB, but it's a whole lot different matter that should be
addressed. Any hints or tips on this one would also be **greatly**
appreciated! ) A list of articles is made from the board's built-in =
catalog
on the fly. The read count is also an index: accessing an article via
acquisition duplcates that very article's reference in the catalog over =
and
over again.=20

The problem aggravates when you use a virtual host monster. I solved =
this
problem for my message board application by saving articles' relative =
paths
as uids and reconstrucing their absolute paths based on the article
container's (that is, the built-in catalog's) absolute_url() at the =
moment
of retreiving them from the catalog. It works just fine even with VHM =
since
the built-in catalog is never used **outside of the application.**

But I couldn't figure out how to solve this with the portal_catalog =
tool,
for example, or any other application that should be accessed site-wide.
Although VHM is a great tool, it creates a lot of headaches in terms of
paths and stuff for me.

To address this problem, I put Squid as frontend before my ZEO clients
(cache peers) and created a redirect script, which removes VHM paths:

	* http://www.example.net is redirected to /root/www_example_net, a
VHM base

	* http://www.example.net/www_example_net is also valid, but leads to
duplicate entries in the catalog
=09
	* url redirector removes any reference of the folder www_example_net
in the given url

It only works for VHM cases, but is no good with accesses via =
acquisition.

I want to make sure that no duplicates get into any catalog from the =
source
code level even when using VHM or when users access objects by way of
acquisition.

( The problem is easy to reproduce. In a CMF site, create a couple of
folders and in one of them, create a news item and publish it. It'd =
appear
on top of the news_slot box. Revisit the news item via acquisition, that =
is,
putting the other folder on top of the item and edit it. Voila, the
news_slot box shows two references for the news item. )

Creating a property in the catalog that holds unique keys for every =
object
and checking the property before inserting a new entry....well...seems =
to be
very kludgy.

In short, can I catalog objects using acquisition-safe unique keys, say
their oid's or auto-generated md5 hash or something?

( I wonder, if it's doable with little problem, why the designer of =
ZCatalog
made it in such a way that object's physical paths work as uids, which =
may
lead to the above-mentioned problems since they can't be unique when =
objects
are acquired with different urls... please enlighten me if there's =
something
I should be aware of in this regard, Casey. Terribly sorry for another
newbish stupid question :-)

Any hints or tips would be appreciated.

Thanks in advance.

Wankyu Choi
---------------------------------------------------------------
  Wankyu Choi
  CEO/President
  NeoQuest Communications, Inc.
  http://www.zoper.net
  http://www.neoboard.net
---------------------------------------------------------------  =20