[ZODB-Dev] ZODB for spambayes server-side filter?

Simone Piunno pioppo at ferrara.linux.it
Mon Jan 12 09:44:43 EST 2004


Hello,

I'm working on a server-side spam filter based on spambayes.
After some prototyping with BDB4, I've started to look at ZODB.
I'm trying to understand if this is a good idea.

The project design is the following:

 - implemented as an SMTP proxy daemon

 - we'll  keep a server-wide Word Probability Database, shared among 
   all users, or several of them, but we'll also have a per-user WPD.  Users
   will be able to choose among them, and even completely disable the
   filter for their address.

 - the filter will get all the traffic, but for each email it can choose the
   right WPD based on the RCPT address (envelope receiver).

 - training will be basically done by email mime forward.  In case the filter
   is not sure about the trainer's identity (e.g. because no password based
   authentication scheme was in place), the filter could send a request to 
   the trainer address, with a cookie in the subject, asking for reply to
   confirm identity (much like Mailman does for subscriptions).

 - all messages can be temporarily retained in a cache, so that training
   can be performed on the pristine copy instead of the forwarded one.  Old
   entries in the cache would be automatically expired.

 - users can choose to receive all the traffic, simply tagged, or they can
   choose to block spam and/or unsures.  They will receive a daily report
   on blocked email, so that just skimming at the from/subject list in the
   report they could decide if a correction is requested.  Blocked email could
   be unblocked and/or trained manually through the web, if you do it before
   automatic expiration timeout.

 - configuration will be done through the web.

 - accurate statistics will be kept per-user and server-wide.

I believe simple BDB is too flat to persist such a complex data structure, 
therefore I've started looking at ZODB.  I'm fairly conviced that a 
transactional storage is required here and it will be mostly read only: 
writes will be only for training, stats update and configuration.
After some benchmark, I got a 5-10x performance increase.

One main question is: how to avoid collision collapse?  I think at 1st approx 
in case of transaction collision I can safely abort the SMTP connection and 
wait for retry, but how can I be sure that more retries won't accumulate 
collapsing the database?

TIA
  Simone
-- 
This signature intentionally left blank





More information about the ZODB-Dev mailing list