[Zope] Folder with one million Documents?

Casey Duncan casey_duncan@yahoo.com
Sun, 27 Jan 2002 22:21:07 -0800 (PST)


--- Joachim Werner <joe@iuveno-net.de> wrote:
> Hi!
> 
> Just my 2 eurocents:
> 
> > I am developing a simple DMS. Up to now I use a
> python product with a
> > BTreeFolder which
> > contains all the documents. Every document gets an
> ID with
> > DateTime().millis(). There will
> > be up to 50 users working at the same time. And in
> the end I will have
> > up to 3 million documents.
> >
> > Is there a better class than BTreeFolder for such
> mass storage?
> 
> If it is mainly large documents (like MS Office or
> PDF files) you are trying
> to manage, the fastest way of handling this is using
> the filesystem for
> storage and serving. You could do the cataloging in
> Zope and hold link
> objects to the actual files in a Zope tree (and yes,
> if it is MANY objects,
> BTrees will be a good idea). These links could also
> manage the metadata.

I thoroughly agree. Having developed a DMS myself, My
cut-off point (which is really just an engineering
intuition more than anything) was at about 5000
documents, it would be best to store them directly in
the file system.

Now, since the DMS I developed (DocumentLibrary) was
for a target of < 5000 documents, I went for the
simpler route of storing them in a BTreeFolder.

What you will have to do to make an effective FS
storage system, is create code that processes uploads
and places them in an arbitrary hierarchy. Obviously
putting 3 million documents in one FS directory will
just plain fail in most FSes and at worst will perform
dismally. You'll need to devise a way for the system
to subdivide amongst a shallow hierarchy of dirs,
something like Squid does with its cache directories.

For serving the files you could use Apache, but I
might be tempted to try something simpler, like micro
httpd or tux or something light-weight.

I agree that serving static binaries is not ZServer's
strong suit. I guess that choice will depend on the
frequency and size of downloads.

Another thought might be to store the files in the FS
and proxy them through Zope, like ExtFile does. Then
put Squid in front of Zope to cache them so that they
are only served the first time from Zope. Then you
don't have to worry about what stuff is getting served
from where.

BTW: If you do set up any nifty FS storage solution, I
would be interested in seeing it for future version of
DocumentLibrary.

Good Luck!
-Casey



__________________________________________________
Do You Yahoo!?
Great stuff seeking new owners in Yahoo! Auctions! 
http://auctions.yahoo.com