[ZODB-Dev] polite advice request

Sat Aug 17 05:49:03 CEST 2013

Hi Jim et all!

I am struggling with a weird data base, and my goal is to show off how
great this works with (zodb|durus, the latter already failed pretty much).

Just to give you an impression of the size of the problem:

There are about 25 tables, each with currently 450,000 records.
After all the changes since 20120101, there were 700,000 records involved
and morphed for each table.

These records have some relevant data, but extend to something like 95
additional columns which are pretty cumbersome.

This database is pretty huge and contains lots of irrelevant data.

When I create the full database in native dumb style (create everything
as tuples), this crap becomes huge and nearly untractable by Python.

I managed to build some versions, but see further:

In extent to the 25 tables snapshot, this database mutates every 2 weeks!
Most of the time, there are a few thousand updates.
But sometimes, the whole database changes, because they decided to
remove and add some columns, which creates a huge update that changes
almost everything.

I am trying to cope with that in a better way.
I examined lots of approaches to cope with such structures and tried some
things with btree forests.

After all, it turned out that structural changes of the database (2 columns
removed, 5 inserted) result in huge updates with no real effect.

Question:
Did you have that problem, and can you give me some advice?
I was thinking to switch the database to a column-oriented layout, since
this way I could probably get rid of big deltas which just re-arrange very
many columns.

But the overhead for doing this seems to be huge, again.

Do you have a good implementation of a column store?
I would like to implement a database that tracks everything, but is able 
to cope
with such massive but simple changes.

In effect, I don't want to keep all the modified records, but have some 
function
that creates the currently relevant tuples on-demand.
Even that seems difficult. And the whole problem is quite trivial, it 
just suffers
from Python's idea to create so very many objects.

--------------------

So my question, again:

- you have 25 tables

- tables are huge (500,000 to 1,000,000 records)

- highly redundant (very many things could be resolved by a function 
with special cases)

- a new version comes every two weeks

- I need to be able to inquire every version

How would you treat this?

What would you actually store?

Would you generate a full DB every 2 weeks, or would you (as I do) try to
find a structure that knows about the differences?

Is Python still the way to go, or should I stop this and use something like
PostgreSQL? (And I doubt that this would give a benefit, actually).

Would you implement a column store, and how would you do that?

Right now, everything gets too large, and I'm quite desperate. 
Therefore, I'm
asking the master, which you definately are!

cheers -- Chris

-- 
Christian Tismer             :^)   <mailto:tismer at stackless.com>
Software Consulting          :     Have a break! Take a ride on Python's
Karl-Liebknecht-Str. 121     :    *Starship* http://starship.python.net/
14482 Potsdam                :     PGP key -> http://pgp.uni-mainz.de
phone +49 173 24 18 776  fax +49 (30) 700143-0023
PGP 0x57F3BF04       9064 F4E1 D754 C2FF 1619  305B C09C 5A3B 57F3 BF04
       whom do you want to sponsor today?   http://www.stackless.com/