[ZODB-Dev] polite advice request

Sun Aug 18 17:09:35 CEST 2013

On Fri, Aug 16, 2013 at 11:49 PM, Christian Tismer <tismer at stackless.com> wrote:
> Hi Jim et all!
>
> I am struggling with a weird data base, and my goal is to show off how
> great this works with (zodb|durus, the latter already failed pretty much).
>
> Just to give you an impression of the size of the problem:
>
> There are about 25 tables, each with currently 450,000 records.
> After all the changes since 20120101, there were 700,000 records involved
> and morphed for each table.
>
> These records have some relevant data, but extend to something like 95
> additional columns which are pretty cumbersome.
>
> This database is pretty huge and contains lots of irrelevant data.
>
> When I create the full database in native dumb style (create everything
> as tuples), this crap becomes huge and nearly untractable by Python.
>
> I managed to build some versions, but see further:
>
> In extent to the 25 tables snapshot, this database mutates every 2 weeks!
> Most of the time, there are a few thousand updates.
> But sometimes, the whole database changes, because they decided to
> remove and add some columns, which creates a huge update that changes
> almost everything.
>
> I am trying to cope with that in a better way.
> I examined lots of approaches to cope with such structures and tried some
> things with btree forests.
>
> After all, it turned out that structural changes of the database (2 columns
> removed, 5 inserted) result in huge updates with no real effect.
>
> Question:
> Did you have that problem, and can you give me some advice?
> I was thinking to switch the database to a column-oriented layout, since
> this way I could probably get rid of big deltas which just re-arrange very
> many columns.
>
> But the overhead for doing this seems to be huge, again.
>
> Do you have a good implementation of a column store?
> I would like to implement a database that tracks everything, but is able to
> cope
> with such massive but simple changes.
>
> In effect, I don't want to keep all the modified records, but have some
> function
> that creates the currently relevant tuples on-demand.
> Even that seems difficult. And the whole problem is quite trivial, it just
> suffers
> from Python's idea to create so very many objects.
>
> --------------------
>
> So my question, again:

I doubt I understand them. :)

> - you have 25 tables

Of course, ZODB doesn't have tables.

We have applications with many more data types.

We also have applications with many more collections,
which are often heterogeneous.

In ZODB data types and collections are generally
orthogonal.

Good OO database design tries to avoid
queries/joins in favor of object traversal.

>
> - tables are huge (500,000 to 1,000,000 records)

We have larger collections. <shrug>

> - highly redundant (very many things could be resolved by a function with
> special cases)
>
> - a new version comes every two weeks
>
> - I need to be able to inquire every version

Not sure what this means.

> How would you treat this?

I don't know what you're referring to as
"this".

There are a number of strategies
to schema migration, some as simple
as providing defaults for new attributes
in classes, to custom __setstate__ scripts
to in-place data migration, to *potentially*,
database transformation during replication.

> What would you actually store?

Um, that's too vague a question.

> Would you generate a full DB every 2 weeks, or would you (as I do) try to
> find a structure that knows about the differences?

I don't think I/we understand your problem well enough to
answer.  If data has a very low shelf life, then replacing it frequently
might make sense.  If the schema changes that frequently, I'd
as why.  If this is a data analysis application, you might be better
served by tools designed for that.

> Is Python still the way to go, or should I stop this and use something like
> PostgreSQL? (And I doubt that this would give a benefit, actually).

Ditto,

> Would you implement a column store, and how would you do that?

Ditto.

>
> Right now, everything gets too large, and I'm quite desperate. Therefore,
> I'm
> asking the master, which you definately are!

"large" can mean many things. The examples you give don't
seem very large in terms of storage, at least not for ZODB.

Beyond that there are lots of dimensions of scale that ZODB
doesn't handle well (e.g. large transaction rates, very
high availability).

It's really hard to make specific recommendations without
knowing more about the problem. (And it's likely that someone
wouldn't be able to spend the time necessary to learn more
about the problem without a stake in it. IOW, don't assume I'll
read a much longer post getting into details. :)

Jim

-- 
Jim Fulton
http://www.linkedin.com/in/jimfulton