[ZODB-Dev] polite advice request

Mon Aug 19 09:33:30 CEST 2013

In some ways the ZODB is less flexible. It requires you to understand more about how you will access the data before you import it, than does an SQL database. This is because the datastructure defines how you can query it in a ZODB. 
For example, if you need multiple indexes to your data, then to make it efficient you might choose a different data structure. Whereas in SQL you can add indexes after the fact. Which ever way you go however, you are always better off thinking about how you will access your data first. for example when you reimport the data do you need to do a look up on each item to see if it's there and merge, or will you just delete the lot and start from scratch?

Having said this, you might look at a project like souper that tries to support tabular type data without having to think too much about the data structures.

On 19/08/2013, at 1:09 AM, Jim Fulton <jim at zope.com> wrote:

> On Fri, Aug 16, 2013 at 11:49 PM, Christian Tismer <tismer at stackless.com> wrote:
>> Hi Jim et all!
>> 
>> I am struggling with a weird data base, and my goal is to show off how
>> great this works with (zodb|durus, the latter already failed pretty much).
>> 
>> Just to give you an impression of the size of the problem:
>> 
>> There are about 25 tables, each with currently 450,000 records.
>> After all the changes since 20120101, there were 700,000 records involved
>> and morphed for each table.
>> 
>> These records have some relevant data, but extend to something like 95
>> additional columns which are pretty cumbersome.
>> 
>> This database is pretty huge and contains lots of irrelevant data.
>> 
>> When I create the full database in native dumb style (create everything
>> as tuples), this crap becomes huge and nearly untractable by Python.
>> 
>> I managed to build some versions, but see further:
>> 
>> In extent to the 25 tables snapshot, this database mutates every 2 weeks!
>> Most of the time, there are a few thousand updates.
>> But sometimes, the whole database changes, because they decided to
>> remove and add some columns, which creates a huge update that changes
>> almost everything.
>> 
>> I am trying to cope with that in a better way.
>> I examined lots of approaches to cope with such structures and tried some
>> things with btree forests.
>> 
>> After all, it turned out that structural changes of the database (2 columns
>> removed, 5 inserted) result in huge updates with no real effect.
>> 
>> Question:
>> Did you have that problem, and can you give me some advice?
>> I was thinking to switch the database to a column-oriented layout, since
>> this way I could probably get rid of big deltas which just re-arrange very
>> many columns.
>> 
>> But the overhead for doing this seems to be huge, again.
>> 
>> Do you have a good implementation of a column store?
>> I would like to implement a database that tracks everything, but is able to
>> cope
>> with such massive but simple changes.
>> 
>> In effect, I don't want to keep all the modified records, but have some
>> function
>> that creates the currently relevant tuples on-demand.
>> Even that seems difficult. And the whole problem is quite trivial, it just
>> suffers
>> from Python's idea to create so very many objects.
>> 
>> --------------------
>> 
>> So my question, again:
> 
> I doubt I understand them. :)
> 
>> - you have 25 tables
> 
> Of course, ZODB doesn't have tables.
> 
> We have applications with many more data types.
> 
> We also have applications with many more collections,
> which are often heterogeneous.
> 
> In ZODB data types and collections are generally
> orthogonal.
> 
> Good OO database design tries to avoid
> queries/joins in favor of object traversal.
> 
>> 
>> - tables are huge (500,000 to 1,000,000 records)
> 
> We have larger collections. <shrug>
> 
>> - highly redundant (very many things could be resolved by a function with
>> special cases)
>> 
>> - a new version comes every two weeks
>> 
>> - I need to be able to inquire every version
> 
> Not sure what this means.
> 
> 
>> How would you treat this?
> 
> I don't know what you're referring to as
> "this".
> 
> There are a number of strategies
> to schema migration, some as simple
> as providing defaults for new attributes
> in classes, to custom __setstate__ scripts
> to in-place data migration, to *potentially*,
> database transformation during replication.
> 
>> What would you actually store?
> 
> Um, that's too vague a question.
> 
>> Would you generate a full DB every 2 weeks, or would you (as I do) try to
>> find a structure that knows about the differences?
> 
> I don't think I/we understand your problem well enough to
> answer.  If data has a very low shelf life, then replacing it frequently
> might make sense.  If the schema changes that frequently, I'd
> as why.  If this is a data analysis application, you might be better
> served by tools designed for that.
> 
>> Is Python still the way to go, or should I stop this and use something like
>> PostgreSQL? (And I doubt that this would give a benefit, actually).
> 
> Ditto,
> 
>> Would you implement a column store, and how would you do that?
> 
> Ditto.
> 
>> 
>> Right now, everything gets too large, and I'm quite desperate. Therefore,
>> I'm
>> asking the master, which you definately are!
> 
> "large" can mean many things. The examples you give don't
> seem very large in terms of storage, at least not for ZODB.
> 
> Beyond that there are lots of dimensions of scale that ZODB
> doesn't handle well (e.g. large transaction rates, very
> high availability).
> 
> It's really hard to make specific recommendations without
> knowing more about the problem. (And it's likely that someone
> wouldn't be able to spend the time necessary to learn more
> about the problem without a stake in it. IOW, don't assume I'll
> read a much longer post getting into details. :)
> 
> Jim
> 
> -- 
> Jim Fulton
> http://www.linkedin.com/in/jimfulton
> _______________________________________________
> For more information about ZODB, see http://zodb.org/
> 
> ZODB-Dev mailing list  -  ZODB-Dev at zope.org
> https://mail.zope.org/mailman/listinfo/zodb-dev