[Zope-dev] NEO High Performance Distributed Fault Tolerant ZODB Storage
vincent at nexedi.com
Wed Apr 7 08:50:37 EDT 2010
Le mardi 6 avril 2010 18:54:24, Jim Fulton a écrit :
> Is there a document you can point to that provide a description of the
> approach used?
A document is in the works, in the while I'll try to describe the architecture
briefly (should be readable to people knowing a bit about TPC) and give some
typical use cases.
NEO is distributed, with different types of nodes involved.
The 3 main types are:
- Client nodes: what gets embedded in any application using NEO, so typically
Zope + ZODB
- Master nodes: one of them gets elected (to become the "primary master node",
other becoming "secondary master nodes"). The primary handles all
centralised tasks, such as:
- oid/tid generation
- cluster consistency checking (high level: "is there enough nodes to cover
- broadcasting cluster changes to all nodes (new storage node, storage node
getting disconnected, etc)
When a primary master falls, secondaries take over by restarting an
There is no persistent data on a master node, asides from its configuration
(the name of the NEO cluster it belongs to, the addresses of other master
nodes and its own listening address)
- Storage nodes: they "contain" object data, stored & accessed through them
and ultimately stored in a local storage backend (MySQL currently, but all
NEO really needs is something which can guarantee some atomicity and limited
Other nodes (existing or planned) are:
- Admin node
Cluster monitoring (human-level: book-keeping of cluster health)
- Control command
Administrator CLI tool to trigger cluster actions.
Technical note: for now, it uses the admin node as a stepping stone in all
cases, though it could be avoided in some cases.
- Backup node
Very similar technically to a storage node, but only dumping data out of
storage nodes, and being a CLI tool, probably not a daemon (not implemented
- there is currently no security in NEO (cryptography/authentication)
- expected cluster size is of around 1000 (client + storage) nodes, probably
more (depends on the node type ratio and usage pattern)
Some sample use cases:
"Client wants data"
Client connects to the primary master node (happens once upon startup,
and again if primary master dies). During the initial handshake, the client
receives the "partition table", which tells him which storage nodes are part
of the cluster and which one contains which object. Then, client connects to
one of the storage nodes which is expected to contain desired data, and asks
it to be served.
"Client wants to commit"
Client is already known to cluster and enters TPC in ZODB: it sends object
data to all candidate storage (known by looking up in its partition table) for
each object. Those storage locally handle locking at object level, a write
lock being taken upon such data transmission. Once the "store" phase is over,
vote phase starts and the vote results depends on storage refusing data (base
version of object not being the latest one at commit time, etc), resulting in
conflict resolution, and ultimately in ConflictError exception.
Then the client notified the primary master of ZODB decision (finish or
abort), which in turn asks all involved storage nodes to make changes
persistent or trash them (releasing object-level write locks).
If it chose to finish, involved storage take a read lock on objects ("barrier"
kind of use), answer master that they are done acquiring this lock. In turn,
master asks them to release all locks (barrier effect is achieved, write lock
get released to allow further transactions) and send invalidation requests to
"New storage enters cluster"
When a storage node enters an existing (and working) cluster, it will be
assigned to some partitions, in order to achieve load balancing. It will by
itself connect to storage nodes having data for those partitions, and start
replicating them. Those partitions might or might not be ultimately be dropped
from their original container nodes, depending on the data replication
constraints (how many copy of those partitions exist in the whole cluster).
When a storage dies (disconnected, or request timeout), it is set as
"temporarily down" in primary master partition table (change which is then
broadcast to all other nodes). it means that the storage node might still have
its data, but is currently unavailable. If the lost storage contained the last
copy of any partition, the cluster lost a part of its content: it asks all
nodes to interrupt service (they stay running, but refuse to serve further
requests until the cluster gets back to a running state).
Note that currently, loosing a storage doesn't trigger automatically the
symmetric action to adding a storage node: partitions are not added to
existing nodes. It can only be triggered by manual action, via the admin node
and its command line tool. This was chosen to avoid seeing the cluster waste
time balancing data over and over if a node starts flapping, and could be made
"Storage comes back"
When a storage comes back to life (after a power failure or whatever), it asks
other storage nodes for the transactions it missed, and replicate them.
> That's unfortunate. Why not a less restrictive license?
Because we have several FSF members in Nexedi, so we use the GPL for all our
products (with very few exceptions, notably monkey-patches reusing code under
> These seem to be very high level.
> You provide a link to the source, which is rather low level.
> Anything in between?
You mean, a presentation of NEO (maybe what I wrote above would fit) ?
Or on the "write scalable Zope application" topic ?
More information about the Zope-Dev