[ZODB-Dev] Pre-announce: Oscar 0.1

Greg Ward gward@mems-exchange.org
Mon, 20 Aug 2001 17:01:38 -0400


--6TrnltStXW4iwmi0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline

Hi all --

several months ago, I cooked up a tool, Oscar for rigorously
type-checking a Python object graph: you define an object schema
(currently through specially-formatted class docstrings), and Oscar
crawls a persistent object graph to ensure that every scrap of data in
it conforms to your schema.  We use this regularly in the MEMS Exchange
for integrity-checking our ZODB database; it's not the be-all-end-all to
checking that all is well with an object database, but it's a hell of a
lot better than nothing.

In the past few weeks, I finally got around to writing the scripts and
documentation necessary to release Oscar publicly.  Now I'm ready to do
so, pending approval by the CNRI brass (sigh).  There's nothing
available for download just yet, so no chest-thumping post to
python-announce.  But there is documentation describing the Oscar type
language, which I think is a fine way to descibe Python data types.  So,
on the assumption that types-sig and zodb-dev readers are more likely
than most to want to rush out and try Oscar as soon as it's available,
I'm posting all that documentation right here.  I welcome feedback as to
whether this is a crazy idea or not, whether the type syntax is bogus or
excellent, whether the type-system is "good enough" or needs to be
all-encompassing, etc.

Attached you'll find:

  type-system.txt
    a description of Oscar's type system and the syntax for
    defining Oscar types
  schema.txt
    a description of what an object schema consists of and
    how you define one
  checking.txt
    how to use Oscar to type-check an existing persistent
    object graph

Enjoy!  Hopefully the real release will happen this week or next.

        Greg
-- 
Greg Ward - software developer                gward@mems-exchange.org
MEMS Exchange                            http://www.mems-exchange.org

--6TrnltStXW4iwmi0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="type-system.txt"

Oscar's type system
-------------------

Oscar's type system is a large, useful subset of Python's type system.
The major advantages of Oscar's type system are that it is explicit and
enforced.  Since Python types are implicit (determined at run-time) and
mostly unenforced, Oscar sits quite neatly on top of Python, bringing
order and structure to a potentially chaotic situation.

Oscar understands the following major classes of data types:

  * atomic types: anything with a distinct Python type object can
    be an atomic type in Oscar, but they're intended for types with
    a single, atomic value.  The built-in types int, string, and float
    are obvious candidates (and in fact these are present as atomic
    types by default in any Oscar schema, along with long and complex).
    You can add use other built-in types (e.g. file, function) as atomic
    types, or any extension type.  For example, if you use the
    mx.DateTime module, you might add DateTime as an atomic type, so
    you can declare variables as being of type DateTime and have Oscar
    enforce that requirement.

    Examples:
      "string" denotes a string variable
      "int" denotes an integer variable
      "DateTime" denotes a DateTime variable; this only works if you
        have explicitly added an atomic type called "DateTime" to your
        schema

  * container types: Python's built-in list, dictionary, and tuple types.
    (Classes that act like lists, dictionaries, and tuples are
    "instance-container" types, and I haven't yet decided what to do
    about the type-class unification in Python 2.2.)

    Oscar enforces fairly stringent rules for container types:
      - lists must be homogenous, i.e. all elements of the same
        type, and may be of any length
        Examples:
          "[string]" denotes a list of strings
          "[int|long]" denotes a list of either ints or longs
            (a union type; see below)
          "[any]" denotes a list of anything (ie., no enforcement)
            (see below for "any" types)

      - dictionaries must be separately homogenous: all keys must
        be of the same type, and all values must be of the same type.
        (Incidentally, Oscar knows nothing about which types are
        hashable and allowed to be dictionary keys; that's enforced by
        Python at run-time.)  The key type and value type are specified
        separately.
        
        Examples:
          "{ string : int }" denotes a dictionary mapping strings to ints
          "{string : int|long} denotes a dictionary mapping strings
            to either ints or longs
          "{long : [string]} denotes a dictionary mapping longs to
            lists of strings

      - tuples are hetergenous (mixed-type) but fixed in size, and each
        "slot" is fixed in type.

        Examples:
          "(int,)" denotes a tuple containing exactly one integer
          "(string, string)" denotes a pair of strings
          "([int|long], string, int)" denotes a triple:
            list of (int or long), string, int

        Tuple types have one exception to this rule: if a tuple type is
        "extended", then the rules change for its last slot: for
        example, the extended tuple type "(string, int*)" (note the "*")
        denotes a tuple with exactly one string followed by zero or more
        ints.  The following are all valid values of this type:
          ("foo", 3)
          ("foo", 3, 1)
          ("foo", 2, 5, 1, 6, 2, 1, 4, 5, 1, 15, 6, 2, 5)
          ("foo",)
        
        This is mainly used for tuples that act like lists, eg. if you
        want a list of strings to be usable as a dictionary key, you
        code it as a tuple of strings instead (lists aren't hashable).
        This practice is incompatible with Oscar's basic tuple
        definition, so extended tuples are provided as an escape.

    Note that "of the same type" refers to Oscar types, not Python
    types.  For example, if a variable is declared "[int|long]",
    each element is checked separately to make sure it is either
    an int or a long; [1, 2L, 3] is a valid value of the type
    "[int|long]".  (Again, union types are described below.)

  * instance types: used for class instances.  A class Foo defined in
    the module foo.bar has an associated instance type "foo.bar.Foo".
    Generally, it's not enough to say that a variable is of type
    "foo.bar.Foo"; you also want to specify the instance attributes of
    Foo (and their types!).  Each instance type has an associated class
    definition that stores this information.  This is where Oscar's real
    power shines through, because typically Python data is accessed via
    an instance of some class.  If your schema has a class definition
    for that "root class", and for the class of each object reachable
    from the root, Oscar will crawl your entire object graph, ensuring
    that every instance, every attribute of every instance, and every
    element of every container anywhere in that object graph is of the
    correct type.

    The essential ingredient of a class definition is its attribute
    list.  This is described below, in "Defining a class schema".

    Examples:
      "FooBar" denotes an instance of class FooBar defined in
        the main program
      "thing.Thing" denotes an instance of class Thing defined
        in module thing

  * instance-container types: Python classes often implement the
    semantics of lists, tuples, or dictionaries.  You don't want to give
    up type-checking every attribute of instances of such classes, but
    you also want to make sure that they conform to the strict
    type-checking rules Oscar applies to containers.  Hence,
    instance-container types marry the two.

    Examples:
      "UserList.UserList [string]"
        denotes an instance of the UserList class, defined in the
        UserList module, that acts like a list of strings
      "MyDict { string : int|long }"
        denotes an instance of the MyDict class that acts like a
        dictionary mapping strings to either ints or longs

  * union types: any set of Oscar types may be combined to form a
    union type.  A candidate value is tested against each sub-type of
    the union type, and only rejected if all of the sub-types reject it.

    Examples:
      "int | long" denotes a value that may be either an int or a long
      "string | [string] : (string, string)"
        denotes a value that may be either a string, a list of strings,
        or a pair (tuple) of strings

  * wildcard type: used for variables that can be of any value.
    There is only one wildcard type, spelled "any".

  * boolean type: used for boolean (true/false) values.  Strictly
    speaking, any Python value can be interepreted in a boolean way:
    eg. 0, 0L, 0.0, "", and None are all false values, while 42,
    3.14159, and "foo!" are all true.  Oscar restricts this drastically:
    the only allowed values for boolean variables are 0, 1, and None.

  * alias types: used to define shorthand names for commonly-used 
    types.  The most common use of this is to alias the bare name of a
    class to its fully-qualified name -- e.g. if class Thing is defined
    in module project.util, then "Thing" might be an alias for
    "project.util.Thing".  ("project.util.Thing" is the instance type,
    and "Thing" is an alias type that expands to that instance type.)

    Aliases are also useful if you have a particular union type used
    frequently; instead of always spelling out "int | float | long", you
    can define "number" as an alias for this union type.  (This also
    makes it easy to change your definition of "number" if someday you
    have to extend it to handle, say, complex or rational numbers.)


Type grammar
------------

[taken from the type_parser.py module]

type : NAME                     # atomic, alias, instance, boolean, any
     | container_type           # list, tuple, dictionary
     | NAME container_type      # instance-container type
     | union_type

container_type : list_type
               | tuple_type
               | dictionary_type
list_type      : "[" type "]"
tuple_type     : "(" (type ",")* type "*"? ","? ")"
dictionary_type: "{" type ":" type "}"

union_type : type ("|" type)+

Tokens:
  NAME : [a-zA-Z_][a-zA-Z0-9_]*(\.[a-zA-Z_][a-zA-Z0-9_]*)*


$Id: type-system.txt,v 1.1 2001/08/20 18:10:09 gward Exp $

--6TrnltStXW4iwmi0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: attachment; filename="schema.txt"
Content-Transfer-Encoding: 8bit

Object schemata
---------------

An object schema consists of the following components:

  * a set of atomic types, usually a subset of Python's builtin
    types.  The default atomic types are string, int, long, float, and
    complex.  In principle, you can add other builtin types (like
    function, class, or file) or extension types to a schema, but
    Oscar currently has problems with many builtin types.  (In
    particular, only types whose values can be pickled may be atomic
    types in Oscar.)

  * a type alias mapping, letting you define shorthand names for common
    types.

  * a set of class definitions.  A class definition maps instance
    attribute names to attribute types.  This performs two purposes: it
    defines the expected set of attributes for instances of a class, and
    it defines the type of each attribute.

In the current version of Oscar, an object schema is defined through a
project description file and the class docstrings in a set of source
files.  This is useful in practice, but it's kind of hard to talk about
object schemata without a simple, compact schema description language.
Thus, consider the following pseudo-schema:

  class Thing:
    name : string

  class Animal (Thing):
    num_legs : int
    furry : boolean

(Coincidentally, this is the syntax emitted by gen_schema's "-t" option.
However, this is currently a write-only language; Oscar has no way to
parse schemata created by "gen_schema -t".)

This defines an object schema with no additional atomic types (just the
default five: string, int, long, float, and complex), no aliases, and
two classes (both, presumably, in the __main__ module, since the class
names are unqualified).

If you ask Oscar to type-check an instance of Thing under this schema,
or if it comes across a Thing instance in the course of type-checking a
larger object graph, it does the following:
  * ensure that the instance has exactly one attribute, 'name'
  * ensure that the value of this attribute is a string

Similarly, Oscar type-checks an Animal instance under this schema as
follows:
  * ensure that it has exactly three attributes, 'name', 'num_legs',
    and 'furry' (note that 'name' is inherited from Thing)
  * ensure that the value of 'name' is a string, 'num_legs' an int,
    and 'furry' a boolean (i.e. 0, 1, or None)


Defining an object schema: class docstrings
-------------------------------------------

Currently, you define an object schema by writing specially-formatted
class docstrings.  (There is no separate schema description
language... yet.)  For example, the Thing class in the above
pseudo-schema might be documented as:

  class Thing:
      """A single thing, which may be an animal, vegetable, or mineral.
      The only property common to all things is a name.

      Instance attributes:
        name : string
          the name of the thing
      """

Oscar (specifically, the gen_schema script that parses these docstrings)
ignores everything in the docstring up to the "Instance attributes:"
line.  After that, things get fairly rigid:
    
  * the "Instance attributes:" line must be indented to the same depth
    as the main body of the docstring
    
  * each attribute name is indented two spaces relative to that,
    and followed by a colon (":") and the attribute's type
    
  * attribute descriptions (which are optional, and are ignored by
    Oscar) are indented a further two spaces
    
  * when indentation returns to the same level as the "Instance
    attributes:" line, Oscar stops processing the docstring and
    goes on to the next class in the module (thus, blank lines
    are allowed in the attribute list)

Here is a slightly more elaborate example:

  class Animal (Thing):
      """An animal, ie. a thing with multiple legs and possibly fur.

      Instance attributes:
        num_legs : int
          the number of legs this animal has
        furry : boolean
          whether this animal is furry or not

      Outsiders should use 'get_num_legs()' and 'is_furry()' to access
      these attributes.
      """

Here is a stripped-down version of this docstring that is exactly
equivalent as far as Oscar is concerned:

  class Animal (Thing):
      """
      Instance attributes:
        num_legs : int
        furry : boolean
      """

Sometimes a class will have no instance attributes of its own; Oscar has
special syntax for this:

  class Mammal (Animal):
    """Instance attributes: none"""

This is different from simply omitting the list of instance attributes,
or omitting the docstring entirely.  If Oscar sees a Mammal instance
with any attributes apart from those inherited from Animal, it will
complain.  However, if Mammal has no docstring or attribute list, Oscar
can't do detailed type-checking of instances of that class.  Instead, it
  * complains that the class has no docstring (or no attribute list)
  * exclude the class from the schema
  * when type-checking an object graph, complain about any instances of
    that class it discovers 


Defining an object schema: the project description file
-------------------------------------------------------

Writing class docstrings that document every instance attribute is the
key part of defining an object schema.  However, you still have to tell
Oscar how to find those class docstrings and what to do with them.  This
is done with the gen_schema script and its project description file.


[Searching by directory]

At its simplest, the project description file contains a list of
directories to search for Python source files, and possibly a prefix to
use in turning source filenames into module names.  For example, the
project description file for Oscar itself (oscar.cfg in the top-level
Oscar directory) starts out with this:

  dirs = ["."]
  prefix = "oscar"

(The project description file is just Python code; it's execfile'd by
gen_schema.)  This instructs gen_schema to search for *.py in the
current directory, and to assume that all modules found actually live in
the "oscar" package.  Hence when it finds schema.py, it considers that
module to be "oscar.schema", and a class ObjectSchema in that file will
be called "oscar.schema.ObjectSchema".

gen_schema does *not* search recursively; if you want it to descend into
sub-directories, you must specify them explicitly:
  dirs = ["compiler", "compiler/parser", "compiler/optimizer"]

The directories in 'dirs' are interpreted relative to a base directory
supplied with the "-d" (or --base-dir) option to gen_schema.  If you run
gen_schema from Oscar's top directory (ie., where schema.py lives),
everything is fine -- the current directory is the right place to look
for Oscar's source files.  In that case,
  ./scripts/gen_schema -p oscar.cfg

is the right incantation.  (The resulting schema will be written (as a
pickle) to schema.pkl.)

If you're in /home/greg and Oscar is in /tmp/oscar, though, the above
incantion is wrong: Oscar will consider any *.py files in /home/greg to
be part of the "oscar" package, and will scan them for docstrings to
generate a schema.  This probably won't work; you need to specify the
base directory that 'dirs' is interpreted relative to:
  /tmp/oscar/scripts/gen_schema -p /tmp/oscar/oscar.cfg -d /tmp/oscar

(Obviously, it's easier just to run gen_schema from the right place!)


[Specifying individual modules]

If you don't want to search every "*.py" file in a list of directories,
you can supply a list of explicit module names, eg.:
  extra_modules = ["oscar.schema",
                   "oscar.valuetype"]

Note that extra_modules is a list of fully-qualified module names, *not*
filenames.

This variable is called 'extra_modules' because these modules are added
to the list of modules found by searching the directories named in
'dirs'.  If 'dirs' isn't supplied, the modules in 'extra_modules' are
Oscar's only source for class definitions.


[Excluding individual modules]

You can refine gen_schema's search for classes by excluding certain
modules.  As an example, Oscar includes a copy of SPARK (John Aycock's
nifty parser framework) as the "oscar.spark" module; since this is
really someone else's code, it doesn't have Oscar-style docstrings to
parse.  Also, the parser classes are transient and shouldn't wind up in
any persistent store of an Oscar object graph, so there's not much point
in type-checking them.  Thus, I exclude both oscar.spark and
oscar.type_parser (which provides classes derived from the SPARK
classes) from gen_schema's scan:
  exclude_modules = ["oscar.spark", "oscar.type_parser"]

Like extra_modules, exclude_modules is a list of fully-qualified module
names.


[Excluding individual classes]

You can also exclude specific classes from the search, instead of whole
modules.  This is useful if a particular module provides some transient
classes and other first-class persistent classes.  For example, I might
wish to exclude the TypecheckContext class, defined in oscar.context,
from schema generation:
  exclude_classes = ["oscar.context.TypecheckContext"]

Again, classes are specified as fully-qualified Python names.


[Adding atomic types]

If the five default atomic types aren't enough for your project, you'll
have to add new ones.  This might happen if you use extension types in
your application, or if you store slightly odd objects in your
persistent object graph, like functions or class objects.  New atomic
types are specified using an example value, not using the type object
itself.  (This is necessary because type objects can't be pickled, and
gen_schema pickles the schema for future use.  We can't store type
objects in the pickled schema, so we store sample values instead.)

For instance, to add Marc-André Lemburg's DateTime type to your schema,
add this to your project definition:
  import mx.DateTime
  atomic_types = [mx.DateTime.now()]

The structure of 'atomic_types' is a tad complex.  Most often, each
element of the list is simply a value of the atomic type you want to add
to your schema -- eg. here I created a sample DateTime object.  Since
these sample values go straight into the object schema, which is
subsequently pickled by gen_schema, these must be pickle-able values.
Oscar probably needs to grow a real schema definition language before
you can have, say, Python function or file objects as atomic types in an
object schema.  (In other words, I think this is an implementation
problem due to reliance on pickling rather than a fundamental problem.)

In this simple case, the name of the atomic type is implicit, because
the type itself supplies its name -- "DateTime" in the above example.
(Try "type(DateTime.now()).__name__".)

In some cases, though, you may want to specify your own name for an
atomic type.  In that case, just supply a tuple (sample_value,
type_name) in atomic_types.  This is useful if you're dealing with
ExtensionClass, where every class is a new type.  (This is also the case
with classes derived from types in Python 2.2.)  For instance, a ZODB
application that needs "class" and "instance" types (for class objects
and generic instance objects) might do this:

   import ZODB
   from Persistence import Persistent
   # ...
   atomic_types = [(Persistent(), "instance"),
                   (Persistent, "class")]

If you don't understand why you might need this, you probably don't need
it.


Putting it all together
-----------------------

For a simple example of defining an object schema, take a look in the
"examples" sub-directory of Oscar's source distribution.  There, you'll
find:
  * the thing.py and animal.py modules, which provide the classes
    ThingCollection, Thing, Animal, and Mammal
  * the make_things script, which creates some things, bundles
    them in a collection, and pickles them to things.pkl
  * the things.proj project description file, which tells
    gen_schema how to generate a schema for this project

For now, we're just going to generate a schema from the Python source
files and things.proj.  Later (in "checking.txt", the document that
covers type-checking an object graph) we'll run make_things and
type-check the results.

If you haven't installed Oscar yet, you should either do so now or
perpetrate your favourite kludge for ensuring that it's available
through sys.path.  (If you don't have a favourite kludge, just install
it.)  Run
  python -c 'import oscar'
to make sure it worked -- if this command completes silently, all is
well.

Installing Oscar should also install the gen_schema and check_data
scripts.  I'll assume they're in your shell's PATH; you might have to
adjust your PATH or the commands here accordingly.

Before we run gen_schema, let's take a look at the ingredients of this
project.  First, the project description file, things.proj, is quite
simple:

  extra_modules = [("thing", "thing.py"), ("animal", "animal.py")]

There's no 'dirs' here, meaning gen_schema won't go searching for "*.py"
anywhere.  It just looks for the 'thing' module in thing.py, and the
'animal' module in animal.py.  Since explicit source filenames are
supplied, the 'thing' and 'animal' modules don't have to be in Python's
path -- gen_schema simply parses the source files.

Next, take a look at thing.py.  You'll see that it defines two classes,
Thing and ThingCollection, and that the instance attributes of each are
fully documented.  Similarly, animal.py provides the Animal and Mammal
classes.

Finally, let's run gen_schema.  We'll save the schema for this project
to thing_schema.pkl and thing_schema.osc -- the two files have the same
content, but only the latter is human-readable.  From the "examples"
directory:

  gen_schema -p things.proj -o things_schema.pkl -t things_schema.osc

If you're really curious about what's going on here, add the "-v"
option.  The output of gen_schema (without "-v") should look like this:

  looking for classes...
  found 4 classes
  parsing class docstrings...
  writing object schema to things_schema.osc...
  pickling object schema to things_schema.pkl...

Take a look at things_schema.osc for a human-readable representation of
the schema also saved in things_schema.pkl.

Now that we have an object schema for this project, we can use it later
to type-check a persistent object graph created by applications that use
this project, such as make_things.  This will be done in the next
document, "checking.txt".


$Id: schema.txt,v 1.5 2001/08/20 18:32:31 gward Exp $

--6TrnltStXW4iwmi0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: attachment; filename="checking.txt"

Type-checking an object graph
-----------------------------

Once you've gone to the trouble of defining your object schema by
documenting your instance attributes, creating a project description
file, and running gen_schema, it's quite simple to ensure that a
collection of objects conforms to your schema.  I'm going to assume
you've followed the steps at the end of the "schema.txt" document, and
created a schema for the "things" project in things_schema.pkl.  Once
you have a schema, you can type-check a saved object graph.  First, we
have to create an object graph using the make_things application.

This requires a bit more setup than running gen_schema, since now we
have to be able to import the 'thing' and 'animal' modules.  The easiest
way to do that is to switch into the "examples" subdirectory and force
Python to search it for modules:

  cd examples
  export PYTHONPATH=.

(That last line is for Bourne-like shells under Unix.  For csh-like
shells, say "setenv PYTHONPATH .".  For other operating systems, you're
on your own.)

Now run make_things:

  python make_things

You should now see the file "things.pkl" in the current directory.

Note that make_things has a (deliberate) type error in it, which means
the object graph captured in things.pkl is inconsistent with the object
schema in things_schema.pkl.  We're going to use the check_data script
to find the type error in the data.

(See if you can spot the error by reading the make_things script.  It
would be easy to modify the underlying class -- Animal, in this case --
to catch this particular error.  However, there's no limit to the number
of type errors you can make in Python code, and hardening your
underlying classes to catch every single one of them is a lot of work.
Thus, Oscar exists to catch such errors after-the-fact.  That is, Oscar
doesn't tell you immediately when a type error is made -- it only
detects it in data that has been saved to a persistent store.  I suspect
the Oscar machinery could be used for run-time type-checking, but that
would probably impose a pretty severe performance penalty, so I haven't
experimented in that direction... yet.)

Obviously, in order to load your pickled object graph, Oscar needs to be
able to import the 'thing' and 'animal' modules.  Since you already ran
make_things, that precondition is satisfied, so we can go ahead and run
check_data:

  check_data -f pickle things_schema.pkl things.pkl

The "-f" option tells check_data what format your object graph is stored
in.  (Currently, the only other option is "zodb", which is further
modified by the "-s" (storage) option.)  The first filename,
things_schema.pkl, is the file containing the object schema for this
project.  This is *always* a pickle, regardless of "-f".  The second
argument is the location of the data to be checked -- in this case, the
object graph created by make_things.

The one type error in the make_things script corresponds to one type
error in things.pkl, reported by check_data as:

  root.things['Tyrannosaurus rex'].num_legs:
    expected int, got string ('2 big, 2 small')

If you were unable to track down the error message by reading the code
earlier, this error message should help a lot.  ;-)

--6TrnltStXW4iwmi0--