[Zope-CMF] Dublin Core Subject Qualifier Implementation

sean.upton@uniontrib.com sean.upton@uniontrib.com
Wed, 20 Feb 2002 08:59:43 -0800


Seb,
Thanks for the ideas.

My concern, I guess, about multi-lingual vocabularies is that I would like
this to work without relying upon some presentation-specific thing like
looking at REQUEST, so the only two things I could think of were:

- the need for a tool in CMF that finds the desired language for the calling
object/component/tool/user
- optionally passing the language parameter explicitly

I'm not sure if there is another way to do this, but I'm open to ideas.

I think I will borrow some of your ideas to refine an interface spec for
this, primarily to support hierarchy-like vocabularies.  One other thing I
was thinking about is having another method that returned a list of subjects
by walking from a leaf on a tree of subjects up its parent nodes to the root
for the given topic.
	getGeneralSubjects(id, vocabulary=None):
	    """gets all general subjects as list for a specific (detail)
subject code"""

Here is an example of this pasted from IPTC's subject code spec:

Subject = 04000000     Name = Economy, Business & Finance  Explanation = All
matters concerning the planning, production and exchange of wealth. 
	SubjectMatter = 04003000     Name = Computing & Information
Technology 
		SubjectDetail = 04003005     Name = Software 

So if I called getGeneralSubjects('04003005', vocabulary='IPTC'), I would
get returned:
	[('04003000', 'Computing & Information Technology'),('04000000',
'Economy, Business & Finance')]


The vocabulary or tool implementation could walk this hierarchy in several
ways, but I think this is beyond the scope of my spec (though use of
ParsedXML with some caching comes to mind), as long as a hierarchy, as well
as pure tabular vocabularies are supported... I'll start thinking about that
when I start coding an implementation.

One other thing that I realize this example brings up: in IPTC, general
subjects have a long full-text 'Explanation' field that could be accessed;
also, sometimes you might want to shorten a name.  For example, I may want
to have a queriable short name for 'Economy, Business, & Finance' that is
just 'Business.'

One implementation detail that could effect users of a tool like this,
though, is importing a 3rd party vocabulary.  If the internal representation
was in XML and exposed via DOM, external data in XML, text, or
relational/tabular lookup could be imported by a programmer simply knowing
the tool's DTD and writing DOM methods to populate the structure.  So
perhaps the interface should include a getVocabularyDOM(vocabname) that
returns a reference to the ParsedXML DOM for a vocabulary and perhaps a
createVocabulary(name) that creates the namespace for the vocabulary along
with an empty DOM object for it.  It might also make sense to create an
addToVocabulary(vocabulary, code, description=None, text=None,
parentcode=None), as well as deleteFromVocabulary(vocabulary, code).  This
would be similar to your setSubject() method below, except it would allow
you to specify a parent subject in a hierarchy.  Certain vocabularies should
be locked read-only / non-mutable, not suppporting these methods, and only
be able to be edited as raw XML by an administrator.

If I do this in XML, it seems like the logical next step is to define a
mockup or schema of the XML structure, since this would be more than just an
implementation deal, but also a interface specification of sorts for an
import-vocabulary use-case.

Thoughts?

Thanks for the ideas,
Sean

-----Original Message-----
From: seb bacon [mailto:seb@jamkit.com]
Sent: Wednesday, February 20, 2002 2:33 AM
To: sean.upton@uniontrib.com
Cc: zope-cmf@zope.org
Subject: Re: [Zope-CMF] Dublin Core Subject Qualifier Implementation


Sean, 

That's a really interesting idea.  It would be a great thing to
integrate with the CMF.  

Here's some of my thoughts, since you asked ;-)

The namespace qualifier seems like a good idea.  

The language aspect should be dealt with by l18n structures rather than
on the application level, e.g. the system locale (I've never looked at
ZBabel etc so I'm not up on the accepted way of doing this).

The UI problem of selecting a subject from 1000s has been discussed on
the list before - have a search around for ideas.  My feeling is that
the best way of doing this is to arrange the subjects heirarchically. 
For example, there are 17 categories in the IPTC subjects.  The UI
should allow you to select an entire category as well as its
subdivisions.

The internal representation should be an XML-like tree, which you could
manipulate in a similar way to XML (like a SAX parser, for example). 
The tool could have an 'import' function, so people can load in
specialist vocabularies - possibly from an XML format?

The job of mapping between id and name shouldn't be tricky - you should
only ever specify an id to the tool, and it could always return (id,
name) tuples.  I noticed that a lot of subjects in the specs you mention
have descriptions too - you could make it a (id,name,description) tuple
or, something similar, to expose this.   

Regarding vocabulary, you could optionally supply a vocabulary id to
each method, or you could rely on a default vocabulary which can be set
by the user.

I'd be tempted to miss out the icon thing, although it's a nice idea. 
It's only any use if the application requires it, and someone has the
time to generate 1000s of icons - wouldn't this be a minority of cases? 
Anyway, here's my take on the interface:

 getSubject(subject_id, vocabulary=None):
   "return (id, name) tuple"

 searchSubjects(search_term, vocabulary=None):
   """do a text search of subject names,
      return list of (id, name) tuples"""

 getChildSubjects(subject_id, vocabulary=None):
   "return list of children of subject_id"

 getParentSubject(subject_id, vocabulary=None):
   "return parent of subject_id"

 getSiblingSubjects(subject_id, vocabulary=None):
   "return siblings of subject_id"

 getRootSubjects(subject_id, vocabulary=None):
   "return list of root subjects"

 setDefaultVocabulary(vocabulary):
   "set a default vocabulary, return None if it doesn't exist"
 
 setSubject(subject_id, subject_name, vocabulary):
   "add a new subject to vocabulary"

 getVocab(subject_id):
   "return a (id, name, description) tuple for the volcabulary of the
specified subject"



On Tue, 2002-02-19 at 20:46, sean.upton@uniontrib.com wrote:
> Hey everybody,
> I am looking at implementing Dublin Core Qualifiers for Subject metadata
as
> a means of expressing subjects within multiple controlled and standardized
> vocabularies (namely, IPTC subjects for news and sports stuff, and NAICS
or
> SIC codes for Business information), in addition to supporting plain-text
> subject vocabularies as well.  Is there any established pattern or syntax
> for dealing with subject codes this way in the CMF?  I haven't found
> anything, so I have been thinking about a solution... my thoughts are
below.
> 
> The Dublin Core Qualifiers spec has several recommended element encoding
> schemes for LC and medical subjects, but nothing excludes other
> industry-standard subject vocabularies, such as IPTC (news/media,
worldwide)
> or NAICS (used by North American governments, business/economic news, and
> yellow pages), or market names (stock tickers).
> 	http://www.dublincore.org/documents/dcmes-qualifiers/#subject
> 	http://www.iptc.org/
> 	http://www.census.gov/epcd/www/naics.html
> 
> My first hunch is that the best way to convey a namespace/qualifier for a
> subject code system is with a colon in the text, separating the vocabulary
> "NAME" (in dcmes-qualifers terms).  My second hunch is that I need to
create
> a subject lookup tool that performs lookups for "human-readable"
> counterparts for codes, so that codes with qualifiers can get a
description
> that makes sense to a content user.  I also think such a framework might
be
> useful for content creators if the user interface for metadata entry
enabled
> efficient lookup with these codes (the biggest UI issue is that number of
> these codes may be in the order of thousands, something like a popup
> search/browse dialog might be appropriate).
> 
> Example lookup/translation input/output:
> 
> 	NAICS:511110  --> "Newspaper Publishers"
> 	IPTC:01016000 --> "Television"
> 	NASDAQ:MSFT --> "Microsoft Corporation"
> 	Media Companies --> "Media Companies" (verbatim translation of
> unqualified text)
> 
> This tool should support internationalization (or is it localization?) of
> description lookup, because these vocabularies are often defined by
> multi-national organizations (thus multi-lingual lookup tables might
exist,
> for example IPTC supports most Eurpoean languages, Turkish, and Arabic);
> this isn't to say that one need implement every language a vocabulary
> supports to satisfy this, but that the interface for this tool should
> support a language encoding parameter for this purpose, so that a
> multi-lingual site can support multiple languages with one vocabulary
> (SignOnSanDiego publishes content in English and some Spanish).
> 
> In use of this tool, there would still be interfacing issues to make this
> work with the metadata tool and content types, both in terms of suporting
a
> user interface for massive amounts of subject codes, as well as
determining
> when to display the code and when to display the lookup description...
> 
> I'd be interested to see what people think about this.  I wrote some
> interface documentation, which is pasted below that might help in
explaining
> my idea.  Thoughts?
> 
> Thanks,
> Sean
> 
> #####################################
> ##################################### 
> 
> import Interface
> 
> class portal_subjectlookup(Interface.Base):
>       """
>         Interface for registry of subject code qualifier
>         vocabularies.  Among other things that a tool 
>         implementing this interface should do is provides the
>         ability to query with a code, language, and
>         vocabulary, and get descriptions.
>       """
> 
>       def getDescriptionFromCode(code, vocabulary=None, language='en-US'):
>           """
>             Lookup code in registry specified by vocabulary for language 
>             specified in language.
> 
>             Pre-condition:  code is a string object and is not None
>             Post-condition: a string is returned with a human-readable
>                             text description (string) for a code in
>                             the language specified, if available.
> 
>                             If a registry implementation in the tool is
not
>                             available in the language specified, a default

>                             language should be used.
> 
>                             If no viable option can be found in lookup, 
>                             method should return None.
> 
>             Notes: sorry about the ethnocentrism in the language default.
>           """
> 
>        def findCodeByKeyword(query, vocabulary=None, language='en-US'):
>           """
>             Used primarily by content producers, or agents on their
behalf.
> 
> 
>             This method is used to find a correct code, for a piece of
> content
>             when the code is unknown, but the subject matter is.  This
>             allows a query, which can be either a single string keyword,
or
> a
>             sequence of keyword strings.  The query is an "or" query, so
> that if
>             query == ['foo','bar'] topic codes with descriptions matching
> both
>             should be returned.
> 
>             Pre-condition:  query is a string or a sequence of strings and
>                             is not None.  If query is a sequence, a query
>                             will be performed for all terms as specified
>                             above.  If vocabulary is specified, only
search
>                             that vocabulary, otherwise a 'search all' is
>                             assumed.
> 
>             Assumptions:    it is assumed that the query that is passed to
> this
>                             method should match with a wildcard on the end

>                             of each keyword, so that a query of
>                             ['bio','tech','medi'] would find
biotechnology,
>                             technology, medicine, technical, medical,
etc...
> 
>             Post-condition: a sequence of matches is returned, where a
match
>                             is a tuple of vocabulary, code, and
description
>                             in the language of choice.
> 
>                             If a registry implementation in the tool is
not
>                             available in the language specified, a default

>                             language should be used.
> 
>                             If no viable option can be found in lookup, 
>                             method should return None.           
>           """
> 
>        def listAllCodes(vocabulary=None, language='en-US'):
>           """
>             This method lists all entries in lookup tables for subject
>             vocabulary codes, either globally, or within a particular
>             vocabulary.  Output is similar to findCodeByKeyword()...
>             
>             Assumptions:    If vocabulary is not None, then search 
>                             globally across all vocabularies present
>                             in this tool.
>                            
>             Post-Condition: a sequence of entries is returned, where an
>                             entry is a tuple of vocabulary, code, and
>                             description in the language of choice.
> 
>                             If a registry implementation in the tool is
not
>                             available in the language specified, a default

>                             language should be used.
> 
>                             If no viable option can be found in lookup, 
>                             method should return None.
>           """
>        def getIconPathForSubject(code, vocabulary=None):
>           """
>             Attempts to find an icon path registered for a code/vocabulary
>             combo.  Since vocab is optional, this could potentially need
>             to look through the registry for entries in all vocabularies.
> 
>             Returns a list of "wrapped-icons" where a wrapped-icon is a 
>             tuple containing the icon width, icon height, and icon path
>             as a list; example: 
>             [ (32,32,['path','to','images','subj32.png']),
>               (16,16,['path','to','images','tiny','subj16.png']) ]
>           """