[Zope3-dev] Translation question

Barry A. Warsaw barry@zope.com
Mon, 20 May 2002 20:01:15 -0400


[Sorry for the length, I know that means I'll lose 75% of my
audience. ;) -BAW]

>>>>> "JF" == Jim Fulton <jim@zope.com> writes:

    JF>    It's worth noting that message catalogs are an artifact of
    JF> a particular implementation ideas and are, therefore not very
    JF> core. *only* a particular translation service implementation
    JF> depends on message catalogs.

I like this separation of concerns.

    JF>   - Language tags, as defined by 1766 (e.g. en-US), are used
    JF> to identify languages.

I've been told by i18n'ers that you might want a hierarchy of language
tags, e.g. you might make a request for a phrase in en_US but you'd be
happy if the translation were available in a shared catalog for US and
UK English.  Other languages may also have multiple dialects that
share many common translations, with just a few specializations in
each dialect.

This probably means that IMessageCatalogs should be hierarchical.  If
an ITranslationService doesn't use catalogs then it should still be
prepared to handle destination languages from general (en) to specific
(en_US).

    JF>       translate("hello world", destination_language="fr")

How often will the destination language be explicitly passed into the
translate() method?  If the destination language is to be (usually)
decided on the basis of language negotiation, wouldn't the destination
language (usually) be part of the request?  Even when the user can
explicitly choose a language ("View this site in: English Dutch Polish
Russian Japanese") there will probably be some contextual object that
encapsulates this decision and I don't think you're going to want to
be passing this object explicitly all around the place.  Then again,
if it's part of the request, that should be fine, because you can
always get to that, right?

    JF>     The two models are usually mixed up by using ids that are
    JF> actually phrases in some implicit source language, like
    JF> English. Translation of these ids to the source language is
    JF> accomplished by simply returning the id.

There's also the issue of a default translation.  One of the
advantages of using an implicit source language as your message ids is
that if no translation is available, you will still end up with
/something/ that maybe makes sense instead of an error, a cryptic
warning, or some incomprehensible message identifier.  Then again, an
English phrase may well be just as incomprehensible to any particular
viewer.

It will always be true that catalogs are out-of-date.  Translators (be
they volunteer or paid) simply can't keep up with the pace of software
change, and there's no hope of keeping a dozen or so language teams on
the same development or release schedule.  So I believe that you must
both plan for incomplete translations, and provide tools and services
to help make the translators lives easier.  The former should be in
scope for this project, but the latter can be deferred.

The question is, is it better to return an untranslated message in
some default, guaranteed to exist source language (i.e. English) or to
not provide the message at all?  Another thing to consider is that if
you're going to use language-neutral message id's, you've now got to
essentially have a translation team for your implicit source language
to provide the catalog from message-id to that language.  You also
need a system for managing message-ids.  You probably never want to
recycle ids.  Once an id has been assigned a particular semantic
meaning it never changes.

OTOH, using a source language as your message has its own drawbacks.
I've found that there tends to be a higher inertia against changing
the strings in a source file because changing even one character (say
whitespace or punctuation) causes the catalog lookup to fail.

There's also the issue of output domain.  For example, you might have
a source string that has the same translation for html, email, xml in
one language, but different translations in other languages.

So I think neither system is perfect (pure message-ids or source
language used as message ids), but it seems that there is consensus
around the approach of using English as your source language, that
your source messages are your message ids, and that untranslated
source strings get returned in the source language.

    JF>     I wonder if applications will commonly define their own
    JF> domains.

Probably.  It seems like an application framework like Zope might have
several message domains, perhaps even competing ones.

    JF>   - Because of gramatical differences among languages,
    JF> translations need to be aware of variable
    JF> interpolation. Variable interpolation is indicated with a
    JF> common string interpolation language. For example, in::

    JF>       You have purchased $count ${item}s.

It gets worse. :)

Planning for translatability means arranging your source messages as
whole sentences.  Some languages, such as Japanese may find it
impossible to translate certain sentence fragments.

Then you have the problem of how to translate (or not translate :)
plural forms.  Your example above is broken because you've included
the plural form in the source string (it might even be broken just for
English).  Here's how bad it gets (from the GNU gettext manual):

    In Polish we use e.g. plik (file) this way:

    1 plik
    2,3,4 pliki
    5-21 pliko'w
    22-24 pliki
    25-31 pliko'w

Will the ITranslationService interface be able to handle this?

    JF> Are translations done on a product-by-product basis, or system
    JF> wide?

Probably both.  On the one hand, a product designer might want to
arrange for their own translations and provide them as a part of their
product ("Zwizzy now supports 13 languages!").  OTOH, I can imagine
that that would lead to a lot of duplication of effort.  So
refactoring translations and catalogs, or providing some generic
shared domains would be useful.

    JF> It seems to me that using text in an implicit source language
    JF> as translation identifiers, rather than using explicit source
    JF> languages doesn't make sense of translations are shared among
    JF> multiple subsystems, but is acceptable when dealing with
    JF> subsystem-specific translations.

It's a trade-off.  Think about what using explicit message-ids means:
you've got to have a way to register unique ids globally, you've got
to carve out id space for product designers, you'll have a lot of
seemingly duplicate entries because message #403 is close, but not
quite right for my application, etc.  This approach also imposes a
greater burden on the software engineer because they always have to
have the message catalog handy or the code may make no sense to them
at all.

It may be true that one system works best for core Zope source code,
services, and products, while another system works better for 3rd
party stuff.  Does it even make sense to decide one way or the other?
Perhaps the ITranslation service can be defined in such a way that
either approach works (since its essentially a fancy mapping), but
that you as Zope Pope decide the way you want to see core Zope code
work.

    [...and later...]

    JF> This gets back to the discussion below. Typically, the "ids"
    JF> people use are text in some implicit source language. This
    JF> works if the translation is done for a project but not of
    JF> translations are shared accross projects because different
    JF> programmers may use different (implicit) source languages.

If I'm using English as my source language and you're using Russian, I
don't see how we can share the same translation domain.

-Barry