Internationalisation of database content (text columns)

Started by Wolfgang Kellerover 15 years ago2 messagesgeneral
Jump to latest
#1Wolfgang Keller
feliphil@gmx.net

(Since I got no replies on the novice list, I repost this question here)

Hello,

I'm working on a database schema which contains lots of "type code lookup" tables. The entries of these tables are also hierarchically related among themselves (subtype/supertype), to store rather large and quite complex taxonomies.

Currently the schema does not provide for translation of the entries in these tables, although it is meant to be used for inter-company data exchange, i.e. for use by many different people having all different native languages. When searching for "best practice" guides about the topic of transparently translating database content, I came across about this example:

http://rwec.co.uk/pg-atom/i18n.html

The basic method looks pretty intuitive to me, except that I don't understand why it wouldn't be more obvious to use a composite type, with language codes as column names, instead of an array for the custom "translated text" data type.

Any opinions from more experienced "mahouts"? Any obvious drawbacks of using such a composite type? Other than the fact that the schema has to be modified to add languages, which wouldn't be that desastrous in this case.

Are there any other (better?) "best practice" methods for the internationalisation of database content? Maybe methods that use capabilities (e.g. text search/thesaurus?) already built-into PostgreSQL?

BTW: Methods that use a single table to hold all translations for all strings in all tables of the entire schema are not very useful in this case, since it can't be excluded that depending on the context (i.e. the specific semantics of the specific "type code lookup" table) the translation of one and the same string in one language will be different in other languages.

TIA,

Sincerely,

Wolfgang Keller

#2Karsten Hilbert
Karsten.Hilbert@gmx.net
In reply to: Wolfgang Keller (#1)
Re: Internationalisation of database content (text columns)

On Fri, Oct 08, 2010 at 12:12:54PM +0200, Wolfgang Keller wrote:

I'm working on a database schema which contains lots of
"type code lookup" tables. The entries of these tables are
also hierarchically related among themselves
(subtype/supertype), to store rather large and quite complex
taxonomies.

.From my experience it depends. Either you've got a "coding
system" which you need to provide localized "translations"
for or else you've got arbitrary type code lookups.

With coding systems it is typically not really a translation
of the coded term but rather *another* term people attach to
the same code - incidentally when using another language.
Terms in one language change while they don't change in
another. Think of the code as defining a class with all the
local language terms being ever-so-slightly different things
all belonging into that class (eg. while "back pain" and
"Kreuzschmerz" aren't considered translations of each other
*medically* they can well be considered to group under the
same ICD-10 code). Thus I've found this general scheme to
work well:

create table coded_term (
pk serial primary key,
code text,
term text,
lang text,
fk_coding_system integer
references coding_system(pk),
unique(code, term, lang, fk_coding_system)
);

(it can be argued whether lang should fold into coding_system)

If it's about arbitrary lookup values for codes I am using a
gettext version rewritten in pgsql similar to this:

create table lut_colors (
pk serial primary key
color text
);

create view v_lut_colors as
select
pk
as pk_lut_color,
color
as color,
_(color)
as l10n_color
from
lut_colors;

(you don't need the view or you don't need it in this way but
it's useful)

Now you guessed it: _() is a plpgsql function which does a
translation table lookup based on the database account (or a
passed in user name) and a pre-configured (or passed in)
language per said account/user name. It falls back from,
say, "de_DE" to, say, "de" to returning the original string.

The translation table is filled this way:

select i18n_upd_tx('de_DE', 'blue', 'blau');
select i18n_upd_tx('de_DE', 'grey', 'grau');

It doesn't really matter which language is used as the
"original" lookup language as long as a translation exists
for the desired target language:

select i18n_upd_tx('en', '1�34kjafg8', 'yellow');

will properly make

select _('1�34kjafg8', 'en');

return "yellow".

All the code for this is to be found in the git repository
for GNUmed at gitorious:

http://gitorious.org/gnumed

BTW: Methods that use a single table to hold all
translations for all strings in all tables of the entire
schema are not very useful in this case, since it can't be
excluded that depending on the context (i.e. the specific
semantics of the specific "type code lookup" table) the
translation of one and the same string in one language will
be different in other languages.

Well, either add in a context field to the _()/i18n_upd_tx()
approach or consider using the coding system approach. You
might even figure out a way to use the tableoid in the
translation function:

create view v_lut_colors as
select
pk
as pk_lut_color,
color
as color,
_(color, lut_colors.tableoid)
as l10n_color
from
lut_colors;

This would require applying the tableoid when adding
translations though.

Karsten
--
GPG key ID E4071346 @ wwwkeys.pgp.net
E167 67FD A291 2BEA 73BD 4537 78B9 A9F9 E407 1346