Internationalization: plural forms, context, and more

Status:	unknown
Sponsored by:	no-one
Developed by:	Julian Maurice
Expected for:	2018
Bug number:	Bug 15395
Work in progress repository:	No URL given.
Description:	Make translation of Koha easier by using gettext the way it is meant to

Current behaviour

Currently the translation workflow is as follows:

A script extracts all the translatable strings from template files and put them in PO files
Translators translate the PO files
A script uses the translated PO files and English templates to produce translated templates

This has several problems:

String extraction is buggy and impose arbitrary rules to source files (no line breaks in TT directive, no TT directive inside HTML tags)
String extraction have to guess which strings are translatable and ends up putting strings like "%s %s%s%s %s%s %s %s %s%s %s" in PO files, making the translation harder. It also forces translators to review strings where the only change is the number of '%s' (it looks like this problem can be fixed by bug 12221)
String replacement is buggy and impose other arbitrary rules to source files (JS strings should be enclosed by double quotes)
It prevent us to use some essential features of Gettext (and internationalization in general) like plural forms and context

New proposed behaviour

What this bug proposes to do is to mark all translated strings by wrapping them in function calls. Ex:

[% t('A translatable string') %]

That way, translated strings are clearly identified and can be extracted more easily using standard tools (xgettext) and it should fix all the problems mentioned above.

That implies a change in how translations work : translations wrapped in t() calls will be retrieved during run-time instead of during creation of translated template files

However, bug 15935 do not wrap any string (except in the example patch) so applying it will not break anything. It will just allow us to start wrapping translatable strings.

In the long term, when all translatable strings are wrapped, it will allow us to skip entirely the process of creating translated templates and all languages will be available without having to run a script on the server (`translate install`).

More on plural forms

Please read this first: https://www.gnu.org/software/gettext/manual/gettext.html#Plural-forms

Currently to handle plural forms in Koha we would do something like this

[% IF items.count == 1 %]
    There is 1 item
[% ELSE %]
    There are [% items.count %] items
[% END %]

Or, easier:

Items: [% items.count %]

The 1st example assumes all languages have only one singular form and one plural form, which is wrong.

The 2nd example does not give us much flexibility to write messages.

Bug 15395 fixes this by allowing to tell the string extractor which strings are singular and which are plural. It can be done by wrapping strings like this:

[%# assumes that we have at least 1 item %]
[% tn('There is an item', 'There are several items', items.count) %]

The generic form is

[% tn(msgid, msgid_plural, n) %]

What happens here is that we tell gettext that:

the singular form is msgid ('There is an item')
the plural form is msgid_plural ('There are several items'),

This will result in a PO entry like this:

msgid "There is an item"
msgid_plural "There are several items"
msgstr[0] "Here goes the translation for the singular form"
msgstr[1] "Translation for the 1st plural form"
msgstr[2] "Translation for the 2nd plural form, if any"
msgstr[X] "Translation for the Xth plural form, if any"

The n parameter of tn() is an integer that Gettext will use to determine which form should be used, depending of the Plural-Forms PO header (which is different for each language, see https://localization-guide.readthedocs.io/en/latest/l10n/pluralforms.html). For instance, for Slovak, Plural-Forms is

nplurals=3; plural=(n==1) ? 0 : (n>=2 && n<=4) ? 1 : 2;

which means the result of tn() will be:

msgstr[0] if n is 1
msgstr[1] if n is 2, 3 or 4
msgstr[2] for any other value of n

More on context

Context is simply an additional string to help disambiguate words in certain situations. For instance, "item" is a very generic term and has several translations in other languages. We can help the translator by attaching context to a string

[% tp('Bibliographic record', 'item') %]

The resulting PO entry will be:

msgctxt "Bibliographic record"
msgid "item"
msgstr ""

The context will appear in Pootle so that translators will know what kind of item they are dealing with.

The context will not appear in Koha interface

Same string with different context can have different translations. For instance:

 [% tp('email', 'subject') %]
 [% tp('bibliographic record', 'subject') %]

will result in two different entries in the PO file:

msgctxt "bibliographic record"
msgid "subject"
msgstr ""

msgctxt "email"
msgid "subject"
msgstr ""

It can also help in some cases where it is not clear if the string to translate is a verb, a noun, or something else. For instance:

[% tp('verb', 'Order') # to order %]
[% tp('noun', 'Order') # an order %]

Variable substitution in translation

Imagine you want to translate the following message

Item checked out on [% date %]

You could do something like this

[% t('Item checked out on') %][% issue.issuedate %]

But it has some inconveniences:

The message is not complete, the translator can't be sure if it's a date after the message, and the translation might change if it's something other than a date
It is impossible to put the date elsewhere than the end of the message, which could be a problem in some languages.

To fix that, you can use variable substitutions.

[% tx('Item checked out on {date}', { date = issue.issuedate }) %]

What we have done here:

Append an 'x' to the function name (every translation function has an 'x' variant)
Add a {date} string in msgid parameter, where date is the name of a key inside the last parameter of the function
Add an hashref as last parameter that contain the string substitutions to be done after the retrieval of the translated message

You can add as many variable substitutions as you want. For instance

[% tx("Hi {foo}! I'm {bar}", { foo = foo_value, bar = bar_value }) %]

Translations will have to keep those {NAME}.

msgid "Hi {foo}! I'm {bar}"
msgstr "Salut {foo}! Je suis {bar}"

Combine them all

Of course you can use plural forms, context and variable substitution at the same time

[% tnpx(context, msgid, msgid_plural, n, vars) %]

[% tnpx('bibliographic record', 'there is {count} item', 'there are {count} items', items.count, { count = items.count }) %]

Additional benefits

This bug also allows to translate strings in Perl code. CSV headers, messages returned by the API, ... (everything that doesn't need templates but is currently using them because it's the only way to translate things)
With little work (see bug 21156) this new behaviour can be used in JS files too (no need to declare global variables in .inc files only to use them in JS code)

Downsides

As translation is done during runtime, we should make sure that it doesn't degrade performances too much. Here's my attempt to test impact on performances: https://gitlab.com/jajm/time-i18n. Please test and comment.

Proposed roadmap

Push bug 15395 as soon as possible (it will not affect existing translations)
Once bug 15395 is in main, start wrapping all strings (bug 20988 can help). Starting early in the release cycle will leave us time to deal with potential problems with Pootle integration, and to re-translate strings if needed.
Refuse new patches that introduce non-wrapped translatable strings (new rule in coding guidelines)
Once all strings are wrapped, remove all references to translated template directories (koha-tmpl/intranet-tmpl/prog/LANG and koha-tmpl/opac-tmpl/bootstrap/LANG) and make all languages available in sysprefs. This step might require some change in how translation in our XSL files works. It should be possible to translate them on-the-fly and cache them somewhere (to avoid having to run `translate install` only for XSL files), or maybe generate multilingual XML files with itstool, but it's outside the scope of this bug.

Internationalization, plural forms, context, and more RFC

Contents