Translation in templates fix RFC

From Koha Wiki
Jump to navigation Jump to search

Translation in templates fix

Status: unknown
Sponsored by: no-one
Developed by: Michał Kula
Expected for: 2024/2025
Bug number: Bug 38618
Work in progress repository: No URL given.
Description: Raw strings in templates shouldn't be matched for translations, move all to explicit template blocks


Currently, in .tt and .inc templates, raw strings are extracted and matched against .po files, with raw replacement occuring, which ignores the context. This is not a sane solution and should be rather looked in the categories of a historical artifact of how translations were retro-fitted into an originally English-only software. Why it's bad:

  • it ignores the context, since a string can be used in multiple different contexts, requiring multiple distinct rules for escaping them, leading to actual unexpected breakage with "bad" translations ("bad" as in using quotes inside of them!). There is no one-size-fits all:
    • in HTML text context: escape HTML elements <> into < >, probably replace newlines to <br>
    • in HTML raw html context: don't escape and allow elements like <br> (if such translations exist despite being risky)
    • in HTML entities context (inside of attributes): insert quotes around string, escape quotes inside of the string (as &quot, replace all other characters with required HTML entites/escapes as required by the HTML standard (including newlines)
    • in JavaScript string: insert quotes around string, escape quotes inside of the string with
  • no explicit control over what should be translated and what should not, it's anyone's guess (so this is not too intuitive for developers either)
  • for new developers "magic" implicit translations are actually very confusing, it's hard to understand at which stage and how the translation occur, only old developers can grasp the full chain of translations, this makes code review and auditing very hard (bad for security)


I want there to be a universal agreement that we should do away with the context-less string replacements for translations already and focus on how to fix them in a proper and future-proof way. And I want this solution to be as simple as possible, but while also being as proper as possible.


Why explicit templates (shortly):

  • no unexpected translations of stuff we don't want to translate
  • explicit is better than implicit, invoke a simple macro for a translation of user-facing text, simple as that (and no automatic guesswork)
  • can properly escape depending on context, guarantees no XSS instead of doing random escapes and crossing fingers (this isn't some theoretical blabbering, RIGHT NOW some intranet pages are broken in some languages due to Syntax Error in JS, this is URGENT and needs a proper fix to prevent it from resurfacing all the time)


I propose moving all of these strings into explicit t* invokations. Furthermore I propose diving the existing t* functions into separate explicit sub-routines that force proper usage:

  • t_html, tx_html, tn_html, etc. = loose HTML text
  • t_raw, tx_raw, tn_raw, etc. = raw insertion, potentially dangerous (if you wanna be really explicit, t_raw_dangerous if you wanted to be explicit about removing any HTML from translating strings, but then again _raw could be useful for chaining it into other funcs, but then again I18N.t could be used in such smart cases instead)
  • t_html_attr, tx_html_attr, tn_html_attr, etc. = inserts a quoted attribute inside of HTML element props
  • t_js_string, tx_js_string, tn_js_string, etc. = inserts a quoted JavaScript string inside of JavaScript code


The above avoids chains like t('') | html, which has several advantages:

  • enforces more consistent ways in how translations are embedded
  • performance: single call with full control over returning already escaped string allows introducing some type of caching if escaping at runtime every time is a concern, would be impossible with chained calls
  • an explicit call like t_html (but not t_html_attr) allows us to do very cool things like: optionally enabling a preference to display a button to open the translation in Weblate (insert such snippet only into HTML text, not entities or JS strings)
  • inserting quotes for HTML attrs and JS strings means the developer and translator is not to worry about which type of quotes should be used and can focus on more ambitious things


Auto-refactoring pattern

1. JavaScript explicit strings:

before:

const string1 = _("Heya");
const string2 = _('Something: "%s"').format("yay");

after:

const string1 = [% t_js_string('Heya') %]; // notice no quotes around [% %]!
const string2 = [% t_js_string('Something: "%s"') %].format("yay");

2. HTML loose text:

before:

<span>I am a loose string</span>

after:

<span>[% t_html('I am a loose string') %]</span>

3. HTML attribute:

before:

<span data-test="I am a loose attribute">...</span>

after:

<!-- notice no quotes -->
<span data-test=[% t_html_attr('I am a loose attribute') %]>...</span>

4. Existing invokations to be fixed (incomplete list; do analogous for other t* function variants here ofc):

before:

<span data-test="[% t('I am a loose attribute') | html_entity %]">...</span>
<span data-test="[% t('I am a loose attribute') | html %]">...</span><!-- it should be html_entity, yet many people use html, sigh!!! -->
<title>[% t('I am a loose string') | html %]</title>
<p>[% tnx('Bundle of {count} item', 'Bundle of {count} items', bundle_items_count, { count = bundle_items_count }) | html %].</p>
<script>var authtypecode = "[% authtypecode | html %]";</script><!-- yikes, this is unrelated to translations in this case, but fix this crap too by the opportunity, WARN: ensure they don't actually insert said var into raw HTML later, in which case do |html|js_string, but preferably just be smart at the moment of insertion, unfortunately requires individual look at these 182+ cases, actually might split this into a separate bug -.- -->

after:

<!-- notice no quotes -->
<span data-test=[% t_html_attr('I am a loose attribute') %]>...</span>
<span data-test=[% t_html_attr('I am a loose attribute') %]>...</span>
<title>[% t_html('I am a loose string') %]</title>
<p>[% tnx_html('Bundle of {count} item', 'Bundle of {count} items', bundle_items_count, { count = bundle_items_count }) %].</p>
<script>var authtypecode = [% authtypecode | js_string %];</script><!-- let's imagine "authtypecode" could be a translation too -->


Implementation path/goals

The loose template translation string matching will be replaced with only explicit template blocks, taking into consideration three current usages: HTML text, HTML entities, JS strings.

0. Solve this properly once and well within the defined confined scope, don't divide it into separate temporary stopgap solutions (nothing is more permanent than a temporary solution and we already seem to suffer from this problem for 6 years since the initial bug was opened!!!!!). I want to do it once and properly, please and thank you.

1. Implement said new explicit t*_* functions first. Also remove function _(s) { return s; } from JS, since we won't use that pattern anymore.

2. Create a script to automatically fix all the possible cases to the new rules, to make it easy to apply. Well-defined rules mean an old manual patch-set won't get stuck in limbo always waiting for rebasing, that's the only chance to get it ever merged.

  • first pass: fix various usages of existing t* functions migrating them to t*_* context-aware functions, try to detect common mis-usage of wrong escapes in wrong places and correct them, because later they'll get buried in sand; while developing it keep iterating until no more non-context-aware t invokations remain
  • second pass: match up JS _() function usage and replace it with [% t_js_string() %]
  • third pass: match up loose non-JS strings and insert the new t_* (depending on context) (yes, I think those will only be the base t function, not the plural/args ones etc.; see non-goals)
  • the above passes should be split into separate commits

3. Provide a short cheat-sheet page, so that people know what translation functions to use when (currently they're inconsistently/wrongly used all over the place), put it in code style guidelines, make QA most aware of this as the guardians (hey, the simplicity should make it intuitive and straightforward, so don't worry!), eventually implement some lint checks for obvious mistakes (distant future: if we end up having implemented Phabricator RFC, lint will warn contributor before QA person wastes their time on it)

TODO:

  • Figure out how to handle html_line_break macro.
  • Decide if t() should use single or double quotes by convention, since it's mix currently...


Goals

In TLDR, we do:

  • removal of auto-detected strings in templates for translation, all translations have to be explicit now, and as such:
  • removal of function _(s) { return s; } in JS in favor of [% t_js_string() %], cannot use _() anymore, either use template or __() runtime funcs
  • removal of loose HTML strings in favor of [% t_html() %] and [% t_html_attr() %]
  • turning t* functions into explicit specialized context-aware sub-functions depending on where they're used, by-chance fixup of wrong usages of escapes in existing code
  • this ensures no possibility of accidental XSS and syntax errors with translations, bringing some much-needed sanity


Non-goals

The scope is intentionally limited to make things simpler and more contained, to avoid dragging this out more than it has to be:

1. Migrating _() calls to __() runtime JS calls, that would be more tricky because:

  • would have to move said strings from messages to messages-js, would have to check if nothing else still uses it in messages before removal, would lose Weblate history
  • would immediately inflate the client translations JSON 2-3x times
  • instead, devs should gradually introduce __() as they code/refactor new stuff, optionally a different refactor could be proposed to mass-change those at different point in time

2. Migrating current singular translations like contain(s) to one that has separate contain (one), contains (few), contains (many):

  • this is a lot of manual work
  • requires developer understanding of how non-English languages works
  • requires making new strings and translators translating them accordingly
  • document how to do it, have someone else do it if and when they see fit separately

3. Unification of how template parameters are done:

  • again a lot of manual labour since they're invoked in different weird ways
  • stuff like JS calling _().format() has to remain this way because formatting is only done at runtime, those would have to be migrated to __() instead, but see point #1 here
  • if you wanna refactor those for consistency, propose it as separate bug/RFC and figure out and cover all the cases (such as for example: get rid of String.prototype.format for JS, use named or numbered parameters like {param} everywhere etc.)


See also

Former art: