Handling UTF8 in development

From Koha Wiki
Jump to navigation Jump to search

After bug 6554/bug 11944/... have been pushed on main, we still need to address some UTF8 issues in the code here and there. In the meantime it has (very unfortunately :) been reverted again.
Here is a first draft for how Koha should handle UTF8 and how we should adhere in our coding.

General principles

Decode what comes in, encode what goes out. (In the middle: leave everything in Perl internal format.)

Use of Encode::encode("UTF-8", ...) and Encode::decode("UTF-8", ...) is recommended above utf8::encode and utf8::decode.

UTF-8 is stricter in perl than the liberal utf8.

If you have any doubts about this, read Larry Wall on the subject here

Decoding

We decode input by using -utf8 on the CGI module. Note that there is a warning on using that flag for binary uploads. But there is not much binary uploading in Koha.

As an alternative, we can decode specific cgi parameters. But note that this creates a risk of forgetting one!

Another alternative can be seen in report 10075 with a wrapper Koha::CGI for the CGI module.

If you read from a file, add the <:encoding(UTF-8) to your open statement. Or add a similar construct in a use open-statement at the start.

Note that we now interpret our template files as UTF-8 via the corresponding parameter in Templates.pm.

Currently, we still have some deeper decode statements in modules such as Auth.pm. We should strive to eliminate these by taking care of decoding our data as soon as it comes in.

Encoding

When you write to a file, add the >:encoding(UTF-8) to your open statement. Or add a similar construct in a use open-statement at the start.

Normally, CGI scripts should not print to STDOUT themselves, but use the Output module. The Output module encodes our output in routine output_with_http_headers.

DBI

In Context.pm a new db connection is initialized with mysql_enable_utf8.

This makes that we do not need to decode data coming from the mysql db and vice versa.

Note: on the report 10110 we discovered that $sth->{NAME} did not return decoded data.

DBIC

see https://metacpan.org/pod/DBIx::Class::Manual::Cookbook#Using-Unicode

URI Escaping

Note that uri_escape_utf8 includes encoding your data. If you send it to Output.pm afterwards, you will have double encoding.

This area needs some further research. E.g. the URLs in Search.pm.

Unscaping is done with uri_unescape. However, there is no uri_unescape_utf8, so we heed to use something like:

Encode::decode_utf8( uri_unescape($string) )

XSLT

The Koha xslt files should contain something like

<xsl:output method = "html" indent="yes" omit-xml-declaration = "yes" encoding="UTF-8"/>

that specifies UTF-8.

If that is true, there is no need to decode what you get back from LibXSLT.

Zebra

Data coming from Zebra should be decoded as soon as it comes in. Examples of this can be found in Search.pm.

TODO: Probably we need to extend this list to be more complete.

Known bugs

  • Soon after cataloguing Control n.12158637 from LOC Z39.50, I delete this record. On branch there is a web software error: "Can't call method "title" on an undefined value at /var/root-koha/koha-11944/cataloguing/addbiblio.pl line 826" On Sandbox 6 there is no error. On both the record is deleted.
  • On Circulation, if the current selected library's name has diacritics chars, on checking out to a patron with an outstanding fee, selecting "Make payment" [below "Attention"] the name of the current library is shown badly; on the contrary, on Patrons, the "Fines" tag of the patron is ok [same koha/members/pay.pl]
  • On Catalog, on Checkout history, if the item has been checked out at a library whose code is "実験社", this library code is shown badly in "Checked out from" column; on the contrary, a "μμμ" library code is shown ok
  • On Circulation, if the current selected library's name has diacritics chars, on checking out, after searching twice a partial name "No patron matched ..." the name of the current library is shown badly
  • On Circulation, if the current selected library's name has diacritics chars, on checking out, after selecting a patron among a few [found by "partial name"] the name of the current library is shown badly
  • Search history (caused by the patch "Bug 11944: Facets and resort") (Jonathan:It should be fixed with last patch set, could you confirm?)
  • Create a new budget with diacritis in budget name Bug 12650, bug NOT connect with 11944, resolved with perl 5.18
  • When you send an e-mail and the name of the list has a special char, the subjects in TB has problems. The problem is present also in main, so it is NOT connect with 11944
  • No latin chars on printing Basketgroup, Bug 12832. The problem is present also in main, so it is NOT connect with 11944
  • Insert a suggestion, see Bug 12664
  • Authority: To test.
  • Serial. Using ellenic char in numbering and frequencies I have a problem in search result, Bug 12640
  • In lists, abort in send list
  • In Serials, numbering pattern' name and label
  • Details of a new UniMARC authority
  • Fund name X in the "Already Receipt" orders' table, at "Subtotal for X"
  • Authentication
  • Facets
  • import records using the "Stage MARC records for import" tool

Tested successfully

  • Patron insert: OK
  • Patron search: OK
  • Patron print slip: OK
  • Patron print summary: OK
  • Patron print quick slip : OK
  • Patron - Child: OK
  • Patron - Extra attributes: OK [However -- See Bug 12638 to use UTF-8 in code name of patron attribute]
  • Patron: syspref 'alphabet' for the list of intial letter: OK
  • ACQ, bug 10180 resolved by fixes of 11944
  • Authorized valuses: OK
  • Tool - Patron list: OK [However -- See Bug 12637]
  • Tools -> Tags: OK (tested OPAC entry and moderation)
  • Tools -> Patron comments: OK (tested OPAC entry and moderation)
  • Tools - Stage MARC records for import : OK
  • Tools - Staged MARC record management : OK
  • Tools - Export data : OK
  • Reports - extract data from DB: OK
  • Reports - Use the authorized values list: OK
  • Administration - Z39.50 server administration (and use, via cataloging) OK
  • Administration - Circulation and Fine rules - OK (tested patron category, item type, and library name)
  • Serial - manage frequencies: OK
  • Serial - manage numbering patterns: OK
  • Serial - Check expiration (and Renew): OK
  • Serial - Claims: OK
  • Serial - Create/Update a subscription: OK
  • Serial - searche and filter: OK
  • ACQ - Vendor anagraphic data: OK
  • ACQ - Basket data and CSV export: OK
  • ACQ - Basket group Create/Update: OK
  • ACQ - Invoces: OK
  • ACQ - Receive / Trasfert : OK
  • ACQ - Late Orders: OK
  • ACQ - Search results: OK
  • ACQ - Basket group printing: support only latin diacritics, bug also on main, bug 12832.
  • List and Cart: OK
  • OPAC: OK
  • Intranet Search: OK
  • CAT - Biblio: OK
  • CAT - Biblio framework: OK
  • CAT - Z39.50: OK
  • CIRC - Checkout, checkin, renewing, holds, Offline (13.1) : OK
  • AUTH : OK

Help to test the patch

  • Sandboxes 6 (MARC21) and 16 (UNIMARC) are set up for testing
  • Problems can be noted on Bug 11944 or on this page under 'Known bugs'. If you tested something and it worked nicely, you can note that under 'Tested successfully'.
  • The most current patch set can be found on Biblire's repository, http://git.biblibre.com/biblibre/kohac.git
  • The branch is ft/bug_11944
  • To take the branch there is a quit complex path because Biblibre server has many branches:

- git init
- git remote add -t ft/bug_11944 -m ft/bug_11944 koha-biblibre http://git.biblibre.com/biblibre/kohac.git
- [No you recevie a:  fatal error: git fetch -p -f koha-biblibre, so do .. ]
- git fetch -p -f --depth=1 koha-biblibre
- git branch --track bug_11944 koha-biblibre/ft/bug_11944
- git reset --hard main
- git checkout bug_11944 : OK