ICU Chains Library

From Koha Wiki
Jump to navigation Jump to search

ICU Chains Configuration and Customization

After ICU_chains_configuration, it may be necessary to modify words-icu.xml or phrases-icu.xml, based on the grammar of the language being searched.

See http://www.indexdata.com/zebra/doc/icuchain-files.html, http://www.indexdata.com/yaz/doc/yaz-icu.html, http://userguide.icu-project.org/transforms/general/rules and http://www.unicode.org/cldr/charts/latest/transforms/index.html

Languages

Arabic

  • Developer: Yasserkad
  • File: words-icu.xml
  • Language: Arabic
  • Locale: ar
  • Status: Complete
  <icu_chain locale="ar">
    <transliterate rule="\'>\ "/>
    <transliterate rule="[:Number:] { '-' > '' "/>
    <transform rule="[:Control:] Any-Remove"/>
    <tokenize rule="l"/>
    <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
    <transform rule="NFD"/>
    <transform rule="[:Nonspacing Mark:] Remove"/>
    <transform rule="NFC"/>
    <transliterate rule="{ الا > ا "/>
    <transliterate rule="{ الأ > أ "/>
    <transliterate rule="{ الإ > إ "/>
    <transliterate rule="{ الآ > آ "/>
    <transliterate rule="{ الب > ب "/>
    <transliterate rule="{ الت > ت "/>
    <transliterate rule="{ الث > ث "/>
    <transliterate rule="{ الج > ج "/>
    <transliterate rule="{ الح > ح "/>
    <transliterate rule="{ الخ > خ "/>
    <transliterate rule="{ الد > د "/>
    <transliterate rule="{ الذ > ذ "/>
    <transliterate rule="{ الر > ر "/>
    <transliterate rule="{ الز > ز "/>
    <transliterate rule="{ الس > س "/>
    <transliterate rule="{ الش > ش "/>
    <transliterate rule="{ الص > ص "/>
    <transliterate rule="{ الض > ض "/>
    <transliterate rule="{ الط > ط "/>
    <transliterate rule="{ الظ > ظ "/>
    <transliterate rule="{ الع > ع "/>
    <transliterate rule="{ الغ > غ "/>
    <transliterate rule="{ الف > ف "/>
    <transliterate rule="{ الق > ق "/>
    <transliterate rule="{ الك > ك "/>
    <transliterate rule="{ الل > ل "/>
    <transliterate rule="{ الم > م "/>
    <transliterate rule="{ الن > ن "/>
    <transliterate rule="{ اله > ه "/>
    <transliterate rule="{ الو > و "/>
    <transliterate rule="{ الي > ي "/>
    <display/>
    <casemap rule="l"/>
  </icu_chain>

Chinese / zh_TW

    <icu_chain locale="zh_TW.UTF-8">
      <transliterate rule="\'>\ "/>
      <transliterate rule="[:Number:] { '-' > '' "/>
      <transform rule="[:Control:] Any-Remove"/>
      <tokenize rule="l"/>
      <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
      <transform rule="NFD"/>
      <transform rule="[:Nonspacing Mark:] Remove"/>
      <transform rule="NFC"/>
      <display/>
      <casemap rule="l"/>
    </icu_chain>

Kurdish (کوردی)

  • Developer: D.Roshani
  • File: words-icu.xml
  • Language: Kurdish (کوردی)
  • Locale: ku
  • Status: Untested
  <icu_chain locale="ku">
    <transliterate rule="\'>\ "/>
    <transliterate rule="[:Number:] { '-' > '' "/>
    <transform rule="[:Control:] Any-Remove"/>
    <tokenize rule="l"/>
    <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
    <transform rule="NFD"/>
    <transform rule="[:Nonspacing Mark:] Remove"/>
    <transform rule="NFC"/>
    <transliterate rule="{ ئ > ئـ "/>
    <transliterate rule="{ ئا > ا "/>
    <transliterate rule="{ بێ > ب "/>
    <transliterate rule="{ پێ > پ "/>
    <transliterate rule="{ تێ > ت "/>
    <transliterate rule="{ جێ > ج "/>
    <transliterate rule="{ چێ > چ "/>
    <transliterate rule="{ حێ > ح "/>
    <transliterate rule="{ خێ > خ "/>
    <transliterate rule="{ دال > د "/>
    <transliterate rule="{ رێ > ر "/>
    <transliterate rule="{ ڕێ > ڕ "/>
    <transliterate rule="{ زێ > ز "/>
    <transliterate rule="{ ژێ > ژ "/>
    <transliterate rule="{ سێ > س "/>
    <transliterate rule="{ شێ > ش "/>
    <transliterate rule="{ عین > ع "/>
    <transliterate rule="{ غین > غ "/>
    <transliterate rule="{ فێ > ف "/>
    <transliterate rule="{ ڤێ > ڤ "/>
    <transliterate rule="{ قێ > ق "/>
    <transliterate rule="{ کێ > ک "/>
    <transliterate rule="{ گێ > گ "/>
    <transliterate rule="{ لێ > ل "/>
    <transliterate rule="{ ڵێ > ڵ "/>
    <transliterate rule="{ لام > م "/>
    <transliterate rule="{ نوون > ن "/>
    <transliterate rule="{ هە > ھ "/>
    <transliterate rule="{ ئه > ە "/>
    <transliterate rule="{ ئو > و "/>
    <transliterate rule="{ ئۆ > ۆ "/>
    <transliterate rule="{ ئوو > وو "/>
    <transliterate rule="{ ئی > ی "/>
    <transliterate rule="{ ئێ > ێ "/>
    <display/>
    <casemap rule="l"/>
  </icu_chain>

Polish

  • Developer: Fsomers
  • File: words-icu.xml
  • Language: Polish
  • Locale: pl
  • Status: Incomplete
  <transliterate rule="{ ą > a "/>
  <transliterate rule="{ Ą > a "/>
  <transliterate rule="{ ć > c "/>
  <transliterate rule="{ Ć > c "/>
  <transliterate rule="{ ę > e "/>
  <transliterate rule="{ Ę > e "/>
  <transliterate rule="{ ł > l "/>
  <transliterate rule="{ Ł > l "/>
  <transliterate rule="{ ń > n "/>
  <transliterate rule="{ Ń > n "/>
  <transliterate rule="{ ó > o "/>
  <transliterate rule="{ Ó > o "/>
  <transliterate rule="{ ś > s "/>
  <transliterate rule="{ Ś > s "/>
  <transliterate rule="{ ź > z "/>
  <transliterate rule="{ Ź > z "/>
  <transliterate rule="{ ż > z "/>
  <transliterate rule="{ Ż > z "/>

I would like to see a full xml <icu_chain locale="pl"> tag here.

Swedish

  • Developer: Gaetan Boisson, Fridolin Somers
  • File: words-icu.xml
  • Language: Swedish
  • Locale: sv-SE
  • Status: Untested
  • Notes:
    <icu_chain locale="sv-SE">
      <transform rule="[^åäöÅÄÖ] NFD"/><!-- do not undiactric some characters -->
    </icu_chain>

Thai

    <icu_chain locale="th">
      <transliterate rule="\'>\ "/>
      <transliterate rule="[:Number:] { '-' > '' "/>
      <transform rule="[:Control:] Any-Remove"/>
      <tokenize rule="l"/>
      <transform rule="[[:WhiteSpace:][:Punctuation:]] Remove"/>
      <transform rule="NFD"/>
      <transform rule="[:Nonspacing Mark:] Remove"/>
      <transform rule="NFC"/>
      <display/>
      <casemap rule="l"/>
    </icu_chain>