Koha::SearchEngine::Elasticsearch v2

From Koha Wiki
Jump to navigation Jump to search

Koha::SearchEngine::Elasticsearch

Status: unknown
Sponsored by: Bulac, Frantiq
Developed by: Frédéric Demians
Expected for: 2020-05-11
Bug number: Bug 25453
Work in progress repository:
Description: New Search Engine for Elasticsearch


Abstract

I, Frédéric Demians, have the project to improve Koha Elasticseach (ES)implementation. I will write a new search engine named Koha::SearchEngine::ES. They way we currently deal with ES configuration, and its implementation in Koha, are complex, error prone, without feedback, and somehow obscure. To such an extend that now using the full potential of ES and adding new functionalities is out of reach.

My goal is to write an alternative search engine with those objectives :

  • Improve searchability of Koha Catalogs
  • Code refactoring and simplification
  • No regression
  • Index authority non-heading forms
  • Index ISNI (http://isni.org) authors via web services, or IdRef (https://www.idref.fr) for French libraries.
  • Index full text of documents linked via 856 fields.
  • Simplify ES indexes customization by providing feedback to Koha administrator

How, when, who

I will submit at the beginning of Koha 20.11 development cycle (June 2020) a new patch implementing the new ES search engine. It will include an upgrade process to transform automatically legacy ES configuration to the new configuration.

From this moment, I will ask for feedback, and I will provide: (1) help to test the development, (2) quick fix to any regression. I found that the following Koha developers have actively participated to the ES development (put your name on the list if I forgot you) :

  • Robin Sheat — Initial implementation
  • Ere Maijala — Performance optimization, multiprocess indexing ; alt script indexing
  • David Gustafsson — MARC non-filling characters ; multi-process indexing ; opac/staff restrictions
  • Nick Clemens — Authority matching,
  • Olli-Antti Kivilahti
  • Alex Arnaud
  • Jonathan Druart

Centralized JSON configuration

Rather than splitting ES configuration into koha-conf.xml, several yaml files, and 3 tables, all aspects of ES configuration will be centralized in a unique json data structure. It allows to gain quickly and easily understanding of fields content, how they are populated and how they will be usable for searches. Such a data structure is expandable by nature, contrary to the 3-tables current configuration which requires a lot of coding just to add a new parameter. Let see for example, this title field definition :

"title" : {
  "opac": 1,
  "staff": 1,
  "weight": 200,    
  "source" : [
    {
      "map" : "245a",
      "index" : ["search","sort","suggestible"]
    },
    {
      "map" : ["246","247","490a","505t","700t","710t","711t"],
      "index" : ["search","suggestible"]
    },
    {
      "map" : ["130","210","211","212","214","222","240","730","740","780","785"]
    }
  ]
},

It means that 3 types of MARC21 fields will be used to populate the title ES field : All the MARC field will be searchable. Browsing (suggestible) by title will show: 245a, 246, 247, 490a, 505t, 700t, 710t, and 711t, but not 130, 210, etc. Sorting by title will use only the content of 245a.

Discussion with clients — Now, librarian Koha administrators have the false feeling or fantasia that they can tweak their ES configuration. They add new ES fields, they link them to MARC field. The reality is that they can't see the result of their experimentation without help from their vendor. There is no feedback on how ES fields are populated : you can't see what ES fields will look like after modifying the configuration.

For a Koha vendor, JSON configuration will facilitate the discussion with their clients about customizing searches in Koha. Currently, when someone wants to add a MARC field to an index, or create a new index, you have to _see_ the exact configuration of the client to understand what't going on. It means accessing the Koha staff interface and then inspecting the Elasticsarch engine configuration. Indexed fields are split in several places there. With a json based configuration, you can just send a snipped of json containing a working field configuration. With that code, the client can compare with his configuration and tweak it. You can say for example : In order to make working the Availability filter on the OPAC, you need `notforloan` field defined like this :

"notforloan" : {
  "type" : "number",
  "source" : { "map" : "9527" }
},

Plugins

The way ES fields are populated from MARC fields will be driven by several plugins. Currently the process of extracting data from Koha records is in one place Koha::SearchEngine::Elasticearch. The marc_records_to_documents method do all the work with a lot of specific conditions for, let say, MARC 880 fields, ISBN expanding, etc. The new Elasticsearch implementation will provide a Plugin mechanism in order to extract data from record, and even get data from elsewhere (web semantic).

Several plugins are provided in standard to extract data how it used to be with the legacy search engine. It's easy to add new plugins to accommodate specific need. Let say for example that I have an external thesaurus or taxonomy. I can develop a plugin which will access this vocabulary to send related terms to ES.

Default

The default plugin Koha::SearchEngine::Plugin copy several fields from MARC record into ES fields. You can specify several sources. And various parameters for each source. For example :

"author": {
  "type": "string",
  "source": [
    {
      "index": ["search","facet","suggestible"],
      "map": ["100a","110a","111a","700a"]
    },
    {
      "map": ["245c"]
    }
  ]
},

A source is a set of several MARC fields and subfields, and target indexes.

In the map parameter, you have a string or an array of strings which specify the MARC field and content to gather. The syntax is simple :

  • 245 : get all 245 field subfields
  • 245a : get $a subfield from 245
  • 245abcd : From 245 field, group several subfields separated by a space: a, b, c and d.
  • 105a_/10-15 : From 100$a, get substring from position 10 to position 15. The first position is 0.

An index is a ES kind of field which provide specific functionalities. There are 4 ES field indexes :

  • search — Generic keyword and phrase search. Default index when no index param is specified for a source.
  • facet — Field used to generate facet in Koha opac/staff searches
  • sort — For sorting. It's an automatically truncated field.
  • suggesible — For browsing.

ISBN

This plugin expand ISBN MARC field content into multiple versions of the isbn. For example this configuration:

"isbn": {
  "source": {
    "plugin": {
      "ISBN": {
        "map":["020az"]
      }
    }
   }
},

It will get :

  • 020 $a and $z subfields from 020 MARC field
  • Expand each value into different forms: isbn10 and isbn13
  • The expanded values are store in ES field `isbn`.

If you have the MARC field :

020 $a 0837170435

You will get this muli-valued ES field :

"isbn" : [
   "9780837170435",
   "978-0-8371-7043-5",
   "0837170435",
   "0-8371-7043-5"
],

NonFillChar

MARC21 biblio records can use indicators 1 or 2 in order to specify the number of characters to skip in some fields for sorting. For example, in this Title field:

245  4 $a The Greek Museums / $c Manolis Andronicos.

Since indicator 2 contains the value 4, the title must be sorted at: "Greek Museums".

The NonFillChar plugin specify to use an indicator during the extraction of data from MARC field. The result is sent to the sort index (ie ES __sort field). See this configuration for a Title field:

"title": {
  "source": [
    {
      "map": "245abfgknpv",
      "index": ["search","suggestible"]
    },
    {
      "map": ["246","247","490a","505t","700t","710t","711t"],
      "index": ["search","suggestible"]
    },
    {
      "map": ["130","210","211","212","214","222","240","730","740","780","785"]
    },
    {
      "plugin": {
        "NonFillChar":[
           {
             "map":["245a"],
             "index":["sort"],
             "indicator": 2
           }
         ]
      }
    }
  ]
},


MatchHeading

Implement specialized match value for Koha authority Match module.

Authority

In Koha bilio records can be linked to authority records. There are authority records for Personal Name, Subject, etc. An authority record contains additional information on found in the biblio record. The authority search can find those information, but it's a 2-steps search: 1) you search by authority, 2) then you gain access to biblio records. The OPAC users don't understand how to conduct searches like that, and it doesn't allow to do multi-criteria search: author and tiltle for example.

Let see an example. You have a biblio record like that:

200  4 $a The Greek Museums $f Manolis Andronicos
700    $a Andrónikos $b Manólīs $8 gre $f 1919-1992 $9 1000128202

The author Andrónikos, Manólīs is linked to the authority 1000128202. This authority record contains:

200 $a Andrónikos $b Manólīs $8 gre $f 1919-1992
300 $a Professeur d'art et d'archéologie classique à l'Université
       de Thessalonique (en 1992). - Spécialiste des fouilles de Vergina
       en Macédoine. - Membre de l'Académie des Lincei, Rome
400 $a Andronikos $b Manilos
400 $a Andronicos $b Manolis
400 $a Andronikos $b Manolis
400 $a Andronikos $b Manōlēs
400 $a Andronicos $b Manolis
700 $a Ανδρόνικος $b Μανόλης

You have see-form and see-also form of the author name in 400 and 700 fields. You also have a note in 300. If you search your catalog with the author name Andrónikos Manólīs, you will find the biblio record because the name form is present in the biblio record. But if you search on Andronikos anilos or Ανδρόνικο Μανόλης, you won't find the record because those forms are not present neither in the biblio record nor in the ES field. And what about search on the authority note field?

The Authority plugin allows to populate ES field with information extracted from the linked authorities. For example this configuration :

"author":{
  "source":[
    {
      "plugin":{
        "Authority":[
           {
             "map":["700ab","701ab","702ab"],
             "heading":"200ab",
             "index":["search","facet","suggestible"],
             "see":[
               {
                 "map":["400ab","700ab"],
                 "index":["search"]
               },
               {
                 "map":["300a","340a"],
                 "target":"note",
                 "index": ["search"]
               }
             ]
           },
           {
             "map":["710ab","711ab","712ab"],
             "heading":"210ab",
             "index":["search","facet"],
             "see":[
               {
                 "map":["410ab","710ab"],
                 "index":["search","facet"]
               },
               {
                 "map":["300a","340a"],
                 "target":"note",
                 "index":["search"]
               }
             ]
           }
         ]
      }
   },
   {
      "map":["200f","200g"]
   }
  ]
},

Without the Authority plugin the ES author field contains :

"author" : [
   "Manolis Andronicos",
   "Andronikos Manilos",
],

Those 2 values come from 200 and 700 fields. With the Authority plugin, we have :

"author" : [
   "Manolis Andronicos",
   "Andronikos Manilos",
   "Andronikos Manōlēs",
   "Andronicos Manolis",
   "Ανδρόνικος Μανόλης",
   "Andrónikos Manólīs",
   "Andronikos Manolis"
],
"note" : [
   "Professeur d'art et d'archéologie classique à l'Université de Thessalonique (en 1992). - Spécialiste des fouilles de Vergina en Macédoine. - Membre de l'Académie des Lincei, Rome"
],

From the authoriy record, 400 and 700 fields have been copied to author field. note field contains the authority 300 field. With this plugin you can choose the usage of the extracted MARC fields. Here you can browse (suggestible) the author field with date coming from 700,701,702 fields (personal name), but not those coming from 710, 711, 712 (corporate name).

AGR - AlternateGraphicRepresentation

MARC 21 biblio records can contains 880 fields, ie alternate graphic representation (AGR) of other fields. For example :

100 1  $6 880-01 $a Koukoutsakē, Aggelikē K.
245 12 $6 O Kōstas Montēs pezografos / $c Aggelikē K. Koukoutsakē.
880 1  $6 100-01/ $a Κουκουτσάκη, Αγγελική Κ.
880 12 $6 245-02/(S $aΟ Κώστας Μόντης πεζογράφος / $c Αγγελική Κ. Κουκουτσάκη.

It may be interesting to put those 880 MARC fields into the appropriate ES fields. Some points need to be considered.

  • Some libraries get biblio records by z3950, but don't want to see and use 880 fields at all. They don't want to see greek or arabic characters in their OPAC facets or when browsing the catalog by title.
  • Not all the AGR have to be sent to ES. For example, you want AGR for title but not for authors.
  • Depending of the MARC source, you may want to have ES fields searchable, facetable, sortable, browsable. For example, you want an author facet for 100 field but not for 110. Or you don't want to pollute author facets with AGR.

The AGR plugin parameters allows to specify precisely which 880 fields take into account and in which ES fields to put them. For example :

"alternate-graphic-representation": {
  "source": {
    "plugin": {
      "AGR": [
        {
          "map": ["245abfgknpv","246abfgknpv"],
          "target": ["title"],
          "index": ["search"]
        },
        {
          "map": ["100a","110a","111a","700a"],
          "target": ["author"],
          "index": ["search","suggestible"]
        }
      ]
    }
   }
}

Here 880 fields for 245 and 246 fields will be copied into title ES field, only the search field. And 110, 110, 111, and 700 fields will be copied into author ES field (search and suggestible).

koha-es.pl

A new script koha-es.pl replaces rebuild_elasticsearch.pl. It offers similar functionalities. There is no regression. The performances are at the same level. The multi-process indexing is reimplemented. There are some new functions to help Koha administrator working on his ES configuration. Here is the perdoc:

Name:
    koha-es.pl - Manipulates biblio/authority Elasticsearch indexes

Usage:
    Manipulates Koha ElasticSearch indexes. There is one index for biblio
    records: biblios, and another one for authority records: authorities.
    The command syntax is: "koha-es.pl command param1 param2 param3 ...".
    Here are the available commands:

  index:
    The index command is used to populate a specific index. Syntax has this
    form: "koha-es.pl index biblio|authorities param1 param2 ...". For
    example: "koha-es.pl index biblios resest all"

    biblios all
        Index all biblio records.

    authorities all
        Index all authority records.

    biblios all noverbose
        Index silently all biblio records.

    biblios reset all
        Reset ES 'biblios' index: delete, then recreate with the current
        mapping. Index all biblio records.

    biblios reset all childs 4
        Reset ES biblios index: delete, then recreate with the current
        mapping. Index all biblio records. Create 4 childs (processes) in
        order to speed up the processing.

    biblios all commit 10000
        Index all biblio records. Commit records to ES per batch of 10000
        records. By default 5000.

    biblios 1-100,2000-3000
        Index biblio records with biblionumber in interval [1-100] and
        [2000-3000].

  document:
    Convert Koha biblio/authority records into the document sent to
    ElasticSearch for indexing. This way it's possible to 'see' the effect
    of modifying Koha ES configuration.

    "koha-es.pl document biblios 100-200,1000-1020"

  config:
    "koha-es.pl config" manipulates Koha ES configuration.

    config show
        Display the ES configuration, ie the content of ESConfig system
        preference.

    config check
        Check the JSON ES configuration.

    config mapping biblios|authorities
        Show the ES mapping derived from Koha configuration. It may help to
        diagnostic ES malfunctions.

    config fromlegacy
        Generate ES configuration from legacy Koha ElasticSearch
        configuration which were split in 3 yaml configuration files and 3
        tables (search_marc_to_field, search_marc_map, search_field).

    config fromlegacy save
        Generate ES configuration from legacy Koha ElasticSearch
        configuration and save it in ESConfig system preference.