Elasticsearch

From Koha Wiki
Jump to navigation Jump to search

Elasticsearch Support

This is a work in progress. Details are on Bug 12478. There is also an RFC-style document for high-level descriptions. A kanban-ish TODO list also exists.

We also have a page for technical detail to help you start working on it.

Goals/Status

Essentially, the goal of the short term is to allow zebra to be turned off and have things still work.

Functional Goals

Short term:

  • Replace OPAC biblio search with ES [done]
  • Replace OPAC authority search with ES [done]
  • Replace staff client biblio search with ES [done]
  • Replace staff authority search with ES [done]
  • Improve general integration (packages, realtime indexing, etc.)
  • Ensure that zebra still works properly [in progress]

Medium term:

  • Add a UI that makes it possible to edit the mappings in the staff client [in progress]
  • Add a browse interface [done]

Long term:

  • Ingest data from other sources into ES so Koha can search it natively
  • Any other neat ideas that are possible by having a flexible search engine

Non-functional Goals

Medium term:

  • Make code release-ready

Basic 'Get Started' Steps

Install elasticsearch

kohadevbox

If you are familiar with kohadevbox, you just need to run

 $ KOHA_ELASTICSEARCH=1 vagrant up

Manual install (main)

Have a Debian Jessie server.

Use the Koha community unstable repository, the Koha nightly unstable repository and jessie-backports (for openjdk-8) Add to /etc/apt/source.list

# jessie-backports
deb http://ftp.debian.org/debian/ jessie-backports main
# koha unstable
deb http://debian.koha-community.org/koha unstable main
# koha nightly unstable
deb https://apt.abunchofthings.net/koha-nightly unstable main


You will need apt via https

$ sudo apt install apt-transport-https
$ sudo apt update


Install openjdk-8-jre-headless

$ sudo apt install -t jessie-backports openjdk-8-jre-headless


Install Elasticsearch 5.x. Short story:
Note: As of 20.05.00, 19.11.05 and 19.05.10 releases, we recommend Elasticsearch 6.x

$ wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
$ echo "deb https://artifacts.elastic.co/packages/5.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-5.x.list

Long story: https://www.elastic.co/guide/en/elasticsearch/reference/current/deb.html


Install the analysis-icu plugin

$ sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu


Install the koha-elasticsearch metapackage

$ sudo apt-get install koha-elasticsearch

If you're getting the following error during install

 Couldn't write '262144' to 'vm/max_map_count', ignoring: Permission denied

Try

$ export ES_SKIP_SET_KERNEL_PARAMETERS=true
$ apt-get install elasticsearch

Add a configuration like this to your koha-conf.xml:

 <elasticsearch>
     <server>localhost:9200</server>      <!-- may be repeated to include all servers on your cluster -->
     <index_name>koha_robin</index_name>  <!-- should be unique amongst all the indices on your cluster. _biblios and _authorities will be appended. -->
 </elasticsearch>

Note: you can change localhost for the hostname/IP of an external Elasticsearch server you might have running somewhere.

Advanced Connection Settings

N.B. The following configuration options are only available from Koha version 19.05.

You may need to add other options e.g. if your Elasticsearch server is on a different network. Here are the available configuration keys and their default values:

client: 5_0::Direct cxn_pool: Sniff request_timeout: 60

Example configuration for connecting to Elasticsearch 6.x server on another network:

 <elasticsearch>
     <server>some.other.network:9200</server>  <!-- may be repeated to include all servers on your cluster -->
     <index_name>koha_robin</index_name>       <!-- should be unique amongst all the indices on your cluster. _biblios and _authorities will be appended. -->
     <client>6_0::Direct</client>              <!-- Client version to use -->
     <cxn_pool>Static</cxn_pool>               <!-- Use Static connection pool (see https://metacpan.org/pod/Search::Elasticsearch#cxn_pool for more information) -->
 </elasticsearch>
Advanced Index Settings

By default, when you reset index mappings in Koha's administration or drop and recreate an index, the settings are read from the following files:

[intranetdir]/admin/searchengine/elasticsearch/index_config.yaml
[intranetdir]/admin/searchengine/elasticsearch/field_config.yaml
[intranetdir]/admin/searchengine/elasticsearch/mappings.yaml

If you'd like to use customized versions of these files, you can override any or all of the defaults by using the respective settings in koha-conf.xml (note that they reside outside the elasticsearch element):

 <elasticsearch>
     <server>localhost:9200</server>
     <index_name>koha_robin</index_name>
 </elasticsearch>
 <elasticsearch_index_config>/etc/koha/searchengine/elasticsearch/index_config.yaml</elasticsearch_index_config>
 <elasticsearch_field_config>/etc/koha/searchengine/elasticsearch/field_config.yaml</elasticsearch_field_config>
 <elasticsearch_index_mappings>/etc/koha/searchengine/elasticsearch/mappings.yaml</elasticsearch_index_mappings>


Updating Elasticsearch

If Elasticsearch is updated to a new version, a corresponding analysis-icu plugin version will also need to be installed. Otherwise Elasticsearch may fail to start and an error such as the following will be logged in /var/log/elasticsearch/elasticsearch.log:

java.lang.IllegalArgumentException: plugin [analysis-icu] is incompatible with version [5.6.13]; was designed for version [5.6.12]

A new version of the plugin can be installed as follows:

$ sudo /usr/share/elasticsearch/bin/elasticsearch-plugin remove analysis-icu; sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install analysis-icu

Configuration

Set the 'SearchEngine' system preference to 'Elasticsearch'

Cross fingers

Trigger indexing (from 17.11):

sudo koha-elasticsearch --rebuild -d -v koha

Trigger indexing (before 17.11):

sudo koha-shell kohadev
cd kohaclone
misc/search_tools/rebuild_elasticsearch.pl -v -d

on a manual setup just run

misc/search_tools/rebuild_elasticsearch.pl -v -d

You have to start Elasticsearch before you can do the indexing.

(*You may get an error "Can't locate Catmandu/Importer/MARC.pm"; make sure to use the package libcatmandu-marc-perl and DO NOT CPAN this module - it doesn't work)

Do a search in the OPAC and/or staff client, notice that it all works beautifully.

Search engine configuration

N.B. Many of the configuration options are only available from Koha version 18.11.

There is a default set of fields configured for the search indexes (see admin/searchengine/elasticsearch/mappings.yaml), but you can also use the "Search engine configuration" page available from the Koha Administration page to modify the mappings, field weights for relevance ranking etc. There's also elasticsearch_index_mappings setting in koha-conf.xml that can be used to set a customized mappings.yaml as the defaults file used if the mappings are reset.

You can always reset the mappings to Koha's defaults by using the "Reset Mappings" button. Note that any customizations will be lost forever if the mappings are reset.

Search fields

The Search fields tab allows you to view all the available search fields, their types and weights. Note that changing weights will take effect immediately after the changes have been saved.

MARC to Elasticseach mappings

The Bibliographic records and Authorities tabs contain the mappings from MARC fields to search fields. Any changes here will only be reflected on existing records in the search index after rebuild_elasticsearch.pl (see above) has been run.

The mapping column has the following syntax:

Control fields:

Syntax: TAG_/START-END
Example: 008_/35-37
Description: Takes positions 35-37 from field 008

Data fields:

For data fields there are multiple possibilities depending on what is needed.

Syntax: TAGsubfields
Example: 245ab
Description: Takes subfiels a and b from field 245 and adds them to separate search fields. This will treat the subfields are completely separate values.

Syntax: TAG(subfields)
Example: 245(ab)
Description: Takes subfiels a and b from field 245 and adds them to a single search field separated by a space. This will treat the subfields as belonging together so that e.g. phrase and proximity searches work properly and only a single value is displayed in facets. Note that a simple search still works with any of the words in the phrase, so there's normally no need to index related subfields separately.

Syntax: TAG(subfields)subfields
Example: 245(ab)ab
Description: Combination of both above.

Note that all data field mappings will automatically handle alternate script fields (880) for MARC 21 records.

Advanced Field Configuration

There are two files that control the behavior of the search index: admin/searchengine/elasticsearch/index_config.yaml and admin/searchengine/elasticsearch/field_config.yaml. If necessary, these files can be copied e.g. to the etc directory and elasticsearch_index_config and elasticsearch_field_config in koha-conf.xml set to point to them.

For any changes to these files to take effect, rebuild_elasticsearch.pl will need to be run with the -d parameter that forces the index to be recreated.

field_config.yaml contains the field configuration used to set up the Elasticsearch index. There's normally no need to modify this file.

index_config.yaml contains the configuration for the analysis chain used to process any search terms for indexing or searching. You may need to e.g. modify the settings of the ICU folding filter to preserve the correct sort order. See ICU Folding Filter documentation for an example of how to configure the index for Swedish/Finnish folding. See Analysis documentation for more information on how the analysis works in Elasticsearch.

Here is an example index_config.yaml for Swedish/Finnish.

---
index:
  analysis:
    analyzer:
      analyser_phrase:
        tokenizer: keyword
        filter:
          - finnish_folding
          - lowercase
        char_filter:
          punctuation
      analyser_standard:
        tokenizer: icu_tokenizer
        filter:
          - finnish_folding
          - lowercase
        char_filter: punctuation
      analyser_stdno:
        tokenizer: whitespace
        filter:
          - finnish_folding
          - lowercase
        char_filter:
          punctuation
    normalizer:
      normalizer_keyword:
        type: custom
        filter:
          - finnish_folding
          - lowercase
    filter:
      finnish_folding:
        type: icu_folding
        unicode_set_filter: '[^\u00E5\u00E4\u00F6\u00C5\u00C4\u00D6]'
    char_filter:
      punctuation:
        type: mapping
        mappings:
          - '- => '
          - '+ => '
          - ', => '
          - '. => '
          - '= => '
          - '; => '
          - ': => '
          - '" => '
          - ''' => '
          - '! => '
          - '? => '
          - '% => '
          - '\/ => '
          - '( => '
          - ') => '
          - '[ => '
          - '] => '
          - '{ => '
          - '} => '


The idea is to use ICU folding for most characters except the scandinavian ones (ÅÄÖåäö) that are filtered out of ICU processing using the unicodeSetFilter. Then lowercase filter is used to make sure they're still normalized to lowercase. An additional char_filter is used to remove unwanted punctuation.