Elasticsearch/Implementation

From Koha Wiki
Jump to navigation Jump to search

Elasticsearch Implementation

The Elasticsearch support is mostly composed of two parts: Elasticsearch, for search, and Catmandu, which handles the data conversion from MARC to JSON.

Basic Idea

Searching

There is a compatibility API that makes converting calls to work with both Zebra and Elasticsearch straightforward. A typical search implementation in Zebra might look like:

( $error,$query,$simple_query,$query_cgi,$query_desc,$limit,$limit_cgi,$limit_desc,$stopwords_removed,$query_type) = buildQuery(\@operators,\@operands,\@indexes,\@limits,\@sort_by, 0, $lang);
($error, $results_hashref, $facets) = getRecords($query,$simple_query,\@sort_by,\@servers,$results_per_page,$offset,$expanded_facet,$branches,$itemtypes,$query_type,$scan,1);

The equivalent with the new API, which will automatically switch between zebra and elasticsearch for you would be:

$builder  = Koha::SearchEngine::QueryBuilder->new({index => $Koha::SearchEngine::BIBLIOS_INDEX});
$searcher = Koha::SearchEngine::Search->new({index => $Koha::SearchEngine::BIBLIOS_INDEX});
( $error,$query,$simple_query,$query_cgi,$query_desc,$limit,$limit_cgi,$limit_desc,$stopwords_removed,$query_type) = $builder->build_query_compat(\@operators,\@operands,\@indexes,\@limits,\@sort_by, 0, $lang, { expanded_facet => $expanded_facet });
($error, $results_hashref, $facets) = $searcher->search_compat($query,$simple_query,\@sort_by,\@servers,$results_per_page,$offset,$expanded_facet,$branches,$itemtypes,$query_type,$scan,1);

This is all a bit ugly, but that's because the original interface to zebra is a bit ugly.

The results from this should be the same as from the original way, so that this can be dropped in.

A similar thing is done with the SimpleSearch interface:

( undef,$builtquery,undef,undef,undef,undef,undef,undef,undef,undef) = buildQuery(undef,\@operands);
( $error, $marcresults, $total_hits ) = SimpleSearch( $builtquery, $results_per_page * ( $page - 1 ), $results_per_page );

becomes:

my $builder  = Koha::SearchEngine::QueryBuilder->new({index => $Koha::SearchEngine::BIBLIOS_INDEX});
my $searcher = Koha::SearchEngine::Search->new({index => $Koha::SearchEngine::BIBLIOS_INDEX});
( undef,$builtquery,undef,undef,undef,undef,undef,undef,undef,undef) = $builder->build_query_compat(undef,\@operands);
( $error, $marcresults, $total_hits ) = $searcher->simple_search_compat($builtquery, $results_per_page * ($page - 1), $results_per_page);

Behind the scenes, this is routed to an appropriate concrete function depending on the SearchEngine system preference. On the Elasticsearch side, this is Koha::SearchEngine::Elasticsearch::(QueryBuilder|Search).

Indexing

The actual reindex process is launched by misc/search_tools/rebuild_elastic_search.pl. This script is very generic and could quite readily be made to work with other indexing systems, so long as they have an equivalent module for indexing. Unlike with zebra, it is not (currently) intended that this be run automatically regularly, as modified or added records will be indexed in realtime.

Most of the heavy lifting happens in Koha::ElasticSearch::Indexer and its superclass Koha::ElasticSearch (both probably should be moved in with the other Elasticsearch files.) This runs the records through Catmandu to convert them into JSON (see Koha::ElasticSearch::Indexer::_convert_marc_to_json), then these are passed to Catmandu::Store::ElasticSearch to load into Elasticsearch. This will automatically create the Elasticsearch indexes if needed.

Index creation is done based on a set of database tables, search_field, search_marc_map, and a join table for them, search_marc_to_field. These tables are used for two purposes: to generate the Elasticsearch mappings used on index creation, and to generate the Catmandu fixer rules used when converting each field from MARC to JSON.

In Depth

Indexing

Mapping

Mapping is the most important part of the indexing process. It serves two purposes:

  • to create the Elasticsearch mapping when a new index is created, and
  • to convert MARC data into JSON for Elasticsearch when a record is indexed.

The mapping is stored in the database with a set of tables (search_field, search_marc_map, and search_marc_to_field to join them.)

search_field contains the fields that will be put into Elasticsearch, their type, their human-readable name, etc.

search_marc_map describes all the MARC fields that we care about.

search_marc_to_field joins them together. It says that each MARC field described in search_marc_map should go into a particular search field as defined in search_field. The joining also carries some attributes with it:

  • facet specifies that this relationship should generate facets. The facets go into Elasticsearch as a separate field that is the regular name with __facet appended. This is then used when searching. Keeping this on the join allows us complete expressiveness, for example we can say that we want to both search and facet on the 1xx and 71x fields, but only search on the 245$c, however they should all go into the author field for searching.
  • suggestible causes this mapping to be available for the browse interface. It is saved as a field with __suggestion added.
  • sort causes this mapping to be available for sort. Note that this is really a tri-state boolean (yeah, I know.) If any of the entries for a mapping have a non-null sort value then all sorting will be done on the field name with __sort appended rather than the straight field name. If the map has a true sort value, then it will be copied into the __sort field. If it has a false or null sort value, then it will not be copied, and will be ignored when sorting on the field. This means we can sort on "author", but we only want to sort on 1xx, not 71x.
Creating a new ES index

Whenever a new Catmandu::Store::ElasticSearch object is created, it is given the Elasticsearch mapping details, and will automatically create an new index with this if one doesn't exist already. It uses the mapping that is generated from the search_* tables, in the Koha::ElasticSearch->get_elasticsearch_mappings function.

The indexing process

Via rebuild_elastic_search:

  1. misc/search_tools/rebuild_elastic_search.pl creates an iterator that'll return the records we want to index. Currently by default this is all records.
  2. a number of MARC::Records are collected from the database, and passed to Koha::ElasticSearch::Indexer->udpate_index.
  3. these run through _convert_marc_to_json which creates a MARC record importer, and applies the fixer rules.
  4. the records are sent to Catmandu which applies the fixes and sends the records to Elasticsearch.

When records are modified:

  1. C4::Biblio::ModZebra is called
  2. Instead of adding a record to the database for zebra, one of two functions is called: Koha::ElasticSearch::Indexer->update_index_background or Koha::ElasticSearch::Indexer->delete_index_background. note: these names seem bad. Maybe they should be s/index/record/ instead?
  3. Right now, the _background functions call straight through to the functions they're referring to. The intention is that they might fork and do it in the background, or they might send it to a daemon, or something like that. They need to get to the point where they can be smart, non-blocking, and can handle errors by deferring the indexing.

Searching

There are two parts to searching, based off the way Zebra searching is done: QueryBuilder and Search.

In both of these cases, a mostly backwards-compatible API has been produced that replicates the old zebra API, and classes (Koha::SearchEngine::(QueryBuilder|Search)) that send the call on to which ever search engine implementation is required.

The compatibility functions end in _compat.

Query Builder

Most of the search-related work happens here. It teases apart the query components, rewrites parts of them, and restructures it into an Elasticsearch query structure. There are a lot of things in the Zebra system that have different names in different places, currently the remapping of these is hard-coded into the relevant parts of this file. For example, sorting by call number passes a "call_number" sort value, but we actually want to sort based on the callnum field. This conversion is done in the _convert_sort_fields subroutine. (Actually, we probably want to sort on a normalised version, in the case of callnum.)

A lot of the conversion here is pretty ad-hoc, and the goal is to end up with a lucene-style query. These are not preferred in Elasticsearch, as they give too much power to the searcher, but the safe equivalent is far too limited. In the longer term, this should be replaced with a query parser function that generates native ES queries.

Search

Search pretty much just takes the provided query and runs it. In the _compat version, it does some munging of the output results to make it match the zebra version.

Authorities

Authority searching is done with its own _compat methods, but they tend to be simpler as authority searching is more strict.

Troubleshooting

You can generally view your indexes by browsing to:

Biblios: localhost:9200/koha_{instance}_biblios

Authorities: localhost:9200/koha_{instance}_authorities

You can also view a record from the command line like so: curl 'localhost:9200/koha_kohadev_biblios/data/5?pretty=yes'

To clear your indexes: curl -XDELETE localhost:9200/koha_{instance}_biblios

You can run queries directly against the database using curl or a REST client (Simple REST Client, Postman,HTTPRequester firefox plugin etc.) Adding the elastic plugin Marvel is also helpful, there are more tools than I can cover here, but the gist is that simple requests can be donefor testing/examples