Adoption of Elasticsearch as an alternative search engine

Status:	Draft
Sponsored by:	ByWater Solutions, Catalyst, EBSCO, AARome, CIN, ArcadiaPL, et al.
Developed by:	Robin Sheat (for now)
Expected for:	2015-03-18
Bug number:	Bug 12478
Work in progress repository:	http://git.catalyst.net.nz/gw?p=koha.git;a=shortlog;h=refs/heads/elastic_search
Description:	Adding support for Elasticsearch as an alternative to Zebra will provide for both a more scalable system, and allow additional features, for example incorporating non-MARC data into the system.

Abstract

This RFC describes the goals for implementing Elasticsearch as an alternative search engine into Koha. It explains the specific entry points that will be implemented, and also what won't be. It also covers the additional, related features that will be developed as part of this work.

Introduction

This RFC covers the planned implementation of Elasticsearch as a replacement search engine to Zebra in Koha. It includes both detail on the in-progress implementation phase, and some documentation on how the actual implementation will work. It is not intended to be a reference on deploying Elasticsearch with Koha, though it may help with that. This document will be updated from time to time to indicate current progress.

Elasticsearch

Elasticsearch is a search engine designed with high scalability and robustness in mind. It also has a flexible schema system, allowing any sort of data that can be expressed in JSON to be stored and indexed. It is very easy, using libraries such as Catmandu, to convert data from many forms, including MARC, to a useful serialisation for Elasticsearch. Once in Elasticsearch, searches against any parts of this can be readily performed. In addition, for future use, it will be possible to store non-MARC data alongside regular bibliographic material. This will allow such things as ingesting data from other source to appear alongside them in the catalogue.

High-level Implementation Details

Code and Interfaces

The incorporation of Elasticsearch is first and foremost designed to allow for an easy switch between Zebra and Elasticsearch, this is controlled by a system preference. This will make testing less of a big commitment. To allow this to happen, a set of wrapper APIs have been produced that allow "modern" access to Elasticsearch, and also support a "_compat" layer for the old style calls. This means that calls to C4::Search and C4::AuthoritiesMarc can be kept with little change, and calls to the Elasticsearch code are converted into the more useful form for query building and searching. These are done in the Koha::SearchEngine namespace.

Mappings and Configuration

To configure Koha to use Elasticsearch, the searchengine system preference is used. Additionally, the koha-conf.xml must be configured to point to the Elasticsearch servers, and to namespace the indexes in them appropriately.

For Elasticsearch, the mappings from MARC to search fields is done from the elasticsearch_mappings table in the database. It lists the field names that are stored in Elasticsearch, the MARC field they come from (from each of the supported MARC types: MARC21, UNIMARC, and NOMARC), what index they belong to (biblios or authorities), the data type (string, number, date, etc.), and whether they should be considered for faceting.

Elasticsearch Configuration

This requires no special configuration of Elasticsearch: a normally set up cluster will do the job perfectly well. When an index is first created, the Elasticsearch mapping (which specifies how Elasticsearch will handle each field) is generated from the elasticsearch_mapping table. This process also sets up pseudo-fields for certain types of searches.

Browse Interface

This is a special interface that allows sequential browsing of records, for example: browsing through authors or subjects in alphabetical order. This doesn't directly use Elasticsearch (as Elasticsearch is a search engine, which is distinct from something for browsing.) Instead, the relevant fields are put into a database table and linked with the records that it applies to.

Indexing

Elasticsearch will initially have a system similar to Zebra's for adding and updating records added to Koha. However, unlike Zebra it is possible to do this totally online without affecting searching happening while indexing is in progress. This means that it is possible to do a full reindex while the system is live. An exception to this is if the mapping is updated, this will require deleting the search index and rebuilding it from empty.

Current Progress

Note that no extensive testing has yet been done. There will be issues in those sections marked as complete. Helping finding and fixing those would be a great help to both accelerate development of this, and to learn how the new code works.

Complete

Biblio indexing
OPAC biblio searching
Authority indexing
OPAC authority searching
Staff client biblio searching
Staff client authority searching

Incomplete: not begun / in progress

Browse interface
Make code ready for release

Future goals with no plans for implementation

Staff interface for managing mappings
Ingestion of data from other sources alongside Koha catalogue data, interface for managing this
Supporting useful features unique to Elasticsearch, e.g.:
- Comparing saved searches against every newly catalogued record to see if they match, email members if it matches their interests
- Ingesting and searching the content of electronic resources

Appendix A - Non-developmental Plans

Training

Bywater is considering sponsoring web-based training courses from the Elasticsearch company (now known simply as Elastic) to teach people how to improve the Elasticsearch support, develop more advanced APIs, and generally make it easier for Koha's search ability to be made more powerful.

User talk:Eythian/ES RFC

Contents