C & P Search Rewrite RFC

NOTE: THIS PROPOSAL IS ON THE WIKI SOLELY FOR HISTORICAL REASONS, IN THE HOPES THAT IT WILL BE HELPFUL TO OTHERS

Overview

Based on some of the many useful comments we received in response to the original proposal, we have revised our plan for the plumbing section of the development to allow a more incremental approach to the development. In particular, the revised proposal has the following major changes:

We propose to start the development not with Data::SearchEngine::Zebra but by inserting the new search code in front of the existing search code, an approach that will allow us to replace the existing search parser gradually, in a very granular fashion (in fact, it would be possible to use the new search parser for only one small subset of queries as an initial proof-of-concept).
Rather than starting out from scratch we can collaborate with the Evergreen project to make their QueryParser suitable for both projects. By doing this, we will be able to share the work of maintaining the parser and both projects will be able to take advantage of search syntax refinements developed by either project.

We are particularly excited by the opportunities afforded by collaborating with Evergreen on the QueryParser project. Not only will Koha be better for the search rewrite, Evergreen will be, too.

The RFC reflects the updated proposal. Please note in particular that the numbering has changed, and many of the earlier components have been broken up into smaller chunks.

Development Summary

1. Plumbing

The items in this section represent the base development, and are required for any further work on Koha's search functionality.

(a) Inject new query parser in front of existing Zebra search code.

Because Koha's current query parser works somewhat (for some values of “somewhat”), it would be preferable to keep it functioning for searches that the new query parser cannot yet handle. In order to do that, we will simply (for some value of “simply”) try the new query parser before using the old one, and, if the new one is able to handle the query, don't use the old parser.

Initially, we will handle only “exploded” search terms (i.e. searches for a particular subject plus broader/narrower/related subjects), since this is what is covered by the (separate) project that will implement this step.

(b) Adopt the Evergreen QueryParser

By collaborating with the Evergreen project, we can make the QueryParser independent of Evergreen and incorporate it into Koha. The shared QueryParser will be used to generate the data structure used to query Zebra and Solr. Information about the QueryParser from the Evergreen project can be found at [1]. After this step, the new query parser will handle non-fielded queries, queries using only one field, and queries that use the Evergreen QueryParser syntax.

This step will involve three distinct phases:

i. Transition Evergreen's QueryParser into a separate module

In cooperation with the Evergreen project, move the code out of the Evergreen source tree and into a separate Library::QueryParser (name subject to change) module. Much of this work will actually be submitted to the Evergreen project for inclusion in Evergreen prior to the branching of the QueryParser module. In the spirit of Koha, C & P will donate the labor necessary for this part of the project.

ii. Write a Zebra QueryParser driver

A driver that translates the expressive query data structure generated by the parser into a PQF query that Zebra can handle will enable us to continue taking advantage of Zebra's power while gaining all the expressiveness that Evergreen's QueryParser offers.

iii. Hook the QueryParser into Koha

The driver will allow us to use the QueryParser in Koha, but this step is required to give the QueryParser knowledge about all the indexes, facets, and filters supported by Koha.

(c) Change the Advanced Search screen to use the new QueryParser

We will retain the existing interface but use the improved query functionality. This will allow for more powerful and unambiguous queries from the advanced search pages on both the staff client and OPAC without changing the user interface.

(d) Adjust the query parser to use Koha::SearchEngine::Zebra

At the moment, Koha directly accesses Zebra for performing queries. This gets in the way of Koha using other search engines such as Solr. Paradoxically, this also makes it harder for us to take advantage of Zebra's more advanced features.

In order to integrate Solr support into the Koha codebase, we are using a library called Data::SearchEngine::Solr to interface with Solr. Work was started on an interface for Zebra called Data::SearchEngine::Zebra, but it has not been finished yet. By developing a minimalist version of this, it will be easy to integrate future search engines into Koha.

While there are many problems with the existing searching in Koha, we don't want to lose what we already have, so it is crucial to provide direct access to the following features of our search engines:

“Fuzzy” searching
Boolean logic
“Simple” relevance ranking (but see also the section below on relevancy)
Zebra-specific features: ICU (International Components for Unicode— the library used by Zebra for supporting non-Roman character sets; Solr supports Unicode natively)

(e) Implement search engine native faceting

At the moment Koha uses homegrown and inefficient routines for faceting, even though all modern search engines support faceting natively (for Zebra, facets can also be handled by PazPar2). By using the native support, we can offer more meaningful facets faster.

(f) Better relevancy ranking

Although relevancy ranking similar to what is currently offered by Koha will be available after the creation of 1(b), it is generally agreed that Koha's relevancy ranking is not really up to snuff. For example, at the moment, the default relevancy ranking does not consistently rank exact matches in the author or title highest, nor privilege matches in author or title above other matches. There is a consensus that it should. This component would improve the relevancy ranking to better match expectations. This might be better done prior to 1(e) (native faceting), since it seems to be a higher priority for the community.

(g) Replace Koha's homegrown query parsing code

The QueryParser support above adds more expressive query options to the advanced search and simple searches, but Koha needs to provide users with access to the full expressiveness of bibliographic data to searches to command searches, where the user can enter via keyboard more complex searches using boolean logic, etc. In order to do that, Koha requires a more feature-full query syntax. Koha should support the following four search syntaxes:

i. CCL (Common Command Language)

Vaguely English-like query language currently used as the default language. Unlike in the current situation, boolean and phrase searches will be parsed correctly. Example queries:

A. computer repair => looking for basic guides to computer repair
B. ti:napoleon's and su:russia => looking for a vaguely remembered book called “Napoleon's [something]” about his campaign in Russia
C. ti:jane austen and biography => looking for a biography of Jane Austen called “Jane Austen : a wit before her time”
D. gothic and pb:pocket => looking for Gothic romances published by Pocket Books
E. tailoring or couture sewing => looking to learn about tailoring/couture sewing in order to make a tuxedo
F. murder not mystery => looking for non-fiction books about murder
G. ti:assassins not sondheim => looking for books about assassins that are not about the Stephen Sondheim musical
H. handbook “model t” => looking for a restoration handbook for the Model T car
I. jazz piano and (monk or waller) => looking for material on jazz piano by or about Thelonius Monk or Fats Waller
J. relationships dating not => looking for the book on relationships called “Dating not friendship”
K. guide A+ => looking for guide to earning the A+ certification program

ii. CQL (Contextual Query Language)

Query language used by SRU (Search/Retrieve via URL) and SRW (Search/Retrieve via Web service), the interfaces commonly used for searching OAI repositories. Required for Koha to act as an SRU/SRW server (see 2(a)). CQL queries look like the following:

A. title any "dinosaur bird reptile" => looking for a book which has at least one of the words “dinosaur,” “bird,” or “reptile” in its title
B. dc.author=(kern* or ritchie) and (bath.title exact "the c programming language" dc.title=elements prox///4 dc.title=programming) and subject any/relevant "style design analysis" => looking for any book by Kern or Ritchie with either the title “The C programming language,” or the word “elements” within four words of the word “programming” in the title, and a subject related to “style,” “design,” or “analysis”

iii. PQF (Prefix Query Format)

Textual representation of the query format used by Z39.50. Required for Koha to act as a Z39.50 server (see 2(a)). PQF queries look like the following:

A. @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography database" => looking for a record with the exact title “Beethoven bibliography database”
B. @attr 1=4 @attr 5=102 @attr 2=102 "informat.* retrieval" => looking for the books with the titles relevant to information/informative/informatics/etc.

iv. QueryParser's native format

Currently used as the default query language by Evergreen, and will be used (optionally) as the default in Koha following the completion of 1(b). This is similar to CCL but uses a more expressive syntax and uses punctuation for operators. Following is an example query:

A. ("harry potter" && (stone || chamber)) && (author:rowling || subject:rowling) => looking for a book with the keywords “Harry Potter” and either “stone” or “chamber,” with the author or subject “Rowling” (i.e. either Harry Potter and the Sorcerer's/Philosopher's Stone, Harry Potter and the Chamber of Secrets, or a book about the writing of those books)

(h) Implement a QueryParser driver for Solr

In order to use all the features of QueryParser with Solr, we will also need a driver for Solr.

2. Basic functionality

Though possibly not, strictly speaking, core Koha functionality, in our view these represent the additions to Koha that should be considered basic requirements for this development.

(a) Z39.50/SRU/SRW server

At present a Z39.50/SRU/SRW server is available only if the Zebra search engine has been selected and configured to provide external Z39.50/SRU/SRW service. If Solr has been selected as the search engine, Z39.50/SRU/SRW is unavailable. Z39.50 is commonly used for ILL, at least in the United States where it provides a cost-effective alternative to ILLiad for state-wide ILL networks. SRU/SRW is used by OAI (the Open Archives Initiative) for sharing metadata. An SRU/SRW server integrated into Koha would simplify future enhancements to Koha's federated searching.

3. Additional functionality

Items in this section would add a great deal to Koha's functionality, but represent opportunities for further improvement that could be selected à la carte, rather than – as with the above – being basic requirements of any search rewrite. Many of these features could be implemented before all the above components are completed.

(a) “Did you mean?” spelling plugin

i. Basic plugin

A “Did you mean?” spelling plugin that uses a spelling database drawn from the catalog to generate a dictionary of suggestions.

ii. Consider word collocation when making suggestions

Add logic to the spelling plugin to consider word proximity when making suggestions. As an example of where this might be significant, consider following searches:

A. dastardly dedd => intended search: dastardly deed (looking for mystery romance); recommendation: dastardly dead (finds horror)
B. duck goost => intended search: duck goose (looking for books on kids' games); recommendation: duck ghost (finds children's humor books)

(b) Static relevancy ranking

Allow librarians could choose – for example – to make DVDs always float to the top or sink to the bottom of searches. This could be implemented any time after 1(f).

(c) User-specified relevancy ranking

Rather than imposing a single set of relevancy ranking rules on all libraries, allow librarians to set the relevancy rules for searches on their library's catalog themselves.

(d) User-configurable indexes

Interface to enable the user to configure indexes for all supported search engines, not just Solr.

(e) “Snippets” on the results page

It can be hard to tell exactly why a given record is being suggested. By showing showing a snippet of the part of the record that matched the user's search, users would be able to choose relevant results quicker.

(f) Automatically follow spelling suggestions when no results are found

Add the ability to have the OPAC automatically follow the first suggestion in cases where no results are found, a la Google's “Showing results for 'syzygy'” feature when you search for “szygy” (which is, of course, a typo).

(g) Multi-lingual stemming

Add multi-lingual stemming so that Koha will recognize that “Толстой” and “Толстого” are actually the same word (both words are the last name of the famous author Lev Tolstoy. The first one in nominative case, the other in genitive). Due to limitations in the available software, at least initially stemming would be limited to European languages.

(h) "Fuzzy" searching

Integrated support for “fuzzy” searching, eliminating the need for search engine support, and making it consistent. At present, “fuzzy” searching means something different depending on which search engine is in use:

i. When Solr is in use, “fuzzy” means either that accents are ignored, or that searches will match words with similar spellings based on one of two algorithms.2
ii. When Zebra is in use and ICU is disabled, “fuzzy” means that any one character can be replaced with another character and the word will still match. When Zebra is in use and ICU is enabled, “fuzzy” searching works exactly the same as non-fuzzy searching. This could conceivably be fixed in Zebra, as Zebra is an open source system as well.

(i) Federated searching using the QueryParser

By replacing the current interface to Pazpar2 with one based on the new search code, we could greatly improve search matching, and offer more possibilities for federated searching, including better integration of ILL directly into the OPAC (the BorrowDirect program could use a single OPAC using this feature).

(j) Additional query languages

i. A Google-style query syntax

A simplistic query syntax that uses punctuation rather than words for operators. This would have the benefit of being more familiar than the QueryParser syntax, at the expense of being less expressive.

ii. Solr's native query languages

Note that this would not be necessary to take advantage of Solr's functionality. This would probably be useful only to computer programmers who had grown accustomed to the syntax. Most likely the QueryParser syntax would be sufficient for most uses of Koha, but for completeness' sake it is worth noting that these languages could be implemented. The two query languages most commonly use by Solr are:

The Lucene Query Parser syntax
DisMax (Disjunction Max) syntax

iii. Other special-purpose query syntaxes

A. The Bloomberg query syntax
B. The DIALOG query syntax
C. The LexisNexis query syntax
D. The OCLC query syntax
E. The Westlaw query syntax
F. The query syntax used by any other database you may have used