Switch to Solr RFC

From Koha Wiki
Jump to navigation Jump to search


This article is obsolete




From Zebra to Solr

Status: unknown
Sponsored by: St Etienne University
Developed by: BibLibre
Expected for: 2011-05-01
Bug number: Bug 5360
Work in progress repository: No URL given.
Description: We propose to switch from zebra to solr as indexing engine. See our mail on koha-devel, and the following thread : http://lists.koha-community.org/pipermail/koha-devel/2010-October/034447.html

zebra is fast and embeds native z3950 server. But it has also some major drawbacks we have to cope with on our everyday life making it quite difficult to maintain.

  • zebra config files are a nightmare. You can't drive the configuration file easily. namely : Can't edit indexs via HTTP or configuration. all is in files hardcoded on disk. ⇒ you can't list indexes you can't change indexes, you can't edit indexes, you can't say I want this index at OPAC, that in intranet. (Could be done with scraping ccl.properties, and then record.abs and bib1.att…. But a bit awkward - and less of a HELL than having no configuration file.) So you cannot customize configuration defining the indexes you want easily. And ppl donot get a translation of the indexes since all the indexes are hardcoded in the ccl.properties and we donot have a translation process so that ccl attributes could be translated into different languages. (Could the translation process be improved?)
  • no real-time indexing : the use of a crontab is poor: when you add an authority while creating a biblio, you have to wait some some minutes to end your biblio (might be solved since zebra has some way to index biblios via z3950 extended services, but hard and should be tested and at the time community first tested that, a performance problem was raised on indexing.) - This is http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=5165 and seems our fault, not zebra's.
  • no way to access/process/delete data easily. If you have indexes in it or have some problems with your data, you have to reindex the whole stuff and indexing errors are quite difficult to detect. - test case?
  • during index process of a file, if you have a problem in your data, zebraidx just fails silently… And this is NOT safe. And you have no way to know WHICH biblio made the process crash. We had a LOT of trouble with Aix-Marseille universities that have some arabic translitterated biblios that makes zebra/icu completly crash ! We had to do some recursive script to find 14 biblios on 730 000 that makes zebra crash (even is properly stored & displayed) - test case?
  • facets are not working properly : they are on the result displayed because there are problems with diacritics & facets that can't be solved as of today. And noone can provide a solution (we spoke about that with indexdata and no clear solution was really provided. - test case?
  • zebra does not evolve anymore. There is no real community around it, it's just an opensource indexdata software. We sent many questions onlist and never got answers. We could pay for better support but the fee required is quite deterrent and benefit is still questionable. - we could make it part of Koha's community - is there interest?
  • icu & zebra are colleagues, not really friends : right truncation not working, fuzzy search not working and facets. - test case?
  • we use a deprecated way to define indexes for biblios (grs1) and the tool developped by indexdata to change to DOM has many flaws. we could manage and do with it. But is it worth the strive?

I think that every one agrees that we have to refactor C4::Search. Indeed, query parser is not able to manage independantly all the configuration options. And usage of usmarc as internal for biblio comes with a serious limitation of 9999 bytes, which for big biblios with many items, is not enough. - this limitation is Koha's fault, not Zebra's.

BibLibre investigated in a catalogue based on solr.

A University in France contracted us for that development. This University is in relation with all the community here in France and solr will certainly be adopted by all the libraries France wide (citation required - web searching did not find any statement of intent for all French libraries to adopt Solr). We are planning to release the code on our git early spring next year and rebase on whatever Koha version will be released at that time 3.4 or 3.6.


Why ?

Solr indexes with data with HTTP.

  • It can provide fuzzy search, search on synonyms, suggestions It can provide facet search, stemming.
  • utf8 support is embedded.
  • Community is really impressively reactive and numerous and efficient.
  • And documentation is very good and exhaustive.

You can see the results on solr.biblibre.com and catalogue.solr.biblibre.com

http://catalogue.solr.biblibre.com/cgi-bin/koha/opac-search.pl?q=jean http://solr.biblibre.com/cgi-bin/koha/admin/admin-home.pl you can log there with demo/demo lgoin/password

http://solr.biblibre.com/cgi-bin/koha/solr/indexes.pl is the page where ppl can manage their indexes and links.

a) Librarians can define their own indexes, and there is a plugin that fetches data from rejected authorities and from authorised_values (that could/should have been achieved with zebra but only with major work on xslt).

b) C4/Search.pm count lines of code could be shrinked ten times. You can test from poc_solr branch on git://git.biblibre.com/koha_biblibre.git But you have to install solr.


Concerns

Meetings

How To Help

Documentation for Solr

Record Indexing and Retrieval Options for Koha

The comparative options content originally posted here has been moved to Record Indexing and Retrieval Options for Koha.

Options Advantages and Disadvantages

See Options Advantages and Disadvantages.

Configuration

Record Indexing Server Aspect of Configuration

Solr/Lucene Record Indexing Server Aspect of Configuration

See the summary description of BibLibre work on Solr/Lucene Record Indexing Server Aspect of Configuration.

Z39.50/SRU Server Aspect of Configuration

SimpleServer and YAZ with Solr/Lucene Z39.50/SRU Server Aspect of Configuration

See the summary description of SimpleServer and YAZ with Solr/Lucene Z39.50/SRU Server Aspect of Configuration.

Z39.50/SRU Client Aspect of Configuration

Metasearch (Federated Search) Aspect of Configuration

Planning is still needed for any Koha OPAC metasearch functionality with Solr/Lucene.

Code Functionality

Record Indexing Aspect of Code Functionality

Data::SearchEngine::Solr and Solr/Lucene Record Indexing Aspect of Code Functionality

See the summary description of BibLibre work on Data-SearchEngine-Solr and Solr/Lucene Record Indexing Aspect of Code Functionality.

Local Record Retrieval Aspect of Code Functionality

Data::SearchEngine::Solr and Solr/Lucene Local Record Retrieval Aspect of Code Functionality

See the summary description of BibLibre work on Data-SearchEngine-Solr and Solr/Lucene Local Record Retrieval Aspect of Code Functionality.

Z39.50/SRU Server Aspect of Code Functionality

SimpleServer with Data::SearchEngine::Solr and Solr/Lucene Z39.50/SRU Server Aspect of Code Functionality

See the summary description of BibLibre work on SimpleServer with Data-SearchEngine-Solr and Solr/Lucene Z39.50/SRU Server Aspect of Code Functionality.

Metasearch (Federated Search) Aspect of Code Functionality

Planning is still needed for any Koha OPAC metasearch functionality with Solr/Lucene.

#kohahack12

debrief of #kohahack12

big work in progress

different tasks:

  • inventory of what we have in solr biblibre and zebra community version
  • move specific zebra stuffs in opac-search.pl in pm files
  • write a Data::SearchEngine::Zebra like Solr cpan modules
  • implements thesse in an abstract search layer

todo next:

  • think about queries and how to handle both in koha UI
  • begin to add use cases to see how the layer responds


References:

how it works today

External libs

C4::Search::Engine.pm

  • find_searchengine
  • search

if C4::Context->preference("SearchEngine") == Solr
calls C4::Search::Engine::Solr::SimpleSearch 

  • index
  • add_to_index_queue escape_string

C4::Search.pm

(should not be like that)

AddToIndexQueue
SimpleSearch
calls C4::Search::Engine::find_searchengine
calls C4::Search::Engine::search
IndexRecord
 same scheme as SimpleSearch but call "index"

C4::Search::Engine::Solr.pm

sub SimpleSearch $q, $filters, $params, $caller

  • $q : the query 
  • $filters : filters
  • $parms (   $page : page number $count : maximum number of results to return $sort : indexes to sort on ("idx_name (asc|desc)"))
  • $facets
  • open connection create a new connection: C4::Search::Engine::Solr (Data::SearchEngine::Solr->new)
  • get facets
  • apply filters
  • launch query Data::SearchEngine::Query->new (page, count, query) returns Data::SearchEngine::Solr::Results
  • get results Data::SearchEngine::Solr::Results

sub IndexRecord

  • loop over identifiers
  • getauthority or getbiblio (depending on parameters)
  • instantiate a solr document Data::SearchEngine::Item->new
  • loop over indexes (cf indexes table). If index is linked to a plugin, this plugin returns data to index. If it's a controlfield, it's indexed, otherwise we loop over subfields. If it's a date, normalize to ISO format.
  • If it's a bibliographic record, get authorised fields and index labels create index passing it its values Data::SearchEngine::Item->set_value
  • send it to solr Data::SearchEngine::Solr->add (uses Webservice::Solr to send everything to solr)

examples:

one result Data::SearchEngine::Solr::Results

$VAR1 = bless( {, referer: http://xxx 

                 'spell_suggestions' => {},, referer: http://xxx
              'query' => bless( {, referer: http://xxx
                                  'count' => '30',, referer: http://xxx
                                  'page' => 1,, referer: http://xxx
                                  'query' => "encyclop\\x{e9}die poterie",, referer: http://xxx
                                  'filters' => {},, referer: http://xxx
                                  'facets' => {}, referer: http://xxx
                                }, 'Data::SearchEngine::Query' ),, referer: http://xxx
              'spell_frequencies' => {},, referer: http://xxx
              'pager' => bless( {, referer: http://xxx
                                  'total_entries' => 1,, referer: http://xxx
                                  'entries_per_page' => 30,, referer: http://xxx
                                  'current_page' => 1, referer: http://xxx
                                }, 'Data::SearchEngine::Paginator' ),, referer: http://xxx
              'items' => [, referer: http://xxx
                           bless( {, referer: http://xxx
                                    'values' => {, referer: http://xxx
                                                  'recordid' => '16204',, referer: http://xxx
                                                  'id' => 'biblio_16204', referer: http://xxx
                                                },, referer: http://xxx
                                    'score' => 0,, referer: http://xxx
                                    'id' => 'biblio_16204', referer: http://xxx
                                  }, 'Data::SearchEngine::Item' ), referer: http://xxx
                         ],, referer: http://xxx
              'facets' => {, referer: http://xxx
                            'str_title-series' => [],, referer: http://xxx
                            'str_author' => [, referer: http://xxx
                                              '30387',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '39966',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Cosentino',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Cosentino Peter',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Peter',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Peter Cosentino',, referer: http://xxx
                                              1, referer: http://xxx
                                            ],, referer: http://xxx
                            'str_subject' => [, referer: http://xxx
                                               '1166',, referer: http://xxx
                                               1,, referer: http://xxx
                                               '373',, referer: http://xxx
                                               1,, referer: http://xxx
                                               "C\\x{e9}ramique",, referer: http://xxx
                                               1,, referer: http://xxx
                                               'technique',, referer: http://xxx
                                               1, referer: http://xxx
                                             ],, referer: http://xxx
                            'str_holdingbranch' => [, referer: http://xxx
                                                     'CENTRE',, referer: http://xxx
                                                     1, referer: http://xxx
                                                   ],, referer: http://xxx
                            'str_ccode' => [, referer: http://xxx
                                             '1',, referer: http://xxx
                                             1, referer: http://xxx
                                           ], referer: http://xxx
                          },, referer: http://xxx
              'elapsed' => '0.0799260139465332', referer: http://xxx
            }, 'Data::SearchEngine::Solr::Results' );, referer: http://xxx

6 results: Data::SearchEngine::Solr::Results

$VAR1 = bless( {, referer: http://xxx 

                 'spell_suggestions' => {},, referer: http://xxx
              'query' => bless( {, referer: http://xxx
                                  'count' => '30',, referer: http://xxx
                                  'page' => 1,, referer: http://xxx
                                  'query' => 'poterie',, referer: http://xxx
                                  'filters' => {},, referer: http://xxx
                                  'facets' => {}, referer: http://xxx
                                }, 'Data::SearchEngine::Query' ),, referer: http://xxx
              'spell_frequencies' => {},, referer: http://xxx
              'pager' => bless( {, referer: http://xxx
                                  'total_entries' => 6,, referer: http://xxx
                                  'entries_per_page' => 30,, referer: http://xxx
                                  'current_page' => 1, referer: http://xxx
                                }, 'Data::SearchEngine::Paginator' ),, referer: http://xxx
              'items' => [, referer: http://xxx
                           bless( {, referer: http://xxx
                                    'values' => {, referer: http://xxx
                                                  'recordid' => '16501',, referer: http://xxx
                                                  'id' => 'biblio_16501', referer: http://xxx
                                                },, referer: http://xxx
                                    'score' => 0,, referer: http://xxx
                                    'id' => 'biblio_16501', referer: http://xxx
                                  }, 'Data::SearchEngine::Item' ),, referer: http://xxx
                           bless( {, referer: http://xxx
                                    'values' => {, referer: http://xxx
                                                  'recordid' => '16204',, referer: http://xxx
                                                  'id' => 'biblio_16204', referer: http://xxx
                                                },, referer: http://xxx
                                    'score' => 0,, referer: http://xxx
                                    'id' => 'biblio_16204', referer: http://xxx
                                  }, 'Data::SearchEngine::Item' ),, referer: http://xxx
                           bless( {, referer: http://xxx
                                    'values' => {, referer: http://xxx
                                                  'recordid' => '9861',, referer: http://xxx
                                                  'id' => 'biblio_9861', referer: http://xxx
                                                },, referer: http://xxx
                                    'score' => 0,, referer: http://xxx
                                    'id' => 'biblio_9861', referer: http://xxx
                                  }, 'Data::SearchEngine::Item' ),, referer: http://xxx
                           bless( {, referer: http://xxx
                                    'values' => {, referer: http://xxx
                                                  'recordid' => '18307',, referer: http://xxx
                                                  'id' => 'biblio_18307', referer: http://xxx
                                                },, referer: http://xxx
                                    'score' => 0,, referer: http://xxx
                                    'id' => 'biblio_18307', referer: http://xxx
                                  }, 'Data::SearchEngine::Item' ),, referer: http://xxx
                           bless( {, referer: http://xxx
                                    'values' => {, referer: http://xxx
                                                  'recordid' => '18535',, referer: http://xxx
                                                  'id' => 'biblio_18535', referer: http://xxx
                                                },, referer: http://xxx
                                    'score' => 0,, referer: http://xxx
                                    'id' => 'biblio_18535', referer: http://xxx
                                  }, 'Data::SearchEngine::Item' ),, referer: http://xxx
                           bless( {, referer: http://xxx
                                    'values' => {, referer: http://xxx
                                                  'recordid' => '19763',, referer: http://xxx
                                                  'id' => 'biblio_19763', referer: http://xxx
                                                },, referer: http://xxx
                                    'score' => 0,, referer: http://xxx
                                    'id' => 'biblio_19763', referer: http://xxx
                                  }, 'Data::SearchEngine::Item' ), referer: http://xxx
                         ],, referer: http://xxx
              'facets' => {, referer: http://xxx
                            'str_title-series' => [, referer: http://xxx
                                                    "Dialogues c\\x{e9}ramiques",, referer: http://xxx
                                                    2,, referer: http://xxx
                                                    "Dialogues c\\x{e9}ramiques (Dunkerque).",, referer: http://xxx
                                                    2, referer: http://xxx
                                                  ],, referer: http://xxx
                            'str_author' => [, referer: http://xxx
                                              '2980',, referer: http://xxx
                                              2,, referer: http://xxx
                                              '33518',, referer: http://xxx
                                              2,, referer: http://xxx
                                              'Dunkerque, Nord',, referer: http://xxx
                                              2,, referer: http://xxx
                                              "Mus\\x{e9}e d'art contemporain",, referer: http://xxx
                                              2,, referer: http://xxx
                                              '19913',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '19914',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '24720',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '24721',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '30387',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '30807',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '35619',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '35620',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '39966',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '41085',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '48684',, referer: http://xxx
                                              1,, referer: http://xxx
                                              '48685',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Cosentino',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Cosentino Peter',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Daniel de',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Daniel de Montmollin (textes)',, referer: http://xxx
                                              1,, referer: http://xxx
                                              "Dunkerque, Mus\\x{e9}e d'art contemporain",, referer: http://xxx
                                              1,, referer: http://xxx
                                              "Fr\\x{e8}re de Taiz\\x{e9}",, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Hamid',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Montmollin',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Montmollin Daniel de',, referer: http://xxx
                                              1,, referer: http://xxx
                                              "Mus\\x{e9}e d'art contemporain, Dunkerque",, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Ouazzani',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Ouazzani Thami',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Peter',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Peter Cosentino',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Pierre-Yves Videlier (illustrations )',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Potter',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Potter Tony',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Thami',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Tony',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Triki',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Triki Hamid',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Videlier',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Videlier Yves',, referer: http://xxx
                                              1,, referer: http://xxx
                                              'Yves',, referer: http://xxx
                                              1,, referer: http://xxx
                                              "avec la participation de Brigitte Barberi Daum ; photographies Christian Lignon, G\\x{e9}rard Dufrene ; conception et r\\x{e9}alisation graphique Amina Bennani",, referer: http://xxx
                                              1,, referer: http://xxx
                                              'textes Hamid Triki, Thami Ouazzani',, referer: http://xxx
                                              1, referer: http://xxx
                                            ],, referer: http://xxx
                            'str_subject' => [, referer: http://xxx
                                               "C\\x{e9}ramique",, referer: http://xxx
                                               2,, referer: http://xxx
                                               '1166',, referer: http://xxx
                                               1,, referer: http://xxx
                                               '18355',, referer: http://xxx
                                               1,, referer: http://xxx
                                               '18384',, referer: http://xxx
                                               1,, referer: http://xxx
                                               '1947-....',, referer: http://xxx
                                               1,, referer: http://xxx
                                               '1958-....',, referer: http://xxx
                                               1,, referer: http://xxx
                                               '19802',, referer: http://xxx
                                               1,, referer: http://xxx
                                               '33519',, referer: http://xxx
                                               1,, referer: http://xxx
                                               '33824',, referer: http://xxx
                                               1,, referer: http://xxx
                                               '35622',, referer: http://xxx
                                               1,, referer: http://xxx
                                               '373',, referer: http://xxx
                                               1,, referer: http://xxx
                                               'Camille',, referer: http://xxx
                                               1,, referer: http://xxx
                                               "Chagu\\x{e9}",, referer: http://xxx
                                               1,, referer: http://xxx
                                               'Expositions',, referer: http://xxx
                                               1,, referer: http://xxx
                                               'Maroc',, referer: http://xxx
                                               1,, referer: http://xxx
                                               'Safi',, referer: http://xxx
                                               1,, referer: http://xxx
                                               "Thi\\x{e9}baut",, referer: http://xxx
                                               1,, referer: http://xxx
                                               'Virot',, referer: http://xxx
                                               1,, referer: http://xxx
                                               'technique',, referer: http://xxx
                                               1, referer: http://xxx
                                             ],, referer: http://xxx
                            'str_holdingbranch' => [, referer: http://xxx
                                                     'CENTRE',, referer: http://xxx
                                                     5,, referer: http://xxx
                                                     'AURENCE',, referer: http://xxx
                                                     1, referer: http://xxx
                                                   ],, referer: http://xxx
                            'str_ccode' => [, referer: http://xxx
                                             '1',, referer: http://xxx
                                             6, referer: http://xxx
                                           ], referer: http://xxx
                          },, referer: http://xxx
              'elapsed' => '0.0832719802856445', referer: http://xxx
            }, 'Data::SearchEngine::Solr::Results' );, referer: http://xxx
C4::Search::Query::Solr.pm