Understanding Zebra indexing

From Koha Wiki
Jump to navigation Jump to search

Warning for those who follow: this page is largely outdated, as all examples refer to indexing using GRS-1. The file record.abs is obsolete, and all examples below need to be re-written to use biblio-koha-indexdefs.xml.



Adding additional fields to Zebra search

You will have to modify following files to add previously unsupported field (856$u in our example) to Zebra:


zebradb/biblios/etc/bib1.att

List of search indexes and their corresponding Z39.50 use attributes

Documentation for bib1 is available at http://www.loc.gov/z3950/agency/bib1.html and according to it, our 856$u field should map to 1032

dpavlin@koha-dev:/etc/koha/zebradb$ grep 1032 biblios/etc/bib1.att
att 1032    Doc-id

zebradb/marc_defs/marc21/biblios/record.abs

mapping of MARC fields to indexes

dpavlin@koha-dev:/etc/koha/zebradb$ grep Doc-id marc_defs/marc21/biblios/record.abs
melm 856$u      Doc-id

zebradb/ccl.properties

for searching purposes

dpavlin@koha-dev:/etc/koha/zebradb$ grep Doc-id ccl.properties
Doc-id 1=1032

/usr/share/koha/lib/C4/Search.pm

(The path to your Search.pm may be different.)

add to my @indexes under # biblio indexes.

 'Doc-id'

After that, you will have to reindex Zebra so that new index gets populated.

Alternative example of modifying Koha to support additional fields is available in http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=4506

Examine Zebra index

Example is provided by Adam Dickmeiss <adam@indexdata.dk> on koha-devel mailing list.

You should be able to view indexing for any record by doing a 'present' of the record with element set zebra::index and syntax being XML or SUTRS, see https://www.indexdata.com/zebra/doc/special-retrieval.html

Most Z39.50 client should be able to do that.

yaz-client example:

yaz-client z3950.indexdata.com/marc
Connecting...OK.
Sent initrequest.
Connection accepted by v3 target.
ID : 81
Name : Zebra Information Server/GFS/YAZ
Version: 4.1.3 3ec3ad2bd1e248022bbb7c9d762844bef7f8a461
Options: search present delSet triggerResourceCtrl scan sort
extendedServices namedResultSets
Elapsed: 0.055151
Z> f computer
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 10, setno 1
SearchResult-1: term=computer cnt=10
records returned: 0
Elapsed: 0.029065
Z> s
Sent presentRequest (1+1).
Records: 1
[marc]Record type: USmarc
00366nam 22001698a 4504
001 11224466
003 DLC
005 00000000000000.0
008 910710c19910701nju 00010 eng
010 $a 11224466
040 $a DLC $c DLC
050 00 $a 123-xyz
100 10 $a Jack Collins
245 10 $a How to program a computer
260 1 $a Penguin
263 $a 8710
300 $a p. cm.

nextResultSetPosition = 2
Elapsed: 0.026880
Z> form xml
Z> elem zebra::index
Z> s 1+1
Sent presentRequest (1+1).
Records: 1
Record type: XML
<record xmlns="http://www.indexdata.com/zebra/" sysno="2" set="zebra::index">
<index name="Author-name-personal" type="w" seq="1">@^</index>
<index name="Author-name-personal" type="w" seq="1"></index>
<index name="Author-name-personal" type="w" seq="2">jack</index>
<index name="Author-name-personal" type="w" seq="3">collins</index>
<index name="Author-name-personal" type="p" seq="4">jack collins</index>
<index name="Author" type="w" seq="4">@^</index>
<index name="Author" type="w" seq="1"></index>
<index name="Author" type="w" seq="5">jack</index>
<index name="Author" type="w" seq="6">collins</index>
<index name="Author" type="p" seq="7">jack collins</index>
<index name="any" type="w" seq="7">@^</index>
<index name="any" type="w" seq="1"></index>
<index name="any" type="w" seq="8">jack</index>
<index name="any" type="w" seq="9">collins</index>
<index name="title" type="w" seq="10">@^</index>
<index name="title" type="w" seq="1"></index>
<index name="title" type="w" seq="11">how</index>
<index name="title" type="w" seq="12">to</index>
<index name="title" type="w" seq="13">program</index>
<index name="title" type="w" seq="14">a</index>
<index name="title" type="w" seq="15">computer</index>
<index name="title" type="p" seq="16">how to program a computer</index>
<index name="any" type="w" seq="16">@^</index>
<index name="any" type="w" seq="17">how</index>
<index name="any" type="w" seq="18">to</index>
<index name="any" type="w" seq="19">program</index>
<index name="any" type="w" seq="20">a</index>
<index name="any" type="w" seq="21">computer</index>
<index name="Place-publication" type="w" seq="22">@^</index>
<index name="Place-publication" type="w" seq="1"></index>
<index name="Place-publication" type="w" seq="23">penguin</index>
<index name="any" type="w" seq="24">@^</index>
<index name="any" type="w" seq="25">penguin</index>
<index name="_ALLRECORDS" type="w" seq="2"></index>
</record>
nextResultSetPosition = 2
Elapsed: 0.027786

In XML output "seq" is sequence number, position of "token".

Recover MARC records from Zebra index

Sometimes, you drop the database, but you would really want to keep biblio records which are still in Zebra.

-m option will dump results into marc file which you can later import back into Koha

dpavlin@koha-dev:~$ yaz-client -m ~/backup.marc unix:/var/run/koha/gimpoz/bibliosocket
Connecting...OK.
Sent initrequest.
Connection accepted by v3 target.
ID     : 81
Name   : Zebra Information Server/GFS/YAZ
Version: 4.2.29 94b8ca343cc7084394526bd26c80c208aa16c9f9
Options: search present delSet triggerResourceCtrl scan sort extendedServices namedResultSets
Elapsed: 0.000950
Z> base biblios

I'm assuming that every record in database has gimpoz string in it to select all records:

Z> find gimpoz
Sent searchRequest.
Received SearchResponse.
Search was a success.
Number of hits: 6022, setno 1
SearchResult-1: term=gimpoz cnt=6022
records returned: 0
Elapsed: 0.006662

Number of hits seems OK, so we start retrieving records. You will not be able to retrieve all records using single show command, so we do it in batches of 1000 records:

Z> show 1+1000
# record output...
nextResultSetPosition = 1001
Elapsed: 0.112413
Z> show 1001+1000
...
nextResultSetPosition = 2001
Elapsed: 0.107331
Z> show 2001+1000
...
nextResultSetPosition = 3001
Elapsed: 0.105962
Z> show 3001+1000
...
nextResultSetPosition = 4001
Elapsed: 0.106484
Z> show 4001+1000
...
nextResultSetPosition = 5001
Elapsed: 0.105443
Z> show 5001+1000
...
nextResultSetPosition = 6001
Elapsed: 0.105816
Z> show 6001+22
...
nextResultSetPosition = 6023
Elapsed: 0.002511
Z> exit
See you later, alligator.
$ ls -al ~/backup.marc 
-rw-r--r-- 1 dpavlin dpavlin 1395577 2012-04-24 23:50 /home/dpavlin/backup.marc

Rember to use koha-dump to create backup before you start destructive operations!

"Abandon hope, all ye who enter"

(jcamins originally pasted this in the pastebin, copying it here for more people to see it.)

To give a concrete example of what I'm talking about, consider the following data:

=610  20$aGoddard College$vFaculty publications.

And this configuration:

melm 610$a      Name-and-title
melm 610$t      Name-and-title,Title
melm 610$9      Koha-Auth-Number
melm 610        Name,Subject,Subject:p,Corporate-name

From that, you get these indexes:

Name-and-title = {"Goddard College"}
Name = {"Faculty publications"}
Subject = {"Faculty publications"}
Subject:p = {"Faculty publications"}
Corporate-name = {"Faculty publications"}

What we wanted, obviously, was for the Subject, Name, and Corporate-name indexes to include the entire 610 field. But that doesn't happen, because the moment the "melm 610$a" line is executed, the parser stops, and the "melm 610" line is never reached). In order to get the behavior we want, we need the following configuration (actually, I'm not including subfield $t or subfield $9 in those indexes):

melm 610$a      Name-and-title,Subject,Subject:p,Name,Corporate-name
melm 610$t      Name-and-title,Title
melm 610$9      Koha-Auth-Number
melm 610        Name,Subject,Subject:p,Corporate-name

From this correct configuration, we get these indexes:

Name-and-title = {"Goddard College"}
Name = {Goddard, College, Faculty, publications}
Subject = {Goddard, College, Faculty, publications}
Subject:p = {"Goddard College", "Faculty publications"}
Corporate-name = {Goddard, College, Faculty, publications}

Hopefully someone will be able to do something useful with this information.

Following discussion

  • <hdl> oh that... Yes. Indeeed. We had that kind of problem.
  • <cait> looks like something that should go on the wiki
  • <hdl> we made a patch on record.abs in order to index items.
  • <hdl> And yes record.abs Is a nightmare... But changing DOM indexing is also quite promising in terms of fun... There is no "any" index by default. you have to add that.
  • <jcamins> hdl: the "any index doesn't work the way everyone thinks it does with GRS-1, either.
  • <jcamins> And if we use xsltproc to compile our DOM configuration, it's easy enough to add a "match='*'" template.
  • <hdl> Well at least, it searches the indexed data
  • <jcamins> hdl: yes, but I am only certain that two people understand that.
  • <hdl> jcamins: I much more believe in Solr than doing overcomplex XSLT just to index stuff.
  • <jcamins> (probably more do, but the only two people who I *know* understand what "all any" does are me, and you, since you just told me you did;)
  • <hdl> count paul and gmcharlt and chris and chris_n too in I guess fredericd is also a good guess for that... That makes quite a lot ;)
  • <hdl> (a kind of community of geeks working on the same stuff...)
  • <hdl> any is gathering all the indexed elements. Not all the elements of the record.
  • <jcamins> Right.
  • <hdl> nor all the elements that have not been indexed.
  • <hdl> can be compared to all_fields in solr

Another real life example

Here are the changes that I avoided to explain in the Kohacon2012 presentation Koha in language centres, maybe it will help someone. This example is for MARC21. The idea was to index four fields, map them to strings for the search query and make them searchable in the OPAC. The fields are

 MARC  strings
653$a ln-class
521$a lex
658$a curriculum
655$a index-term-genre

We had to edit four files:

 /usr/share/koha/lib/C4/Search.pm
/etc/koha/zebradb/biblios/etc/bib1.att
/etc/koha/zebradb/ccl.properties
/etc/koha/zebradb/marc_defs/marc21/biblios/record.abs

+ means we added a line

- means we deleted a file

= means we kept the line, it's in the example so you have and idea what section of a file is edited


1. /usr/share/koha/lib/C4/Search.pm


We added the strings that we wanted to map to the MARC fields to @indexes here


 = sub getIndexes{
=     my @indexes = (
=                     # biblio indexes

[…]

 +                     'Index-term-genre',
+                     'lex',
+                     'ln-class',
+                     'curriculum',


2. /etc/koha/zebradb/biblios/etc/bib1.att


We added two lines here, the other two were in there already.


 = ## Fixed Fields and other special indexes

[…]

 = att 9903    lex
+ att 9653    ln-class
+ att 9655    Index-term-genre
= att 9658    curriculum


3. /etc/koha/zebradb/ccl.properties


again, two lines were already in here, two were added


 = lex 1=9903 r=r
= curriculum 1=9658
+ ln-class 1=9653
+ Index-term-genre 1=9655


4. /etc/koha/zebradb/marc_defs/marc21/biblios/record.abs


and here we changed one and added two more lines


 - melm 521$a      lex:n
+ melm 521$a      lex:w, lex:p

[…]

 + melm 653$a      ln-class:w,ln-class:p,Subject,Subject:p
= melm 653$9      Koha-Auth-Number
= melm 653        Subject,Subject:p

[…]

 + melm 655$a      Index-term-genre:w,Index-term-genre:p,Subject,Subject:p
= melm 655$9      Koha-Auth-Number
= melm 655        Subject,Subject:p

[…]

 = melm 658$a      curriculum:w,curriculum:p,Subject,Subject:p


You have to reindex your records after the changes. An example search query using some of these changes is

 opac-search.pl?idx=ln-class&q=English&limit=curriculum:TOEFL

Related pages