Understanding Zebra indexing
Warning for those who follow: this page is largely outdated, as all examples refer to indexing using GRS-1. The file record.abs is obsolete, and all examples below need to be re-written to use biblio-koha-indexdefs.xml.
Adding additional fields to Zebra search
You will have to modify following files to add previously unsupported field (856$u in our example) to Zebra:
zebradb/biblios/etc/bib1.att
List of search indexes and their corresponding Z39.50 use attributes
Documentation for bib1 is available at http://www.loc.gov/z3950/agency/bib1.html and according to it, our 856$u field should map to 1032
dpavlin@koha-dev:/etc/koha/zebradb$ grep 1032 biblios/etc/bib1.att att 1032 Doc-id
zebradb/marc_defs/marc21/biblios/record.abs
mapping of MARC fields to indexes
dpavlin@koha-dev:/etc/koha/zebradb$ grep Doc-id marc_defs/marc21/biblios/record.abs melm 856$u Doc-id
zebradb/ccl.properties
for searching purposes
dpavlin@koha-dev:/etc/koha/zebradb$ grep Doc-id ccl.properties Doc-id 1=1032
(The path to your Search.pm may be different.)
add to my @indexes under # biblio indexes.
'Doc-id'
After that, you will have to reindex Zebra so that new index gets populated.
Alternative example of modifying Koha to support additional fields is available in http://bugs.koha-community.org/bugzilla3/show_bug.cgi?id=4506
Examine Zebra index
Example is provided by Adam Dickmeiss <adam@indexdata.dk> on koha-devel mailing list.
You should be able to view indexing for any record by doing a 'present' of the record with element set zebra::index and syntax being XML or SUTRS, see https://www.indexdata.com/zebra/doc/special-retrieval.html
Most Z39.50 client should be able to do that.
yaz-client example:
yaz-client z3950.indexdata.com/marc Connecting...OK. Sent initrequest. Connection accepted by v3 target. ID : 81 Name : Zebra Information Server/GFS/YAZ Version: 4.1.3 3ec3ad2bd1e248022bbb7c9d762844bef7f8a461 Options: search present delSet triggerResourceCtrl scan sort extendedServices namedResultSets Elapsed: 0.055151 Z> f computer Sent searchRequest. Received SearchResponse. Search was a success. Number of hits: 10, setno 1 SearchResult-1: term=computer cnt=10 records returned: 0 Elapsed: 0.029065 Z> s Sent presentRequest (1+1). Records: 1 [marc]Record type: USmarc 00366nam 22001698a 4504 001 11224466 003 DLC 005 00000000000000.0 008 910710c19910701nju 00010 eng 010 $a 11224466 040 $a DLC $c DLC 050 00 $a 123-xyz 100 10 $a Jack Collins 245 10 $a How to program a computer 260 1 $a Penguin 263 $a 8710 300 $a p. cm. nextResultSetPosition = 2 Elapsed: 0.026880 Z> form xml Z> elem zebra::index Z> s 1+1 Sent presentRequest (1+1). Records: 1 Record type: XML <record xmlns="http://www.indexdata.com/zebra/" sysno="2" set="zebra::index"> <index name="Author-name-personal" type="w" seq="1">@^</index> <index name="Author-name-personal" type="w" seq="1"></index> <index name="Author-name-personal" type="w" seq="2">jack</index> <index name="Author-name-personal" type="w" seq="3">collins</index> <index name="Author-name-personal" type="p" seq="4">jack collins</index> <index name="Author" type="w" seq="4">@^</index> <index name="Author" type="w" seq="1"></index> <index name="Author" type="w" seq="5">jack</index> <index name="Author" type="w" seq="6">collins</index> <index name="Author" type="p" seq="7">jack collins</index> <index name="any" type="w" seq="7">@^</index> <index name="any" type="w" seq="1"></index> <index name="any" type="w" seq="8">jack</index> <index name="any" type="w" seq="9">collins</index> <index name="title" type="w" seq="10">@^</index> <index name="title" type="w" seq="1"></index> <index name="title" type="w" seq="11">how</index> <index name="title" type="w" seq="12">to</index> <index name="title" type="w" seq="13">program</index> <index name="title" type="w" seq="14">a</index> <index name="title" type="w" seq="15">computer</index> <index name="title" type="p" seq="16">how to program a computer</index> <index name="any" type="w" seq="16">@^</index> <index name="any" type="w" seq="17">how</index> <index name="any" type="w" seq="18">to</index> <index name="any" type="w" seq="19">program</index> <index name="any" type="w" seq="20">a</index> <index name="any" type="w" seq="21">computer</index> <index name="Place-publication" type="w" seq="22">@^</index> <index name="Place-publication" type="w" seq="1"></index> <index name="Place-publication" type="w" seq="23">penguin</index> <index name="any" type="w" seq="24">@^</index> <index name="any" type="w" seq="25">penguin</index> <index name="_ALLRECORDS" type="w" seq="2"></index> </record> nextResultSetPosition = 2 Elapsed: 0.027786
In XML output "seq" is sequence number, position of "token".
Recover MARC records from Zebra index
Sometimes, you drop the database, but you would really want to keep biblio records which are still in Zebra.
-m option will dump results into marc file which you can later import back into Koha
dpavlin@koha-dev:~$ yaz-client -m ~/backup.marc unix:/var/run/koha/gimpoz/bibliosocket Connecting...OK. Sent initrequest. Connection accepted by v3 target. ID : 81 Name : Zebra Information Server/GFS/YAZ Version: 4.2.29 94b8ca343cc7084394526bd26c80c208aa16c9f9 Options: search present delSet triggerResourceCtrl scan sort extendedServices namedResultSets Elapsed: 0.000950 Z> base biblios
I'm assuming that every record in database has gimpoz string in it to select all records:
Z> find gimpoz Sent searchRequest. Received SearchResponse. Search was a success. Number of hits: 6022, setno 1 SearchResult-1: term=gimpoz cnt=6022 records returned: 0 Elapsed: 0.006662
Number of hits seems OK, so we start retrieving records. You will not be able to retrieve all records using single show command, so we do it in batches of 1000 records:
Z> show 1+1000 # record output... nextResultSetPosition = 1001 Elapsed: 0.112413 Z> show 1001+1000 ... nextResultSetPosition = 2001 Elapsed: 0.107331 Z> show 2001+1000 ... nextResultSetPosition = 3001 Elapsed: 0.105962 Z> show 3001+1000 ... nextResultSetPosition = 4001 Elapsed: 0.106484 Z> show 4001+1000 ... nextResultSetPosition = 5001 Elapsed: 0.105443 Z> show 5001+1000 ... nextResultSetPosition = 6001 Elapsed: 0.105816 Z> show 6001+22 ... nextResultSetPosition = 6023 Elapsed: 0.002511 Z> exit See you later, alligator. $ ls -al ~/backup.marc -rw-r--r-- 1 dpavlin dpavlin 1395577 2012-04-24 23:50 /home/dpavlin/backup.marc
Rember to use koha-dump to create backup before you start destructive operations!
"Abandon hope, all ye who enter"
(jcamins originally pasted this in the pastebin, copying it here for more people to see it.)
To give a concrete example of what I'm talking about, consider the following data:
=610 20$aGoddard College$vFaculty publications.
And this configuration:
melm 610$a Name-and-title melm 610$t Name-and-title,Title melm 610$9 Koha-Auth-Number melm 610 Name,Subject,Subject:p,Corporate-name
From that, you get these indexes:
Name-and-title = {"Goddard College"} Name = {"Faculty publications"} Subject = {"Faculty publications"} Subject:p = {"Faculty publications"} Corporate-name = {"Faculty publications"}
What we wanted, obviously, was for the Subject, Name, and Corporate-name indexes to include the entire 610 field. But that doesn't happen, because the moment the "melm 610$a" line is executed, the parser stops, and the "melm 610" line is never reached). In order to get the behavior we want, we need the following configuration (actually, I'm not including subfield $t or subfield $9 in those indexes):
melm 610$a Name-and-title,Subject,Subject:p,Name,Corporate-name melm 610$t Name-and-title,Title melm 610$9 Koha-Auth-Number melm 610 Name,Subject,Subject:p,Corporate-name
From this correct configuration, we get these indexes:
Name-and-title = {"Goddard College"} Name = {Goddard, College, Faculty, publications} Subject = {Goddard, College, Faculty, publications} Subject:p = {"Goddard College", "Faculty publications"} Corporate-name = {Goddard, College, Faculty, publications}
Hopefully someone will be able to do something useful with this information.
Following discussion
- <hdl> oh that... Yes. Indeeed. We had that kind of problem.
- <cait> looks like something that should go on the wiki
- <hdl> we made a patch on record.abs in order to index items.
- <hdl> And yes record.abs Is a nightmare... But changing DOM indexing is also quite promising in terms of fun... There is no "any" index by default. you have to add that.
- <jcamins> hdl: the "any index doesn't work the way everyone thinks it does with GRS-1, either.
- <jcamins> And if we use xsltproc to compile our DOM configuration, it's easy enough to add a "match='*'" template.
- <hdl> Well at least, it searches the indexed data
- <jcamins> hdl: yes, but I am only certain that two people understand that.
- <hdl> jcamins: I much more believe in Solr than doing overcomplex XSLT just to index stuff.
- <jcamins> (probably more do, but the only two people who I *know* understand what "all any" does are me, and you, since you just told me you did;)
- <hdl> count paul and gmcharlt and chris and chris_n too in I guess fredericd is also a good guess for that... That makes quite a lot ;)
- <hdl> (a kind of community of geeks working on the same stuff...)
- <hdl> any is gathering all the indexed elements. Not all the elements of the record.
- <jcamins> Right.
- <hdl> nor all the elements that have not been indexed.
- <hdl> can be compared to all_fields in solr
Another real life example
Here are the changes that I avoided to explain in the Kohacon2012 presentation Koha in language centres, maybe it will help someone. This example is for MARC21. The idea was to index four fields, map them to strings for the search query and make them searchable in the OPAC. The fields are
MARC strings 653$a ln-class 521$a lex 658$a curriculum 655$a index-term-genre
We had to edit four files:
/usr/share/koha/lib/C4/Search.pm /etc/koha/zebradb/biblios/etc/bib1.att /etc/koha/zebradb/ccl.properties /etc/koha/zebradb/marc_defs/marc21/biblios/record.abs
+ means we added a line
- means we deleted a file
= means we kept the line, it's in the example so you have and idea what section of a file is edited
1. /usr/share/koha/lib/C4/Search.pm
- We added the strings that we wanted to map to the MARC fields to @indexes here
= sub getIndexes{ = my @indexes = ( = # biblio indexes
[…]
+ 'Index-term-genre', + 'lex', + 'ln-class', + 'curriculum',
2. /etc/koha/zebradb/biblios/etc/bib1.att
- We added two lines here, the other two were in there already.
= ## Fixed Fields and other special indexes
[…]
= att 9903 lex + att 9653 ln-class + att 9655 Index-term-genre = att 9658 curriculum
3. /etc/koha/zebradb/ccl.properties
- again, two lines were already in here, two were added
= lex 1=9903 r=r = curriculum 1=9658 + ln-class 1=9653 + Index-term-genre 1=9655
4. /etc/koha/zebradb/marc_defs/marc21/biblios/record.abs
- and here we changed one and added two more lines
- melm 521$a lex:n + melm 521$a lex:w, lex:p
[…]
+ melm 653$a ln-class:w,ln-class:p,Subject,Subject:p = melm 653$9 Koha-Auth-Number = melm 653 Subject,Subject:p
[…]
+ melm 655$a Index-term-genre:w,Index-term-genre:p,Subject,Subject:p = melm 655$9 Koha-Auth-Number = melm 655 Subject,Subject:p
[…]
= melm 658$a curriculum:w,curriculum:p,Subject,Subject:p
You have to reindex your records after the changes. An example search query using some of these changes is
opac-search.pl?idx=ln-class&q=English&limit=curriculum:TOEFL
Related pages
- Troubleshooting Koha as a Z39.50 server - test Zebra server with yaz-client and more
- Developing Zebra - Develop Zebra using Docker and your C skills