Easy integration of external data RFC

From Koha Wiki
Jump to navigation Jump to search

Background

The Web is starting to create some new identifiers that are highly relevant for libraries, but they are hard to use in our online presence because we are tied to the fossilized MARC format. In which MARC field do you put a MusicBrainz ID or DBpedia URI? And how do you make use of them? This RFC describes a novel way to tie identifiers to bibliographic records, and utilize data from the web in the OPAC.

Two main approaches

There are two things we might like to do:

  1. Link to external websites - if we have the MusicBrainz ID for an album, we can link to the page for that album in MusicBrainz
  2. Mash data from external sources into the OPAC ("mashin"), so we can enrich the user experience with data from the web

Basic mechanics of a mashin

The second point above can be implemented as follows:

  • Construct a URL, based on the identifier and a URL template
  • Fetch data from the URL
  • Parse the data into a datastructure
    • JSON can be parsed directly
    • XML can be parsed with a generic parser, or one tailored to the data in question, e.g. RSS
  • Pass the datastructure and a template for rendering the datastructure to the template engine
  • Render the template with the data from the external source as part of the OPAC detail template, using the "eval" filter

Tables and columns

Table: identifiers

Columns:

identifierid
Primary key, integer, auto increment
biblionumber
Foreign key for biblios
identifier
This can be any kind of ID, including a URI
identifiertype
Foreign key for the identifiertype table

Table: identifiertype

Columns:

identifiertypeid
Primary key, integer, auto increment
targeturl
The URL to which an HTTP request will be made, in order to fetch data. This can be treated as a template where e.g. {{ID}} will be replaced with the value from identifiers.identifier
sparql
A SPARQL template
datatype
Possible values: json, xml, rss (or we might skip this and autodetect the format of the data)
template
A TT template that can display the data
locationonpage
An identifier used for selecting where on the detail view to display this data, e.g. belowitems, aboveitems
cachetime
How long should data from remote sites be cached

Use cases

Simple links

identifiertype.targeturl is empty. Just provide the identifiers.identifier as data to the template and it will render as one or more links.

identifiers.identifier:

f4d4ba9a-bb17-49fd-91de-360c3b8f9a78

identifiertype.template:

<ul>
<li><a href="http://musicbrainz.org/release/{{ID}}">MusicBrainz</a></li>
</ul>

This would replace {{ID}} with the value from identifiers.identifier and render a link to MusicBrainz, using the ID to create an exact link.

REST API

identifiers.identifier:

f4d4ba9a-bb17-49fd-91de-360c3b8f9a78

identifiertype.targeturl:

http://musicbrainz.org/ws/2/release/{{ID}}?inc=artists+labels+recordings

Make a HTTP GET request to the resulting URL, and retrieve XML data:

http://musicbrainz.org/ws/2/release/f4d4ba9a-bb17-49fd-91de-360c3b8f9a78?inc=artists+labels+recordings

Parse the XML into a datastructure and pass that to the template, along with the template from identifiertype.template, which might look something like this:

Tracks on this album:

<ol>
[% FOREACH track IN mashin.belowitems.data.metadata.release.medium-list.medium.track-list %]
<li>[% track.recording.title %]</li>
[% END %]
</ol>

SPARQL

identifiers.identifier:

http://dbpedia.org/resource/Master_of_Puppets

identifiertype.sparql:

select ?last ?next where { 
 <{{ID}}> <http://dbpedia.org/property/lastAlbum> ?last .
 <{{ID}}> <http://dbpedia.org/property/nextAlbum> ?next .
} 

These two are combined into this SPARQL query:

select ?last ?next where { 
 <http://dbpedia.org/resource/Master_of_Puppets> <http://dbpedia.org/property/lastAlbum> ?last .
 <http://dbpedia.org/resource/Master_of_Puppets> <http://dbpedia.org/property/nextAlbum> ?next .
} 

This query is then combined with the URL template in identifiertype.targeturl:

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query={{SPARQL}}&format=application/sparql-results+json&timeout=0&debug=on

This is the complete request that is sent, using e.g. RDF::Query::Client:

http://dbpedia.org/sparql?default-graph-uri=http%3A%2F%2Fdbpedia.org&query=select+%3Flast+%3Fnext+where+{+%0D%0A++%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FMaster_of_Puppets%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2FlastAlbum%3E+%3Flast+.%0D%0A++%3Chttp%3A%2F%2Fdbpedia.org%2Fresource%2FMaster_of_Puppets%3E+%3Chttp%3A%2F%2Fdbpedia.org%2Fproperty%2FnextAlbum%3E+%3Fnext+.%0D%0A}+&format=application/sparql-results+json&timeout=0&debug=on

This returns JSON data like this:

{ "head": { "link": [], "vars": ["last", "next"] },
 "results": { "distinct": false, "ordered": true, "bindings": [
   { "last": { "type": "literal", "xml:lang": "en", "value": "Ride the Lightning" }	, "next": { "type": "literal", "xml:lang": "en", "value": "...And Justice for All" }} ] } }

The JSON data is then parsed into a data structure and passed along for rendering, together with the template from identifiertype.template, which might look something like this:

<ul>
<li>Search for previous ablum: <a href="?q=[% mashin.aboveitems.data.results.last.value %]">[% mashin.aboveitems.data.results.last.value %]</a></li>
<li>Search for next ablum: <a href="?q=[% mashin.aboveitems.data.results.next.value %]">[% mashin.aboveitems.data.results.next.value %]</a><</li>
</ul>

Passing stuff to TT

Conceptually, the datastructure passed to TT should look something like this:

mashin.[locationonpage].data.[data from json or xml]
                       .template.[template from identifiertype.template]

Or if we want to have multiple pieces of data in the same location, [locationonpage] could be an array of hashes:

mashin.[locationonpage][0].data.[data from json or xml]
                       [0].template.[template from identifiertype.template]
                       [1].data.[data from json or xml]
                       [1].template.[template from identifiertype.template]

Locating things on the page

In the template for the OPAC detail page we could put things like this in the locations we wanted to "open up" to data from external sources:

[% PROCESS mashin location='belowitems' %]

And then define one BLOCK that processes the data and template for that location, if there is any:

[% BLOCK mashin %]
 [% IF mashin.$location %]
 [% mashin.$location.template | eval %]
 [% END %]
[% END %]

(The TT code here has not actually been tested, but something similar to it should hopefully work.)

Serverside or AJAX?

The description above is based on all the processing happening serverside, resulting in a complete page being sent to the client. This can of course result in delays, because we have to wait for one or more external services, that can be slow or not respond at all. An alternative approach would be to send the basic page to the client, and then use AJAX techniques to fetch HTML fragments produced by methods similar to the ones described above from the server, and then shoehorn them into the page when they arrive at the client.

Other stuff

  • Display records that share the same ID as "related"