Linked Data RFC

From Koha Wiki
Jump to navigation Jump to search

Semantic data

Status: unknown
Sponsored by:
Developed by:
Expected for:
Bug number: Bug
Work in progress repository:
Description: Library of Congress has issued some statements that point to a future where MARC and it's dialects are replaced by RDF/Linked Data/Semantic Web technologies. This RFC collects some thoughts on how we can start to take advantage of these technologies in Koha.


Background

At the moment it is too early to even think about throwing out MARC completely from Koha, so we should find an approach that will let us keep MARC, but at the same time explore and take advantage of new technologies. Here goes...

Goals

  • Keep MARC, for now
  • Explore and take advantage of RDF/Linked Data/Semantic Web technologies
  • Reuse stuff that others have made, so that we can focus our time and energy on the core things we need in Koha
  • Keep things flexible and light weight - these are early days so we should try not to paint ourselves into any corners
  • Focus on making something that is useful in the OPAC - all other functionality in Koha should not be influenced at all

First steps

This is a suggested minimum of what we should do to take advantage of RDF/Linked Data/Semantic Web technologies in Koha:

  • Transform MARC to RDF.
    • There are multiple ways to do this, and to save time we should start with existing solutions (e.g. marc2rdf from Oslo Public Library, which is built in Ruby and can fetch records from Koha via OAI-PMH, transform them to RDF and push them to a triplestore).
  • Save the resulting RDF in a triplestore
  • Create an interface in Koha for enriching the RDF in the triplestore with
    • the knowledge of librarians and
    • data from external sources.
  • Create an interface in the OPAC for exploring the data in the triplestore
    • This will probably be more about exploring links and relationships, compared to the free text search we are used to with MARC. There is a half-baked prototype of this, called SemantiKoha, just for the fun of it.
    • MetaQuery is a proof of concept prototype for a highly flexible public interface for exploring semantic data

Further steps

When the basic 4 steps described above are implemented, there are lots of other cool things we can do:

Describe non-physical stuff with RDF

A lot of the functionality of traditional ILSes is tied to the managament of physical items: loan periods, reminders, fines, receiving copies of physical journals etc. Currently, we are seeing more and more non-physical "stuff" that is relevant to libraries: ebooks, ejournals and "stuff" that lives on the web. We need to describe this stuff so it can be discovered (by patrons or by Google), and then we need to provide a link to them. Why should we bother with describing this stuff in old fashioned MARC, when we can describe it and make it findable with RDF?

Describe things in new ways

The traditional ILS has discovery that is based on "search results" and "detail views". Why are there no "pages" for authors, that list their works, showing which works are represented in the library and which are not? With RDF we can describe more things on more levels than we can with MARC, and make the ILS a true platform for discovery and mediation.

Save as RDF

Canonical URL/URI for records

Semantic data need good URIs for the subjects of it's triples, even if they are only locally unique. Thanks to this directive:

RewriteRule ^/bib/([^\/]*)/?$ /cgi-bin/koha/opac-detail\.pl?bib=$1 [PT]

in the default koha-httpd.conf we already have this candidate:

http://example.org/bib/1

But this only serves up an HTML representation of the record, and it would be Very Cool if the URI used as the subject could do content negotiation based on the Accept-header provided by the user agent. Changing the behaviour of the URL above might break things in subtle ways for some installations, so it's probably better to introduce a new directive that creates a new URI, e.g.:

http://example.org/record/1

and let this pass requests on to an enhanced version of opac-export.pl, as described below.

Content negotiation for opac-export.pl

Today different formats are served from opac-export.pl, based on the "format" parameter. This could be expanded to serving different formats (including RDF in different formats) based on the Accept header provided by the user-agent.

To make this backwards compatible, the "format" parameter should take precedence over the Accept-header.

For user agents requesting text/html or similar, it should redirect to opac-detail.pl.

SPARQL over SQL

There are tools that provide a gateway between SPARQL and relational databases, e.g. Triplify. Data that could be exposed in this way might be:

  • Basic bibliographic data from the biblio, biblioitems and items tables
  • Circulation numbers ("record x has been issued y times")
  • User generated content like comments, tags, star ratings and lists

Push RDF'ized records into a triplestore when they are saved/updated

This would work similar to how records are indexed by Zebra today, and would mean that Koha itself does the transformation and pushes the result to a triplestore. This could be implemented some way down the line, if and when we get tired of running external tools to do the RDF'ization.

An interface for editing/enriching semantic data

What can be done

  • Encode relationships between records
  • Model authors, subjects, publishers etc
  • Pull in data from external sources

Examples

Triplestore

Some of the steps involve a triplestore for saving and querying semantic data. Wikipedia has a long List of triplestore implementations - the key is probably to stick with some version of SPARQL, so we can give users a choice of any triplestore that supports that version.

Goal: Koha should be able to support any triplestore, as long as it supports the required version of SPARQL (which would be 1.1 at the moment).

Relevant vocabularies

More vocabularies

Links to Wikipedia

Links

Books

Misc