I was honoured to lead a workshop and speak at this year's edition of
Semantic Web in Bibliotheken (SWIB) in Bonn, Germany. It was an amazing
experience; there were so many rich projects being described with obvious
dividends for the users of libraries, once again the European library
community fills me with hope for the future success of the semantic web.
The subject of my talk "Cataloguing for the open web with RDFa and schema.org"
(slides and video recording - gulp)
pivoted while I was preparing materials for the workshop. I was searching
library catalogues around Bonn looking for a catalogue with persistent URIs
that I could use for an example. To my surprise, catalogue after catalogue used
session-based URLs; it took me quite some time before I was able to find ULB,
who had hosted a VuFind front end for their catalogue. Even then, the
robots.txt restricted crawling by any user agent. This reminded me
rather depressingly of my findings from current "discovery layers", which
entirely restrict crawling and therefore put libraries into a black hole on the
Thses findings in the wild are so antithetical to the basic principles of
enabling discovery of web resources that, in a conference about the semantic
web, I opted to spend over half of my talk making the argument that libraries
need to pay attention to the old-fashioned web of documents first and foremost.
The basic building blocks that I advocated were, in priority order:
- Persistent URIs, on which everything else is built
- Sitemaps, to facilitate discovery of your resources
- A robots.txt file to filter portions of your website that should not be
crawled (for example, search results pages)
- RDFa, microdata, or JSON-LD only after you've sorted out the first three
Only after setting that foundation did I feel comfortable launching into my
rationale for RDFa and schema.org as a tool for enabling discovery on the web:
a mapping of the access points that cataloguers create to the world of HTML
and aggregators. The key point for SWIB was that RDFa and schema.org can enable
full RDF expressions in HTML; that is, we can, should, and must go beyond
surfacing structured data to surfacing linked data through
@resource attributes and
The Semantic Web is an extension of the current web in which information is
given well-defined meaning, better enabling computers and people to work in
cooperation. Tim Berners-Lee, Scientific American, 2001
I also argued that using RDFa to enrich the document web was, in fact, truer to
Berners-Lee's 2001 definition of the semantic web, and that we should focus on
enriching the document web so that both humans and machines can benefit before
investing in building an entirely separate and disconnected semantic web.
I was worried that my talk would not be well received; that it would be
considered obvious, or scolding, or just plain off-topic. But to my relief
I received a great deal of positive feedback. And on the next day, both Eric Miller
and Richard Wallis gave talks on a similar, but more refined, theme:
that libraries need to do a much, much better job of enabling their resources
to be found on the web--not by people who already use our catalogues, but by
people who are not library users today.
There were also some requests for clarification, which I'll try to address
generally here (for the benefit of anyone who wasn't able to talk with me, or
who might watch the livestream in the future).
"When you said anything could be described in schema.org, did you mean we should throw out MARC and BIBFRAME and EAD?"
tldr: I intended and, not instead of!
The first question I was asked was whether there was anything that I had not
been able to describe in schema.org, to which I answered "No"--especially since
the work that the W3C SchemaBibEx group had done to ensure that some of the core
bibliographic requirements were added to the vocabulary. It was not as
coherent or full a response as I would have liked to have made; I blame the
But combined with a part of the presentation where I countered a myth about
schema.org being a very coarse vocabulary by pointing out that it actually
contained 600 classes and over 800 properties, a number of the attendees
interpreted one of the takeaways of my talk as suggesting that libraries should
adopt schema.org as the descriptive vocabulary, and that MARC,
BIBFRAME, EAD, RAD, RDA, and other approaches for describing library resources
were no longer necessary.
This is not at all what I'm advocating! To expand on my response, you
can describe anything in schema.org, but you might lose significant
amounts of richness in your description. For example, short stories and poems
would best be described in schema.org as a CreativeWork.
You would have to look at the associated description or keyword properties to
be able to figure out the form of the work.
What I was advocating was that you should map your rich bibliographic
description into corresponding schema.org classes and properties in RDFa at the
time you generate the HTML representation of that resource and its associated
entities. So your poem might be represented as a CreativeWork, with a
about values and relationships. Ideally,
author will include at least one link (either via
@resource) to an
entity on the web; and you could do the same with
about if you are
using a controlled vocabulary.
If you take that approach, then you can serve up schema.org descriptions of works
in HTML that most web-oriented clients will understand (such as search engines)
and provide basic access points such as name / author / keywords, while
retaining and maintaining the full richness of the underlying bibliographic
description--and potentially providing access to that, too, as part of the
embedded RDFa, via content negotiation, or
for clients that can interpret richer formats.
"What makes you think Google will want to surface library holdings in search results?"
There is a perception that Google and other search engines just want to sell
ads, or their own products (such as Google Books). While Google certainly does
want to sell ads and products, they also want to be the most useful tool for
satisfying users' information needs--possibly so they can learn more about those
users and put more effective ads in front of them--but nonetheless, the
motivation is there.
Imagine marking up your resources with the Product / Offer portion of schema.org
you are able to provide search engines with availability information in the
same way that Best Buy, AbeBooks, and other online retailers do (as Evergreen,
Koha, and VuFind already do). That makes it much easier for the search engines
to use everything they may know about their users, such as their current
location, their institutional affiliations, their typical commuting patterns,
their reading and research preferences... to provide a link to a library's
electronic or print copy of a given resource in a knowledge graph box as one of
the possible ways of satisfying that person's information needs.
We don't see it happening with libraries running Evergreen, Koha, and VuFind
yet, realistically because the open source library systems don't have enough
penetration to make it worth a search engine's effort to add that to their
set of possible sources. However, if we as an industry make a concerted effort
to implement this as a standard part of crawlable catalogue or discovery record
detail pages, then it wouldn't surprise me in the least to see such suggestions
start to appear. The best proof that we have that Google, at least, is
interested in supporting discovery of library resources is the continued
investment in Google Scholar.
And as I argued during my talk, even if the search engines never add direct
links to library resources from search results or knowledge graph sidebars,
having a reasonably simple standard like the GoodRelations product / offer
pattern for resource availability enables new web-based approaches for building
appplications. One example could be a fulfillment system that uses sitemaps to
intelligently crawl all of its participating libraries, normalizes the
item request to a work URI, and checks availability by parsing the offers at the