Linked Data and the Semantic Web: GLAM’s betting on the wrong horse?

Blog archives

Mon 19 May 2014

Linked Data and the Semantic Web: GLAM’s betting on the wrong horse?

Since the advent of the internet many GLAM’s (Galleries, Libraries, Archives and Museums) seem to be struggling with the same issue: what’s the best way to publish data online in a format that can be easily used, reused and linked to?

Two technologies that are often named to solve this problem are Linked (Open) Data and the Semantic Web. The first is a set of principles, the second a set of technologies implementing those principles. Both are published and developed mostly by the World Wide Web Consortium (W3C) and promise to solve the ‘data problem’ on the web.

Because the W3C is a non-profit organization it is of little surprise that many GLAM’s see these technologies as the logical solution to their problems.

Although the intentions of the W3C are good i don’t think the solutions as envisioned by the W3C are the best match for GLAM’s. Frankly, the proposed technologies are mostly of little to no use for most GLAM’s at all. Let me explain why.

A solution looking for a problem

Here’s the main problem: the semantic webtechnologies are solutions looking for a problem. Instead of a solution for an actual problem (how can GLAM’s make their data easily available on the web?) the semantic web repurposes mostly obsolete technologies under the new moniker of ‘linked open data’ as the silver bullet to every technological problem.

Why are these technologies useless? Here’s the simple reason: how many succesful ‘semantic’ projects can you name that have broad adoption, not only with GLAM’s, but also at other web projects?

Well, i actually know of a few. But they’re not the ones advocated by the W3C.

Here’s one: Open Graph, also known as ‘the Facebook tags’, is a way to annotate a webpage with some simple metadata, like a thumbnail image and a type (such as ‘video.movie’). It’s widely used, by almost 50% of all shared webpages.

Do webmasters add these tags because they want to add ‘semantic meaning’ and ‘linked data’ to their websites? Of course not, they simply want to have a thumbnail and a proper description whenever somebody shares their pages on Facebook.

So, why does Open Graph works where the semantic technologies fail? It’s very simple: it solves a problem. Given that it’s very easy to implement and the #2 website in the world actively supports it is probably a reason of its success as well.

The hell of XML

Here’s another problem with the semantic technologies. All of the semantic technologies are based on XML, and developers hate XML.

So, what do developers want? Developers simply want what we all want: something simple and easy that works most of the time. That’s why virtually all new API’s have settled on JSON as the best data format instead of XML.

Yes, there are semantic formats in JSON, such as JSON-LD. Unfortunately, the origin of the format, which is XML, is clearly visible in the ‘translated syntax’, which is still unwieldly and unnecessary verbose.

The semantic technologies are tricky to implement, so highly skilled developers are necessary. Unfortunately most GLAM’s don’t employ those in abundance. GLAM’s that actually build their sites ‘in-house’ are a minority, they usually outsource to external web developers. Do you think the nerds there have ever heard of OWL or SPARQL? Nope. But they do know how to parse JSON and do HTTP calls for sure.

Standards and the W3C

But, the semantic web standards are written by the W3C, the place of Sir Tim Berners-Lee, the inventor of the world wide web, surely it must be good?

Unfortunately, most of the standards the W3C develops have little practical use on the ‘real web’. You know HTML5, the standard that gave us useful things like native video and app-like features on the web? None of that came from the W3C. If the W3C would have it their way we would be writing strictly valid, non-backwards compatible XHTML (HTML written as XML). And it probably meant we’ll still be living in the dark ages of Flash and Silverlight to get real stuff done.

On the web, new technologies are usually taken up within months of release. Everything that hasn’t be widely used for a year or two is considered ‘legacy’ (think Flash or Silverlight again).

Look at the semantic technologies again in that context: the first version of the RDF spec is from 1999, OWL is from 2002. How many interesting posts on OWL or RDF do you think are posted on heavily visited developer sites like Hacker News?

Misconceptions on webservices

Another thing that most people propogating semantic web technologies tend to forget is that making a webservice (such as an API) work properly is hard work. It doesn’t come for free with your technology.

This means that your service:

Should be fast and responsive, even with many people using it at the same time
Should be available all the time, 24/7
Should be properly documented

Linked data is all nice and dandy, but if your SPARQL endpoint is only up 50% of the time and it takes a minute to do a query, how do you suppose a developer builds a stable app on top of it?

What you should do

Obviously all of my ranting serves little purpose if i don’t give you an alternative. Most of the principles of the linked data movement are actually good, but the implementation is where the wrong decisions are made.

The most frustrating aspect of the whole ‘everyone should use semantic web technologies’ is that organizations are spending money on useless technologies like triple stores where they could be spending it on stuff that’s actually useful.

Actually, the very first thing you should do is take a hard look at your website, not at the data behind it. Developers are one of your customers, but your very first priorirty should be your regular customers.

So think of the visitors to your museum, or the people in your library. Can they view your website properly on their smartphones? Does the site load fast enough (that means, under 3 seconds)? Is everything easily findable? Are the texts up-to-date? What about the design?

If your website still looks like it’s 1999, maybe it’s time to update that instead of thinking about the ideal world of SPARQL endpoints and structured RDF.

Actually, rebuild your website as the first customer of your API. They can be developed in a tandem, and your webdevelopers will give you invaluable feedback on the use of your API.

Putting it into practice

So, let’s get back to your API. Let’s take the vision of linked data according to Wikipedia:

(…) a method of publishing structured data so that it can be interlinked and become more useful (…) to share information in a way that can be read automatically by computers.

Here are a few things you can do to get to that vision:

Permalinks

Have permanent URL’s to your items. If you have a website with paintings don’t have an URL like

http://www.musee-orsay.fr/en/collections/index-of-works/resultat-collection.html?no_cache=1&zoom=1&tx_damzoom_pi1%5Bzoom%5D=0&tx_damzoom_pi1%5BxmlId%5D=000747&tx_damzoom_pi1%5Bback%5D=en%2Fcollections%2Findex-of-works%2Fresultat-collection.html%3Fno_cache%3D1%26zsz%3D9

Don’t laugh, that’s an actual link. Here’s a better solution:

http://www.musee-orsay.fr/work/000747

Machine-readable data

Offer some kind of machine readable data. If you don’t have the money to develop a full-fledged API that’s fine, CSV dumps are better than nothing.

If you are building an API, at the most basic level, something like

http://www.musee-orsay.fr/work/000747.json

Is okay. If you can deliver a ‘real’ API, that’s cool too. Make it simple. Don’t force people to use an API key. You want to offer a search option? What about something like:

http://api.musee-orsay.fr/search/?q=rembrandt

JSON

JSON should be the only output format. Everybody is doing it. So can you. You really don’t need a XML schema. Nobody will use it. Not quite sure how to translate your existing XML schema to JSON? Take the JSON output from the Europeana API as an example.

Documentation

Instead of trying to shoehorn your data in some metadata format put all your effort in documenting your metadata fields. Your ‘format’ field includes width and height of a painting, but it could also include ‘jpeg’? Fine, tell us about that weird stuff.

Code

If the only way you can deliver documentation on your website is in the form of a PDF file you’re doing it wrong. For developers the main place where they find code and documentation is Github. If you’re writing an example library for your API this is also the place to host it. You can also use Github pages to host your documentation. For a nice example view the Rijksmuseum API docs and their Github profile.

To conclude

Thanks for reading this article. If you think it’s useful, please share it using your favourite medium. Remarks or questions? Add a comment. And for more rambling about stuff like this, follow me on Twitter.

Add a comment

4 comments

Marcus Smith 2014-05-19 on 19:41

This is an interesting take on things, and makes a couple of good points, but is marred by a few misconceptions or misunderstandings, which I hope you’ll permit me to comment on and, in places, correct.

– “Because the W3C is a non-profit organization it is of little surprise that many GLAM’s [sic] see these technologies as the logical solution to their problems.” – I don’t think that the non-profit status – or otherwise – of W3C has anything to do with whether or not people think linked data or the semantic web are good ideas or not. How does the one relate to the other?

– “the semantic web repurposes mostly obsolete technologies under the new moniker of ‘linked open data’ as the silver bullet to every technological problem” – Straw man. The semantic web does not claim to be a technological panacea, but seeks to solve a particular set of problems related to publishing rich, descriptive data in an interoperable way. Which are the “obsolete” technologies to which you refer? As far as I know, HTTP, URIs, and RDF are all alive and well.

– “Do webmasters add these tags because they want to add ‘semantic meaning’ and ‘linked data’ to their websites?” – I don’t know about Open Graph, but adding RDFa or schema.org semantic metadata to your website is a good way to provide more accurate data to search engines and improve your ranking relevance.

– “All of the semantic technologies are based on XML, and developers hate XML.” – Firstly, linked data and the semantic web are not based on XML, they’re (usually) based on RDF. These are two completely different data models: XML is hierarchical, RDF is graph-based. RDF *can* be serialised as XML, but they’re not a very good match: you have to jump through some hoops to get it to work, and the resulting XML can be inconsistent and is not easy to work with using normal XML tools. It’s an easy mistake to make though: sadly, RDF has somehow become strongly associated with XML in the minds of many web developers. When RDF came out, XML was the Next Big Thing and so of course RDF had to serialise to XML; it’s proven to be a hard image to shake. Nb for example that the developers of JSON-LD deliberately avoided mentioning RDF at all for a long time, because when web developers think RDF, they think XML, and then they get the urge to run away very fast.
Secondly, “developers” don’t hate XML: front-end *web*-developers hate XML; everyone else has a mature toolchain for dealing with it sensibly. But front-end web devs have only JavaScript (pity them, poor souls!) and so it’s no surprise that they prefer JSON. JSON is much more convenient to work with than XML when you’re working in JavaScript, but if (other) developers really hated XML that much, they’d probably be using something like YAML instead (JSON is actually a proper subset of YAML, minus the support for complex data structures). As it is, YAML is rarely seen in the wild outside of specific developer communities .

– “Unfortunately, the origin of the format, which is XML, is clearly visible in the ‘translated syntax’” – I’m not sure where you got this from. Actually the development of JSON-LD does not lie with XML at all, or indeed with the semantic web or RDF. It would be fair to say that the developers of JSON-LD are not particularly pro-semantic web (they actually seem to be more frustrated with it!) and the fact that JSON-LD allows the serialisation of RDF graphs to JSON seems to have been a late development, albeit rather a useful one. (See, again, )

– “The semantic technologies are tricky to implement, so highly skilled developers are necessary.” – Not especially tricky; just less familiar than other methods in wider use that you might be more familiar with (relational databases, HTML).

– “Do you think the nerds there have ever heard of OWL or SPARQL? Nope. But they do know how to parse JSON and do HTTP calls for sure.” – This is the need that JSON-LD fills. It’s JSON! Just like your front-end web devs are used to! :D

– “Unfortunately, most of the standards the W3C develops have little practical use on the ‘real web’.” … “And it probably meant we’ll still be living in the dark ages of Flash and Silverlight to get real stuff done.” – Are you for real?!

– “Another thing that most people propogating [sic] semantic web technologies tend to forget is that making a webservice (such as an API) work properly is hard work. It doesn’t come for free with your technology.” – I don’t think anyone forgets that. But a simple REST-based API doesn’t *have* to be hard: there are plenty of frameworks out there that will do the heavy lifting for you. I think is is, again, more a question of lack of familiarity rather than actual difficulty.

– “Linked data is all nice and dandy, but if your SPARQL endpoint is only up 50% of the time and it takes a minute to do a query, how do you suppose a developer builds a stable app on top of it?” – Here you totally have a point. It’s almost become an in-joke that six simultaneous users of a SPARQL endpoint constitutes a DDOS attack, and keeping an endpoint up for any reliable length of time seems to be a real challenge. SPARQL endpoints and triple-stores are still relatively young technologies, and are no where near as quick or stable as relational databases which have a few decades of development lead on them! (And even with that lead, how long do you think your SQL server would last if you opened it up to queries from the web?) This explains why institutions have been generally slow to adopt SPARQL endpoints, preferring instead to use web APIs to allow querying and access to their data, which by contrast tend to be easier to implement, much more reliable, and much quicker.

– “Can they view your website properly on their smartphones? Does the site load fast enough (that means, under 3 seconds)? Is everything easily findable?” – All sound advice.

– “Actually, rebuild your website as the first customer of your API.” – Also good advice, but… weren’t you just saying that web APIs were *hard*? (“Let’s go shopping!”, right?)

– “Have permanent URL’s [sic] to your items.” – Yes! Better still, make them URIs!

– “Offer some kind of machine readable data.” – Also yes! With this, plus the permanent URIs, you just described linked data! I thought you were trying to say that linked data was the “wrong horse” to bet on…?

– “http://www.musee-orsay.fr/work/000747.json” – Don’t use file extensions in your URIs; that’s what ‘Accept’ headers are for. :)

– “JSON should be the only output format.” – Why shouldn’t you offer a variety of output formats and let the consumer decide? Why shouldn’t you use JSON-LD to give your data’s URLs meaning and make them interoperable with other people’s services? JSON has no built-in support for links, and no way to standardise fields across different services, but these are problems that JSON-LD solves – in a backwards-compatible way, no less. Seems like a no-brainer: you keep the web-devs happy by providing JSON, and expose your data in an interoperable way by making that JSON linked data. :)

I don’t agree with your conclusion that linked data technologies are a poor choice for GLAMs; in fact, I think your assessment of the problem is over-simplistic. But neither do I deny that there are implementation difficulties which can make the switch to linked data problematic and which have yet to be satisfactorily resolved. It seems likely that these issues will become less significant as the technologies mature. In the mean time, there are plenty of successful examples of GLAMs using linked open data and the semantic web to do great things! :)
Ethan Gruber 2014-05-20 on 13:02

I agree with most of Marcus’s points here, but there is one where I disagree: content negotiation. I think it’s a great thing that can be implemented optionally, but it would be a mistake for it to be the only means by accessing alternative data models. Here are some practical examples:

One of my software applications, xEAC, is decided to create, edit, and manage collections of EAC-CPF, which is an XML schema for authority, context, and semantic relationships of corporate, personal, and familial entities. At the moment, xEAC delivers derivative RDF in three different models: a basic CIDOC-CRM model, one that conforms to the Standards for Networking Ancient Prosopographies ontology, and one that follows emerging standards from the archival community. It is impossible to deliver all three through content negotiation, because all three could be application/rdf+xml or text/turtle. I created an API to request specific models, but I still do have a ‘.rdf’ URI for getting the default archival model.

Another example of why it can be very useful to deliver a model through a stable URI instead of through content negotiation only: KML. Suppose that you have a gazetteer entry or an object with geographic information that could be rendered into KML. A casual, non-developer user would never use content negotiation to get the KML, but they might want to download the KML. They could be provided a link, which would automatically launch Google Earth, if installed. A URI of a KML file can be copied and pasted in Google Maps and rendered. There are other such web services that can render KML made available through a URI. Implementing the content-negotiation only approach caters only to developer-consumers rather than non-developers. Ideally, you want your system to be useful by as large of an audience as possible. Content negotiation is picking of steam in the semantic web community, but it is actually incongruous to REST architecture.
Nils Breunese 2014-09-10 on 06:58

Great article, but I cringe when I read that developers hate XML and “You really don’t need a XML schema. Nobody will use it.” I’ll take data with a schema over schemaless data any day and I’m convinced that any developer worth its salt should. Yes, I am aware that schemaless JSON looks easy and simple, but it means you have to go and write your own parser and domain model. So much waste and potential for problems.

* shakes fist and mumbles something about getting off some lawn *
Rotimi 2016-06-25 on 16:03

Reading this in JUNE 2016:
Really awesome discussions here, and a most interesting article.

I am a current (sluggish) student of Semantic Web technologies ( JSON-LD, SPARQL, Turtle, RDFS and recently OWL – not yet sure where to focus – which poses the most financial rewards).

I would really love to know what the author and commenters feel about the current state of the semantic web and GLAMs these days?