Open data quality – the next shift in open data?

This blog post is part of our Global Open Data Index blog series. It is a call to recalibrate our attention to the many different elements contributing to the ‘good quality’ of open data, the trade-offs between them and how they support data usability (see here some vital work by the World Wide Web Consortium). Focusing on these elements could help support governments to publish data that can be easily used. The blog post was jointly written by Danny Lämmerhirt and Mor Rubinstein.


Some years ago, open data was heralded to unlock information to the public that would otherwise remain closed. In the pre-digital age, information was locked away, and an array of mechanisms was necessary to bridge the knowledge gap between institutions and people. So when the open data movement demanded “Openness By Default”, many data publishers followed the call by releasing vast amounts of data in its existing form to bridge that gap.

To date, it seems that opening this data has not reduced but rather shifted and multiplied the barriers to the use of data, as Open Knowledge International’s research around the Global Open Data Index (GODI) 2016/17 shows. Together with data experts and a network of volunteers, our team searched, accessed, and verified more than 1400 government datasets around the world.

We found that data is often stored in many different places on the web, sometimes split across documents, or hidden many pages deep on a website. Often data comes in various access modalities. It can be presented in various forms and file formats, sometimes using uncommon signs or codes that are in the worst case only understandable to their producer.

As the Open Data Handbook states, these emerging open data infrastructures resemble the myth of the ‘Tower of Babel’: more information is produced, but it is encoded in different languages and forms, preventing data publishers and their publics from communicating with one another. What makes data usable under these circumstances? How can we close the information chain loop? The short answer: by providing ‘good quality’ open data.


Understanding data quality – from quality to qualities

The open data community needs to shift focus from mass data publication towards an understanding of good data quality. Yet, there is no shared definition what constitutes ‘good’ data quality.

Research shows that there are many different interpretations and ways of measuring data quality. They include data interpretability, data accuracy, timeliness of publication, reliability, trustworthiness, accessibility, discoverability, processability, or completeness.  Since people use data for different purposes, certain data qualities matter more to a user group than others. Some of these areas are covered by the Open Data Charter, but the Charter does not explicitly name them as ‘qualities’ which sum up to high quality. Current quality indicators are not complete – and miss the opportunity to highlight quality trade-offs

Also, existing indicators assess data quality very differently, potentially framing our language and thinking of data quality in opposite ways. Examples are:

Some indicators focus on the content of data portals (number of published datasets) or access to data. A small fraction focus on datasets, their content, structure, understandability, or processability. Even GODI and the Open Data Barometer from the World Wide Web Foundation do not share a common definition of data quality.

 Arguably, the diversity of existing quality indicators prevents from a targeted and strategic approach to improving data quality.

At the moment GODI sets out the following indicators for measuring data quality:

  • Completeness of dataset content
  • Accessibility (access-controlled or public access?)
  • Findability of data
  • Processability (machine-readability and amount of effort needed to use data)
  • Timely publication

This leaves out other qualities. We could ask if data is actually understandable by people. For example, is there a description what each part of the data content means (metadata)?


Improving quality by improving the way data is produced

Many data quality metrics are (rightfully so) user-focussed. However, it is critical that government as data producers better understand, monitor and improves the inherent quality of the data they produce. Measuring data quality can incentivise governments to design data for impact: by raising awareness of the quality issues that would make data files otherwise practically impossible to use.

At Open Knowledge International, we target data producers and the quality issues of data files mostly via the Frictionless Data project. Notable projects include the Data Quality Spec which defines some essential quality aspects for tabular data files. GoodTables provides structural and schema validation of government data, and the Data Quality Dashboard enables open data stakeholders to see data quality metrics for entire data collections “at a glance”, including the amount of errors in a data file. These tools help to develop a more systematic assessment of the technical processability and usability of data.

A call for joint work towards better data quality

We are aware that good data quality requires solutions jointly working together. Therefore, we would love to hear your feedback. What are your experiences with open data quality? Which quality issues hinder you from using open data? How do you define these data qualities? What could the GODI team improve?  Please let us know by joining the conversation about GODI on our forum.