What to cite: versioning and granularity of research data for effective citation

Versions of the Bible http://www.flickr.com/photos/56380734@N05/7686975008/

The Bible in multiple versions

Deciding what constitutes a citable unit of data is a fundamental question for research data management, particularly if data is to become a measurable scholarly output in its own right.

Unlike scholarly articles, datasets take many forms and are often fluid and not clearly defined. The challenges of implementing DOIs (or other identifiers) so that such data is usable, meaningful and manageable was addressed at our 4th workshop, What to cite: versioning and granularity of research data for effective citation, which was held on Monday 3rd December at the British Library.

Roy Lowry from the British Oceanographic Data Centre (BODC) was the first speaker of the day.

The origins of the BODC date back to 1969 and, as such, it has long-established data management practices already in place.  The scarcity and cost of oceanographic data also means that there is a longstanding history of data sharing in the community and, although this puts oceanography ahead of many other disciplines that do not have a pre-existing culture of data sharing, after so many years of doing things in a certain way it is not always easy to persuade researchers to adopt new practices.  Roy emphasised that, while there are clear benefits to the wealth of experience that exists within the organization, the challenge of successfully mapping these existing practices to the new data publication paradigm is significant.

Roy also introduced us all to the concept of the “plastic DOI”: in other words, a DOI that is attached to data that is not particularly meaningful and unlikely to ever be used or cited.  A useful concept!

The BODC today…

A particular issue for the BODC is that data and metadata currently undergoes significant processing at the data centre (in accordance with International Oceanographic Data and Information Exchange (IODE) Policy), with the result that the data ultimately bears little resemblance to what was ingested.

At present, the data centre does not have a policy of preserving ‘snapshots’ of the data while it is under processing/revision.  The user is simply served with the best available data at the time of their request.

In the new data publication paradigm…

In contrast with current BODC practices, a cited dataset should be a ‘fixed’ item (with unique identifier and citation).  Roy suggests that, to ensure fixity, the checksum should be metadata item with any changes triggering the creation of a new version. Previous versions should be preserved and remain accessible.

The nature of oceanographic data means that it needs to remain usable over decadal timescales.  Adhering to standards – OAIS, Dublin Core … – will therefore be an essential part of future data management plans.

Paradigm mapping challenges and potential solutions

1.      Mapping dynamic datasets to a static equivalent suitable for publication:

The BODC are introducing the concept of the “discovery dataset”: pre-defined aggregations of data atoms (i.e. smallest available units of data) with accompanying descriptive and discovery metadata. They are also looking at the possibility of introducing a “request publication” option whereby a DOI is minted for an on-the-fly instantiation of dynamic dataset upon user request.

2.     Ensuring that datasets served to the user can be replicated:

The necessary infrastructure for storage of and access to previous versions of datasets is under construction.

3.     Workflow timing mismatches (e.g. time taken for in-house processing of  data  could delay publication)

The BODC are considering some form of “publication without ingestion” service which would allow submission of data in support of a research article. The data would be given a DOI and subjected to basic quality control but would not undergo full processing.

 Roy stressed that the BODC never mints a DOI on a promise – they always ask to see the hard data first.

Progress towards the new data publication paradigm

Work is ongoing at the BODC to meet the challenges of managing oceanographic data effectively for publication.  Currently, an operational prototype Published Data Library hosts a small number of datasets that have been made available under the new data model.

Roy finished on an optimistic note – there is still a lot of work to be done but he is confident that they will get there.

Next up, Neil Jefferies from the Bodleian Library, Oxford looked at the challenges of DOI implementation at a large, multi-disciplinary institution.

Data Catalogue

The University of Oxford issues UUIDs to all holdings in its repository.  The advantage of UUIDs is that they can be minted in multiple locations and are generally recognised by Google search.  Some items will also be assigned DOIs or other identifiers.  If a researcher wants to publish their data, they must meet the criteria for a DOI. One problem they are currently facing is that data are not always online, or even digitised, and these data may also need to to referenced.

The Oxford data catalogue, DataFinder, will contain information on all institutional research data and resources – even if it is not held in the University’s own repository.  This reflects the need to be able to locate all relevant data and safeguards against data loss if, for example, an external repository ceases to exist.

Metadata as data?

Oxford increasingly store digital objects that consist only of metadata: for example, information about a person. The distinction between data and metadata is therefore becoming less clear. The use of data aggregations, which combine data from multiple sources into a new entity, is another example of a ‘metadata only’ object within the repository.

Rebecca Lawrence, of new online OA journal F1000 Research, presented the publisher’s perspective and introduced us to the novel approach to versioned research articles that F1000 is pioneering.

A problem for scholarly publishers in the digital age is that multiple versions of an article can co-exist in a variety of locations – preprint, author version, institutional repository etc.  The traditional role of the journal is to hold the ‘version of record’ which, in principle, should not change after publication.  BUT, we know that this is not really how science works.

The F1000 model

F1000 aims to more closely reflect the dynamic nature of scientific research by operating a transparent, post-publication peer review process that allows for the ongoing revision of articles in response to referee reports – all carried out in full public view.  This process requires a robust versioning policy to ensure that the latest version of an article is easily identifiable.

F1000 also requires full data deposit and sharing for all articles (a policy that, Rebecca says, all authors have complied with so far). This throws up additional challenges for the journal:

  •  What do we do when data associated with a published article changes?
  •  Error correction: when is a change big enough to warrant a new version?

The journal also accept “Data only articles” (defined as “datasets + protocols, without analyses or conclusions”.  An example can be found here.

Citing versioned articles 

To cope with this new model of versioned publishing, F1000 recommends a standard way of citing article which includes article version number and refereeing status.  For example,

Stephen Senn (2012) Misunderstanding publication bias: editors are not blameless after all. [v1; ref status: awaiting peer reviewhttp://f1000r.es/YvAwwDF1000 Research1:59 (doi: 10.3410/f1000research.1-59.v1)

Ayman Shabana, et al  (2012) Termination of mid-trimester pregnancies: misoprostol versus concurrent weighted Foley catheter and misoprostol. [v2; ref status: Indexed, http://f1000r.es/UbkmSvF1000 Research1:36 (doi: 10.3410/f1000research.1-36.v2)

Most metadata fields can be changed between versions. The core items that should remain the same are:  first author, article number, title.

How will indexing services deal with this new approach?

F1000 have worked with Scopus and Thomson (Web of Knowledge) to ensure that the multiple article versions can be managed effectively. For indexing purposes both Scopus and Web of Science will only retain the latest version of an article.  They will also combine the citations to all versions of an article by using the core part of the DOI as the identifier (F1000 DOIs have a version number appended to the ‘core’ DOI, eg. 10.3410/f1000research.2012.1-45.v2).

Unresolved issues 

Rebecca highlighted some key challenges that article versioning has posed, and the way F1000 have chosen to approach them:

  • Additional authors can be added to later versions of an article. Should they benefit from citations to earlier versions?  At present, all authors receive equal credit.
  • Is the publication date updated when a new version of an article is issued?  Current policy is to retain the publication date of the original article for subsequent versions.
  • Should changes to the title be permitted?  Once an article is indexed (i.e. once it has been successfully peer reviewed), the title is fixed.  Changes prior to this point may be allowed.
  • Referee reports and comments form part of the article itself thus, when a new report/comment is added, the article has altered. Should a new DOI be issued at this point? No new DOI is issued under F1000’s present policy.

The final presentation of the day was from Simon Coles of the University of Southampton whose experience as both an active researcher and Director of the National Crystallography Service  gives him a unique view on the data sharing needs  and motivations of academics.

Researchers are beginning to feel the pressure from research councils to improve their data management practices and openness, although the “why should I bother” attitude is still common.  However, competition and recognition from peers is a significant motivator for most researchers so, as data sharing becomes increasingly seen as an important part of the research process (e.g. through recognition in the REF), this will serve as an important driver towards data sharing and openness.

Chemical data sharing intiatives

The eCrystals archive was an early (established in 2003) experiment in online crystallographic data sharing.  The nature of chemistry research means that published papers are often very ‘data heavy’.  eCrystals aimed to bypass this practice of sharing data via the article by providing a central repository for these datasets.  DOIs were adopted early on (2005) and allowed authors to embed the data DOI in their paper, rather than including the data itself.

Simon also introduced the group to Chemspider (described as “Google for chemistry”) which harvests chemistry data from multiple sources and makes it searchable via a single interface. ChemSpider operates using RDF and demonstrates what is possible with open, well managed data.

Well-structured vs. less structured data

The examples above work for data that is structured and already recognised in the community as a valid research output. But what about less structured data, such as lab notes or instrumental data streams? There is lots of useful information contained in this ‘grey’ material, about the experimental process, for example, that is currently not being exploited.

One project that Simon is currently working on is LabTrove, which aims to provide a framework for the beginning-to-end documentation of an experiment. The tool provides a means of structuring and organising lab notes (with user-added metadata) and also enables the publication and sharing of results.  A “multi layer” approach to the data allows lab notes to be represented in a number of different ways, with each layer playing a different role in the management or discoverability of the data.  Examples of LabTrove in action can be seen at http://www.ourexperiment.org/ and http://biolab.isis.rl.ac.uk/.

Simon also spoke about how his institution (U. Southampton) views the various data management activities that he has been involved in. In general, the University has been supportive but, as ever, they are concerned about how much it will cost.

Future projects

Simon is involved in projects to introduce DataCite into the eCrystals and LabTrove services.  The DataCite/eCrystals integration is already underway, using Southampton data as an exemplar.  Implementing DOIs for LabTrove will require more testing but, ultimately, it would be desirable to be able to cite both an entire lab notebook and individual components (or collections of components).


The workshop ended with a group discussion centred around some key issues relating to the topics of versioning and granularity.  The debates stimulated during this session highlighted some very interesting differences of opinion about the best approach to DOI implemetation but also, reassuringly, a great deal of consensus on many issues.  As this post is already much too long, I’ll blog some of the most interesting points raised during the discussons separately. Watch this space!

You can view the presentations from the workshop here.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s