Scholarly Output Notification and Exchange (SONEX): 2011

Tuesday 13 December 2011

Thematic parallel session on metadata - actions to be taken

On Day II of the JISC MRD Programme 2011-13 launch event in Nottingham, last Dec 2nd, specific subject-based discussion sessions were held among the different JISCMRD02 Projects for research data management in order to promote synergies and joint work on common issues. This is a brief report on the outcomes of such discussions at the parallel session on metadata - some other were simultaneously held for Institutional, Life Sciences, Engineering or Archaeology MRD projects, whose discussions have been reported elsewhere (and there are also other posts summarizing talks for this one too).

It was really hard for some of us to pick a single of those groups, since many projects actually belonged to several strands (some lucky ones had also two representatives at the event, it should be noted). The session on metadata was attended, among others, by:

- Anna Clements (U St Andrews)
- Simon Kerridge (U Sunderland)
- Kevin Ginty (U Sunderland)
- Charlotte Pascoe (British Atmospheric Data Centre)
- Pablo de Castro (SONEX Workgroup)
- Simon Hodson (JISC MRD Programme manager)
- David Shotton (U Oxford)
- Louise Corti (UK Data Archive)
- Marco Fabiani (Queen Mary U London)
...

Discussion

Metadata standards were repeatedly discussed along the session - there was a joint (and unsuccessful) attempt to recall whether anyone knew about a metadata standard registry available for different disciplines. Representatives from CERIF4Datasets Project, University of Sunderland, mentioned they were using the MEDIN metadata standard for their work in marine sciences data management. The Core Scientific Metadata Model (CSMD) standard, developed at STFC for the I2S2 Project was also mentioned as an interesting approach to multi-disciplinary metadata standard for structural sciences such as Chemistry, Materials Sciences, Earth Sciences or Biochemistry. Finally, the PIMMS Project (BADC/U Reading), mentioned Metafor as a Climate Science metadata standard and their goal of using PIMMS software tool to generate CIM-based content.

At some point the idea catched up that metadata standards should perhaps be mandated by publishers in order to harmonise discipline-specific data description procedures. Publishers are actually involved in several very successful international RDM projects, such as Dryad, but -save for REWARD- are significantly missing in JISCMRD02 projects.

Having previously developed the Semantic Publishing and Referencing (SPAR) Ontologies, David Shotton said he was now working on their extension to CERIF-based metadata description of datasets, which is closely linked to dataset CERIFication work being carried out at the CERIF4Datasets Project.

Actions

The following actions were proposed for improving the chances of metadata standard harmonisation - hence enhancing dataset discoverability:

Trying to locate (or otherwise collect) an already existing registry of metadata standards for different disciplines, in order to offer researchers from a given discipline an already tested metadata schema they can re-use,

Mapping metadata standards to each other aiming to produce a minimum-sufficient-information metadata set that may be widely applicable accross disciplines,

Taking steps towards organising a workshop in order to have metadata issues discussed among relevant stakeholders. ANDS Metadata Workshop in 2010 might be a potential source of inspiration for this with all those discipline-based approaches to metadata standards. Proposed dates for this Metadata WS were spring-summer 2012.

Finally, there was a wrap-up by different subject-based project groups which showed strong possibilites for a more stable cooperation among them (Biomedical/Healthcare projects even discussed the possibiity of building a common wiki). Some cooperation frameworks (googlegroups, mailing lists) might be set for promoting this disciplinar trans-project collaboration. Regarding the metadata strand, it should be noted it was also an issue in discussions held at most subject-specific workgroups, so it would potentially allow contributions from all of them.

Friday 2 December 2011

The dawn of a new JISC MRD programme - Day I

After a successful first stage of the JISC Managing Research Data (MRD) Programme (2009-2011), a second phase of JISC MRD was launched yesterday at the NCSL Conference Centre in Nottingham, along a 2-day event that will continue today. JISC MRD02 Programme includes 27 projects classified in three different strands:

Strand A. Research Data Management Infrastructure: 17 projects, to be completed from Mar to Jul 2013, comprising Institutional Pilot projects, Institutional Embedding and Transition to Service projects, Disciplinary projects for creative arts and archaeology, and a Metadata project,

Strand B. RDM Planning: 8 projects running until Mar 2012, aiming to design and implement data management plans and supporting services for researchers,

Strand C. Enhancing DMPOnline projects: 2 projects, aiming to customize and enhance the DCC DMPOnline Tool to improve its interaction with institutional/ disciplinary information systems).

It is worth noting that a number of funded RDM projects along this 2nd programme stage are building upon previous pilot work (projects carried out along JISC MRD programme 2007-2011) in order to for instance extend and embed data management services accross the whole institution.

On describing the research data management programme, Simon Hodson, JISC MRD programme manager mentioned there will be two further JISC MRD calls as early as Jan 2012, dealing with:

- Research data publications, aiming to build partnerships among involved stakeholders and encouraging data citation and publication,

- RDM Train, aiming to design and implement data management training strategies for specific disciplines and support roles (including librarians), to be performed by linking to professional bodies.

Emphasis will also be made along this 2nd JISC MRD programme stage on evidence gahering for project benefits and impact. A session devoted to these issues will be held on Dec 2nd, with practical work with both the Benefits Framework Tool and the Value Chain Impacts Tool. Developing metrics for measuring project impact is a specific programme goal along this 2nd implementation stage.

Project blogging

Another JISCMRD02 main objective -and closely related to impact measurement- is promotion of project dissemination and interaction among themselves and with the broader community via blogging. A specific presentation on 'blogging practices to support project work' was delivered for the purpose by Brian Kelly, UKOLN. The presentation highlighted the relevance of publishing project blogposts as an alternative means of expression to writing research papers or code, and engaged the audience in finding shared views regarding potential benefits blogging may bring to RDM projects, also providing some useful technical advice along the way.

Subsequent discussion focused on pros and cons of blogging as a communication technique (both from regular bloggers' and researchers' viewpoint), as well as on potential advantages of JISCMRD project blog aggregation, with a common RSS feed embedded back into the JISC site.

Parallel sessions and poster-session networking

Two parallel sessions came afterwards, dealing with two principal RDM issues: a first one on DCC Tools, introducing Data Asset Framework (DAF), DMPOnline and CARDIO, and summarized by Paul Stainthorp, U Lincoln, on his JISCMRD02 Day I blogpost.

The 2nd parallel session dealt with UMF Tools and related RDM projects. This 2nd session featured presentations by John Milner on JANET Brokerage and Andy Powell on Eduserv Cloud Pilot, along which the strategy for Academic Cloud service implementation was described - based on the "work with the willing" driving line. The Dynamic Purchasing System (DPS) -originally developed for utilities such as water or light- will be re-used as purchasing framework for cloud-related services. Regarding Eduserv, a 2-month 'introductory tier' will be available (just for institutions) along the service gradual implementation (storage being currently single-site, with no backups at this pilot stage, though there are plans for offering tape backup for part of the stored infrastructure).

After an interesting Q&A time, in which backup was suggested to be an absolute requirement for the success of the initiative and there were questions on various Eduserv use mode details (such as the possibility of using departmental orders/purchase order instead of credit cards for academic use), five projects from the UMF strand were briefly presented which are already working either based on a SaaS approach or in the cloud, or both: these were BRISSkit (Jonathan Tedds, U Leicester), DataFlow (David Shotton, U Oxford), Smart Research Framework (or ELB software as a service, Tim Parkinson, U Southampton), VIDaaS and YouShare Projects. Slides for these presentations will shortly be available and will be linked from here.

Finally, Day I official programme ended with a poster session and networking event, which meant a really good opportunity for RDM projects to interact with each other and with 'fellow travellers'. Synergies among projects became quite evident when having all them displayed together on a set of panels, and having their representatives available and willing to discuss each project aims, challenges and similarities to others offered a very good chance to get the general picture along with the details, as well as for establishing inter-project liasons that went well over closure time.

Sunday 6 November 2011

euroCRIS Membership Meeting – Autumn 2011, Lille, France

On Nov 2-3 the autumn 2011 euroCRIS membership meeting was held at the University of Lille 3 in Lille, France. Attendees from 14 countries (13 European nations plus Canada) met for two days at the Univ-Lille3 Maison de Recherche for learning about the new CERIF 1.3 version (to be released Dec 2011) and the growing number of CERIF-based CRIS implementations in Europe, with a special focus on French ones (see event programme).

Brigitte Joerg, euroCRIS CERIF Task Group Leader and German Research Center for Artificial Intelligence (DFKI), delivered a CERIF v1.3 tutorial at the beginning of the membership meeting. After a general-purpose introduction to CERIF, CRIS Systems and the euroCRIS Group for first-time meeting attendees, the tutorial went into describing new features in the new CERIF 1.3 release (CERIF versions will no longer be named by their year of release as they were so far). Such features include the so-called Infrastructure entities (Facility, Equipment, Service) that have been added to the already existing CERIF Entity Types, namely Base entities (Project, Person, Organisational Unit), Result entities (ResultPublication, ResultPatent, ResultProduct), Second Level entities and Link entities.

Furthermore, the JISC RIM2 MICE Project outcomes (Measuring Impact Under CERIF) have also been brought into the CECRIF 1.3 release under the Measurement & Indicator section. MICE was one of the RIM2 projects –together with CERIFy, BRUCE and IRIOS- presented last September at the JISC programme workshop in Manchester. MICE finished on July 2011 and aimed to “examine the potential for encoding systematic and structured information on research impact in the context of the CERIF schema. MICE aims to build on previous work on impact by producing a comprehensive set of indicators which will then be mapped both to the CERIF standard and the CERIF4REF schema created by the previous Readiness for REF (R4R) Project”. MICE-inspired CERIF 1.3 updates include creation of a new CERIF table, namely the impact measure table, as well as a set of impact indicators: categories that include such concepts as improving performance of existing businesses, improved health outcomes and cultural enrichment. euroCRIS was also involved in the RIM2 UKOLN-led CERIFy Project, dealing with measures of esteem, whose results were as well inspiring for CERIF new Measurement & Indicator definition.

Another new feature for this CERIF release is the Geographic bounding boxes, which will allow displayed information to be restricted to a given geographic area. Geographic bounding boxes are presently defined as squares, thus leaving room for geolocation improvement in future CERIF versions. Finally, a new Linked Open Data (LOD) CERIF Task Group is being planned by euroCRIS.

As a result from this new features, changes in CERIF 1.3 release include a whole set of new entities (such as cfMedium as a new Document Type) and new attributes, as well as removal of some other outdated attributes. The new CERIF version described at the tutorial was a preview, with features such as XML Data Exchange Format Specification and CERIF Formal Semantics still being worked upon until 1.3 version gets finally released next December.

An euroCRIS Overview Session followed the CERIF Tutorial, along which different members of euroCRIS Board reported recent activity. Keith Jeffery highlighted the euroCRIS Rome Declaration on CRIS/IR integration issued earlier this year and mentioned that while CERIF can generate multiple metadata standards such as DC, MODS, etc, OAR usual qDC-based metadata model was insufficiently accurate, so some integration should be seeked along the model CRIS-Publications OAR-Data/Software OAR.

Other euroCRIS-related activity includes EU FP7 OpenAIRE Project moving from qDC to some semi-CERIF standard, as well as the fact that OpenAIRE+ Project will use CERIF. By definition, CERIF serves a multiple-institution scheme (thus allowing for wider context-related information sharing for purposes such as the Research Excellence Framework assessment in the UK), so there’s also a clear need to operate internationally as to demonstrate CERIF interoperability capabilities.

Harry Lalieu, euroCRIS Secretary, announced CRIS2012 Conference to be held in Prague next June, and 2012 euroCRIS membership meetings, which will tale place in Prague just before the CRIS2012 event and possibly in Spain later next year.

Anne Asserson, Universitetet i Bergen and responsable for euroCRIS strategy, announced dataset management as the next environment CERIF will be next moving into (with projects such as University of Sunderland-led CERIF for Datasets paving the way for such move).

Speaking on behalf of Ed Simons, Universiteit Nijmegen and euroCRIS website manager, Keith Jeffery informed the audience a test CRIS is being planned for inclusion at the euroCRIS site, thus allowing for future live-demoing and functionality analysis.

Within the euroCRIS Task Group reports, Brigitte Joerg mentioned the euroCRIS Board-authored paper “Towards a Sharable Research Vocabulary (SRV) - A Model-driven Approach” having been presented at the Metadata and Semantics Research Conference (MTSR 2011) held last October in Izmir, Turkey. A preliminary meeting with Virtual Open Access Agriculture & Aquaculture Repository (VOA3R) Project was also recently held in Madrid in order to plan the future euroCRIS Linked Open Data (LOD) Task Group.

Nikos Houssos, NDC Athens and Task Group Projects leader mentioned running EC FP7 Projects euroCRIS is involved into, such as ENGAGE, dealing with Open Access to Public Sector Information, EuroRIs-Net, one of whose outputs is providing an online CERIF database of RI stakeholders, and OpenAIRE+. UK/JISC Projects such as CERIFy, CRISPool, BRUCE, IRIOS, MICE or RMAS were also cited as a proof of CERIF gradually becoming a common standard for RIM Programme Projects. Many of those projects are having an active euroCRIS involvement.

Danica Zendulková, CVTISR and CRIS-IR Interoperability Task Group leader, announced upcoming TG work along lines such as defining usecases for CRIS/IR interoperability, defining a model of integration interface (including XML data exchanges and web services), implementng an authority file model with attached persistent ID and promoting cooperation between CRIS/OAR communities.

Finally, David Baker, CASRAI and euroCRIS Architecture Task Group manager explained the way towards the Reference CRIS implementation. According to implementation plans, a test CRIS should be available at the euroCRIS site on June 2012.

Several sessions –see euroCRIS meeting presentations- followed the euroCRIS Overview, summarizing recent and forthcoming developments in CRIS and CERIF implementation. An interesting discussion was also held, led by Joachim Schöpfel, on teaching CRIS Systems to his Information Science students at Université de Lille and on potential CERIF application to the teaching environment and scholarly activities beyond research.

A particularly relevant presentation –as it described CERIF-based CRIS implementation in the UK, where CERIF standard adoption has been most successful so far– was UKOLN Rosemary Russell’s “CERIF UK landscape” (final report to be formally published later this year by UKOLN-University of Bath). Some figures were mentioned at the presentation: 17 PURE/Atira CERIF-based CRIS were implemented in the UK along last year, plus 5 Converis/Avedas CRISes and a large number of Symplectic Elements.

The CERIF UK Landscape Project carried out a set of seven interviews among ‘CRIS Project managers’ from different institutions - based at the institutional Research Office (2), Library/Info Services (4) or IT Department (1)- in order to gather their views on the implementation process, CRIS reception by end-users (researchers) and staff, plus experience on CERIF and integration with Institutional Repositories. A summary of the –often not so encouraging– answers is available at the presentation, CERIF being perceived by many as a far too complicated standard whose management would rather be handed over to the CRIS commercial provider. It is a fact however that institutions running a CERIF-based CRIS are in a much better position to deal with the REF requirements.

Wednesday 19 October 2011

MaDAM: A JISC MRD Project for Research Data Management in the Biosciences... on the move

Being in Manchester for the JISC Research Information management (RIM2) event, Sonex didn’t miss the opportunity it provided for paying a visit to the University of Manchester John Rylands University Library and meeting the JISC MRD MaDAM Project team. The 'MaDAM Pilot data management infrastructure for biomedical researchers at University of Manchester' has been funded by the JISC Managing Research Data Programme from Oct 2009 to Jun 2011 and has provided an inspiring example on how to start building an institutional research data management infrastructure almost from scratch.

In order to start developing this RDM infrastructure (see the Project Final Report for details), MaDAM focused on a set of research groups from the biomedical sciences strand aiming to learn about the ways they dealt with data management and to provide them -with their own close involvement- with tools to improve and standardise such practices. Selected research groups -Electron and Standard Microscopy group and Magnetic Resonance Imaging (MRI) Neuropsychiatry Unit- were chosen due to their common need to deal with large images as their main source of research data.

Project focus on a rather narrow research scope was one of the keys to its success - due to its resulting ability to define common ways for dealing with the information, eg at metadata level. The MaDAM planning included further RDM strategy extension to other research groups within the UoM based on the lessons learnt from its application to the few initially selected groups. The MiSS Project (MaDAM into Sustainable Service), funded by the JISC MRD Programme 2011-2013, will be dealing with the RDM strategy extension and widening into the whole of the UoM research works along next years.

An Oracle APEX-based research data management application was developed by MaDAM for the concerned UoM research groups -later to be revamped in order to adapt it to the regular software standards applied at UoM. Frequent meetings were held with researchers along the aplication development so their feedback could be collected to ensure it would meet their needs. Storage needs per researcher per year were estimated (at around 500 GB), a metadata standard for specific data description was devised and stored in the RDM application, and work was carried out with interoperability isses in mind, both with the University CRIS in order to automatically populate Grant and Project information attached to datasets, and with the UoM Fedora-based eScholar IR, where final-version datasets would be transferred via Sword for dissemination, sharing and re-use.

Along the MaDAM Project several conceptual needs regarding the implementation of a solid RDM infrastructure across the UoM (and beyond) were identified -which were later included in the Project Final Report- the main two of which are the following:

- Some means of academic recognition of data-related work by researchers should be put in place in order to promote their involvement in RDM schemas and the adoption of common practices,

- A research data management policy should be adopted by the University of Manchester similar to the one issued at U Edinburgh so that some guidelines are established for providing support to researcher RDM tasks.

MaDAM gradual roll-out to other UoM research groups will face a set of challenges, research data being so discipline-specific. However, plans for such an extension and for ensuring the required institutional support for such a move were designed along MaDAM development -which saw the interest in taking part in the pilot project by a number of additional UoM research groups- and extension work will start soon.

Friday 14 October 2011

CERIFying Research Information Systems... and Research Data

A couple of weeks ago Sonex was attending the JISC Research Information management (RIM2) event at MCC Manchester. It was a very good opportunity to review the four JISC-funded projects (BRUCE at Brunel, IRIOS at Sunderland, CERIFy at UKOLN and MICE at KCL) dealing with CERIF implementation for research information management purposes. A report for the event should be shortly available, along with the slides presented at the event.

Along this one-day meeting the CERIF for Datasets (C4D) Project was mentioned as an IRIOS Project extension to dataset management at the University of Sunderland. As stated in the project presentation, C4D aims to 'CERIFy' existing research dataset metadata conventions, and hence provide access to research data in an environment which also holds information on research projects and research outputs. C4D will also explore the commonality of research dataset metadata, and how much can be represented in CERIF.

Saturday 17 September 2011

Progress on Researcher ID initiatives: IRISC 2011 Helsinki

The problem with names...

Prof. Carlos Martínez-Alonso is a renowned Spanish senior biochemist. He was actually President of the Spanish National Research Council (CSIC) when the Berlin Declaration was signed by the institution in January 2006. Prof. Martínez-Alonso has published hundreds of papers in high impact factor journals. However, when retrieving a complete list of his publications from PubMed database, you find out it is not possible unless several parallel author queries are carried out: there is a Martinez-A C entry under which most of his publications get listed [222]. But then there's also Martinez-Alonso C [21] and even Alonso CM [1].

It might be argued it's all about funny Spanish names with two surnames in them. That's a problem alright. Not just for Spanish names though: it's quite the same for Portuguese/Brazilian authors as well. Not to mention transliteration of Asian author names (see "Which Wei Wang?" Phys Rev 2007 editorial). PubMed is presently running its Author ID project in order to tackle this problem, which is by no means exclusive of theirs: around 2/3 of the over 6 million authors in MEDLINE share a last name and first initial with at least one other author, and an ambiguous name refers to 8 persons on average (Torvik and Smalheiser, "Author name disambiguation in MEDLINE").

Name disambiguation and proper attribution is a well-known problem in the scholarly publishing ecosystem. There have been and there are lots of initiatives trying to tackle this complex issue at subject, institutional or even national level - with remarkable success in the case of the Dutch Digital Author Identifier (DAI).

However, this is not an issue to be tackled at national nor subject level, but globally. Commercial stakeholders such as ThomsonReuters or Elsevier-Scopus are then in a privileged position to implement some international author unique identification schema. From a knowledge discovery viewpoint there are however some problems in this commercial-stakeholder approach: the ResearcherID, ThomsonReuter's author identifier, will provide seamless integration with ISI Web of Knowledge and show all author publications registered in that database, but will otherwise leave out most of the research output.

Some joint effort between public institutions and private stakeholders (remarkably publishers) must therefore be attempted to unify the multiple author identification standards and devise a single, comprehensive one at a global level. And that's where ORCID comes in.

... and strategies to tackle it: IRISC 2011 workshop

The Open Researcher & Contributor ID (ORCID) initiative started in Dec 2009 as a non-profit organisation. Currently over 240 participants have joined the project for developing the one research identifier which is not limited to discipline, institution or geographical area. Many other projects are working in this issue at the same time (such as abovementioned discipline-based PubMed Author ID and Cornell University initially institutional then grown to national VIVO initiative).

ORCID and VIVO were two of the main topics of the IRISC 2011 Workshop on Identity in Research Infrastructure and Scientific Communication held this week (Sep 12-13) in Helsinki - see the event programme with attached presentations. Gudmundur "Mummi" Thorisson, Research Associate at University of Leicester and member of ORCID Technical Working Group, was IRISC 2011 main organizer.

There were two major IRISC 2011 strands: identity regarding knowledge discovery and identity for security & access control (focusing mainly on identity federation). A third big cross-issue along the Helsinki event was research data management, from three different perspectives:

i) dealing with a rapidly increasing amount of biomedical research data (Andrew Lyall, EMBL, ELIXIR Project)

ii) dealing with clinical research sensitive data (see Tony Brookes GEN2PHEN Project presentation)

iii) benefits the ORCID implementation might bring to research data attribution and management (mentioned in most ORCID-related presentations and discussions along the workshop)

There were several presentations dealing both with ORCID and closely resembling VIVO initiatives. Martin Fenner, Hannover Medical School and member of ORCID Board of Directors announced the ORCID registration service will start operating in spring 2012. ORCID will be open: researchers will be able to manage & maintain their profiles, filed data will be openly available, ORCID-related software will be released as open source, and researchers will control their privacy settings (with a chance too to share with particular members). Finally, for ORCID identity definition purposes, self-claim as well as external claiming sources will be used.

Brian Lowe, University of Cornell, presented the already running NIH-funded, institutionally-managed VIVO initiative. VIVO is aiming for an extensible semantic model-based more comprehensive approach than ORCID. However, links have already been established between both initiatives and ORCID is hoping to build upon VIVO success in the US.

Breakout sessions were held on IRISC Day 2 on the workshop's two main strands: "Unique identifiers and the Digital Scholar" (lead by Cameron Neylon and Jason Priem) and "What do researchers need from the authentication and authorisation infrastructure (AAI)?" (chaired by Michael Linden, CSC). Breakout session #1 was devoted to discussing potential tools and services to researchers ORCID could provide in the short term (6 months from adoption). Several groups were set up for the purpose and proposed ideas were later voted and discussed for selecting three main future worklines for ORCID to deal with. The proposed and selected use cases were the following:

-> data submission to repositories (multiple task attribution)

service to enable attribution or comment

pre-populate ORCID data

-> manuscript/grant tracking system

ORCID app gallery

-> automatic CV maintenance (potentially including data citations in CVs)

connecting different author research & social network profiles

Selected ORCID use cases were later introduced by Cameron Naylon along his talk 'ORCID and researchers' at the second annual ORCID Outreach Meeting held at CERN on Sep 16th, 2011.

Monday 29 August 2011

Research data management in crystallography at the XXII IUCr Congress

On Aug 29th a session on research data management will be held at the XXII Congress of the International Union of Crystallography (IUCr2011). The session will feature talks by Brian McMahon (IuCr), Brian Matthews (I2S2 Project), Peter Murray-Rust (CrystalEye), John Westbrook (wwPDB) and Nick Spadaccini (DDLm). Peter Murray-Rust will deliver a talk along the session on Open Crystallography.

Friday 26 August 2011

STM research data management and the Quixote Project

A one-day seminar was held yesterday Thu Aug 25th at the Zaragoza Scientific Center for Advanced Modeling (ZCAM) on research data management and the Quixote Project for data management in Computational Chemistry. The session, entitled “Research data management: The experience of the Quixote project for Quantum Chemistry data. Can it be extended into a collection of research data management repositories?”, was attended by a rather diverse group of researchers (both computational chemists and from other disciplines) and repository managers, aiming to learn about research data management initiatives and specifically about the progress of the Quixote Project, in which two researchers from the University of Zaragoza and the CSIC Institute of Physical Chemistry "Rocasolano" are involved.

The Quixote Project (see paper "The Quixote project: Collaborative and Open Quantum Chemistry data management in the Internet age", in press with the J Chem Inf) is developing the infrastructure required to convert output from a number of different molecular quantum chemistry (QC) packages -such as NWChem or Gaussian- to a common semantically rich, machine-readable format and to build repositories of QC data results.

The session started with an introduction to "STM Research data management initiatives in Spain and abroad" delivered by SONEX member Pablo de Castro, in which different national approaches to RDM were presented based mainly on the information collected at the JISC MRD Programme International Workshop held last March in Birmingham.

Different approaches to data management taken from the JISC and SURF Foundation were discussed at Q&A time: for the JISC, datasets are assets per se, regardless of where they are attached to a research paper as supplementary material, whereas the 'Enhanced publication' approach from the SURF Foundation in the Netherlands, regards datasets mainly as digital objects connected to research publications. Some emphasis was made on the fact that the upcoming OpenAIREPlus European project shares the SURF approach.

Two presentations on the Quixote Project followed, "From Databases in QC 2010, ZCAM, Sep 2010 onwards: a brief history of Quixote" by Jorge Estrada and "The Quixote Project: a pioneering work in managing Computational Chemistry research data" by Pablo Echenique. Both Quixote project members explained the results, the challenges and the cooperation opportunities of this non-specifically-funded RDM project, engaging in a fruitful dialogue with the attending researchers and repository managers on how the QC data assets could be best managed.

Finally Peter Murray-Rust closed the morning interventions with some reflections on the subject "Entering a new era in data management" - see his blogpost for a summary of his ideas.

In the afternoon there were joint debates on how to improve implementation of research data management initiatives. Researcher motivation for dataset sharing was extensively debated: this motivation should ideally not just arise from a given funding agency actually requiring those data to be made available, but from the sheer advantages (as summarized by Peter Murray-Rust) that doing so would bring to the research practice and communication ("improving methodology").

An independent debate session was held for discussing how to start developing some kind of research data management infrastructure in those countries where work in this area is presently beginning. These are some recommendations that were put together by the participants in the debate:

- Some workgroup of (not just library-based) IT professionals should be put together for analysing the current infrastructure and the opportunities for launching new initiatives upon potentially reusable pre-existing ones,

- It would be advisable to analyze the researcher behaviour and needs in terms of storing their datasets into international platforms for data sharing (in case they are available for their specific discipline),

- It would be interesting to examine the motivation for data sharing from research groups in different research areas, so that initial efforts to develop data management infrastructures can start working with those areas more willing to share their data (Earth Sciences recurrently showing up when analysing the international perspective),

- Pioneering initiatives for providing services to STM researchers regarding data handling and storing from given Institutional Repositories (such as eSpacio UNED and Digital.CSIC) should be highlighted as a role model to be spread,

- The OpenAIREPlus/SURF Enhanced papers approach could be a good starting point for Institutional Repositories to work at, by finding out which of their presently filed papers have supplementary data attached at the journal site and trying to independently manage those ones,

- A need was detected along the session talks with researchers for a dataset management system at research centres for basic internal organisation purposes. Datasets filed in this internal storage system may or may not be aimed for publication,

- Production and publication of potentially citable datasets should be acknowledged as a relevant scientific contribution for research assessment purposes,

- There are big differences in needs, procedures and required infrastructure regarding data management between Big Science and long-tail science (the greater part actually being groups of three researchers in a lab with specific needs of their own),

- The Library is a potential supplier of know-how on data processing and storing for researchers, and that role should be promoted within the institutions,

- The Spanish e-Research National Network, mostly dealing with Grid and supercomputing initiatives, might be a good workgroup infrastructure for pioneering data management initiatives in Spain,

- There are real collaboration opportunities between the Quixote Project and the research information management infrastruture at the University of Zaragoza (two IRs being currently available, Zaguan at the University and Digital.CSIC at the Spanish Nacional Research Council, CSIC),

- Research staff (mainly PhD students) getting involved in the management and operation of the dataset information management systems (such as Chempound data repository at the University of Cambridge) seems a prerequisite for the success of the data management initiatives

- Due to the specific data features for various research areas, the incipient data management infrastructure available is more developed for the Social Sciences and Humanities than for STM research areas.

Saturday 20 August 2011

Repositories and CRIS: Working Smartly Together

Due to recent involvement in other OA repository-related activities at the University of Khartoum, reports at this blog on recent events such as the 'Repositories and CRIS: Working Smartly Together' workshop organised by RSP last Jul 19th in Nottingham and the 4th edition of the Repository Fringe in Edinburgh were slightly delayed. Good news about it is that interesting reports on these events have been published in the meantime (see the RSP event review by Gareth J. Johnson at UKCoRR blog). This will allow Sonex to take a different approach to the reporting, making it more of a reflection than of a description, as well as covering the conference followup.

One of the subjects discussed along the Reposit project session within the Conference at EMCC was what mailing list or discussion group should replace the reposit@googlegroups.com forum for discussing IR and CRIS-related issues once the RePosit project comes to an end. Several options were considered, from using already existing lists such as UKCoRR's or ARMA's, to creating a new Super-CRIS list at JISC mail such as cris-super@jiscmail.ac.uk. Steps are being taken after the workshop to make this new list available.

The REF is working as a very strong driver towards CRIS implementation (with CERIF format being extensively considered in order to become a standard, see Marc Cox's presentation). A good number of HEIs do now operate a CRIS as a result (either commercial, in-house built or an extension of their EPrints repository). That is the good news. The not so good ones may be the fact that due to CRIS systems offering an enhanced collection of features, RIM infrastructure managers are starting to wonder whether an Open Access repository (usually managed by the Library) isn't becoming a somehow redundant piece of software, with most of its functionalities being increasingly covered by the CRIS (managed at the Research Offices). Repository phase-out is thus beginning to be discussed at given institutions for integration and optimization purposes. However, as Janet Aucock (University of St. Andrews) writes in the reposit@googlegroups list, even if the degree of overlap between repositories and CRIS systems may be large and growing, there are still features a CRIS will not be able to deliver:

"(...) Another point is to do your homework really well and make absolutely sure that the CRIs can deliver everything that a repository can do. Can it provide established permanent identifiers for items? Can it handle embargoes effectively? What about stats? Does the discovery interface in the portal display all the metadata that you need with regard to open access full text eg rights statements etc. These are small details which we take for granted but are not always embedded into the CRIS. CRIS software is still evolving too, and perhaps not all the functionality necessary is there yet. Another aspect of this is the question of the interfaces for users and discovery. Is the CRIS successfully harvested or crawled by search engines. Is it ranked appropriately. Can it expose metadata appropriately to other services where required? Can it isolate metadata with full text attached/open access full text attached and allow that set to be harvested and reused? We know that our own CRIS supplier is still working on adding all the "repository" functionality that they think is needed for their product. But at the moment I don't know the fine detail of this".

Besides R4R/CERIF4REF Project at KCL mentioned by Marc Cox, other projects also dealing with CERIF implementation regarding CRISes were mentioned such as MICE for Measuring Impact under CERIF, or the BRUCE Project (Brunel Research Under a CERIF Environment) that was presented at the 2011 euroCRIS meeting in Bologna last May (see Sonex post on the two recent euroCRIS meetings in Italy).

Another interesting outcome of this RSP event was the opportunity to learn from local SHERPA RoMEO team about the RoMEO API new v2.8 version and the release of the SHERPA RoMEO Publisher's Policy Tool, that will allow publishers to directly define their RoMEO policies via an embedded portal in SHERPA (actually presented next day, Jul 20th, at the 'RoMEO for Publishers' event in London).

Finally, a poster was featured in the event poster section called “SICA: A CRIS with an embedded Repository working for the innovation in Andalusia Region (Spain)”. With this integrated system for recording scientific production of the researchers belonging to nine universities, research organizations, technology centres and other scientific institutions of the Andalusia region in Spain, the National & Regional CRIS/IR integration initiatives (as recorded by Sonex in its May'2010 post) keep growing. This particular CRIS initiative is being developed within the European SISOB Project on -yet again- how to measure the impact of science in society.

Besides this -not thorough nor systematically updated- Sonex list of National & Regional CRIS/IR integration initiatives, a comprehensive list of 'CRIS + Repositories in the UK' is being put together as a Conference followup. When complete (it's open for any missing one to be filled in) the list will join the RSP Wiki where Institutional Repositories in the UK are already listed as to provide a clear picture of existing infrastructure.

Sunday 17 July 2011

KULTURising research repositories

"...I can only add that research for art, craft and design needs a great deal of further research. Once we get used to the idea that we don't need to be scared of 'research' - or in some way protected from it - the debate can really begin."
(Christopher Frayling, RCA Rector (1996-2010), from: "Research in Art and Design" (Royal College of Art Research Papers, Vol 1, No 1, 1993/4). Royal College of Art, London).

On the Jul 6th meeting at JISC Brettenham House some planning was done as well for Sonex extension besides Swordv2's. In the framework of this project extension, Sonex is expected inter alia to further support the JISC Deposit Projects and continue to gather international deposit use cases, as well as to provide some
recommendations on how to improve deposit.

As part of this further involvement with JISC Deposit Projects, Sonex was attending the Kultivate Project Conference on Jul 15th at the Royal Institute of British Architects (RIBA).

Based at the Visual Arts Data Service (VADS), a research centre at the University for the Creative Arts, and funded by the JISC from late November 2010 to the end of July 2011 within the JISC Deposit strand, the Kultivate Project aims to "share and support the application of best practice in the development of institutional repositories that are appropriate to the specific needs and behaviours of creative and visual arts researchers". Kultivate builds upon the knowledge and experience of the Kultur II group, which grew out of the JISC funded Kultur project (2007-2009). The Group currently consists of over forty institutions and projects and is led by the VADS.

Specific goals of the Kultivate project are:

- to increase the rate of arts research deposit,
- to enhance the user experience for researchers, and
- to develop and sustain a sector-wide community of shared best practice in arts research repositories.

There are significant differences between Kultivate and the rest of the JISCdepo projects (RePosit, DURA and DepositMO) in the sense that while the three other ones deal specifically with semi-automation of widely-recognised content ingest into repositories (mainly by fostering platform interoperability), Kultivate seeks
to extend the coverage of institutional repositories to the creative arts environment, which is both rather different in nature to the mentioned well-accepted research and which hasn't been specifically addressed so far as scholarly output. In this regard, Kultivate can be both seen as sort of an outlier project and as the most challenging of them four.

After eight months of hard work, the Kultivate Project Conference put together a model set of talks and presentations (see programme and updated presentations) to introduce the project outcomes.

Several talks made introductory reflections on what creative arts research should be - with its specific peculiarities. The fact that the output from activities in the creative arts is or is not called research (artists themselves sound a bit surprised sometimes on being called researchers) doesn't seem that relevant anyway - main thing actually being it's scholarly output from many HEIs and Arts Schools, and as such it should be subject to standard deposit into institutional repositories.

However, it is often hard to persuade artists to have their work filed into repositories ("the repo doesn't fit the needs of creative artists" a frequent allegation for not taking part in the project). In this regard, advocacy is particularly critical for institutional projects being carried out in the area - they are breaking through in a discipline where no such thing could possibly exist (so far) as PubMed, Chemical Abstracts or arXiv.

See examples of effective advocacy under the Kultivate project umbrella at Goldsmiths Research Online and UAL Research Online, plus the own Kultivate Advocacy Toolkit, one of the project's main outputs.

Another relevant progress Kultivate is promoting is the setting of metadata standards for description of creative artworks (something that incidentally brings the project closer to the data management strand rather that to the deposit one, making it a quite heterodox one). See for instance 'The listening room' item at UAL Research Online with its four-tabbed description including metadata as well as images and videos (and thus effectively delivering an answer to frequent artists complain on work documentation: "I did a performance, not a video" or "Fine, but where am I?").

Performance Art Data Structure (PADS), for which the unit subject to description is the 'work' not the 'digital object', is yet another solution for complex description of creative arts output developed by the University of Bristol within the JISC-funded CAiRO Project for Complex Archive Ingest for Repository Objects (see example of PADS example record for 'Becoming snail' performance by Paul Hurley at JISC Digital Media).
PADS is also involved in the Europeana attempt to standardise perfomance metadata accross the EU.

Finally, a good (and growing) number of EPrints-based implementations of the Kultur enhancements for designing creative arts output-focussed institutional repositories were presented at the project conference (incidentally arising questions by DSpace-based IR managers on when something similar will be developed for DuraSpace). Kultivate has also provided (in cooperation with the University of Southampton team) a set of technical enhancements to the EPrints platform, among them on the MePrints application and the IRStats package.

Implementation of those enhancements by different institutions (either arts-focussed or general purposed ones with Arts Departments within them) is giving way to a wave of repository KULTURisation (ie being adapted to deal with creative arts output) across the UK that might well spread beyond that once working standards are consolidated. In the meantime the VADS-lead eNova project is already building upon the outputs of both Kultivate and Kultur projects.

Monday 11 July 2011

Sword-Sonex project extension

"Data deposit nowadays... is mainly based upon submission by email... and remains labour-intensive"
(Simon Hodson, JISCMRD Programme manager, on present data deposit workflows)

Representatives of the JISC-funded Sword and Sonex projects met Balviar Notay and Simon Hodson (JISC) on July 6th at Brettenham House, London for further dealing with Sword v2 extension to automated transfer of research data (see reference to last meeting on the issue on Nov 20th).

Once the first round of JISCMRD Phase I projects is over and final reports have been published, the Sword-Sonex workteam is already working to put together a data transfer use case document where different project solutions are listed, with their advantages and shortcomings, so that some analysis is carried out on how Sword might aid the automation of the dataset transfer into repositories (or similar target resources for research data). The team will liaise with several JISCMRD projects in order to find out their specific approach to the data transfer issue. Timeschedule for the extended Sword project (coordinated by Paul Walk, UKOLN) is as follows:

WP1: Identify key projects & individuals who have relevant information and skills regarding datasets [Jul 6-13]

WP2: Document the dataset use cases in collaboration with Sonex [Jul 18-end Aug]

WP3: Interpret the data set use cases as processes carried out with Sword [Sep 5-24]

WP4: Carry out gap analysis on dataset use cases on Sword and recommend future work, and produce a web resource for any new or existing JISC projects (such as those in JISCMRD2 Programme) to refer to, which will provide all the relevant information regarding dataset deposit [Sep 27-Oct 21]

WP5: Identify key Sword clients and potential client environments, accept and evaluate proposals, issue development contracts [Jul 6-Aug 15]

WP6: Development of 1, 2 or 3 client environments [Sep 5-end Oct]

WP7: Project management and administration [Jul 6-end Oct]

Sunday 10 July 2011

CRIS and OAR 2011: "Integrating research information"

Two important euroCRIS events were held in Italy at the end of May: the 2nd workshop on CRIS and OAR (Rome, May 23-24) and the euroCRIS membership meeting 2011 (Bologna, May 26-27). Following last year's euroCRIS meetings in Aalborg for euroCRIS 2010 and Rome for the 1st workshop on CRIS and OAR (link to Sonx posts), these two 2011 workshops offered the international reseach information community the opportunity to debate current state of the development of CRIS systems and their integration with Open Access Repositories for best serving institutional needs in different countries.

Bernard Rentier was a keynote speaker at the meeting at CNR in Rome (presentations available here), where he presented the 'à la liégoise' mandate he has promoted at the Université of Liége for populating the ORBI institutional repository (currently holding near 65,000 items). The ORBI-generated report is actually the only official document for research evaluation at ULg.

Keith Jeffery (STFC) -the embedded milestones, roadmap and workshop purpose slides are taken from his presentation- introduced the 2011 Rome workshop by describing the progress made in the CERIF implementation since last euroCRIS meeting, the 2010 and 2011 milestones (CERIF spreading to several continents, adoption of Avedas Converis at ERC and the ENGAGE Project on Open Govenment Data and the JISC 'Measuring Impact under CERIF' (MICE) Project in the UK), the CERIF roadmap for 2011 and 2012-12 and the purpose of the CNR workshop.

After lots of interesting presentations on the workshop day 1 (with a special mention to CRIS/OAR integration examples in the UK by Simon Kerridge, U Sunderland and ARMA), day 2 was devoted to joint work by workshop attendees on updating the white paper on CRIS and OAR integration. This work resulted in the recently published (July 8th) Rome Declaration on CRIS and OAR, consensus on which was reached after extensive debate via mail.

A few days after the CNR workshop, the 2011 euroCRIS Spring Meeting was held at CINECA, Bologna (watch meeting presentation by Nicola Bertazzoni), with special emphasis on the topic 'CRIS in a University IT environment', for which Italian (Politecnico di Torino Research Information System) and British (BRUCE Project - Brunel Research Under a CERIF Environment) examples were presented.

Gettin' on...

After quite a long, not totally intended silence - schedules get so hectic every now and then- it is the purpose of the Sonex workgroup to update the project blog by briefly reporting on recently held workshops we have attended since last post. These have been, inter alia, the 2nd euroCRIS/CNR-IRPPS workshop on CRIS and OAR (Rome, May 23-24), euroCRIS membership meeting 2011 (Bologna, May 26-27), CERN Workshop on Innovations in Scholarly Communication (OAI7, Geneva, June 22-24) and LIBER 40th Annual Conference 2011 (Barcelona, Jun 29-Jul 2).

Sunday 15 May 2011

A first analysis of data management

As previously mentioned in this blog, the Sonex workgroup is now try to extend its use case scenario analysis on 'Deposit opportunities into repositories' to the realm of research data. A first meeting held at EDINA on Mar 30th served the purpose of drawing a general picture of the data management landscape.

Stress should be put on the fact that the way of handling SSH and STM data may substantially differ. Given the strong IASSIST-attachment of some Sonex members, the workgroup initial approach to data management may therefore be a bit biased towards procedures in the area of Social Science and Humanities. However, attention will be paid as well to specific ways of dealing with STM datasets as the analysis gets fine-tuned.

Moving along the same lines as we did for research articles, we first try to tackle the ACTIONS scope. Data deposit is certainly an issue, but there's more to data-related processes than just deposit. It's also about Access to data and also about Data Notification/Register.

Next we get on to the WHAT and the WHO. Answer to WHAT? is a data set. Previous analysis by Peter Burnhill shows -at least- three different types of research data (see image below).

Dealing mainly with the data file itself, this data type classification is somewhat narrow for the general picture of data management, so Sonex would rather set a new and more generic data classification for answering the question WHAT is there to deposit:

Metadata record

Codebook or user guide, where all necessary information is provided to allow for data re-use*

Raw data or dataset file(s)

* See a DCMI-based description at: Inter-university Consortium for Political and Social Research (ICPSR). (2009). Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle (4th ed.). Ann Arbor, MI. Section 'Important documentation elements', p. 22

These three elements should ideally be supplied as a single package.

As to the question of WHO performs each data-related operation (Notification-Deposit-Grant access), a handful of running projects within the JISC MRD (phase I) programme should serve to test the different use cases resulting from a double-entry 'Action/What' table as featured below.

Next step as we proceed to further development of this preliminary analysis should be a survey for gathering information on procedures for data handling as carried out in specific JISC MRD projects.

Wednesday 27 April 2011

National initiatives for promoting data management strategies: an overview

- "Hello, I want to deposit my data"
- "Sir, this is a library!"
- "Sorry" -he whispers- "I want to deposit my data".
(as told by Brian Hole, British Library, along his presentation of the DRYAD UK initiative)

Main objective of the JISC MRD International Workshop held last month was to review progress achieved by the JISC Managing Research Data Programme and to discuss this in the context of broader international developments.

As stated in the workshop programme overview, "this dimension reflects key partnerships which JISC, the JISCMRD Programme and the DCC has been building through the IDCC Conference, the Knowledge Exchange and other initiatives. They include the Australian National Data Service, the NSF funded DataNet Projects, institutions in the US and Australia, the DFG, SURF, DANS etc".

Whithin the broader context, besides a couple of preliminary talks on the European Union approach to (and future funding of) data management initiatives -by John Wood, on the EU 'Riding the wave' report, and by Carlos Morais-Pires on the Digital Agenda for Europe- the workshop featured a specific session on "National and international infrastructure initiatives" whose first panel was called "Approaches and strategies in the UK, US, and Germany". Australian and Dutch national or specific approaches were also discussed, either at this session or later along the event.

Besides the national initiatives featured in this and further sessions along the meeting -it was reassuring to see such a broad scope of strategies or already running projects taking place at the same time in so many different countries- there are also additional, sometimes preliminary initiatives for promoting data management policies at national or institutional level in other countries such as Finland, Portugal, France, Poland or South Africa.

As new initiatives for research data management keep steadily coming up, this session was an opportunity to get an informal update on DCC's report 'Comparative Study of International Approaches to Enabling the Sharing of Research Data' - see its summary and main findings here as of Nov 2008.

Digital Curation Centre - UK
Kevin Ashley, Digital Curation Centre (DCC), described the present picture of data management in the UK as "a new context", where Universities are increasingly willing to take responsibility for data management (specially in areas not covered by Data Centres).
Once UK funder and NSF rules for Data Management Planning are being implemented, this in-advance planning is becoming very important for funders, researchers, institutions, collaborators and reusers. DCC current tasks include integrating different Data Discovery Services plus building institutional capacities: skills, policies, etc. Besides that, DCC is providing the new DMP Online service aimed to produce and maintain Data Management Plans.
Good news is that, despite varying degrees of involvement, institutions in the UK have accepted their role in RDM.

NSF-funded DataNet Projects - US
A summary of present state of research data management in the US was provided by presentations of the DataONE and DataConservancy initiatives, resp. delivered by William Michener (University Libraries at U New Mexico) and Sayeed Choudhury (Johns Hopkins University).

After stating that "researchers are presently using 90% of their time managing data instead of interpreting them", W. Michener presented the Data Observation Network for Earth (DataONE) initiative (a live DataONE presentation at U of Tennessee is available). This NSF-supported initiative aims to ensure preservation and access to multi-scale, multi-discipline, and multi-national science data. DataONE Coordinating Nodes around the world will help achieving needed international collaboration for solving the grand science and data challenges, particularly with regard to education.

The DataConservancy initiative aims to research, design, implement, deploy, and sustain data curation infrastructure for cross-disciplinary discovery with an emphasis on observational data. S. Choudhury's presentation stressed the need for data preservation as a necessary condition for data reuse and introduced the recent connection of data and publications through arXiv.org as one of the pilot projects that build upon the Project APIs.

DFG - Germany
New DFG information infrastructure projects in Germany were presented by Dr Stefan Winkler-Nees, who mentioned both Jan 2009 DFG Recommendations for Secure Storage and Availability of Digital Primary Research Data, as a base report for promoting standardized work in the data management area, and DFG running call for proposals "Information infrastructures for research data". Selected projects at this call are due to be shortly announced and will start on May/Jun'2011. Finally, in a a common line of thought with other initiatives, Dr. Winkler-Nees mentioned DFG is aiming for teaching and qualification of both researchers and data curators.

SURF Foundation & DANS - The Netherlands
Later on along the workshop, John Doove presented the SURF Enhanced Publications initiative within the SURFshare programme 2007-2011. Six new projects funded along 2011 by the SURF Foundation will allow researchers from a variety of disciplines to share datasets, illustrations, audio files, and musical scores with fellow researchers in the context of Enhanced Publications (programme video available on YouTube). There were already two previous grants rounds for Enhanced Publications. The six running projects, whose results are due in May 2011, take place within five disciplines: Economics (Open Data and Publications, Tilburg University), Linguistics (Lenguas de Bolivia, Radboud University Nijmegen, and Enhanced NIAS Publications, KNAW-Royal Netherlands Academy of Arts ans Sciences), Musicology (The Other Josquin, University Utrecht), Communication sciences (Enhancing Scholarly Publishing in the Humanities and Social Sciences, KNAW) and Geosciences (VPcross, KNAW).

The Dutch strategy for increasing research data available online was completed with the presentation "Sustainable and Trusted Data Management" delivered by Laurent Sesink (DANS-Data Archiving and Networked Services). DANS, est. 2005, deals with storage and continuous accessibility of research data in
the social sciences and humanities and promotes the 'Data Seal of Approval' for certification of data repositories, guaranteeing via a series of required criteria a qualitatively high and reliable way of managing research data.

Australian National Data Service (ANDS) - Australia
Finally, Andrew Treloar, Director of Technology, Australian National Data Service (ANDS), supplied a comprehensive perspective from a national infrastructure provider and in a way summarized previous talks by saying that, despite differences, there are common themes emerging in national approaches to data management, as there are things only they can do. Along his plenary presentation "Data: Its origins in the past, what the problems are in the present, and how national responses can help fix the future" he mentioned for instance that Hubble Space Telescope-related publication statistics show double research is being done thanks to data reuse. Efficiency, validation, integrity of scholarly records, value for money and self-interest were listed as (non-altruistic) arguments for data reuse.

Having the chance to attend this series of brilliant presentations and checking out how policies for opening access to research data keep spreading over institutions and countries were undoubtedly part of the Birmingham workshop highlights. Next opportunity for keeping up with it all will be next November at the Knowledge Exchange Workshop on Research Data Management in Bonn, Germany.

Monday 25 April 2011

Could external cooperation improve collection of specific JISC MRD project-related information?

In forthcoming days SONEX will be publishing some posts on the JISC MRD Programme International Workshop held last March 28-29th at Aston Business School Conference Centre, Birmingham. Certain aspects debated at this comprehensive meeting were very useful for establishing an approach for dealing with research data management from a SONEX viewpoint, as debated in a SONEX meeting at EDINA on Mar 30th whose outcome will also be shortly blogged.

See IUCr Brian McMahon's report for a general review on the JISC MRD workshop.

One of the most visible disciplinary approaches to data management presented at the JISC MRD event -which featured all kinds of institutional and subject-based initiatives in the area- was the one coming from meteorology, palaeoclimatology and climate-related sciences: there was a presentation of the PEG-BOARD Project (U of Bristol) at the Subject-Oriented Approaches session on Monday, followed by ACRID (U of East Anglia & STFC) and Metafor (BADC & STFC) Project presentations on Tuesday afternoon.

One of the most relevant features of these climate-related projects is interdisciplinarity. PEG-BOARD Project in particular aims to serve the archaeology research community by supplying them their paleoclimate data.

A few specific aspects about PEG-BOARD were discussed after the project presentation. Interesting thing about them is they were not mentioned along the talk, nor are they reported at the project site:

- Due to the project interdisciplinarity, there are two clearly different user groups for palaeoclimatology data produced: climatologists, who will understand the nature of involved datasets, as they're central to their discipline, and archaeologists, who don't and need not know much about the data format but need the information contained in it for their own purposes - thus functioning as regular non-technical users to the project instead of researchers. However, as they are indeed researchers, the feedback they may provide on the project outcome could be so much more valuable.

- What archaeologists care about in the end is the data plottings, and Data Centres will not provide such processing. So what PEG did was implement specific software capabilities that will address the needs of non-technical data users (i.e. archaeologists), as to allow them to search for the plots or false-colour graphics they need. This piece of middleware is a conceptual key feature of the project in terms of deliverables.

- Climate data is usually archived in binary format, so it's often not easy to process. UK Met Office provided lots of info, often incomplete or in old formats. The adaption process of raw data to the project needs was very interesting and worth disseminating.

- Climate models were written in FORTRAN. When re-written or translated into C++, the results would vary for the same data arrays due to specific treatment by the code. That poses a quite amazing challenge in terms of model interpretation.

- When asked on whether researchers provided enriched metadata for their data, the answer was there's usually an input in terms of past experiments, i.e. "this is the data outcome of such and such experiment when changing initial conditions in such a way". Such-and-such experiment would be described the same way until one was reached that wasn't described at all.

The fact that none of these project aspects is recorded or discussed at the project blog poses a question on whether an external approach to data management projects might collect and disseminate very interesting information that researchers may not consider relevant enough to discuss from project blogs. Such an external approach to running projects might be carried out by data librarians in order to
share these specific project details with the data management community.

For whatever it may be worth, Sonex would be keen to do this kind of job for the MRD community.

Tuesday 5 April 2011

I2S2 Project workshop at RAL-STFC

Along a busy week in terms of research data management events (due to be shortly reported from this blog), last Friday Apr 1st Sonex had the opportunity -thanks to Simon Hodson, JISC MRD programme manager- to attend the I2S2 Project workshop at the Rutherford-Appleton Laboratory (RAL) at STFC in Didcot. I2S2 -standing for 'Infrastructure for Integration in Structural Sciences' is a JISC MRD project ending in Mar 2011 aiming to "identify requirements for a data-driven research infrastructure in "Structural Science", focusing on the domain of Chemistry, but with a view towards inter-disciplinary application".

Several presentations were delivered along the meeting: Brian Matthews on the I2S2 project achievements, ICAT architecture and CSMD metadata standard, Brian McMahon, International Union of Crystallography (IUCr) on 'Information Management and Publication in Crystallography', Tom Griffin on TopCAT GUI for management of data coming out of STFC ISIS and DIAMOND facilities, Steve Androulakis on the TARDIS ANDS-supported project at Monash University, Mark Borkum on OreCHEM files, Chris Morris on on PiMS (Protein Information Management System) and Juan Bicarregui on the EU PANData project.

Along the IUCr presentation the need was identified for filing & preserving different data categories such as raw measurements, processed numerical data, derived info and the paremeters. The convenience of providing access to raw diffraction images was also stressed along the talk, these files being a few GB in size, and thus not large enough for Data Centres but too big for sites such as CCDC. A review on Crystallographic Information Framework (CIF) file formats was provided, with imgCIF being used for raw data storing out of the experiment, .fcf for including structure factors after data reduction and a final stage of structure solution and refinement being performed in the lab before the author starts formatting those into a IUCr paper, which would translate CIF into SGML for producing final fcf, cif, pdf and html versions.

Raw data was mentioned to be kept for 183 days at SFTC and 3 months at Australian Synchrotron (in which TARDIS is involved), and a discussion followed on the fact that some agreement shoud be reached on the kind of data that ought to be stored and preserved. The process of attachment of DOIs to datasets was also discussed, IUCr being presently involved in projects such as XYZ or Open Bibliography in order to promote this objective.

A TopCAT demo was provided by Tom Griffin. This open source GUI (see image above) is being used for storing raw data from STFC facilities such as ISIS and DIAMOND. TopCAT provides access to its contents through an open registration system, thus operating as a sort of STFC institutional data repository, and would be potentially applicable to other institutions, facilities and disciplines.

TARDIS presentation by Steve Androulakis, Monash Univ, Australia, mentioned their using of XML/METS metadata standards for research data description at the federated institutional repository-platform initially meant to store X-ray diffraction images, later evolving into a much larger initiative with application into microscopy (MicroTARDIS), particle physics and gene processing through the Squirrel software.

Finally, extra presentations were delivered on PiMS (Protein Information Management System) by Chris Morris, STFC and on the European PANData project by Juan Bicarregui, STFC e-Science. PANData aims to build Photon and Neutron Data Infrastructure through a consortium of European synchrotron facilities and neutron sources.

A final summary was made on the whole set of presented I2S2-related features (imgCIF, CIF, IuCr/XML/RDF BIBLIO, PDBML, CML, ICAT, TopCAT, ICAT Lite/CSMD, TARDIS, PiMS, PANData, NeXuS) by mapping them on the I2S2 Idealized Scientific Research Activity Lifecycle Model (see image above - may click on it for an updated version). References were also made to other initiatives not represented at the meeting such as Quixote Project for Computational Chemistry CML data management or Protein Production and Crystallization.