Data management at the LBMC

guide

Author

Laurent Modolo

Published

January 16, 2023

Introduction

This document is a summary of the information that you can find in the biowiki and guide of good practices of the LBMC.

Nowadays, numerical data are at the core of the scientific activities and we often worry about their management and safe-keeping. You will find in this guide a list of storage facilities that you have access to, as a member of the LBMC and guidelines on how to use these facilities.

All data are not equal. For example, some data need to be shared while others need to be accessible only to one user or even encrypted. In this document, we are going to first classify data according to their size and nature:

documents: small files
codes: small files with complex history
experimental data: small to huge files

The experimental data category can be seen as quite open. In the data backup community, we often further categorize experimental data as :

hot: data on which you are currently working on, you want a rapid access to them
warm: data on which you may be working on, you want an easy access to them
cold: data on which you will not be working on in a foreseeable future, you don’t care if it takes some time to retrieve them.

The hot to cold categorization is closely related to the money and energy cost of the underlying storage facilities (the colder the cheaper).

For all of the above categories, we need to discriminate between backuped data and archived data. The data that you are working on can have none to multiple backups. An increase in the number of backups will increase the resilience and the physical cost of the storage of your data, but also management time spent to update all the copies. Data that will not change in the future can be archived. In this case the data need to be deposited in an archive facility along with the correct metadata, where it will get a unique identifier and will stay accessible forever (which may require a potentially large number of multi-site backup). The H2020 recommendations to make research data findable, accessible, interoperable and reusable (FAIR), encourage the use of data management plans to structure theses metadata.

Data Management Plans (or DMPs) are a key element of good data management. A DMP describes the data management life cycle for the data to be collected, processed and/or generated. As part of making research data FAIR, a DMP should include information on:

the handling of research data during & after the end of the project
what data will be collected, processed and/or generated
which methodology & standards will be applied
whether data will be shared/made open access and
how data will be curated & preserved (including after the end of the project).

The ANR wrote a document of recommendations concerning the DMP.

The DMP may need to be updated over the course of the project whenever significant changes arise, such as (but not limited to): new data, changes in consortium policies or changes in consortium composition and external factors.

We will now go over the solutions that you have access to, to store, backup, and archive your documents, codes and experimental data.

Documents

There are several solutions to backup and share your documents:

Automatic backup for workstations

If your computer is correctly configured, you can make daily backups of your documents on a wired connection. Different snapshots of your documents will stay accessible and you can restore your documents at any of these snapshots. Note that some type of files are excluded from these backups.

Data Backup and Synchronization Tools

Backup and synchronization tools allow you to continuously synchronize a list of the folder with a remove server. In addition to provide a backup of these folders, you can also easily share some of them with other users or between different computers (and increase the number of backups). You also have a small history of the last modifications where you can restore a given file to an anterior version.

The CNRS provide a synchronization service called MyCore (100 Gb), which should be accessible to all members of the LBMC.
The UE provides a synchronization service called b2drop (20 Gb), which should be accessible to all members of the LBMC.

For both services, the data stored can be considered as heavily backuped (the data should not be lost on their ends).

Other famous USA companies also provide similar services, but despite the Safe Harbor and Privacy Shield, the court of justice of the EU ruled on July 16 2020, that the USA privacy law cannot be made compatible with EU privacy law. Therefore, you should not use these services for work (or at least you should heavily encrypt the content stored on them).

Shared Network Volumes

Shared network volume is seen by your computer as external hard disk that is only available if your computer is connected to the corresponding network. Even if they look like an external hard disk, shared network volume doesn’t offer the same level of accessibility as local storage. Shared network volume performances and availability can vary depending on to the load of the network whose speed will always be slower than the speed of your local storage.

On the ENS network, you have access to the BIODATA network volume. Your BIODATA space is only accessible by your team’s members and from the ENS network (not throughout the VPN).

The BIODATA storage space is managed by Stéphane Janczarski and hosted by the ENS DSI and allows you to store raw data, directly from scientific platforms. Each team has access to two folders:

nameofteam/: (2To for the LBMC), with daily snapshots on another server in the SLING room
nameofteam2/: (12To for the LBMC), backuped monthly by Stéphane

You team can buy more storage, to add to BIODATA.

Codes

Most of the human bioinformatic work will result in the production of lines of code or text. While important, the size of such data is often quite small and should be copied to other places as often as possible. Your documentation is also a valuable set of files. Git is nowadays the reference system to store code and its history of modification. It can be seen as the numerical equivalent of a cahier de laboratoire. You can even numerically sign your contributions.

Gitbio

All LBMC members have access to the Gitbio server to back up and share their codes.

Using, git means that a copy of these files exists at least on your computer (and the computer of every collaborator in the project), on the gitbio server and on the backup of the gitbio server (updated every 24h). The details of the code and documentation management within your project are developed in src and doc paragraph of the Section 1 of the guide of good practices.

When using a version control system (see Section 3 of the guide of good practices), making regular pushes to the LBMC gitbio server will not only make you gain time to deal with different versions of your project but also save a copy of your code on the server.

Code archive

The EU and the CNRS and various French ministries support the softwareheritage project, which can make automatic archive of git code repositories. Upon publication of your work, you can therefore add your git repository to the softwareheritage project to archive it.

Experimental Data

In this section we will present some rules to manage your project data. Given the size of current experimental data sets, one must find the balance between securing the data for his/her project and avoid the needless replication of gigabytes of data.

From the time spent to get the materials, to the cost of the reagents and acquisition, your data are precious. Moreover, for reproducibility concern you should always keep a raw version of your data to go back to. Those two points mean that you must make an archive of your raw data as soon as possible (the external hard or thumb drive on which you can get them doesn’t count).

When you receive data, it’s also always important to document them. Write a simple description.txt file in the same folder that describes your data and how they were generated. This metadata of your data is important to archive and index them. There are numerous conventions for metadata terms that you can follow, like the dublin core. Metadata will also be useful for the persons that are going to reuse your data (in meta-analysis for example) and to cite them.

Public Archives

Public archives like ebi (UE) or ncbi (USA) are free to use for academic purpose. These institutions propose different services for different types of data. For example, for the ebi (UE) you have:

ENA (the European Nucleotide Archive) to store raw sequencing data, sequence assembly information and functional annotation
BIA (the BioImage Archive) to store and distributes biological images
EMDB (the Electron Microscopy Data Bank) to store electron cryo-microscopy maps and tomograms of macromolecular complexes and subcellular structures
BioStudies to store descriptions of biological studies, links to data from these studies in other databases at EMBL-EBI or outside, as well as data that do not fit in the structured archives at EMBL-EBI

Once your raw data deposited on a public archive, you can consider that they have a level of backup that you cannot reasonably reach and that they are safe. The archiving procedure request metadata information on the author of the data and the on nature of the data. Filling the forms of the archiving procedure is akin to writing a DMP with an infinite lifetime for the data.

Public archives propose an embargo time system during which your dataset will stay private. You will get an automatic alert before the end of the embargo and you will be able to renew it as many times as you need. Therefore, you should systematically archive your raw data.

Once a dataset is archived, it will never be deleted.
These archives support a wide array of data types.
The embargo can be extended as far as you want.
You will get a reminder when the end of the embargo is near. Thus your precious data won’t go public inadvertently.

BIODATA

For many kinds of raw data, the storage available on BIODATA could be enough to have a backup. Moreover, your team can buy more storage if needed.

The BIODATA storage space is managed by Stéphane Janczarski and hosted by the ENS DSI and allows you to store raw data, directly from scientific platforms. Each team has access to two folders:

nameofteam/: (2To for the all the LBMC), with daily snapshots on another server in the SING room
nameofteam2/: (12To for the all the LBMC), backuped monthly by Stéphane

PSMN:

The PSMN (Pôle Scientifique de Modélisation Numérique) is the preferential high-performance computing (HPC) center the LBMC have access to. The LBMC members have access to a volume of storage in the PSMN facilities accessible, once connected with a PSMN account. The access to these volumes is preferentialy done by the command line with ssh but can also be done with a graphical interface like Filezilla. You can request a training course to Cerasela Iliana Calugaru to learn how to use these resources.

A copy of your data can be placed in your PSMN team folder /Xnfs/site/lbmcdb/team_name, with up to 600To of storage for the biology department. You can contact Helene Polveche or Laurent Modolo if you need help with this procedure. This will also facilitate the access to your data for the people working on your project if they use the PSMN computing facilities.

CCIN2P3:

The CCIN2P3 (Centre de Calcul de l’Institut national de physique nucléaire et de physique des particules) gives access to a percentage of its ressources to biologists. In addition to the computing resources, you can also make long-term backup of your data in this center. With a PSMN account, you can make long-term backup of your data there.

The CCIN2P3 don’t know you, and don’t provide archiving services, therefore you must write a DMP to define some information like the owner of the data, its nature and its lifetime. The first step is to write a DMP. You will need to create an account on dmp.opidor.fr, where you can find a DMP template for the CCIN2P3.

You can then contact the PSMN staff to send this DMP to the CCIN2P3. Once this DMP is validated by the CCIN2P3 staff, you will be able to upload your data from the PSMN.