Biocomputing at LBMC
Introduction
We spend an increasing amount of time building and using software. However, most of us were never taught how to do this correctly and efficiently. The resulting problems are multiple and easily avoidable. This document summarizes a set of good practices in bioinformatics.
- Section 1, presents the organization of your working folder for a given bioinformatics project.
- Section 2, lists the resources available to manage and secure data in your project.
- Section 3, presents the
git
code versioning system and some examples on how to use it. - Section 4, enumerates some rules to follow when you write code. These rules will ease the reproducibility of your analysis and collaborative development for your project.
These good practices were compiled from different sources, often overlapping, listed in the References of this document.
Project organization
The first step at the start of a bioinformatic project is to plan for the structure of the project. Following this structure will facilitate collaboration with others or your future self. In this section we are going to present a guide for your project organization. This guide should cover most bioinformatic project requirements. This section aims at facilitating collaboration with other bioinformaticians in the LBMC or even yourself in the future. You are strongly encouraged to follow it and to enforce its policies in your team.
The project must have the following structure:
project_name/
bin/
data/
doc/
results/
src/
CITATION
CONTRIBUTING
README
LICENCE
todo.txt
You can get a template of this organization on the following git
repository: https://gitbio.ens-lyon.fr/LBMC/hub/minimal_git_repo.
Text files at the root of the project directory
The README
file must contain different information on your project, such as the project title, a short description and contact information on the carrier of the project. You should also provide some examples on how to run tasks to be able to reproduce your work. This includes the dependencies that needs to be installed.
The CONTRIBUTING
file points out to visitors ways they can help, the tests they can run and the guidelines that the project adheres to.
The LICENSE
file must contain the license you wish your work to be published under. Lack of an explicit license implies that the author is keeping all rights and others are not allowed to re-use or modify the material.
For code sources the Loi pour une République numérique du 7 octobre 2016 tells you to use the GNU Affero General Public License v3.0 or later and the Creative Commons Attribution 4.0 your documentation.
The CITATION
file must contain information about how to cite the project as a whole and where to find and how to cite any data sets, code, figures or documents in the project.
You can use reputable DOI-issuing repository such as figshare, datadryad or zenodo to facilitate this step.
The todo.txt
file. If you don’t use tools like issues in Gitlab, you can maintain a to-do list with a clear description of its items, so they make sense to newcomers. This will also help you to keep track of the work progress and time-table.
data
folder
The data
folder must contain a .gitignore
file whose content is simply *
. Therefore, git
will ignore files in this folder (See Section 2 for more information on data management).
A general rule for data management is to have a single authoritative representation of every piece of data in the system.
The data
folder must contain only the raw data for your project. No script must write in it (except the ones to get the data in the first place). This point in crucial for the reproducibility of your work. One must be able to go back to the first step of your analysis and play it back again step by step. One other advantage of keeping the raw data untempered is that it allows more freedom to experiment with your analysis pipeline without side effect between the different strategies.
Data files in this folder, and in general, should contain some metadata like a time stamp and few biologically meaningful key words. We advise you to use the following naming convention: 2020_12_31-informative_name.file
. An informative file name can for example be a compilation of the species lineage replicate and sequencing technology name. With this format, sorting the file by name will also sort them by date and the most important metadata will be kept in the file name. Avoid, space and special characters (use **_** instead). When possible use open file formats that are easier to handle with standard tools and help to promote open science.
When writing script or code, it’s important to be able to test it. You can create a data/examples
folder that contains small toy data sets to test your scripts or software as described in the README
file. This will give to others the possibility to validate your work, and allow you to check if new modifications work correctly, hence saving you a lot of time (See Section 4 for more information on testing).
src
folder
If you are developping a new tool, its source code must be in src
. If you are developing or using an analysis pipeline, we advise you to put the functional part of your code in a src/func
folder and in a src/
folder the pipelines or scripts that contain the commands called to run your functions. If you are developing a web tools, you can create the src/model
, the src/view
and src/controler
folders if you use the MVC pattern.
If you are using online resources, documenting every link used and the value of every field filed can be fastidious. Instead, you have access to a palette of command like wget
or curl
to automatize your requests. Most online tools also provide APIs (and associated documentation) that facilitate command line interaction with them.
The goal of this folder is first, to let the computer do the work and second to automatize every step of your analysis. By saving commands in a file, it’s easier to re-use them and to build tools to automate workflows. This means that others will be able to make the same analysis on their own data, that you will have a publication ready pipeline at the end of your project, and that your work can easily be integrated in another project.
The content of the src/tests/
folder, must be regularly commited to your git
repository to ensure the tracability of your code (nobody is going to juge you if you have crappy code somewhere in your git
history as long as it’s gone in the final version).
tests
folder
The tests
folder must contain a list of tests files that can be executed to test your code. This will be explained in more detail in the Section 4 on tests-driven development.
doc
folder
You like to write stuff? Put it in the doc
folder. Even if you don’t like to write, write anyway on what you are doing and put it in the doc
folder. The doc folder contains documents associated with the project. This includes files for your publication and documentation for your source code. A throughout documentation add a huge value to your project as others will find it easier to comprehend and reuse your work and cite it instead of starting something new.
We advise you to keep an electronic lab notebook in a doc/reports
subfolder to track your experiments and to describe your workflow. This notebook can be easily generated using tools like knitr
or Sweave
. Those tools can call code or functions from the src/func
folder to compute results and generate figures. You can use Makefile to automate the generation of documents in the doc
folder.
results
folder
The results
folder must contain a .gitignore
file whose content is simply *
. Therefore, git
will ignore files in this folder (See Section 2 for more information on data management).
Every generated results or temporary files must go to the results folder. This also means that the integrality of the results
folder can be regenerated from the data
, bin
and src
folders. If this is not the case for a given result file, delete it and write the necessary code in src
to regenerate it.
We advise you to use the same naming convention in the results
folder than in the data
folder. It’s easy to load file that has a variable part (date and time) with the use of special characters like *
. Adding time stamps to your results files will help you track down errors in your analysis.
Even if we don’t enforce a backup policy for the results
folder keep in mind that computation time is not free and that days or weeks of computations (even if easily reproducible with the guidelines of this document), are valuable. Moreover, keeping intermediate files to be able to restart an analysis at any point can save you a lot of time. It’s up to you to discriminate between valuable final or intermediate results that could ease the reviewing process of your work and temporary files that are only consuming space. You can use a results/tmp
folder to make this distinction.
bin
folder
The bin
folder which historically contains any compiled binary file must also contain third party scripts and software. You should be able to fill this folder with the information contained in the dependencies section of the README
file or doc/
folder. The compiled file from your work can be recompiled and the third party material can be got back from the internet or other sources. This folder can also be automatically filed if necessary by the execution of the content of the src
folder.
Data Management
In this section we will present some rules to manage your project data. Given the size of current NGS data set one must find the balance between securing the data for his/her project and avoid the needless replication of gigabytes of data.
Your code and documentation are also valuable sets of files. Using, git
means that a copy of these files exist at least on your computer (and the computer of every collaborator in the project), on the gitbio server and on the backup of the gitbio server (updated every 24h). The details of the code and documentation management within your project are developed in src
and doc
paragraph of the Section 1. In this section, we focus on replicating the data
and results
folder content of your project, on multiple sites, in order to secure it.
From the time spent to get the material, to the cost of the reagents and of the sequencing, your data are precious. Moreover for reproducibility concern you should always keep a raw version of your data to go back to. Those two points mean that you must make a backup of your raw data as soon as possible (the external hard or thumb drive on which you can get them doesn’t count). When you receive data, it’s also always important to document them. Write a simple description.txt
file in the same folder that describes your data and how they were generated. These metadata of your data are important to archive and index them. There are numerous conventions for metadata terms that you can follow, like the dublin core. Metadata will also be useful for the persons that are going to reuse your data (in meta-analysis for example) and to cite them.
Public archives
Public archives like ebi (UE) or ncbi (USA) are free to use for academic purpose. Public archives propose an embargo time during which your dataset will stay private. Therefore, you should use them as soon as you get your raw data.
- Once a dataset is archived, it will never be deleted.
- These archives support a wide array of data type.
- The embargo can be extended as far as you want.
- You will get a reminder when the end of the embargo is near. Thus your precious data won’t go public inadvertently.
PSMN:
The PSMN (Pôle Scientifique de Modélisation Numérique) is the preferential high-performance computing (HPC) center the LBMC have access to. The LBMC have access to a volume of storage in the PSMN facilities accessible, once connected with your PSMN account.
A second copy of the raw data can be placed in your PSMN team folder /Xnfs/site/lbmcdb/team_name
. You can contact Helene Polveche or Laurent Modolo if you need help with this procedure. This will also facilitate access to your data for the people working on your project if they use the PSMN computing facilities.
Code safety
Most of the human bioinformatic work will result in the production of lines of code or text. While important, the size of such data is often quite small and should be copied to other places as often as possible.
When using a version control system (See Section 3), making regular pushes to the LBMC gitbio server will not only make you gain time to deal with different versions of your project but also save a copy of your code on the server. You can also make instantaneous or daily backup in your home directory at the PSMN.
With the LBMC you can also use the Silexe server. The CNRS provides a synchronization service called MyCore to synchronize folders on their servers (100Gb). The UE provides a synchronization service called b2drop to synchronize folders on their servers (20Gb).
Versioning
Biologists keep their lab journal up to date so their future self or other people can check on and reproduce their work. In bioinformatics versioning can be seen as a bioinformatic journal where you can comment the addition of new functions to your project. This also means that you can go back at any point of this journal to revert to your code at an earlier state.
Moreover, where a lab journal is linear, you can start new paths (branches) to try new ideas and test new features. Your main working branch will be left undisturbed. With a versioning software, you can even make progress on different branch at the same time. Successful branches can then be merged back into the main branch to include new working and tested functionalities.
The strength of a code versioning system is to do all of the above transparently. You don’t have to keep different versions of your files; it’s the versioning software job. By going to another branch or time point, your working directory will be changed to match the status of the files at that point. If you jump back, the files will be changed back to the condition where you came from.
The flexibility of the version control software to jump to a given time point of your project relies on the granularity of those time points. Therefore, you should try to make incremental changes to your project and record them with the version control software as often as possible. This will also help you to comply with the recommendations of the Section 4 on coding.
You can find the LBMC course on git
at https://gitbio.ens-lyon.fr/LBMC/hub/formations/git_basis/.
Installing git
We chose git
for the version control software to use at the LBMC. git
can be easily installed on most operating systems with the following instructions.
On Linux you can type:
# on debian/ubuntu
apt-get install git
# on redhat/centOS
yum install git
To install git
on macOS with homebrew, you can type:
brew install git
When using git
for the first time, you need to give him your identity so it can sign your entries (commits). To do that you can use the commands:
git config --global user.name "first_name last_name"
git config --global user.email first_name.last_name@ens-lyon.fr
Using git
To start recording your bioinformatic journal you simply need to place yourself in your project directory and use the command:
git init
Then you can record the status of a given file or a list of files with the commands:
git add file_a
git add file_b
git commit -m "creation of file a and b"
Each new commit will create a new entry in your bioinformatic journal with the current status of your project. If you missed some things or made an error, you can easily amend the last commit with:
git rm file_b
git add file_c
git commit --amend
This will open your favorite text editor to let you edit the commit message and amend it. At any time to see the status of your repository, you can use the command:
git status
One strength of git
is his decentralized structure. This means that you can keep your own journal on your computer without the need to push your changes to a central repository. This also means that there are powerful tools in git
to merge differences between different repositories of the same project. To facilitate collaborative work (like your superior checking on your progress), you can use a central shared repository. One such instance is available at https://gitbio.ens-lyon.fr/ for the LBMC.
To push your local repository to the LBMC gitlab server you can use these commands:
git remote add origin url_to_comes
git push -u origin master
The full documentation of every command and possibilities with git is well beyond the scope of this document. However, you can access a complete and well-written documentation on the website git-scm.com. There is also a huge community around git
so most of your problems with it should find their answer online or in the LBMC. Also, don’t forget to go to the git
formation organized in the LBMC !
Coding
In this section we are going to introduce some concept and rules to follow and implement. The goal of this section is to write better code and scripts in your projects, with validated and reproducible results.
Write programs for people, not computers
The first goal to follow in your project is to write code for people and not for computers. We are limited and there is only so much information that we can keep in mind at the same time. Thus program should not require its readers to hold more than a handful of facts in memory at once.
This means that you should use very simple control flow constructs. Split logical units of code into functions. No function should exceed about 60 lines of code, with one line per statement and one line per declaration. A function is a logical unit that is understandable and verifiable as a unit. Simple control flows are easier to verify and result in improved code clarity.
Keep the number of parameters in your function small. It will be easier to track them throughout your function. It will also help you for debugging and testing it. Try also to keep the amount of memory objects modified by your function small. If you need to modify loots of items, write more functions. The validity of parameters must be checked inside each function. Also check the return values of your function.
Don’t conserve a block of code more than once in your project. If you need a block of lines of code at different points of your program, transform it into a function and make a call to it. This will keep your code small and avoid the problem of maintaining different version of the same code. If you are using an object-oriented language use inheritance.
Apply a naming convention
Define a naming convention for your variables, functions, objects, and files at the start of your project and kept it. We advise you to use lower-case character separated by underscores for variables and functions names. Object template names should start with an upper-case character to differentiate them from functions. We also advise you to configure your editor to use two spaces characters instead of the tabulation character (called soft tabs).
A brief summary of those rules should be written in the CONTRIBUTING
file of your project. This kind of naming convention will allow you to use informative variable and function names. Informative names will clarify your code for you and others. This will encourage collaborative work and help you debugging or factorizing your code.
Modern editors can also provide you with add-ons that can automatically check your code syntax to see if it follows coding conventions that are widely recognized for a language. Those add-ons call upon software like the lintr
package for R
, g++
for C/C++
or pep8
for python
. Another advantage of using these tools is to ensure that your code will be valid for the future evolution of the language.
Iterative development and continuous integration
Aim to have short development cycle from the conception of your code to its test. The conception will be simpler and your code will evolve with the addition of small new functional units.
With a short development cycle, the addition of new functionalities or improvements will result in small independent changes to your code. Those small changes will be easier to track with a version control system and can be published daily or many times a day. To enforce this policy, you should try to make incremental changes to your project. This means working in small steps with frequent feedback and course correction rather than trying to plan months of work in advance.
To achieve this development rhythm you need to apply another rule: don’t optimize prematurely. Write a code that is simple, clear and works. Keep in mind that a source code is rewritable at any point in time. You can later try to rewrite the suboptimal sections of your code. If you followed the previous points and the following section on testing, this will involve small changes with minimal side effects. You can make those changes in a new branch while keeping a working (if suboptimal) main branch.
Tests driven development
The value or your code reside in its number of working lines and not in its number of lines. Thus each modification must be tested. The easiest way to do that is to build and code the tests before new functionalities. Instead of writing your own tests (that would need to be tested), use testing libraries like testhat
package for R
, or unittest
modules for python to facilitate the addition of tests to your code. Using a test-driven development will also provide you with a complete set of tests to check for side effect in the non-modified part of your code after a modification.
There are different kinds of tests that you can use, like unit tests or integration tests.
Unit tests are simple tests to check a functionality or a module. When you write a function, you first write one or more unit tests that aim at checking if the return value of your function corresponds to what you expect. Then you can write your function and test it. Unit tests are beneficial at many points of the code development:
- Before: they force you to detail your code requirements
- Writing: they keep you from over-coding, when all the test cases pass, you are done.
- Factoring: everything keeps working when you improve your code.
- Maintaining: instead of reconsidering everything you can just say: ``nop sir the code still passes all our tests’’.
- Working with others: You can test if the additions from your work don’t break other developers tests.
Integration tests are one level of complexity above the unit tests. They aim at checking the assembly of elementary components in your code. Integration tests can be used with the content of your data/examples
folder to check after each step of your pipeline if you get the expected results.