PEMA : a Pipeline for Environmental DNA Metabarcoding Analysis
Haris Zafeiropoulos1,2, Katerina Vasileiadou1,3, Ha Quoc Viet1, Christos Arvanitidis1, Pantelis Topalis4, Christina Pavloudi1, Evangelos Pafilis1
¹ Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR), Heraklion, Crete, Greece
² School of Biology, University of Crete, Heraklion, Greece
³ Charles University, Prague, Czechia
⁴ Institute of Molecular Biology and Biotechnology (IMBB), Foundation for Research and Technology (FORTH), Heraklion, Greece
Metabarcoding is a genetics-NGS-based biodiversity assessment method, that uses marker genes to detect and determine environmental sample community composition. Depending on the targeted group of organisms, specific marker genes are employed. eDNA, i.e. DNA that is collected from a variety of environmental samples, and metabarcoding attempt to turn the page in the way we explore biodiversity.
PEMA is a pipeline for two marker genes, 16S rRNA (prokaryotes) and COI (eukaryotes). As input, PEMA accepts fastq files as returned by Illumina sequencing platforms. PEMA processes the reads from each sample and returns an OTU-table with the taxonomies of the organisms found and their abundances in each sample. It also returns statistics and a FASTQC diagram about the quality of the reads for each sample. Finally, in the case of 16S, PEMA returns alpha and beta diversities, and makes correlations between samples. PEMA attempts to answer all three main issues of metabarcoding: sequence pre-processing, OTU-clustering and taxonomy assignment.
PEMA is written in BDS programming language and it is meant to be executed in HPC systems - like “Zorba”, HCMR’s cluster.
In the COI case, two clustering algorithms can be performed by PEMA (CROP and SWARM), while in the 16S, two approaches for taxonomy assignment are supported: alignment- and phylogeny-based. For the latter, a reference tree with 1000 taxa was created using SILVA_132_SSURef, EPA-ng and RaxML-ng as shown in Figure 2. For the analysis of the OTU-table in case of 16S, the “phyloseq” R package is used. To verify PEMA’s efficiency and accuracy, it was tested with previously published datasets*.
PEMA found 4.457 OTUs while Pavloudi et al. (2017) had 7.050. It needed about 3 hours to make the final OTU-table but it took another 24 hours to create the required tree for Rhea, using 2 nodes of “Zorba”.
Respectively, PEMA ended up with 81 Animalia species when it was tested for the COI case, while Bista et. al (2017) had 73 OTUs assigned to species level . This time, 6 nodes were used and it took 12 hours for PEMA to be completed.
PEMA managed to end up with results similar to the publications of the datasets used. However, the differences are not absent. That is exactly due to the main problem of metabarcoding: there is not yet a standardized protocol, especially in case of COI.
Thanks to software containerization technologies, PEMA is available for all types of OS and its only dependency is the Singularity application to be installed on the computer environment. PEMA is handy and needs no installation; a tutorial can be found on its GitHub repository.