Multivariate statistical techniques for metagenomic analysis of microbial community recovered from environmental samples
High-throughput big dataset generated through next generation sequencing (NGS) of DNA samples helps identify key differences in the function and taxonomy between microbial communities as well as shed light on the diversity of microbes, cooperation and evolution in any particular ecosystem. During this study, three statistical techniques namely, Random Forest (RF), Multidimensional Scaling (MDS) and Linear Discriminant Analysis (LDA) approaches were employed for functional analysis of 212 publicly available metagenomic datasets within and between 10 environments against 27 metabolic functions. RF generates the 8 most important metabolic variables along with MDS and LDA among which Photosynthesis has the highest score (70.20); Phages, prophages has the second highest score (61.31) and Membrane Transport was found to have the eighth highest score (45.29). The MDS plot was found useful to visualize the separation of the microbes from human or animal hosts from other samples along the first dimension and the separation of the aquatic and mat communities along the second dimension. LDA analyses compared the extent of the microbial samples into three broad groups: the human and animal associated samples, the microbial mats, and the aquatic samples. RF showed that phage activity is a major difference between host-associated microbial communities and free-living. The MDS and LDA techniques suggest that mat communities were unique from both the animal associated metagenomes and the aquatic samples with differences in the vitamin and cofactor metabolism.
J. bio-sci. 24: 45-53, 2016