Basic Research at Sequence Analysis team

Next generation sequence Analysis

Apart from Conventional Bioinformatics solutions, as reflected in above mentioned projects, we provide solutions for analyzing Next Generation Sequencing (NGS) data. We have experience in handling NGS data for both prokaryotes and eukaryotes independent of the platform used. We carry out de novo assembly, transcriptome and reference genome assembly, SNP and Indel identification, variant detection covering a wide range of downstream analysis. Continuous efforts are also on to develop tools for high throughput studies of the same.

Next Generation Sequence
Screen-shot of reference genome mapping

Population stratification analysis of the human genome data

Population-specific variants aid in improving our understanding of the landscape of genetic diversity. Genomic heterogeneity across different ethnic groups is also an indicator of interplay of multiple genetic and environmental factors leading to natural selection. Reliable identification of variants is an important prerequisite for building a Single Nucleotide Polymorphism (SNP) profile. The use of allele frequency differences for studying variation between populations has shown promising results in the recent past. In this study, joint genotyping approach has been used to derive variants of Gujarati Indians in Houston (GIH) and Indian Telugu in UK (ITU) populations from 1000 genome project (1KGP) with an objective to unravel the variation in allele frequencies (of normal individuals) for Very Important Pharmacogene (VIP) and genes involved in vitiligo, cancer and cardiovascular diseases vis-à-vis global population data. SNPs of both these populations with significant allele frequency variation with super-populations from 1000 genome project and gnomAD were identified based on Chi-square distribution, population stratification and fixation indices (Fst). The SNPs identified in this study can be used as putative sites for analysing natural selection that can better explain the observed genetic variation in GIH and ITU population for understanding polygenic traits.

GIH-ITU
Flow-chart for identification of significant SNPs

Population structure analysis of pharmacogenes

Breast Cancer study with in-silico analysis of histone modifications using chromatin immunoprecipitation sequencing (ChIP-Seq) and RNA-seq data

Epigenetic regulation (DNA-methylation, Histone modification, non-coding RNA) of genes play an important role in various developmental stages and disease pathogenesis. Epigenetic influence on differential expression of genes through non-coding RNA and play a crucial role in cancer regulation. In breast cancer we studied the in-silico analysis of histone modifications using chromatin immunoprecipitation sequencing (ChIP-Seq) and RNA-seq data . Histone modification data of H3K4me3 from one normal-like and four breast cancer cell lines were used to predict miRNA expression at the promoter level. Predicted miRNA promoters (based on ChIP-Seq) were used as a probe to identify gene targets. We proposed an integrative approach using ChIP-seq and RNA-seq to identify the epigenetic influence on genes via miRNA-mRNA axis.

Chip and RNA Sequence
ChIP-Seq and RNA sequencing (RNA-Seq) data integration workflow for prediction of
miRNA-mRNA interaction via 3′-untranslated region (3′-UTR) binding target prediction
© Genomics & Informatics Journal

Variant Calling Using Consensus Approach

Consensus-TTP (thresholds from true positive dataset) improvises popular variant calling protocols available in terms of their accuracy for human genomes using a benchmark dataset of Genome in a Bottle (GIAB). The data was mapped to the GRCh37 reference genome using three popular read aligners viz. BWA-MEM, Bowtie2 and NovoAlign. Variant calling was performed using three widely used variant callers viz. samtools-bcftools, VarScan2 and GATK-HC. The predicted variants were assessed in terms of true positives, false positives and false negatives. SNPs with relatively high read depth and mapping quality scores which were part of false positives dataset were re-analysed taking into account the read depth and mapping quality scores derived from the true positive dataset as threshold, which is one of the novel features of this work. Consensus approach was used for identifying SNPs (predicted by more than one variant calling pipeline) in the false positive dataset that satisfied the threshold criteria to be included as true positives. Precision and Recall were measured for evaluating each of the variant calling pipelines. Predicted variants were validated by comparing with gold standard data.

Improved identification of actionable high confidence variants
Reduced prediction of false positives
Valuable for prioritization of actionable variants

Flow-chart for Consensus variant prediction

Mycobacterium project

MTuberculosis (TB) is the leading cause of death due to infectious disease globally. Prevention and reduction of transmission are the key strategies for improving the control of TB, which requires sensitive diagnoses at the early stages of the disease. However, this is the most challenging issue, because specimens for the detection of Mycobacterium are not always readily obtainable. Additionally, it takes several weeks for sputum culture, but the results are not sensitive. Therefore, non-invasive biomarkers with high sensitivity, specificity, and reproducibility are important for the early diagnosis of TB. The types of biomarkers studied and identified include antibodies, cytokines, metabolic activity markers, mycobacterial antigens and volatile organic compounds. No meta-analyses have been performed till date due to the between-study heterogeneity of the samples.

Hence, in this study the primary objective of this project is to build a robust methodology to identify lineage-specific single nucleotide polymorphisms (SNP)s for Mycobacterium tuberculosis complex (MTBC) isolates which would help in better disease management. This study would enable identification of novel lineage-specific SNPs for each MTBC lineage and sub-lineage using whole genome sequences available in all public domains for all MTBC isolates.

With the advent of easily available Next Generation Sequencing (NGS) Technologies, these SNPs identified will help predict the accurate type and lineage of the Mycobacterium isolate, which will further aid in accurate treatment of the same.

Phylogeny of L5 isolates

Population stratification of L5.1 lineage at k=6

Phylogenetic tree of Mycobacterium

Population stratification of L5.2 lineage at k=4

SARS-CoV-2 Genomics

Since its emergence from the city of Wuhan in late 2019, the SARS-CoV-2 virus has spread widely to become a global pandemic. Despite all efforts, the virus is rapidly mutating and spreading across the globe infecting trillions of individuals.

The likelihood of mutation in a virus increases when it is widely circulating in a population and causing many infections. The SARS-CoV-2 is a RNA virus and is prone to frequent genetic variation to ensure their survival, giving rise to diversity and different lineages of the virus. Multiple lineages of SARS-CoV-2 have been reported from across the world and in India. Initial studies reported the existence of two ancestral lineages of SARS-CoV-2 - A and B, wherein, the lineage B is known to have a better transmissibility and increased infectivity. The lineage B is known to acquire several other mutations in the receptor binding domain of the spike protein and has diversified into new sub-lineages, viz., B.1.1.7 (Alpha), B.1.351 (Beta) and B.1.617.2 (Delta), to name a few which are of high concern. These variants, which are recent emergence of the virus with distinct biological properties that confer higher infectivity, increased transmission, severe disease, re-infection, and immune escape are a cause for concern. Genomic studies of SARS-CoV-2 in India have revealed the introduction of several lineages and their spread. However, the prediction of a newly emerging strain/lineage of the virus has fallen below the limit of detection.

Population stratification analysis are capable of understanding the diversification of SARS-CoV-2 into lineages recently reported, along with aiding in the prediction of emerging strains, with evidences of genomic admixtures. Hence in order to understand the different lineages of SARS-CoV-2 existing across Pan-India, along with comprehending the emergence of new lineages of the virus, a population stratification study of the SARS-CoV-2 virus has been initiated using huge computational facilities capable of handling large amounts of genomic data deposited across publicly available sources like GISAID. This study would also facilitate better understanding of the evolutionary dynamics along with prediction of emerging strains of SARS-CoV-2 virus in India due to the underlying variations present across the sequenced viral genomes.

Flow-chart for Consensus variant prediction
Population stratification using model and non-model based approaches

Sequence-based classifier

Angiotensin converting enzyme-2 (ACE2) has been established as the host receptor of the virus SARS-CoV-2, the causative agent of COVID-19 pandemic. In normal conditions, ACE2 receptor is found in the lung, heart, kidney and gut cells and regulates blood pressure and inflammation. Increased expression of ACE2 is known to protect cardiac injury in adult humans. ACE2 inhibitors are being explored as drug targets for combating COVID-19. In this scenario, our study is aimed at understanding the population-based variant profile of this gene using the genomic data available in 1000 genomes project with a special emphasis on Indian populations.

Artificial intelligence (AI) methods consists of various Machine Learning algorithms that hold promise to enable computers to assist humans in analysis of large complex data sets like human genomic data. Machine learning methods can be divided into supervised, and unsupervised methods. Supervised methods are trained on examples with labels and are then used to predict these labels on other examples, whereas unsupervised methods find patterns in data sets without the use of labels. To analyze ACE2 receptors in human genome, we have designed an AI framework with the use of semi-supervised ML layers. Semi-supervised ML would be combinations of supervised and unsupervised ML approaches. This technique would help in leveraging patterns in variants of ACE2 genomic data to improve the power in the training and prediction of ACE2 genome classification with neural networks or deep neural networks.

Flow-chart of Sequence-based classifier

Rule-based integration of RNA-Seq analyses tools for identification of novel transcripts

Recent evidences suggest that substantial amount of genome is transcribed than anticipated, giving rise to large number of unknown or novel transcripts. Identification of novel transcripts can provide key insights into understanding important cellular functions as well as molecular mechanisms underlying complex disease like cancer. RNA-Seq has emerged as a powerful tool to detect novel transcripts, which previous profiling techniques failed to identify. A number of tools are available for enabling identification of novel transcripts at different levels. Read mappers such as TopHat, MapSplice and SOAPsplice predict novel junctions, which are the indicators of novel transcripts. Cufflinks assembles novel transcripts based on alignment information and Oases performs de novo construction of transcripts. A common limitation of all these tools is prediction of sizable number of spurious or false positive novel transcripts (over-prediction). An approach that integrates information from all above sources and simultaneously scrutinizes false positives to correctly identify true novel transcripts with high confidence is proposed. To demonstrate this approach, simulated datasets with varying read lengths and coverage were created, including a target set of 200 transcripts to be identified as novel. Of these, 114 novel transcripts from the target set were recovered. The approach was also tested on breast cancer cell line viz., MCF-7 which led to identification of novel transcribed regions which mapped well with recent annotation of Homo sapiens GRCh37.67.gtf as long non-coding RNA (lncRNA) thereby affirming this approach for detection of high confidence novel transcripts.

Flow-chart of Sequence-based classifier
Flow-chart for identification of novel transcripts