GIW/InCoB 2015 Talk Abstracts

Morning, Day 1, Sept. 9, 2015

Spectral Processing

J73 ⏰ Preprocess and condensation of Raman spectrum for single-cell phenotype analysis

Xuetao Wang, CUDA Research Centre, Qingdao, Shandong, China
Shiwei Sun, Key Lab of Intelligent Information Processing, Institute of Computing Technology of the Chinese Academy of Sciences, Beijing, China
Lihui Ren, CAS Key Laboratory of Biofuels and Shandong Key Laboratory of Energy Genetics, Bioinformatics Group of Single-Cell Center, Qingdao Institute of Bioenergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao, Shandong, China
Xiaoquan Su, CUDA Research Centre, Qingdao, Shandong, China
Dongbo Bu^†, Key Lab of Intelligent Information Processing, Institute of Computing Technology of the Chinese Academy of Sciences, Beijing, China
Kang Ning^†, CUDA Research Centre, Qingdao, Shandong, China
Xin Gao, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Saudi Arabia

Background: In recent years, high throughput and non-invasive Raman spectrometry technique has matured as an effective approach to identification of individual cells by species, even in complex, mixed populations. Raman profiling is an appealing optical microscopic method to achieve this. To fully utilize Raman proling for single-cell analysis, an extensive understanding of Raman spectra is necessary to answer questions such as which filtering methodologies are effective for pre-processing of Raman spectra, what strains can be distinguished by Raman spectra, and what features serve best as Raman-based biomarkers for single-cells, etc.
Results: In this work, we have proposed an approach called rDisc to discretize the original Raman spectrum into only a few (usually less than 20) representative peaks (Raman shifts). The approach has advantages in removing noises, and condensing the original spectrum. In particular, effective signal processing procedures were designed to eliminate noise, utilising wavelet transform denoising, baseline correction, and signal normalization. In the discretizing process, representative peaks were selected to signicantly decrease the Raman data size. More importantly, the selected peaks are chosen as suitable to serve as key biological markers to differentiate species and other cellular features. Additionally, the classication performance of discretized spectra was found to be comparable to full spectrum having more than 1000 Raman shifts. Overall, the discretized spectrum needs about 5% of the storage space of a full spectrum and the processing speed is considerably faster. This makes rDisc clearly superior to other methods for single-cell classication.

Background: Mass Spectrometry (MS) is a ubiquitous analytical tool in biological research and is used to measure the mass-to-charge ratio of bio-molecules. Peak detection is the essential first step in MS data analysis. Precise estimation of peak parameters such as peak summit location and peak area are critical to identify underlying bio-molecules and to estimate their abundances accurately. We propose a new method to detect and quantify peaks in mass spectra. It uses dual-tree complex wavelet transformation along with Stein’s unbiased risk estimator for spectra smoothing. Then, a new method, based on the modified Asymmetric Pseudo-Voigt (mAPV) model and hierarchical particle swarm optimization, is used for peak parameter estimation.
Results: Using simulated data, we demonstrated the benefit of using the mAPV model over Gaussian, Lorentz and Bi-Gaussian functions for MS peak modelling. The proposed mAPV model achieved the best fitting accuracy for asymmetric peaks, with lower percentage errors in peak summit location estimation, which were 0.17% to 4.46% less than that of the other models. It also outperformed the other models in peak area estimation, delivering lower percentage errors, which were about 0.7% less than its closest competitor － the Bi-Gaussian model. In addition, using data generated from a MALDI-TOF computer model, we showed that the proposed overall algorithm outperformed the existing methods mainly in terms of sensitivity. It achieved a sensitivity of 85%, compared to 77% and 71% of the two benchmark algorithms, continuous wavelet transformation based method and Cromwell respectively.
Conclusions: The proposed algorithm is particularly useful for peak detection and parameter estimation in MS data with overlapping peak distributions and asymmetric peaks. The algorithm is implemented using MATLAB and the source code is freely available

Transcription Factor Pairs

Background: Biologists are puzzled by the extremely low percentage (3%) of the binding targets of a yeast transcription factor (TF) affected when the TF is knocked out, a phenomenon observed by comparing the TF binding dataset and TF knockout effect dataset.
Results: This study gives a plausible biological explanation of this counterintuitive phenomenon. Our analyses find that TFs with high functional redundancy show significantly lower percentage than do TFs with low functional redundancy. This suggests that functional redundancy may lead to one TF compensating for another, thus masking the TF knockout effect on the binding targets of the knocked-out TF. In addition, we show that seven classes of genes (lowly expressed genes, TATA box-less genes, genes containing a nucleosome-free region immediately upstream of the transcriptional start site (TSS), genes with low transcriptional plasticity, genes with a low number of bound TFs, genes with a low number of TFBSs, and genes with a short average distance of TFBSs to the TSS) are insensitive to the knockout of their promoter-binding TFs, providing clues for finding other biological explanations of the surprisingly low percentage of the binding targets of a TF affected when the TF is knocked out.
Conclusions: This study shows that one property of TFs (functional redundancy) and seven properties of genes (expression level, TATA box, nucleosome, transcriptional plasticity, the number of bound TFs, the number of TFBSs, and the average distance of TFBSs to the TSS) may be useful for explaining a counterintuitive phenomenon: most binding targets of a yeast transcription factor are not affected when the transcription factor is knocked out.

NOTE: This talk will combine results from two related papers with the following abstracts:
Background: Computational identification of cooperative transcription factor (TF) pairs helps understand the combinatorial regulation of gene expression in eukaryotic cells. Many advanced algorithms have been proposed to predict cooperative TF pairs in yeast. However, it is still difficult to conduct a comprehensive and objective performance comparison of different algorithms because of lacking sufficient performance indices and adequate overall performance scores. To solve this problem, in our previous study (published in BMC Systems Biology 2014), we adopted/proposed eight performance indices and designed two overall performance scores to compare the performance of 14 existing algorithms for predicting cooperative TF pairs in yeast. Most importantly, our performance comparison framework can be applied to comprehensively and objectively evaluate the performance of a newly developed algorithm. However, to use our framework, researchers have to put a lot of effort to construct it first. To save researchers time and effort, here we develop a web tool to implement our performance comparison framework, featuring fast data processing, a comprehensive performance comparison and an easy-to-use web interface.
Results: The developed tool is called PCTFPeval (Predicted Cooperative TF Pair evaluator), written in PHP and Python programming languages. The friendly web interface allows users to input a list of predicted cooperative TF pairs from their algorithm and select (i) the compared algorithms among the 15 existing algorithms, (ii) the performance indices among the eight existing indices, and (iii) the overall performance scores from two possible choices. The comprehensive performance comparison results are then generated in tens of seconds and shown as both bar charts and tables. The original comparison results of each compared algorithm and each selected performance index can be downloaded as text files for further analyses.
Conclusions: Allowing users to select eight existing performance indices and 15 existing algorithms for comparison, our web tool benefits researchers who are eager to comprehensively and objectively evaluate the performance of their newly developed algorithm. Thus, our tool greatly expedites the progress in the research of computational identification of cooperative TF pairs.

Background: Transcriptional regulation of gene expression in eukaryotes is usually accomplished by cooperative transcription factors (TFs). Computational identification of cooperative TF pairs has become a hot research topic and many algorithms have been proposed in the literature. A typical algorithm for predicting cooperative TF pairs has two steps. (Step 1) Define the targets of each TF under study. (Step 2) Design a measure for calculating the cooperativity of a TF pair based on the targets of these two TFs. While different algorithms have distinct sophisticated cooperativity measures, the targets of a TF are usually defined using ChIP-chip data. However, there is an inherent weakness in using ChIP-chip data to define the targets of a TF. ChIP-chip analysis can only identify the binding targets of a TF but it cannot distinguish the true regulatory from the binding but non-regulatory targets of a TF.
Results: This work is the first study which investigates whether the performance of computational identification of cooperative TF pairs could be improved by using a more biologically relevant way to define the targets of a TF. For this purpose, we propose four simple algorithms, all of which consist of two steps. (Step 1) Define the targets of a TF using (i) ChIP-chip data in the first algorithm, (ii) TF binding data in the second algorithm, (iii) TF perturbation data in the third algorithm, and (iv) the intersection of TF binding and TF perturbation data in the fourth algorithm. Compared with the first three algorithms, the fourth algorithm uses a more biologically relevant way to define the targets of a TF. (Step 2) Measure the cooperativity of a TF pair by the statistical significance of the overlap of the targets of these two TFs using the hypergeometric test. By adopting four existing performance indices, we show that the fourth proposed algorithm (PA4) significantly outperforms the other three proposed algorithms. This suggests that the computational identification of cooperative TF pairs is indeed improved when using a more biologically relevant way to define the targets of a TF. Strikingly, the prediction results of our simple PA4 are more biologically meaningful than those of the 12 existing sophisticated algorithms in the literature, all of which used ChIP-chip data to define the targets of a TF. This suggests that properly defining the targets of a TF may be more important than designing sophisticated cooperativity measures. In addition, our PA4 has the power to predict several experimentally validated cooperative TF pairs, which have not been successfully predicted by any existing algorithms.
Conclusions: This study shows that the performance of computational identification of cooperative TF pairs could be improved by using a more biologically relevant way to define the targets of a TF. The main contribution of this study is not to propose another new algorithm but to provide a new thinking for the research of computational identification of cooperative TF pairs. Researchers should put more effort on properly defining the targets of a TF rather than totally focus on designing sophisticated cooperativity measures.

Software Demos

Motivation: Increasing evidences suggest that most of the genome is transcribed into RNAs, but many of them are not translated into proteins. All those RNAs that do not become proteins are called “non-coding RNAs (ncRNAs)”, which outnumbers protein-coding genes. Interestingly, these ncRNAs are shown to be more tissue specifically expressed than protein-coding genes. Given that tissue-specific expressions of transcripts suggest their importance in the expressed tissue, researchers are conducting biological experiments to elucidate the function of such ncRNAs. Owing greatly to the ad-vancement of next generation techniques, especially RNA-seq, the amount of high-throughput data are increasing rapidly. However, due to the complexity of the data as well as its high volume, it is not easy to re-analyze such data to extract tissue-specific expressions of ncRNAs from published datasets.
Results: Here, we introduce a new knowledge database called “C-It-Loci”, which allows a user to screen for tissue-specific transcripts across three organisms: human, mouse, and zebrafish. C-It-Loci is very intuitive and easy to use to identify not only protein-coding genes but also ncRNAs from various tissues. C-It-Loci defines ho-mology through sequence and positional conservation to allow for the extraction of species-conserved loci. C-It-Loci can be used as a starting point for further biological experiments.
Availability: C-It-Loci is freely available online without registration.

RNA editing is a process in which RNA molecules are modified after transcription through RNA polymerase. It is understood that RNA editing events result in diversifications of RNAs and proteins without affecting the genomic sequence. In humans, the most common type of editing is the conversion of adenosine residues into inosine (A-to-I RNA editing), which is modulated by adenosine deaminases acting on RNA (ADARs). It is reported that A-to-I editing occurs mostly in the noncoding regions that contain repetitive elements (Ramaswami, Nat Methods, 2012). Furthermore, a recent study shows that a family member of ADARs "ADAR1" binds to Dicer to facilitate the cleavage of pre-microRNAs and the loading of miRNAs onto RNA-induced silencing complexes (Ota, Cell, 2013) to highlight the importance of RNA editing in the context of noncoding RNAs. Since RNA editing is a common phenomenon in various cell types, there are numerous occasions in which the sequencing results from deep sequencer do not match exactly to the target genomic sequence, which are not known to be results of mutations (e.g. SNPs). Up until now, several strategies have been proposed to analyze RNA-seq data for RNA editing events. All these reported strategies require extensive knowledge about programming and relies on various other programs, which make it hard for non-programmers to fully utilize such strategies. To overcome these difficulties, we introduce an easy-to-use bioinformatics tool called "RNAeditor". Without extensive knowledge about bioinformatics and programming, users can analyze their RNA-seq data to extract RNA editing sites. RNAeditor utilizes various tools, which are well validated and commonly used in deep sequencing analyses (e.g. BWA, GATK), and equipped with various filters to separate RNA editing events from mutations of the genome. Furthermore, graphical generators are implemented in RNAeditor to create figures that can be used for publications.

Cancerogenesis is driven by mutations leading to aberrant functioning of a complex network of molecular interactions and simultaneously affecting multiple cellular functions. Therefore, the successful application of bioinformatics and systems biology methods for analysis of high-throughput data in cancer research heavily depends on availability of global and detailed reconstructions of signaling networks amenable for computational analysis. We present here the Atlas of Cancer Signaling Network (ACSN), an interactive and comprehensive map of molecular mechanisms implicated in cancer. The resource includes tools for map navigation, visualization and analysis of molecular data in the context of signaling network maps. Constructing and updating ACSN involves careful manual curation of molecular biology literature and participation of experts in the corresponding fields. The cancer-oriented content of ACSN is completely original and covers major mechanisms involved in cancer progression, including DNA Repair, Cell Survival, Apoptosis, Cell Cycle, EMT and Cell Motility. Cell signaling mechanisms are depicted in details, together creating a seamless ‘geographic-like’ map of molecular interactions frequently deregulated in cancer. The map is browsable using the NaviCell tachnology (a web interface and APIs based on Google Maps) and include the semantic zooming principle. The associated web-blog provides a forum for commenting and curating the ACSN content. ACSN allows uploading heterogeneous omics data from users on top of the maps for visualization and performing functional analyses. We suggest several scenarios for ACSN application in cancer research, particularly for visualizing high-throughput data, starting from siRNA-based screening results or mutation frequencies to innovative ways of exploring transcriptomes and phosphoproteomes. Integration and analysis of these data in the context of ACSN may help interpret their biological significance and formulate mechanistic hypotheses. ACSN may also support patient stratification, prediction of treatment response and resistance to cancer drugs, as well as design of novel treatment strategies.

A recent report on the emarketer has estimated that by the year 2015, the total number of people using mobile devices will reach 4.77 billion. Mobile phones are no longer devices which you use to make phone calls. They are everything you need and everything you will need. Encapsulated within the light weight portable gadget are your diary, alarm clock, “shopping mall”, bank, entertainment centre and even personalized healthcare advisor. With millions of mobile apps available for download from the iTunes and Google play stores, the power of the modern smartphone is almost limitless. As modern man transcends the Maslow’s hierarchy of needs beyond mere survival, there is an increased emphasis in healthy lifestyle and healthy living. As such there is an increased interest in the development and usage of health applications for managing individual’s health and wellness. Such tools not only allow you to track your daily activities, calories burnt, they could also be digital buddies to monitor specific diseases like that of type 2 diabetes mellitus. When combined with the outcomes of large data analytics, we believe that such applications can be made more relevant and useful by providing more tools, based on historical data, on managing one’s health. Thus, in this demo, we will be demonstrating one of the two mobile health applications, developed by our group, which can be used to enhance the health of modern man. The first is a sex health education and HIV Risk modelling app called THINK which provides users with sexual health information and a mobile tool to estimate one’s risk (stratified into Low, Moderate & High risk categories), accompanied by suitable suggestions and recommendations to lower the risk, of contracting HIV. The second is a mobile diabetic buddy app named BG-PRED which helps diabetic patients manage their diabetes by predicting the subsequent 4 hour blood glucose levels and providing timely reminders to prevent hypo- or hyper-glycemia.

High Dimensional Data / Feature Selection

Background: Principal component analysis is used to summarize matrix data, such as found in transcriptome, proteome or metabolome and medical examinations, into fewer dimensions by fitting the matrix to orthogonal axes. Although this methodology is frequently used in multivariate analyses, it has disadvantages when applied to experimental data. First, the identified principal components have poor generality; since the size and directions of the components are dependent on the particular data set, the components are valid only within the data set. Second, the method is sensitive to experimental noise and bias between sample groups. It cannot reflect the experimental design that is planned to manage the noise and bias; rather, it estimates the same weight and independence to all the samples in the matrix. Third, the resulting components are often difficult to interpret. To address these issues, several options were introduced to the methodology. The resulting components were scaled to unify their size unit. Also, the axes directions were identified using training data sets and shared among experiments. This training data reflects the design of experiments, and its preparation allows noise to be reduced and group bias to be removed.
Results: The effects of these options were observed in microarray experiments, and showed an improvement in the separation of groups and robustness to noise. The range of scaled scores was unaffected by the number of items. Additionally, unknown samples were appropriately classified using pre-arranged axes. Furthermore, these axes well reflected the characteristics of groups in the experiments. As was observed, the scaling of the components and sharing of axes enabled comparisons of the components beyond experiments. The use of training data reduced the effects of noise and bias in the data, facilitating the physical interpretation of the principal axes.
Conclusions: Together, the introduced options resulted in improved generality and objectivity of the analytical results.

Background: Transgenerational epigenetics is currently considered important in disease, but its mechanisms are not yet fully understood. Transgenerational epigenetic abnormalities expected to cause disease are likely to be initiated during development and be mediated by aberrant gene expression associated with aberrant promoter methylation that is heritable between generations. However, because methylation is removed and then re-established during development, it is not easy to identify promoter methylation abnormalities by comparing normal lineages with those expected to exhibit transgenerational epigenetic abnormalities.
Methods: This study applied the recently proposed principal component analyses based unsupervised feature extraction to the previously reported and publically available gene expression/promoter methylation profiles of rat primordial germ cells between E13 and E16 of the F3 generation vinclozolin lineage to identify multiple genes that exhibited aberrant gene expression/promoter methylation during development.
Results: The identified genes were globally related to tumors, the prostate, kidney, testis and the immune system that were previously reported to be related to various diseases caused by transgenerational epigenetics.
Conclusions: Among the genes reported by principal component analyses based unsupervised feature extraction, we propose that chemokine signaling pathways and leucine rich repeat proteins are key factors that can initiate transgenerational epigenetic-mediated diseases, because multiple genes included in these two categories were identified in this study.

Alzheimer’s disease is a multifactorial disorder that may be diagnosed earlier using a combination of tests rather than any single test. Search algorithms and optimization techniques in combination with model evaluation techniques have been used previously to perform the selection of suitable feature sets. Previously we successfully applied GA with LR to neuropsychological data contained within the The Australian Imaging, Biomarkers and Lifestyle (AIBL) study of aging, to select cognitive tests for prediction of progression of AD. This research addresses an Adaptive Genetic Algorithm (AGA) in combination with LR for identifying the best biomarker combination for prediction of the progression to AD. The model has been explored in terms of parameter optimization to predict conversion from healthy stage to AD with high accuracy. The results has shown consistency with some of the medical research. The algorithm presented here is generic and can be extended to other data sets generated in projects that seek to identify combination of biomarkers or other features that are predictive of disease onset or progression.

Afternoon, Day 1, Sept. 9, 2015

Chemical Informatics

Identification of compound-protein interactions (CPIs) is an important but challenging task in biomedical research. Machine learning based methods have been developed to predict new CPIs based on the known ones. Existing machine learning based approaches typically use the known CPIs as positive training samples and the unknown interactions selected randomly as the negative training samples (so far there is not yet any benchmark set of non-compound protein interactions) to build classifiers for identifying new CPIs. However, such classifiers are actually built from a noisy negative set where positive interactions may exist but are not yet identified or validated. As a result, these classifiers cannot perform as well as they should be.
Instead of simply treating the unknown CPIs as negative examples, we treat them just as unlabeled samples. And we propose a novel method called PUCPI (an abbreviation of PU learning for Compound Protein Interaction identification) that employs biased-SVM to identify CPIs using only positive and unlabeled examples. To the best of our knowledge, this is the first work that identifies CPIs using only positive and unlabeled examples. We first collect known CPIs as positive samples and then randomly select compound-protein pairs not in the positive set as unlabeled examples. For each CPI/compound-protein pair, we extract protein domains as protein features and compound substructures as chemical features, then take the tensor product of the corresponding compound vector and protein vector as the feature vector of the CPI/compound-protein pair. After that, biased-SVM is employed to train classifiers on different datasets of CPIs and compound-protein pairs. Experimental results over various datasets show that our method outperforms six typical classifiers, including random forest, L1- and L2-regularized logistic regression, naive Bayes, SVM and k-nearest neighbor (kNN), and three existing CPI prediction models.
Source code, datasets and related documents of PUCPI are available.

J64 ⏰ Cocktail Multiple Drug Targets Design by Attacking on the Core Network Markers of Four Cancers with Ligand-Based and Structure-Based Virtual Screening Methods

Yung-Hao Wong, NTHU, Taiwan
Chih-Lung Lin, ITRI, Taiwan
Ting-Shou Chen, ITRI, Taiwan
Chien-An Chen, ITRI, Taiwan
Pei-Shin Jian, ITRI, Taiwan
Yi-Hua Lai, NCHU, Taiwan
Lichieh Julie Chu, Molecular Medicine Research Center, Chang Gung University, Taoyuan, Taiwan
Cheng-Wei Li, NTHU, Taiwan
Jeremy Chen^†, NCHU, Taiwan
Bor-Sen Chen^†, NTHU, Taiwan

Background: It is a long history to apply computer-aided drug designs on various cancers, while they were always focused on single target. The development of systems biology lets scientists reveal more hidden mechanisms of cancers, and their temp to apply systems biology on cancer therapy is still at the trial stage. Our lab has successfully developed various systems biology models, especially on several cancers. Based on these achievements, it is our first attempt to combine multiple targets therapy with systems biology.
Methods: In our previous study, we identified 28 significant proteins in the common core network markers of four types of cancers as the house keeping proteins of these cancers. In this study, we rank these proteins by summing over their carcinogenesis relevance values (CRV) and then perform docking and pharmacophore method to do virtual screening on NCI anti-cancer drug library. We also do more pathway analysis on these proteins by Panther and Metacore to reveal more mechanisms of these cancer house keeping proteins.
Results: We designed several scenarios to do the cocktail multiple target therapies. The first one, we identified top 20 drugs for each of the 28 cancer house keeping proteins, and we analyze the docking pose for further understanding the interaction mechanism of these drugs. To find the duplicates, we get 13 drugs to target the 11 proteins simultaneously. The second scenario, we choose the top 5 proteins with the highest summation of CRV and use them to be the drug targets. The pharmacophore we built is also applied to do virtual screening on another drug library － “Life-Chemical”. According to these results, the wet-lab bio-scientists could be free to do combination of these drugs for multiple targets therapy on cancers, which is different from the traditional single target therapy.
Conclusions: Combination of the systems biology with computer-aided drug design could help us develop the novel cocktail multiple targets therapy. We believe this will enhance the efficiency and lead to new direction for cancer therapy.

J88 ⏰ Privacy-preserving search for chemical compound databases

Kana Shimizu^†, AIST, Japan
Koji Nuida, AIST, Japan
Hiromi Arai, The University of Tokyo, Japan
Shigeo Mitsunari, Cybozu, Japan
Nuttapong Attrapadung, AIST, Japan
Michiaki Hamada, Waseda University, Japan
Koji Tsuda, The University of Tokyo, Japan
Takatsugu Hirokawa, AIST, Japan
Jun Sakuma, AIST, Japan
Goichiro Hanaoka, AIST, Japan
Kiyoshi Asai, The University of Tokyo, Japan

Background: Searching for similar compounds in a database is the most important process for in-silico drug screening. Since a query compound is an important starting point for the new drug, a query holder, who is afraid of the query being monitored by the database server, usually downloads all the records in the database and uses them in a closed network. However, a serious dilemma arises when the database holder also wants to output no information except for the search results, and such a dilemma prevents the use of many important data resources.
Results: In order to overcome this dilemma, we developed a novel cryptographic protocol that enables database searching while keeping both the query holder’s privacy and database holder’s privacy. Generally, the application of cryptographic techniques to practical problems is difficult because versatile techniques are computationally expensive while computationally inexpensive techniques can perform only trivial computation tasks. In this study, our protocol is successfully built only from an additive-homomorphic cryptosystem, which allows only addition performed on encrypted values but is computationally efficient compared with versatile techniques such as general purpose multi-party computation. In an experiment searching ChEMBL, which consists of more than 1,200,000 compounds, the proposed method was 36,900 times faster in CPU time and 12,000 times as efficient in communication size compared with general purpose multi-party computation.
Conclusion: We proposed a novel privacy-preserving protocol for searching chemical compound databases. The proposed method, easily scaling for large-scale databases, may help to accelerate drug discovery research by making full use of unused but valuable data that includes sensitive information.

Genomics & NGS

J77 ⏰ ARG-walker: Inference of Individual Specific Strengths of Meiotic Recombination Hotspots by Population Genomics Analysis

Chee Keong Kwoh, School of Computer Engineering, Nanyang Technological University, Singapore
Hao Chen, Singapore Immunology Network (SIgN), A*STAR, Singapore
Peng Yang, Institute for Infocomm Research (I2R), A*STAR, Singapore
Jing Guo, School of Computer Engineering, Nanyang Technological University, Singapore
Teresa Przytycka, NCBI, NLM, NIH, USA
Jie Zheng^†, Nanyang Technological University, Singapore

Background: Meiotic recombination hotspots play important roles in various aspects of genomics, but the underline mechanisms for regulating the locations and strengths of recombination hotspots are not yet fully revealed. Most existing algorithms for estimating recombination rates from sequence polymorphism data can only output average recombination rates of a population, although there is evidence for the heterogeneity in recombination rates among individuals. For genome-wide association studies (GWAS) of recombination hotspots, an efficient algorithm that estimates the individualized strengths of recombination hotspots is highly desirable.
Results: In this work, we propose a novel graph mining algorithm named ARG-walker, based on random walks on ancestral recombination graphs (ARG), to estimate individual-specific recombination hotspot strengths. Extensive simulations demonstrate that ARG-walker is able to distinguish the hot allele of a recombination hotspot from the cold allele. Integrated with output of ARG-walker, we performed GWAS on the phased haplotype data of the 22 autosome chromosomes of the HapMap Asian population samples of Chinese and Japanese (JPT+CHB). Significant cis-regulatory signals have been detected, which is corroborated by the enrichment of the well-known 13-mer motif CCNCCNTNNCCNC of PRDM9 protein. Moreover, two new DNA motifs have been identified in the flanking regions of the significantly associated SNPs (single nucleotide polymorphisms), which are likely to be new cis-regulatory elements of meiotic recombination hotspots of the human genome.
Conclusions: Our results on both simulated and real data suggest that ARG-walker is a promising new method for estimating the individual recombination variations. In the future, it could be used to uncover the mechanisms of recombination regulation and human diseases related with recombination hotspots.

We present a new pair-wise genome alignment method, based on a simple concept of finding an optimal set of local alignments. It gains accuracy by not masking repeats, and by using a statistical model to quantify the (un)ambiguity of each alignment part. Compared to previous animal genome alignments, it aligns thousands of locations differently and with much higher similarity, strongly suggesting that the previous alignments are non-orthologous. The previous methods suffer from an overly-strong assumption of long un-rearranged blocks. The new alignments should help find interesting and unusual features, such as fast-evolving elements and micro-rearrangements, which are confounded by alignment errors.

J40 ⏰ Sprites: detection of deletions from low-coverage sequencing data by re-aligning split reads

Zhen Zhang, Central South University, China
Jianxin Wang^†, Central South University, China
Junwei Luo, Central South University, China
Xiaojun Ding, Central South University, China
Jiancheng Zhong, Central South University, China
Jun Wang, Baylor College of Medicine, USA
Fang-Xiang Wu, University of Saskatchewan, Canada
Yi Pan, Georgia State University, USA

Advances of next generation sequencing technologies and availability of short read data enable the detection of structural variations (SVs). Deletions, an important type of SVs, have been suggested in association with genetic diseases. There are three types of deletions: blunt deletions, deletions with microhomologies and deletions with microsinsertions. The last two types are very common in the human genome, but they pose difficulty for the detection. Furthermore, finding deletions from low-coverage data remains challenging. It is highly appealing to develop sensitive and accurate methods to detect deletions from low-coverage data, especially deletions with microhomologies and deletions with microinsertions.
We present a novel method called Sprites which finds deletions from low-coverage data. It aligns a whole soft-clipping read rather than its clipped part to the target sequence, a segment of the reference which is determined by spanning reads, in order to find the longest prefix or suffix of the read that has a match in the target sequence. This alignment aims to solve the problem of deletions with microhomologies and deletions with microinsertions. Using both simulated and real data we show that Sprites performs better on detecting deletions compared to other current methods in terms of F-score.

Background: The potential utility of the Burrows-Wheeler transform (BWT) of a large amount of short-read data (``reads'') has not been fully studied. The BWT basically serves as a lossless dictionary of reads, unlike the heuristic and lossy reads-to-genome mapping results conventionally obtained in the first step of sequence analysis. Thus, it is naturally expected to lead to development of sensitive methods for analysis of short-read data. Recently, one of the most active areas of research in sequence analysis is sensitive detection of rare genomic rearrangements from whole-genome sequencing (WGS) data of heterogeneous cancer samples. The application the BWT of reads to the analysis of genomic rearrangements is addressed in this study.
Results: A new method for sensitive detection of genomic rearrangements by using the BWT of reads in the following three steps is proposed: first, breakpoint regions, which contain breakpoints and are joined together by rearrangement, are predicted from the distribution of so-called discordant pairs by using a kind of the conjugate gradient method; second, reads partially matching the breakpoint regions are collected from the BWT of reads; and third, breakpoints are detected as branching points among the collected reads, and their precise positions are determined. The method is experimentally implemented, and its performance (i.e., sensitivity and specificity) is evaluated by using simulated data with known artificial rearrangements. It is also applied to publicly available real biological WGS data of cancer patients, and the detection results are compared with published results.
Conclusions: The BWT of short-read data, serving as a lossless dictionary of reads, enables sensitive analysis of genomic rearrangements in heterogeneous cancer-genome samples.

Pathway and Gene Association

Background: Cellular function is represented using molecular interaction networks. Function is organised and pathways are identified based on network topology, however, this approach often fails to account for the dynamic nature of molecular interactions. Nodes engaging in spatial/temporally dependent interactions may result in a functionally diverse set of molecules being clustered into a single module. To capture biologically realistic sets of interacting molecules, we use experimentally defined pathways as spatial/temporal units of cellular activity.
Results: We defined functional profiles of yeast pathways based on a minimal set of Gene Ontology terms sufficient to represent each pathway’s genes. Gene ontology terms were used to annotate 271 pathways, accounting for pathway multifunctionality and gene pleiotropy. Pathways were then arranged into a network, linked by shared functionality. Of the genes in our data set, 44% appeared in multiple pathways performing a diverse set of functions. Linking pathways by overlapping functionality revealed a modular network with energy metabolism forming a sparse centre, surrounded by several dense clusters of genetic and metabolic pathways. Signalling pathways formed a highly discrete branch connected to the centre of the network. Inter-pathway GIs were enriched by a factor of 5.5, indicating that these clusters are of real biological significance.
Conclusions: This representation of cellular function enables analysis of gene/protein activity in the context of specific functional roles, as an alternative to typical molecule-centric graph-based analysis. The network demonstrates the cooperation of multiple pathways to perform biological processes, grouping pathways into functionally organised clusters with interdependent outcomes.

Network analysis is a common approach for the study of genetic view of diseases and biological pathways. Typically, when a set of genes are identified to be of interest in relation to a disease, say through a genome wide association study (GWAS) or a different gene expression study, these genes are typically analyzed in the context of their protein-protein interaction (PPI) networks. Further analysis is carried out to compute the enrichment of known pathways and disease-associations in the network. Having tools for such analysis at the fingertips of biologists without the requirement for computer programming or curation of data would accelerate the characterization of such genes of interest. Currently available tools do not integrate network and enrichment analysis and their visualizations, and most of them present results in formats not most conducive to human cognition. The Lens for Enrichment and Network Studies of human proteins (LENS) is a web-based tool that does not require software or plugin downloads and performs network and pathway and diseases enrichment analyses on genes of interest to users. The tool creates a visualization of the network, provides easy to read statistics on network connectivity, and displays Venn diagrams with statistical significance values of the network’s association with drugs, diseases, pathways, and GWASs. LENS is free and does not require login for use.

Background: Investigating association between genes can be used in understanding the relations of genes in biological processes. STRING and GeneMANIA are two well-known web tools which can provide a list of associated genes of a query gene based on diverse biological associations such as co-expression, co-localization, co-citation and so on. However, the transcriptional regulation association and mutant phenotype association have not been used in these two web tools. Since the comprehensive transcription factor (TF)-gene binding data, TF-gene regulation data and mutant phenotype data are available in yeast, we developed a web tool called YAGM (Yeast Associated Genes Miner) which constructed the transcriptional regulation association, mutant phenotype association and five commonly used biological associations to mine a list of associated genes of a query yeast gene.
Description: In YAGM, we collected seven kinds of datasets including TF-gene binding (TFB) data, TF-gene regulation (TFR) data, mutant phenotype (MP) data, functional annotation (FA) data, physical interaction (PI) data, genetic interaction (GI) data, and literature evidence (LE) data. Then by using the hypergeometric test to calculate the association scores of all gene pairs in yeast, we constructed seven biological associations including two transcriptional regulation associations (TFB association and TFR association), MP association, FA association, PI association, GI association, and LE association. Moreover, the expression profile association from SPELL database was also included in YAGM. When using YAGM, users can input a query gene and choose any possible subsets of the eight biological associations, then a list of associated genes of the query gene will be returned based on the chosen biological associations.
Conclusions: In this study, we presented the YAGM which provides eight biological associations for mining associated genes of a query gene in yeast. Among the eight biological associations constructed in YAGM, three (TFB association, TFR association, and MP association) are novel ones. By comparing the query results of two well-known web tools (STRING and GeneMANIA), we found that YAGM can find out distinct associated genes of a query gene. That is, YAGM can provide alternative candidates of associated genes for biologists to do further experimental investigation. We believe that YAGM will be a useful web tool for yeast biologists. YAGM is available online.

Bioconductor

Bioconductor is a collection of close to 1,000 individually code-reviewed software package, hundreds more annotation and experiment data packages, and specialized data structures for various domains. This talk will provide a high-level overview of Bioconductor in the domains of gene expression, DNA variant calling, flow cytometry, proteomics and metabolomics. This talk will provide users with an up-to-date overview of Bioconductor’s offerings and recent developments in these diverse domains. It will also review options available to users wishing to analyze public data from souces such as The Cancer Genome Atlas, the Gene Expression Omnibus, and ArrayExpress, and data distributed by Bioconductor itself.

DNA methylation is an epigenetic modification of DNA that is involved in the regulation of gene expression. There are many high-throughput assays for studying DNA methylation, each of which has its own set of bioinformatic challenges. However, there are also common statistical themes, such as the strong spatial correlation of DNA methylation along the genome and that measurements from these assays are aggregates from a population of heterogeneous cells.
The Bioconductor project currently includes 38 packages for analysing DNA methylation data. This talk will introduce some of these packages and help users identify appropriate tools and methodology for their own analyses.

Transcriptome sequencing is a popular application in functional genomics research, and the Bioconductor project hosts a wide collection of tools (76+ packages) that are capable of performing a complete analysis, from read mapping, normalization and exploratory data analysis through to differential expression and pathway analysis. Another exciting application is single cell gene expression analysis, where a number of methods are evolving. This talk will provide an overview of some of the most popular packages and showcase complete workflows for RNA-seq analysis from raw data through to biologically relevant gene lists and pathways.

The ability to analyze protein expression at the single-cell level has become invaluable for identifying numerous cellular subsets. Mass cytometry (also known as cytometry by time-of-flight, CyTOF) allows for measurements of cellular heterogeneity with unprecedented dimensionality by simultaneous analysis of more than 40 proteins per cell. The cytofkit package is designed to facilitate the analysis of mass cytometry data. It automates subset discovery by using dimensionality reduction (t-SNE) and density-based clustering algorithms (DensVM). Subsequently it utilizes ISOMAP to map progressions between different subsets. It adds the low dimensional maps of t-SNE (ISOMAP) and cluster designations to the list of mass cytometry parameters and creates FCS (.fcs) files to include these additional parameters for subsequent analysis using FlowJo software. In addition, cytofkit provides down-sampling functions to enable uncovering race cell populations. Overall, cytofkit presents a general approach for mapping cellular heterogeneity and progression from mass cytometry data.

Morning, Day 2, Sept. 10, 2015

Protein Classification

J63 ⏰ Markov chain based semi-supervised Multi-instance Multi-labeled method for protein function prediction

Qingyao Wu^†, School of Software Engineering, South China University of Technology, China
Chao Han, School of Software Engineering, South China University of Technology, China
Jian Chen, School of Software Engineering, South China University of Technology, China
Shuai Mu, School of Software Engineering, South China University of Technology, China
Huaqing Min, School of Software Engineering, South China University of Technology, China

Automated assignment of protein function has received considerable attention in recent years for genome-wide study.With the rapid accumulation of sequencing genomes data setting through high-throughput experimental techniques, the process of manual predicting functional properties of proteins has become increasingly cumbersome. Such a vast amount of proteomics and genomics data sets can only be annotated computationally.However, automated assigning functions to unknown protein is challenging due to its inherent difficulty and complexity.Previous works have revealed that solving problems involving complicated objects with multiple semantic meanings using the Multi-Instance Multi-Label (MIML) framework can lead to good performances. As for protein function prediction problems, each protein object in nature may associate with distinct functional and structural units (instances) and multiple functional properties (class labels) where each unit is described by an instance and each functional property is considered as a class label. It is convenient and natural to tackle the protein function prediction problems by using the MIML framework. In this paper, we propose a sparse Markov chain based semi-supervised MIML method, called Sparse-Markov. A transductive probability graph is constructed to encode the affinity information of the data based on ensemble of Hausdorff distance metrics. Our goal is to exploit the affinity between protein objects in the sparse transductive probability graph to seek a sparse steady state probability of the Markov chain model to do protein function prediction, such that two proteins are given similar functional labels if they are close to each other in terms of an ensemble Hausdorff distance in the graph. Experimental results on seven real-world organism data sets covering the biological three-domain system show that our proposed Sparsity-based Markov method is able to achieve better performance than four state-of-the-art MIML learning algorithms.

Nuclear Receptors (NR) superfamily plays an important role in key biological, developmental and physiological processes. Developing a method for classification of NR proteins is an important step towards understanding the structure and functions of the newly discovered NR protein. The recent studies on NR classification are either unable to achieve optimum accuracy or are not designed for all the known NR subfamilies. In this study we developed RF-NR, a Random Forest based approach for improved classification of nuclear receptors. RF-NR discriminates NRs from non-NR proteins and for NRs it also predicts the subfamily of the NR protein. RF-NR uses spectrum-like features namely: Amino Acid Composition, Di-peptide Composition and Tripeptide Composition. Benchmarking on two independent datasets with varying sequence redundancy reduction criteria showed that RF-NR had better (or comparable) accuracy than other existing methods. The added advantage of our approach is that we can also obtain biological insights about the important features that are required to classify NR subfamily.

Determining specific enzymatic functions is a fundamental step for reconstructing metabolic networks. The biological functions of genes are characterized by relatively small number of functional domains and domain patterns. A weighted mapping from domain architectures to EC numbers is used to score the association between functional domain and enzyme family (DEAS). This mapping performed better than a combination of standard mappings. However, the new mapping together with other direct top-down methods can cover only small portion of known enzymes. Bottom up methods can overcome this issue by re-building HMM profiles for enzymes. Those methods share a common classification protocol, in which training enzymes are clustered into subgroups and each subgroup is represented by a sequence profile. We improved this protocol by a stringent strategy with a proper subgroup clustering procedure, leveraging on our DEAS association score, instead of traditional Blast similarity score. The clustering procedure explicitly utilizes enzyme functional domain architecture to score the similarity between enzymes. Besides, our classifier (EnzDP) focuses on calibrating HMM profile thresholds and employs an enhanced classification procedure, including active site checking. Analysis showed that, EnzDP achieved a micro-accuracy of 94.5% in a solid 5-fold cross validation. EnzDP also outperformed other bottom-up methods in many testing experiments. It can serve as a reliable automatic tool for enzyme annotation.

J59 ⏰ SCMMTP: Identifying and characterizing membrane transport proteins using propensity scores of dipeptides

Yi-Fan Liou, National Chiao Tung University, Taiwan
Tamara Vasylenko, National Chiao Tung University, Taiwan
Chia-Lun Yeh, National Chiao Tung University, Taiwan
Wei-Chun Lin, National Chiao Tung University, Taiwan
Shih-Hsiang Chiu, National Chiao Tung University, Taiwan
Phasit Charoenkwan, National Chiao Tung University, Taiwan
Shinn-Ying Ho^†, National Chiao Tung University, Taiwan
Hui-Ling Huang^†, National Chiao Tung University, Taiwan
Li-Sun Shu, Overseas Chinese University, Taiwan

Background: Identifying putative membrane transport proteins (MTPs) and understanding the transport mechanisms involved remain important challenges for the advancement of structural and functional genomics. However, the transporter characters are mainly acquired from MTP crystal structures which are hard to crystalize. Therefore, it is desirable to develop bioinformatics tools for the effective large-scale analysis of available sequences to identify novel transporters and characterize such transporters.
Results: This work proposes a novel method (SCMMTP) based on the scoring card method (SCM) using dipeptide composition to identify and characterize MTPs from an existing dataset containing 900 MTPs and 660 non-MTPs which are separated into a training dataset consisting 1,380 proteins and an independent dataset consisting 180 proteins. The SCMMTP produced estimating propensity scores for amino acids and dipeptides as MTPs. The SCMMTP training and test accuracy levels respectively reached 83.81% and 76.11%. The test accuracy of support vector machine (SVM) using a complicated classification method with a low possibility for biological interpretation and position-specific substitution matrix (PSSM) as a protein feature is 80.56%, thus SCMMTP is comparable to SVM-PSSM. To identify MTPs, SCMMTP is applied to three datasets including: 1) human transmembrane proteins, 2) a photosynthetic protein dataset, and 3) a human protein database. MTPs showing α-helix rich structure is agreed with previous studies. The MTPs used residues with low hydration energy. It is hypothesized that, after filtering substrates, the hydrated water molecules need to be released from the pore regions.
Conclusions: SCMMTP yields estimating propensity scores for amino acids and dipeptides as MTPs, which can be used to identify novel MTPs and characterize transport mechanisms for use in further experiments.

Virus Classification & Metagenomics

J09 ⏰ Obtaining long 16S rDNA sequences using multiple primers and its application on dioxin-containing samples

Tsunglin Liu^†, Institute of Bioinformatics and Biosignal Transduction, National Cheng Kung University, Tainan, Taiwan
Yi-Lin Chen, Molecular Diagnostic Laboratory, Department of Pathology, National Cheng Kung University Hospital, Tainan, Taiwan
Chuan-Chun Lee, Molecular Diagnostic Laboratory, Department of Pathology, National Cheng Kung University Hospital, Tainan, Taiwan
Ya-Lan Lin, Molecular Diagnostic Laboratory, Department of Pathology, National Cheng Kung University Hospital, Tainan, Taiwan
Kai-Min Yin, Environmental Analysis Laboratory, Environmental Protection Administration, Executive Yuan, Taiwan
Chung-Liang Ho^†, Institute of Bioinformatics and Biosignal Transduction, National Cheng Kung University, Tainan, Taiwan

Background: Next-generation sequencing (NGS) technology has transformed metagenomics because the high-throughput data allow an in-depth exploration of a complex microbial community. However, accurate species identification with NGS data is challenging because NGS sequences are relatively short. Assembling 16S rDNA segments into longer sequences has been proposed for improving species identification. Current approaches, however, either suffer from amplification bias due to one single primer or insufficient 16S rDNA reads in whole genome sequencing data.
Results: Multiple primers were used to amplify different 16S rDNA segments for 454 sequencing, followed by 454 read classification and assembly. This permitted targeted sequencing while reducing primer bias. For test samples containing four known bacteria, accurate and near full-length 16S rDNAs of three known bacteria were obtained. For real soil and sediment samples containing dioxins in various concentrations, 16S rDNA sequences were lengthened by 50% for about half of the non-rare microbes, and 16S rDNAs of several microbes reached more than 1000 bp. In addition, reduced primer bias using multiple primers was illustrated.
Conclusions: A new experimental and computational pipeline for obtaining long 16S rDNA sequences was proposed. The capability of the pipeline was validated on test samples and illustrated on real samples. For dioxin-cotaining samples, the pipeline revealed several microbes suitable for future studies of dioxin chemistry.

J74 ⏰ Precise Genotyping and Recombination Detection of Enterovirus

Chieh-Hua Lin, Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taiwan
Yu-Bin Wang, Institute of Information Science, Academia Sinica, Taiwan
Shu-Hwa Chen, Institute of Information Science, Academia Sinica, Taiwan
Chao Agnes Hsiung^†, Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Taiwan
Chung-Yen Lin^†, Institute of Information Science, Academia Sinica, Taiwan

Enteroviruses (EV) with different genotypes cause diverse infectious diseases in humans and mammals. A correct EV typing result is crucial for effective medical treatment and disease control; however, the emergence of novel viral strains has impaired the performance of available diagnostic tools. Here, we present a web-based tool EVIDENCE (EnteroVirus In DEep conception) for EV genotyping and recombination detection. We introduce the idea of using mixed–ranking scores to evaluate the fitness of prototypes based on relatedness and on the genome regions of interest. Using phylogenetic methods, the most possible genotype is determined based on the closest neighbor among the selected references. To detect possible recombination events, EVIDENCE calculates the sequence distance and phylogenetic relationship among sequences of all sliding windows scanning over the whole genome. Detected recombination events are plotted in an interactive figure for viewing of fine details. In addition, all EV sequences available in GenBank were collected and revised using the latest classification and nomenclature of EV in EVIDENCE. These sequences are built into the database and are retrieved in an indexed catalog, or can be searched for by keywords or by sequence similarity. EVIDENCE is the first web-based tool containing pipelines for genotyping and recombination detection, with updated, built-in, and complete reference sequences to improve sensitivity and specificity. The use of EVIDENCE can accelerate genotype identification, aiding clinical diagnosis and enhancing our understanding of EV evolution.

J95 ⏰ g-FLUA2H: A web-based application to study the dynamics of animal-to-human mutation transmission for influenza viruses

Muhammad Farhan Sjaugi^†, Perdana University － Centre for Bioinformatics, Malaysia
Swan Tan, Perdana University － Centre for Bioinformatics, Malaysia
Hadia Syahirah Raman, Perdana University － Centre for Bioinformatics, Malaysia
Wan Ching Lim, Perdana University － Centre for Bioinformatics, Malaysia
Nik Elena Mohamed, Perdana University － Centre for Bioinformatics, Malaysia
J. Thomas August, Department of Pharmacology and Molecular Sciences, The Johns Hopkins University School of Medicine, USA
Mohammad Asif Khan^†, Perdana University － Centre for Bioinformatics, Malaysia

g-FLUA2H is a web-based application to study the dynamics of influenza A virus animal-to-human (A2H) mutation transmissions. The basic requirement of the application is the viral protein sequences of the animal and human viruses as an input. The comparative analyses between the co-aligned sequences of the animal and human host populations is based on a sliding window approach of size nine for statistical significance and data application to the major histocompatibility complex (MHC) and T-cell receptor (TCR) immune response mechanisms. The sequences at each of the aligned overlapping nonamer positions of the respective hosts are classified as four patterns of characteristic diversity motifs, as a basis for quantitative analyses: (i) "index", the most prevalent sequence; (ii) "major" variant, the second most common sequence and the single most prevalent variant of the index, with at least one amino acid mutation; (iii) "minor" variants, multiple different sequences, each with an incidence (percent occurrence) less than that of the major variant; and (iv) "unique" variants, each observed only once. The diversity motifs and their incidences at each of the nonamer positions allow evaluation of the mutation transmission dynamics and selectivity of the sequences in relation to the animal or the human hosts. g-FLUA2H greatly benefits from the grid back-end for analysis of massively large influenza sequence datasets and is publicly available. This application can be used for a detailed proteome-wide characterization of the composition and incidence of mutations present in the animal and human host populations for a better understanding of host tropism.

Background: Estimating the number of different species (richness) in a mixed microbial population has been a main focus in metagenomic research. Existing methods of species richness estimation ride on the assumption that the reads in each assembled contig correspond to only one of the microbial genomes in the population. This assumption and the underlying probabilistic formulations of existing methods are not useful for quasispecies populations where the strains are highly genetically related.
The lack of knowledge on the number of different strains in a quasispecies population is observed to hinder the precision of existing Viral Quasispecies Spectrum Reconstruction (QSR) methods due to the uncontrolled reconstruction of a large number of in silico false positives. In this work, we formulated a novel probabilistic method for strain richness estimation specifically targeting viral quasispecies. By using this approach we improved our recently proposed spectrum reconstruction pipeline ViQuaS to achieve higher levels of precision in reconstructed quasispecies spectra without compromising the recall rates. We also discuss how one other existing popular QSR method named ShoRAH can be improved using this new approach. Results: On benchmark data sets, our estimation method provided accurate richness estimates (< 0.2 median estimation error) and improved the precision of ViQuaS by 2–13% and F-score by 1–9% without compromising the recall rates. We also demonstrate that our estimation method can be used to improve the precision and F-score of ShoRAH by 0–7% and 0–5% respectively.
Conclusions: The proposed probabilistic estimation method can be used to estimate the richness of viral populations with a quasispecies behavior and to improve the accuracy of the quasispecies spectra reconstructed by the existing methods ViQuaS and ShoRAH in the presence of a moderate level of technical sequencing errors.

Assorted Topics

In metabolic network modification, we newly add enzymes or/and knock-out genes to maximize the biomass production with minimum side-effect. Although this problem has been studied for various problem settings via mathematical models including flux balance analysis, elementary mode, and Boolean models, some important problem settings still remain to be studied. In this paper, we consider Boolean Reaction Modification (BRM) problem, where a host metabolic network and a reference metabolic network are given in the Boolean model, the host network initially produces some toxic compounds and cannot produce some necessary compounds, but the reference network can produce the necessary compounds, and we should minimize the total number of removed reactions from the host network and added reactions from the reference network so that the toxic compounds are not producible, but the necessary compounds are producible in the resulting host network. We developed integer linear programming (ILP)-based methods for BRM, and compared with SimOptStrain. The results show that our method is good for reducing the total number of added and removed reactions, while SimOptStrain is good for optimizing the production of the target compound. Our software is freely available.

Motivation: The majority of metazoan parasites known to invade vertebrate hosts are mainly represented in 3 phyla: Platyhelminthes, Nematoda and Acanthocephala. Many of the parasite members of these phyla are collectively known as helminthes and are causative agents of many debilitating, deforming and killing diseases of man and animals. The Project “North-East Parasite Information Data-base” (NEPID) basically aims at characterization of helminth para-site biodiversity for identification of different geographical isolates and host associations, unraveling cryptic, emergent, exotic and invasive pathogens; the parasite primary specimens, host information, and spatial and temporal data along with results of analyses, diagnostic capacity, images, etc., with synoptic summaries for parasite and host associations form the basis for educational materials to the end users and researchers. The database is an outcome of com-bined approach of wetlab and in silico parasite research based on next generation sequencing of selected platyhelminth transcrip-tomes, whole genomes, mitochondrial genomes and species specific diagnostic molecular markers developed and collated into a compendium of helminth parasite information in North-east India.
Results: Here, we present a research web-based database on parasite information endemic to North East India which would serve as a first hand reference for studying exotic helminth parasites and especially those that are zoonotic in nature. The database houses information on classical taxonomy, disease information, morphological classification, literary references relevant to the research along with molecular sequences that serve as genus/species specific markers and can aid in diagnosis of the parasite strains with perfec-tion. Besides, NGS data of 3 selected helminth parasites are housed in the present database. The database would also be helpful as a study material for students interested in parasite information and their biology. Users can search the database location wise, species wise, host wise easily and the sites from where the specimens were collected are reflected on the database through Keyhole Markup Language (KML) files linked to google maps. The outcome from the integrated research network provided us with the impetus to collate the information into a searchable database and is made available freely on a common portal dedicated to parasite information, analysis and research.

Background: Next-generation sequencing (NGS) technologies has brought an unprecedented scale of genomic data for analysis. Unlike array-based profiling technologies, NGS can reveal the expression profile across a transcript at the base level. Such a base level read coverage provides further insights for alternative mRNA splicing, single-nucleotide polymorphism (SNP), novel transcript discovery, etc. However, to our best knowledge, none of existing NGS viewers can timely visualize genome-wide base level read coverages in an interactive environment.
Results: This study proposes an efficient visualization pipeline and implements a lightweight read coverage viewer, Light-RCV, with the proposed pipeline. Light-RCV consists of four featured designs on the path from raw NGS data to the final visualized read coverage: i) read coverage construction algorithm, ii) multi-resolution profiles, iii) two-stage architecture and iv) storage format. With these designs, Light-RCV achieves < 0.5s response time on any scale of genomic ranges, including whole chromosomes. Finally, a case study was conducted to show the importance of visualizing read coverage in the base level and the value of Light-RCV.
Conclusions: Compared with multi-functional genome viewers such as Artemis, Savant, Tablet and Integrative Genomics Viewer (IGV), Light-RCV is solely devoted to visualization without advanced analyses. But its backend technology provides an efficient kernel of base level visualization that can be easily embedded to other viewers. It is the first viewer that can timely visualizing genome-wide read coverage at the base level in an interactive environment. Light-RCV is free software and available at http://zoro.ee.ncku.edu.tw/light-rcv/ and http://merry.ee.ncku.edu.tw/light-rcv/.

Afternoon, Day 2, Sept. 10, 2015

Cancer

J86 ⏰ An integrated bioinformatics analysis to dissect kinase dependency in triple negative breast cancer

Karen Ryall, University of Colorado Anschutz Medical Campus, USA
Jihye Kim, University of Colorado Anschutz Medical Campus, USA
Peter Klauck, University of Colorado Anschutz Medical Campus, USA
Jimin Shin, University of Colorado Anschutz Medical Campus, USA
Minjae Yoo, University of Colorado Anschutz Medical Campus, USA
Anastasia Ionkina, University of Colorado Anschutz Medical Campus, USA
Todd Pitts, University of Colorado Anschutz Medical Campus, USA
John Tentler, University of Colorado Anschutz Medical Campus, USA
Jennifer Diamond, University of Colorado Anschutz Medical Campus, USA
Gail Eckhardt, University of Colorado Anschutz Medical Campus, USA
Lynn Heasley, University of Colorado Anschutz Medical Campus, USA
Jaewoo Kang, Korea University, Korea
Aik Choon Tan^†, University of Colorado Anschutz Medical Campus, USA

Background: Triple-Negative Breast Cancer (TNBC) is an aggressive disease with a poor prognosis. Clinically, TNBC patients have limited treatment options besides chemotherapy. The goal of this study was to determine the kinase dependency in TNBC cell lines and to predict compounds that could inhibit these kinases using integrative bioinformatics analysis.
Results: We integrated publicly available gene expression data, high-throughput pharmacological profiling data, and quantitative in vitro kinase binding data to determine the kinase dependency in 12 TNBC cell lines. We employed Kinase Addiction Ranker (KAR), a novel bioinformatics approach, which integrated these data sources to dissect kinase dependency in TNBC cell lines. We then used the kinase dependency predicted by KAR for each TNBC cell line to query K-Map for compounds targeting these kinases. We validated our predictions using published and new experimental data
Conclusions: In summary, we implemented an integrative bioinformatics analysis that determines kinase dependency in TNBC. Our analysis revealed candidate kinases as potential targets in TNBC for further pharmacological and biological studies.

Epithelial-to-mesenchymal transition (EMT) initiates metastases in cancer and represents an attractive target mechanism for interference with the disease. However the key players of the process are still debatable. We hypothesized that the control of EMT initiation requires activity of several synergistically interacting genes. To reveal what are the synthetic interactions, we first constructed a comprehensive map of EMT signaling network and performed structural analysis that allowed highlighting the network organization principles and complexity reduction up to core regulatory routes. Using the reduced network we compared combinations of single and double mutants for achieving the EMT phenotype; predicted that a combination of p53 knock-out and overexpression of Notch would induce EMT and suggested the molecular mechanism of deregulation of EMT inducer control. This prediction lead to generation of colon cancer mice model with metastases in distant organs. We confirmed in invasive human colon cancer samples that EMT markers are associated with modulation of Notch and p53 gene expression in similar manner as in the mice model, supporting a synergy between these genes to permit EMT induction. Our prediction of synthetic interaction between Notch and p53 demonstrated that there are ways to reach permissive conditions that induce EMT in addition to those already described in the literature. This idea was not intuitive and this combination of mutations most probably would not arise without the compelling evidence provided by the analysis of comprehensive signaling network that inspired to design the experiment in the way it lead to the discovery of alternative mechanism for EMT induction. In addition, the comprehensive EMT signaling network is rich resource of information that can be used in further studies. Finally, the new EMT mice is a relevant model mimicking the invasive human colon cancer and a system for therapeutic drug discovery.

Background: Recently, wide range of diseases have been associated with changes in DNA methylation levels, which plays a vital role in gene expression regulation. With ongoing developments in technology, there is a certain increase in disease studies which benefit from both epigenetics and transcriptomics experiments. In this work, we have used expression and methylation data of thyroid carcinoma as a case study and explored how to optimally incorporate expression and methylation information into the disease study when both data are available. Moreover, we have also investigated whether there are important post-translational modifiers which may be crucial to reveal insights of thyroid cancer.
Results: In this study, we have conducted a threshold analysis for varying methylation levels to identify whether setting a methylation level threshold increases the performance of functional enrichment. Moreover, in order to decide on best-performing analysis strategy, we have performed data integration analysis including comparison of 9 different analysis strategies. As a result, combining methylation, expression significances and using genes with more than 15% methylation change led to a better detection rate of thyroid-cancer associated pathways in top 20 functional enrichment results. Furthermore, pooling the data from different experiments increased analysis confidence by improving the data range. On the other hand, we have identified 207 transcription factors and 245 post-translational modifiers with more than 15% methylation change which may be important in understanding underlying mechanisms of thyroid cancer.
Conclusion: While only expression or only methylation information would not clearly reveal both primary and secondary mechanisms involved in disease state, combining expression, methylation information led to a better detection of disease-related genes and pathways that are found in the literature. Moreover, setting a valid methylation level threshold improved the functional enrichment results, revealing the core pathways involved in disease development such as; endocytosis, apoptosis, glutamatergic synapse, MAPK, erbB, TGF-beta and toll-like receptor pathways. Overall, in addition to novel analysis framework, our study reveals important thyroid-cancer related mechanisms, secondary molecular alterations and contributes to better knowledge of thyroid cancer aetiology which probably was not possible by using only expression or only methylation information.

Transcriptomal profiling is one of the most used methods for characterizing a tumor sample in order to detect important biological differences between the functioning of tumor and normal cells, to precise the diagnosis and to suggest personalized treatment. Recently, large-scale efforts (such as The Cancer Genome Atlas, TCGA) were undertaken to produce transcriptome profiling for a large number of tumor samples for various cancer types and to make these data available for analysis. Currently, we have access to several tens of thousands of tumoral transcriptomes, most of which are collected for the prevailing cancers such as breast, colon, lung and ovarian cancers.
There exists an important question of what can be learned from these “big data”: in particular, what kind of signals shape the transcriptomes in many cancer types or are specific to a particular cancer type. We’ve approached this question by deconvoluting into independent components 22 different datasets collecting transcriptomal profiles for various types of cancer (6671 tumor samples in total), focusing at comparing bladder and breast cancer in more details. For each dataset, we’ve computed components which seemed to be sufficient to capture the most important biological or technical signals. We’ve made particular effort to understand the meaning of the components for our proper transcriptomic dataset for bladder cancer.
We systematically compared 440 components recapitulating the deconvolution results into a correlation graph. We analysed this graph for existence of communities of tightly connected nodes, which we interpreted as highly reproducible signals. We identified communities containing the nodes of only one cancer type (cancer type-specific signals) as well as the communities collecting components from cancers of different types. Among such generic signals we’ve identified the changes in the transcriptome connected to infiltration of lymphocytes or myofibroblasts, to the cell cycle or to the functioning of mitochondria. We identified a community associated with GC-content of the gene sequences, which can indicate specific technological biases. Among cancer type-specific signals, we observed a clear signal connected to existence of the basal-like breast cancer subtype and to the recently characterized basal-like subtype of bladder cancer. We also characterized the components specific to two bladder cancer progression pathways.
The urothelial differentiation component, found in all bladder cancer data sets studied, was specifically associated with bladder luminal tumors. We looked for altered genomic regions associated with the urothelial differentiation component and selected the regions of gains that were found in tumors associated with the differention. To demonstrate the functional involvement of PPARG in bladder tumors, we studied the effect of siRNA-mediated PPARG knockdown on the growth of nine bladder cancer-derived cell lines. The experimental results indicated that PPARG was involved in tumor cell growth. The presence of PPAR target genes among the contributing genes of the urothelial differentiation component suggested that PPARG could control the expression of several contributing genes of this component. We tested this hypothesis by comparing the transcriptome of the SD48 cell line treated with three different siRNAs targeting PPARG.

Epitope

J03 ⏰ A computational method for identification of viral vaccine targets from protein regions of conserved HLA binding

Lars R. Olsen, University of Copenhagen, Denmark
Christian Simon, University of Copenhagen, Denmark
Ulrich Johan Kudahl, University of Copenhagen, Denmark
Frederik Otzen Bagger, University of Cambridge, UK
Ole Winther, University of Copenhagen, Denmark
Ellis L Reinherz, Dana-Farber Cancer Institute, USA
Guang Lan Zhang, Boston University, USA
Vladimir Brusic^†, Nazarbayev University, Kazakhstan

Computational methods for T cell-based vaccine target discovery focus on selection of highly conserved peptides identified across pathogen variants, followed by prediction of their binding of HLA molecules. However, experimental studies have shown that T cells often target diverse regions in highly variable viral pathogens and this diversity may need to be addressed through redefinition of suitable peptide targets. We have developed a method for antigen assessment and target selection for polyvalent vaccines, with which we identified immune epitopes from variable regions, in which all variants still bind HLA. These regions, although variable, can thus be considered genetically stable in terms of HLA binding and represent valuable vaccine targets. We applied this method to predict CD8+ T-cell targets in influenza A H7N9 HA and significantly increased the number of potential vaccine targets compared to the number of targets discovered using the traditional approach where low-frequency peptides are excluded. We developed an intuitive visualization scheme for summarizing the T cell-based antigenic potential of any given proteome or protein using clear and easy to interpret graphics, and implemented this as freely available software.

Hepatitis C virus (HCV) belongs to Flaviviridae family of viruses. HCV represents a major challenge to public health since with estimated global prevalence of 2.8% of the world’s population. The design and development of HCV vaccine has been hampered by rapid evolution of viral quasi species resulting in antibody escape variants. HCV envelope glycoprotein E1 and E2 that mediate fusion and entry of the virus into host cells are primary targets of host immune responses. Structural characterization of E2 core protein and a broadly neutralizing antibody AR3C together with E1E2 sequence information enabled the analysis B-cell epitope variability. The E2 binding site by AR3C and its surrounding area were identified from the crystal structure of E2c-AR3C complex. We clustered HCV strains using the concept of “discontinuous motif/peptide” and classified B-cell epitopes based on their similarity. The assessment of antibody neutralizing coverage provides insights into potential cross-reactivity of the AR3C neutralizing antibody across HCV variants.

J66 ⏰ Prediction of linear B-cell epitopes of hepatitis C virus for vaccine development

Ming-Ju Tsai, Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
Wen-Lin Huang, Department of Management Information System, Asia Pacific Institute of Creativity, Miaoli, Taiwan
Kai-Ti Hsu, Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
Jyun-Rong Wang, Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
Yi-Hsiung Chen, Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan
Shinn-Ying Ho^†, Institute of Bioinformatics and Systems Biology, National Chiao Tung University, Hsinchu, Taiwan, Department of Biological Science and Technology, National Chiao Tung University, Hsinchu, Taiwan

Background: High genetic heterogeneity in the hepatitis C virus (HCV) is the major challenge of the development of an effective vaccine. Existing studies for developing HCV vaccines have mainly focused on T-cell immune response. However, identification of linear B-cell epitopes that can stimulate B-cell response is one of the major tasks of peptide-based vaccine development. Owing to the variability in B-cell epitope length, the prediction of B-cell epitopes is much more complex than that of T-cell epitopes. Furthermore, the motifs of linear B-cell epitopes in different pathogens are quite different (e.g. HCV and hepatitis B virus). To cope with this challenge, this work aims to propose an HCV-customized sequence-based prediction method to identify B-cell epitopes of HCV.
Results: This work establishes an experimentally verified dataset comprising the B-cell response of HCV dataset consisting of 774 linear B-cell epitopes and 774 non B-cell epitopes from the Immune Epitope Database. An interpretable rule mining system of B-cell epitopes (IRMS-BE) is proposed to select informative physicochemical properties (PCPs) and then extracts several if-then rule-based knowledge for identifying B-cell epitopes. A web server Bcell-HCV was implemented using an SVM with the 34 informative PCPs, which achieved a training accuracy of 79.7% and test accuracy of 70.7% better than the SVM-based methods for identifying B-cell epitopes of HCV and the two general-purpose methods. This work performs advanced analysis of the 34 informative properties, and the results indicate that the most effective property is the alpha-helix structure of epitopes, which influences the connection between host cells and the E2 proteins of HCV. Furthermore, 12 interpretable rules are acquired from top-five PCPs and achieve a sensitivity of 75.6% and specificity of 71.3%. Finally, a conserved promising vaccine candidate, PDREMVLYQE, is identified for inclusion in a vaccine against HCV.
Conclusions: This work proposes an interpretable rule mining system IRMS-BE for extracting interpretable rules using informative physicochemical properties and a web server Bcell-HCV for predicting linear B-cell epitopes of HCV. IRMS-BE may also apply to predict B-cell epitopes for other viruses, which benefits the improvement of vaccines development of these viruses without significant modification. Bcell-HCV is useful for identifying B-cell epitopes of HCV antigen to help vaccine development.

Motivation: The incomplete ground truth of training data of B-cell epitopes is a tough issue in computational epitope prediction. The challenge is that only a small fraction of surface residues of an antigen have been confirmed as antigenic residues (positive training data) － the remaining residues are unlabeled and uncertain. As some of these uncertain residues can possibly be grouped to form novel but currently unknown epitopes, it is biased to unanimously classify all the unlabeled residues as negative training data following the traditional supervised learning scheme.
Method and Results: We propose a positive-unlabeled learning algorithm to address this problem. The key idea is to distinguish between epitope-likely residues and reliable negative residues in unlabeled data. The method has two steps: (1) identify reliable negative residues using a weighted SVM with a high recall; and (2) construct a classification model on the positive and the reliable negative residues. A complex-based 10-fold cross-validation was conducted to show that our method outperforms those commonly used predictors DiscoTope 2.0, ElliPro and SEPPA 2.0 in every aspect. As case studies, our method was tested on antigens of West Nile virus, dihydrofolate reductase and beta-lactamase, and on two Ebola antigens whose epitopes are currently unknown. All the results were achieved on a newly established data set of unbound structures of antigens, instead of on bound structures which may contain unfair binding information such as bound-state B-factors and protrusion index to exaggerate the epitope prediction performance.

Protein-{Protein,RNA,DNA} Interface

Protein-RNA interactions (PRIs) are essential for many biological processes, so understanding aspects of the sequence and structure in PRIs is important for understanding those processes. Due to the expensive and time-consuming processes required for experimental determination of complex protein-RNA structures, various computational methods have been developed to predict PRIs. However, most of these methods focus on predicting only RNA-binding regions in proteins or only protein-binding motifs in RNA. Methods for predicting entire residue-base contacts in PRIs have not yet achieved sufficient accuracy. Furthermore, some of these methods require 3D structures or homologous sequences, which are not available for all protein and RNA sequences.
We propose a prediction method for residue-base contacts between proteins and RNAs using only sequence information and structural information predicted from only sequences. The method can be applied to any protein-RNA pair, even when rich information such as 3D structure is not available. Residue-base contact prediction is formalized as an integer programming problem. We predict a residue-base contact map that maximizes a scoring function based on sequence-based features such as $k$-mer of sequences and predicted secondary structure. The scoring function is trained by a max-margin framework from known PRIs with 3D structures. To verify our method, we conducted several computational experiments. The results suggest that our method, which is based on only sequence information, is comparable with RNA-binding residue prediction methods based on known binding data.

Background: Protein-protein interaction (PPI) is essential for molecular functions in biological cells. Investigation on protein interfaces of known complexes is an important step towards deciphering the driving forces of PPIs. Each PPI complex is specific, sensitive and selective to binding. Therefore, we have estimated the relative difference in percentage of polar residues between surface and the interface for each complex in a non-redundant heterodimer dataset of 278 complexes to understand the predominant forces driving binding.
Results: Our analysis showed ~60% of protein complexes with surface polarity greater than interface polarity (designated as class A). However, a considerable number of complexes (~40%) have interface polarity greater than surface polarity, (designated as class B), with a significantly different p-value of 1.66E-45 from class A. Comprehensive analyses of protein complexes show that interface features such as interface area, the relative abundance of polar and non-polar residues, solvent free energy gain upon interface formation, binding energy and the percentage of interface charged residues distinguish among class A and class B complexes, while electrostatic visualization maps also help differentiate interface classes among complexes.
Conclusions: Class A complexes are classical with abundant non-polar interactions at the interface; however class B complexes have abundant polar interactions at the interface, similar to protein surface characteristics. Five physicochemical interface features analyzed from the protein heterodimer dataset are discriminatory among the interface residue-level classes. These novel observations find application in developing residue-level models for protein-protein binding prediction, protein-protein docking studies and interface inhibitor design as drugs.

J81 ⏰ Characterizing informative sequence descriptors and predicting binding affinities of heterodimeric protein complexes

Ming-Ju Tsai, National Chiao-Tung University, Taiwan
Srinivasulu Yerukala Sathipati, National Chiao-Tung University, Taiwan
Jyun-Rong Wang, National Chiao-Tung University, Taiwan
Kai-Ti Hsu, National Chiao-Tung University, Taiwan
Phasit Charoenkwan, National Chiao-Tung University, Taiwan
Wen-Lin Huang, National Chiao-Tung University, Taiwan
Hui-Ling Huang, National Chiao-Tung University, Taiwan
Shinn-Ying Ho^†, National Chiao-Tung University, Taiwan

Protein–protein interactions (PPIs) are involved in various biological processes, and underlying mechanism of the interactions plays a crucial role in therapeutics and protein engineering. Most machine learning approaches have been developed for predicting the binding affinity of protein-protein complexes based on structure and functional information. This work aims to predict the binding affinity of heterodimeric protein complexes from sequences only.
This work proposes a support vector machine (SVM) based binding affinity classifier, called SVM-BAC, to classify heterodimeric protein complexes based on the prediction of their binding affinity. SVM-BAC identified 14 of 580 sequence descriptors including 531 physicochemical properties of amino acids to classify 216 heterodimeric protein complexes into low and high binding affinity. SVM-BAC yielded the training accuracy, sensitivity, specificity, AUC and test accuracy of 85.80%, 0.88, 0.82, 0.86 and 83.33%, respectively, better than existing machine learning algorithms. The 14 features and support vector regression were further used to predict the binding affinities (Pkd) of 200 heterodimeric protein complexes. Prediction performance of a Jackknife test was the correlation coefficient of 0.34 and mean absolute error of 1.4. We further analyze three informative physicochemical properties according to their contribution to prediction performance. Results reveal that the following properties are effective in predicting the binding affinity of heterodimeric protein complexes: apparent partition energy based on buried molar fractions, relations between chemical structure and biological activity in principal component analysis IV, and normalized frequency of beta turn.
The proposed sequence-based prediction method SVM-BAC uses an optimal feature selection method to identify 14 informative features to classify and predict binding affinity of heterodimeric protein complexes. The characterization analysis revealed that the average number of beta turns and hydrogen bonds at protein-protein interfaces in high binding affinity complexes are more than those in low binding affinity complexes.

Background: Transcription factors, regulating the expression inventory of a cell, interacts with its respective DNA subjugated by a specific recognition pattern, which if well exploited may ensure targeted genome engineering. The mostly widely studied transcription factors are zinc finger proteins (ZFP) that bind to its target DNA via direct and indirect recognition levels at the interaction interface. Exploiting the binding specificity and affinity of the interaction between the zinc fingers and the respective DNA, can help in generating engineered zinc fingers for therapeutic applications. Experimental evidences lucidly substantiate the effect of indirect interaction like DNA deformation and desolvation kinetics, in empowering ZFPs to accomplish partial sequence specificity functioning around structural properties of DNA. Exploring the structure-function relationships of the existing zinc finger-DNA complexes at the indirect recognition level, can aid in predicting the probable zinc fingers that could bind to any target DNA. Deformation energy, which defines the energy required to bend DNA from its native shape to its shape when bound to the ZFP, is an effect of indirect recognition mechanism. Water is treated as a co-reactant for unfurling the affinity studies in ZFP-DNA binding equilibria that takes into account the unavoidable change in hydration that occurs when these two solvated surfaces come into contact.
Results: Aspects like desolavtion and DNA deformation have been theoretically investigated based on simulations and free energy perturbation data revealing a consensus in correlating affinity and specificity as well as stability for ZFP-DNA interactions. Greater loss of water at the interaction interface of the DNA calls for binding with higher affinity, eventually distorting the DNA to a greater extent accounted by the change in major groove width and DNA tilt, stretch and rise.
Conclusion: Most prediction algorithms for ZFPs do not account for water loss at the interface. The sequence-dependent deformation in the DNA upon binding with ZFP as well as preference of bases at the 2nd and 3rd position of the repeating triplet, reported in this study, provide an absolutely new insight about the indirect interactions in DNA-protein complexation.

Morning, Day 3, Sept. 11, 2015

miRNA & other topics

Micro-RNAs (miRNAs) are small RNA molecules known to participate in important regulatory mechanisms through the targeting of mRNAs by sequence specific interactions, leading to targeted inhibition of gene expression. Initial studies have highlighted the importance of miRNA in normal mammary gland development processes associated with the lactation cycle but a role in lactation is not yet completely clear. However, the recent identification of significant quantities of miRNA in the milk of a number of mammals, together with the reporting of functional plant food miRNA in the blood of people, have precipitated speculations about the potential role of miRNAs in milk production as well as immune protection and development of the young.
To examine the role of milk miRNAs as informative markers of lactation, maternal physiology or information carrying signals for the timely delivery of development signals to the young, we deploy a comparative framework across the mammalian kingdom. The identification of conserved milk miRNAs allows the characterization of essential miRNA while difference in milk miRNA composition may reveal specific adaptations of a putative secretory miRNA signalling pathway. Towards this, we produce high throughput sequencing data from milk of a number a species, including more distantly related monotremes and marsupials. Marsupials, such as tammar wallaby, present one of the most interesting lactation phenotype due to their adoption of a short gestation with a relatively long lactation cycle following birth of an immature neonate with significant development during lactation. Continuous changes of tammar milk composition have been reported and may contribute to development and immune protection of pouch young. Therefore the marsupial model presents a unique opportunity to address the putative contribution of secretory milk miRNA in these processes. Profiling of miRNAs collected from tammar milk at different time points of lactation was conducted by High throughput sequencing¹. The results show that miRNA are also secreted at relatively high levels in marsupial milk and that milk miRNA composition changes significantly during the course of lactation in this specie. In addition, the difference in miRNA profiles obtained from maternal blood and milk indicates that passive transfer of serum is not a major contributor of miRNA secretion, suggesting that the mammary gland is the more likely major contributor of milk miRNA biogenesis. In contrast, highly expressed milk miRNAs could be detected at significantly higher levels in the blood serum of the neonate in comparison to adult blood, suggesting milk miRNAs may be absorbed through the gut of the young, at least during the early postnatal phase of development. Therefore the results support the notion that milk miRNA may have evolve to contribute protective and developmental signals to the young.

1. “Differential temporal expression of milk miRNA during the lactation cycle of the marsupial tammar wallaby (Macropus eugenii)”. Vengamanaidu Modepalli, Amit Kumar, Lyn A. Hinds, Julie A Sharp, Kevin R Nicholas, Christophe Lefevre. BMC Genomics 15:1012, 2014.

J80 ⏰ Investigation of microRNAs in mouse macrophage responses to lipoposaccharide-stimulation by combining gene expression with microRNA-target information
Chia-Chun Chiu, National Cheng Kung University, Taiwan
Wei-Sheng Wu^†, National Cheng Kung University, Taiwan

Background: Toll-like receptors, which stimulated by pathogen-associated molecular patterns such as lipopolysaccharides (LPS), induces the releasing of many kinds of proinflammatory cytokines to activate subsequent immune responses. Plenty of studies have also indicated the importance of TLR-signalling on the avoidance of excessive inflammation, tissue repairing and the return to homeostasis after infection and tissue injury. The significance of TLR-signalling attracts many attentions on the regulatory mechanisms since several years ago. However, as newly discovered regulators, how and how many different microRNAs (miRNAs) regulate TLR-signalling pathway are still unclear.
Results: By integrating several microarray datasets and miRNA-target information datasets, we identified 431 miRNAs and 498 differentially expressed target genes in bone marrow-derived macrophages (BMDMs) with LPS-stimulation. Cooperative miRNA network were constructed by calcalating targets overlap scores, and a subnetwork finding algorithm was used to identify cooperative miRNA modules. Finally, 17 and 8 modules are identified in the cooperative miRNA networks composed of miRNAs up-regulate and down-regulate genes, respectively.
Conclusions: We used gene expression data of mouse macrophage stimulated by LPS and miRNA-target information to infer the regulatory mechanism of miRNAs on LPS-induced signalling pathway. Also, our results suggest that miRNAs can be important regulators of LPS-induced innate immune response in BMDMs.

J37 ⏰ A model for gene deregulation detection using expression data
Etienne Birmele^†, Université Paris Descartes, France
Thomas Picchetti, Université Paris Descartes, France
Julien Chiquet, Université d'Evry, France
Mohamed Elati, Université d'Evry, France
Pierre Neuvial, CNRS, France
Remy Nicolle, Université d'Evry, France

In tumoral cells, gene regulation mechanisms are severely altered, and these modifications in the regulations may be characteristic of different subtypes of cancer. However, these alterations do not necessarily induce differential expressions between the subtypes. To answer this question, we propose a statistical methodology to identify the misregulated genes given a reference network and gene expression data.
Our model is based on a regulatory process in which all genes are allowed to be deregulated. We derive an EM algorithm where the hidden variables correspond to the status (under/over/normally expressed) of the genes and where the E-step is solved thanks to a message passing algorithm. Our procedure provides posterior probabilities of deregulation in a given sample for each gene. We assess the performance of our method by numerical experiments on simulations and on a bladder cancer data set.

H151 ⏰ Functional basis of microorganism classification
Chengsheng Zhu^†, Rutgers University, USA
Tom Delmont, Marine Biological Laboratory, USA
Timothy Vogel, Université de Lyon, France
Yana Bromberg^†, Rutgers University, USA

Correctly identifying nearest “neighbors” of a given microorganism is important in industrial and clinical applications, where close relationships imply similar treatment. Microbial classification based on similarity of physiological and genetic organism traits (polyphasic similarity) is experimentally difficult and, arguably, subjective. Evolutionary relatedness, inferred from phylogenetic markers, facilitates classification but does not guarantee functional identity between members of the same taxon or lack of similarity between different taxa. Using over thirteen hundred sequenced bacterial genomes we built a novel function-based microorganism classification scheme, functional-repertoire similarity-based organism network (FuSiON; flattened to fusion). Our scheme is phenetic, based on a network of quantitatively defined organism relationships across the known prokaryotic space. It correlates significantly with the current taxonomy, but the observed discrepancies reveal both (1) the inconsistency of functional diversity levels among different taxa and (2) an (unsurprising) bias towards prioritizing, for classification purposes, relatively minor traits of particular interest to humans. Our dynamic network-based organism classification is independent of the arbitrary pairwise organism similarity cut-offs traditionally applied to establish taxonomic identity. Instead, it reveals natural, functionally defined, organism groupings and is thus robust in handling organism diversity. Additionally, fusion can use organism meta-data to highlight the specific environmental factors that drive microbial diversification. Our approach provides a complementary view to cladistic assignments and holds important clues for further exploration of microbial lifestyles. Fusion is a more practical fit for biomedical, industrial, and ecological applications, as many of these rely on understanding the functional capabilities of the microbes in their environment, and are less concerned with phylogenetic descent.

T2 ⏰ Recent Development of Deep Learning Technology and its Application to Quantitative Structure-Activity Relationship
Kenta Oono^†, Preferred Networks

Deep Learning (DL) is a field of machine learning that utilizes Neural Networks (NN) or other computational models with "Deep" structures. Recent breakthrough in this field had made DL successful in various tasks like Image Recognition, Natural Language Process, and Speech Recognition.
DL is also applicable to some of Biochemistry tasks. In 2012, Merck organized a data-mining competition on Quantitative Structure-Activity Relationship (QSAR). The team which employed DL-based approach won the competition. Their approach was combination of multitask learning and DL, in which they share one NN among multiple tasks and solve them simultaneously.
To extend their approach and solve much more tasks simultaneously, we have developed an algorithm for distributed multitask DL called Community Learning. The experiment shows that the algorithm scales well up to 8 machines and improves the prediction accuracy of 5 assay results measured by AUC.
In this talk, we will introduce the recent development of DL and our approach to application of distributed multitask DL to QSAR.

Protein Sequence Analysis

T1 ⏰
Kentaro Tomii^†, AIST, Japan

We have developed several bioinformatics tools to facilitate biomedical research. Among them we present three recently developed bioinformatics tools for protein sequence/structure analysis. i) PoSSuM v.2.0 is a resource for investigating ligand analogs and target proteins of small-molecule drugs. You can explore the binding pocket universe within PoSSuM, which is a database of similar ligand binding and putative pockets. ii) MitoFates is a novel method, that incorporates recent developments in proteomics data, sequence features, and position weight matrices, for mitochondrial presequence and cleavage site prediction. MitoFates attains better performance than existing predictors in both detection of presequences and in predicting their cleavage sites. iii) ScreenCap3 is a reliable prediction method to discover novel caspase-3 substrates. ScreenCap3, based on machine learning and on information not only of experimentally verified positive examples but also of negative examples, provides substantial improvement in terms of precision, compared with existing methods. These three tools are available at each website and will be useful in high-throughput biology such as proteomics research.

J41 ⏰ Prediction of neddylation sites from protein sequences and sequence-derived properties
Ahmet Sinan Yavuz^†, Biological Sciences and Bioengineering Program, Faculty of Engineering and Natural Sciences, Sabanci University, Turkey
Namik Berk Sozer, Department of Genetics and Bioengineering, Faculty of Engineering and Architecture, Yeditepe University, Turkey
Osman Ugur Sezerman^†, Department of Biostatistics and Medical Informatics, Faculty of Medicine, Acibadem University, Turkey

Background: Neddylation is a reversible post-translational modification that plays a vital role in maintaining cellular machinery. It is shown to affect localization, binding partners and structure of target proteins. Disruption of protein neddylation was observed in various diseases such as Alzheimer’s and cancer. Therefore, understanding the neddylation mechanism and determining neddylation targets possibly bears a huge importance in further understanding the cellular processes. This study is the first attempt to predict neddylated sites from protein sequences by using several sequence and sequence-based structural features.
Results: We have developed a neddylation site prediction method using a support vector machine based on various sequence properties, position-specific scoring matrices, and disorder. Using 21 amino acid long lysine-centred windows, our model was able to predict neddylation sites successfully, with an average 5-fold stratified cross validation performance of 0.91, 0.91, 0.75, 0.44, 0.95 for accuracy, specificity, sensitivity, Matthew’s correlation coefficient and area under curve, respectively. Independent test set results validated the robustness of reported new method. Additionally, we observed that neddylation sites are commonly flexible and there is a significant positively charged amino acid presence in neddylation sites.
Conclusions:: In this study, a new neddylation site prediction method was developed for the first time in literature. Common characteristics of neddylation sites and their discriminative properties were explored for further in silico studies on neddylation. Lastly, up-to-date neddylation dataset was provided for researchers working on post-translational modifications.

J75 ⏰ A two-layered machine learning method to identify protein O-GlcNAcylation sites with O-GlcNAc transferase substrate motifs
Hui-Ju Kao, Yuan Ze University, Taiwan
Chien-Hsun Huang, Tao-Yuan Hospital, Taiwan
Neil Arvin Bretaña, University of New South Wales, Australia
Cheng-Tsung Lu, Yuan Ze University, Taiwan
Kai-Yao Huang, Yuan Ze University, Taiwan
Shun-Long Weng^†, Hsinchu Mackay Memorial Hospital, Taiwan
Tzong-Yi Lee^†, Yuan Ze University, Taiwan

Protein O-GlcNAcylation, involving the β-attachment of single N-acetylglucosamine (GlcNAc) to the hydroxyl group of serine or threonine residues, is an O-linked glycosylation catalyzed by O-GlcNAc transferase (OGT). Molecular level investigation of the basis for OGT’s substrate specificity should aid understanding how O-GlcNAc contributes to diverse cellular processes. Due to an increasing number of O-GlcNAcylated peptides with site-specific information identified by mass spectrometry (MS)-based proteomics, we were motivated to characterize substrate site motifs of O-GlcNAc transferases. In this investigation, a non-redundant dataset of 410 experimentally verified O-GlcNAcylation sites were manually extracted from dbOGAP, OGlycBase and UniProtKB. After detection of conserved motifs by using maximal dependence decomposition, profile hidden Markov model (profile HMM) was adopted to learn a first-layered model for each identified OGT substrate motif. Support Vector Machine (SVM) was then used to generate a second-layered model learned from the output values of profile HMMs in first layer. The two-layered predictive model was evaluated using a five-fold cross validation which yielded a sensitivity of 85.4%, a specificity of 84.1%, and an accuracy of 84.7%. Additionally, an independent testing set from PhosphoSitePlus, which was really non-homologous to the training data of predictive model, was used to demonstrate that the proposed method could provide a promising accuracy (84.05%) and outperform other O-GlcNAcylation site prediction tools. A case study indicated that the proposed method could be a feasible means of conducting preliminary analyses of protein O-GlcNAcylation and has been implemented as a freely available web-based system, OGTSite.

T3 ⏰ International Society for Computational Biology (ISCB) － Invest in You, Your Research Field: An Open Forum to Learn More about ISCB and Its Programs
Bruno Gaeta^†, University of New South Wales, Australia; Treasurer Elect & Member of the Board of Directors, International Society for Computational Biology

ISCB has emerged as the leading professional society for participants in the field of computational biology and bioinformatics. This diverse group of nearly 3400 global members includes researchers, practitioners, technicians, students, and suppliers. ISCB Members find their experience and our benefits far exceed the investment. Join us for this special presentation about ISCB to learn more about the Society, our programs and initiatives, and how to get involved with shaping our future.

Image Processing

J53 ⏰ SPF-CellTracker: Tracking multiple cells with strongly-correlated moves using a spatial particle filter
Osamu Hirose^†, Kanazawa University, Japan

Tracking many cells in time-lapse 3D image sequences is an important challenging task of bioimage informatics. Motivated by a study of brain-wide 4D imaging of neural activity in C. elegans, we present a new method of multi-cell tracking. Data types to which the method is applicable are characterized as follows: (i) cells are imaged as globular-like objects, (ii) it is difficult to distinguish cells based only on shape and size, (iii) the number of imaged cells ranges in several hundreds, (iv) cells move interacting strongly with one another and (v) cells do not divide. We developed a tracking software SPF-CellTracker designed based on a widely-used particle filter tracking. Incorporating covariation and relative positions among moving cells into prediction model is the key to reduce the tracking errors: identity-switching and coalescence of tracked positions. By incorporating a Markov random field into the state transition model, we describe the interaction of cells’ movements e.g. (1) covariation of closely-located cells (2) preservation of relative positions among cells, and (3) collision avoidance. We also derive a fast computation algorithm, called spatial particle filter. With the live-imaging data of neural activity of C. elegans in which approximately 120 nuclei of neurons are imaged, we demonstrate an advantage of the proposed method over the standard particle filter and the method reported by Tokunaga et al. (2014).

J82 ⏰ Automatic Genotyping from DNA Gel Electrophoresis Images using Bio-image Processing Technique
Saowaluck Kaewkamnerd, National Electronics and Computer Technology Center, Thailand
Apichart Intarapanich, National Electronics and Computer Technology Center, Thailand
Kittipat Ukosakit, Thammasat University Rangsit Campus, Thailand
Sissades Tongsima^†, National Center for Genetic Engineering and Biotechnology, Thailand
Somvong Tragoonrung, National Center for Genetic Engineering and Biotechnology, Thailand
Philip Shaw, National Center for Genetic Engineering and Biotechnology (BIOTEC), National Science and Technology Development Agency (NSTDA), Thailand

DNA gel electrophoresis is a molecular biology technique for separating different sizes of DNA fragments. Applications of DNA gel electrophoresis include DNA fingerprinting (genetic diagnosis), estimation of size of DNA, and DNA size separation for Southern blotting. By applying electric field to a gel matrix, different sizes of DNA molecules migrate differently through the gel and these molecules are accumulated as DNA bands on the gel lane, i.e., smaller molecules move faster and farther than the larger ones. Through special staining procedure, bands of DNA can be observed on the gel matrix under ultraviolet light. As the technology become mature, many laboratories adopt DNA gel electrophoresis as a standard routine for DNA fingerprinting. Manually reading these DNA bands is a laborious and error prone task when a large number of bands are to be interrogated. Although many bio-imaging techniques have been proposed, none of them can truly automate the typing of DNA.

J28 ⏰ Automated identification of copepods using digital image processing and artificial neural network
Lee Kien Leow^†, Institute of Biological Sciences, Faculty of Science, University Malaya, Kuala Lumpur, Malaysia
Li-Lee Chew, Institute of Ocean & Earth Sciences, Faculty of Science, University Malaya, Kuala Lumpur, Malaysia
Ving Ching Chong, Institute of Biological Sciences, Faculty of Science, University Malaya, Kuala Lumpur, Malaysia
Sarinder Kaur Dhillon^†, Bioinformatics, Institute of Biological Sciences, Faculty of Science, University Malaya, Kuala Lumpur, Malaysia

Background: Copepods are planktonic organisms that play a major role in the marine food chain. Studying the community structure and abundance of copepods in relation to the environment is essential to evaluate their contribution to mangrove trophodynamics and coastal fisheries. The routine identification of copepods can be very technical, requiring taxonomic expertise, experience and much eort which can be very time consuming. Hence, there is an urgent need to introduce novel methods and approaches to automate identification and classification of copepod specimens. This study aims to apply digital image processing and machine learning methods to build an automated identification and classification technique.
Results: We developed an automated technique to extract morphological features of copepods' specimen from captured images using digital image processing techniques. Artificial Neural Network (ANN) was used to classify the copepod specimens from species Acartia spinicauda, Bestiolina similis, Oithona aruensis, Oithona dissimilis, Oithona simplex, Parvocalanus crassirostris, Tortanus barbatus and Tortanus forcipatus based on the extracted features. 60% of the dataset was used for a two-layer feed-forward network training and the remaining 40% was used as testing dataset for system evaluation. Our approach demonstrated an overall classification accuracy of 93.13% (100% for A.spinicauda, B.similis and O.aruensis, 95% for T.barbatus, 90% for O.dissimilis and P.crassirostris, 85% for O.similis and T.forcipatus).
Conclusions: The methods presented in this study enable automated classification of copepods to the species level. Future studies should include more classes in the model, improving the selection of features, and reducing the time to capture the copepod images.

Special Session on NGS

S2a ⏰ CloudDOE: A User-Friendly Tool for Speeding up Hadoop Cloud Deployment and Genomic Data Analysis Using MapReduce
Yu-Jung Chang, Institute of Information Science, Academia Sinica, Taipei, Taiwan

Hadoop/MapReduce-based cloud computing has been successfully adopted in large-scale data analysis of bioinformatics, such as genome assembly, mapping reads to genomes, finding single nucleotide polymorphisms, etc. However, the prerequisite procedures of running MapReduce programs pose considerable challenges for biological research laboratories that are interested in using MapReduce. CloudDOE encapsulates technical details behind a user-friendly graphical interface and provides smart wizards for deploying a Hadoop cloud, and running bioinformatics applications, including CloudBurst read mapping, CloudBrush genome assembly, and CloudRS error correction.

S2b ⏰ Multi-Omics OnLine Analysis System (MOLAS)
Shu-Hwa Chen, Institute of Information Science, Academia Sinica, Taipei, Taiwan

Here we present MOLAS, Multi‐Omics onLine Analysis System, a robust web application which can take gene expression data (FPKM/RPKM) from different libraries as inputs, map these expressed genes with build-in annotations for further analyses (enrichment analysis, involved pathways etc.) and unlock biological meaning of the complex data in the intuitive interface.

S2c ⏰ The investigation of genome wide DNA methylation
Pao-Yang Chen, Institute of Plant and Microbial Biology, Academia Sinica, Taipei, Taiwan

DNA methylation is an important epigenetic modification involved in many biological processes. Bisulfite treatment coupled with high-throughput sequencing (BS-seq) provides an effective approach for studying genome-wide DNA methylation at base resolution. My talk will give an overview of a comprehensive pipeline for the epigenomic data analysis of genome wide DNA methylation, including the biology, alignment, and bioinformatic analysis. I will cover a few case studies with integrative (epi)genomic analyses in plants and animals.

S2d ⏰ Studying RNA processing of small silencing RNAs by NGS
Jui-Hung Hung, Bioinformatics and systems biology institute, National Chiao Tung University, Taiwan

I my talk I will:
A. Introduce detailed steps and caveats of typical analysis of three major types of small silencing RNAs: miRNA, endo-‐siRNA, and piRNA.
B. Describe how to retrieve sequence features that reflect the outcomes of the RNA processing in the biogenesis of small silencing RNAs.
C. Demonstrate with examples and successful stories to exemplify the utility of the introduced analysis

Afternoon, Day 3, Sept. 11, 2015

Cancer

H148 ⏰ Mutation signatures implicate aristolochic acid in bladder cancer development
Steve Rozen^†, Duke-NUS Graduate Medical School Singapore
Song Ling Poon, Laboratory of Cancer Epigenome, Division of Medical Sciences, National Cancer Centre Singapore
Mi Ni Huang, Duke-NUS Graduate Medical School, Singapore
Yang Choo, Duke-NUS Graduate Medical School, Singapore
John R. McPherson, Duke-NUS Graduate Medical School, Singapore
Willie Yu, Duke-NUS Graduate Medical School, Singapore
Hong Lee Heng, National Cancer Centre Singapore
Anna Gan, National Cancer Centre Singapore
Swe Swe Myint, National Cancer Centre Singapore
Ee Yan Siew, National Cancer Centre Singapore
Lian Dee Ler, National Cancer Centre Singapore
Lay Guat Ng, Singapore General Hospital
Wen-Hui Weng, National Taipei University of Technology, Taiwan
Cheng-Keng Chuang, Chang Gung Memorial Hospital, Taiwan
John Yuen, Singapore General Hospital
See-Tong Pang, Chang Gung Memorial Hospital, Taiwan
Patrick Tan, Duke-NUS Graduate Medical School, Singapore; Cancer Sciences Institute of Singapore, Singapore; Genome Institute of Singapore, Singapore
Bin Tean Teh, National Cancer Centre Singapore

This paper was published in Genome Medicine, 7:38, 2015.
Background: Aristolochic acid (AA) is a natural compound found in many plants of the Aristolochia genus, and these plants are widely used in traditional medicines for numerous conditions and for weight loss. Previous work has connected AA-mutagenesis to upper-tract urothelial cell carcinomas and hepatocellular carcinomas. We hypothesize that AA may also contribute to bladder cancer.
Methods: Here, we investigated the involvement of AA-mutagenesis in bladder cancer by sequencing bladder tumor genomes from two patients with known exposure to AA. After detecting strong mutational signatures of AA exposure in these tumors, we exome-sequenced and analyzed an additional 11 bladder tumors and analyzed publicly available somatic mutation data from a further 336 bladder tumors.
Results: The somatic mutations in the bladder tumors from the two patients with known AA exposure showed overwhelming AA signatures. We also detected evidence of AA exposure in 1 out of 11 bladder tumors from Singapore and in 3 out of 99 bladder tumors from China. In addition, 1 out of 194 bladder tumors from North America showed a pattern of mutations that might have resulted from exposure to an unknown mutagen with a heretofore undescribed pattern of A > T mutations. Besides the signature of AA exposure, the bladder tumors also showed the CpG > TpG and activated-APOBEC signatures, which have been previously reported in bladder cancer. Conclusions: This study demonstrates the utility of inferring mutagenic exposures from somatic mutation spectra. Moreover, AA exposure in bladder cancer appears to be more pervasive in the East, where traditional herbal medicine is more widely used. More broadly, our results suggest that AA exposure is more extensive than previously thought both in terms of populations at risk and in terms of types of cancers involved. This appears to be an important public health issue that should be addressed by further investigation and by primary prevention through regulation and education. In addition to opportunities for primary prevention, knowledge of AA exposure would provide opportunities for secondary prevention in the form of intensified screening of patients with known or suspected AA exposure.

H129 ⏰ Atlas of Cancer Signaling Network: a systems biology research for integrative analysis of cancer data with Google Maps
Inna Kuperstein^†, Institut Curie, France
Eric Bonnet, Institut Curie, France
Hien-Anh Nguyen, Institut Curie, France
David Cohen, Institut Curie, France
Eric Viara, SYSRA, France
Luca Grieco, University College London, UK
Simon Fourquet, Institut Curie, France
Laurence Calzone, Institut Curie, France
Christophe Russo, Institut Curie, France
Maria Kondratova, Institut Curie, France
Marie Dutreix, Institut Curie, France
Emmanuel Barillot, Institut Curie, France
Andrei Zinovyev, Institut Curie, France

Cancerogenesis is driven by mutations leading to aberrant functioning of a complex network of molecular interactions and simultaneously affecting multiple cellular functions. Therefore, the successful application of bioinformatics and systems biology methods for analysis of high-throughput data in cancer research heavily depends on availability of global and detailed reconstructions of signaling networks amenable for computational analysis. We present here the Atlas of Cancer Signaling Network (ACSN), an interactive and comprehensive map of molecular mechanisms implicated in cancer. The resource includes tools for map navigation, visualization and analysis of molecular data in the context of signaling network maps. Constructing and updating ACSN involves careful manual curation of molecular biology literature and participation of experts in the corresponding fields. The cancer-oriented content of ACSN is completely original and covers major mechanisms involved in cancer progression, including DNA Repair, Cell Survival, Apoptosis, Cell Cycle, EMT and Cell Motility. Cell signaling mechanisms are depicted in details, together creating a seamless ‘geographic-like’ map of molecular interactions frequently deregulated in cancer. The map is browsable using NaviCell web interface using the Google Maps engine and semantic zooming principle. The associated web-blog provides a forum for commenting and curating the ACSN content. ACSN allows uploading heterogeneous omics data from users on top of the maps for visualization and performing functional analyses. We suggest several scenarios for ACSN application in cancer research, particularly for visualizing high-throughput data, starting from siRNA-based screening results or mutation frequencies to innovative ways of exploring transcriptomes and phosphoproteomes. Integration and analysis of these data in the context of ACSN may help interpret their biological significance and formulate mechanistic hypotheses. ACSN may also support patient stratification, prediction of treatment response and resistance to cancer drugs, as well as design of novel treatment strategies.

H138 ⏰ CSNK1E/CTNNB1 Are Synthetic Lethal to TP53 in Colorectal Cancer and are Markers for Prognosis
Grace S. Shieh^†,
Jan-Gowth Chang, Institute of Statistical Science, Academia Sinica, Taiwan
Khong-Loon Tiong, Center of RNA Biology and Clinical Application, China Medical University Hospital, China Medical University, Taiwan
Kuo-Ching Chang, Institute of Statistical Science, Academia Sinica, Taiwan
Kun-Tu Yeh, Institute of Statistical Science, Academia Sinica, Taiwan
Ting-Yuan Liu, Department of Pathology, Changhua Christian Hospital, Taiwan
Jia-Hong Wu, Graduate Institute of Medicine, College of Medicine, Kaohsiung Medical University, Taiwan
Ping-Heng Hsieh, Institute of Statistical Science, Academia Sinica, Taiwan
Shu-Hui Lin, Institute of Statistical Science, Academia Sinica, Taiwan
Wei-Yun Lai, Department of Pathology, Changhua Christian Hospital, Taiwan
Yu-Chin Hsu, Institute of Biochemistry and Molecular Biology, School of Life Sciences, National Yang-Ming University, Taiwan
Jeou-Yuan Chen, Institute of Statistical Science, Academia Sinica, Taiwan

Two genes are called synthetic lethal (SL) if their simultaneous mutations lead to cell death, but each individual mutation does not. Targeting SL partners of mutated cancer genes can kill cancer cells specifically, but leave normal cells intact. We present an integrated approach to uncovering SL pairs in colorectal cancer (CRC). Screening verified SL pairs using microarray gene expression data of cancerous and normal tissues, we first identified potential functionally relevant (simultaneously differentially expressed) gene pairs. From the top-ranked pairs, ~20 genes were chosen for immunohistochemistry (IHC) staining in 171 CRC patients. To find novel SL pairs, all 169 combined pairs from the individual IHC were synergistically correlated to five clinicopathological features, e.g. overall survival. Of the 11 predicted SL pairs, MSH2-POLB and CSNK1E-MYC were consistent with literature, and we validated the top two pairs, CSNK1E-TP53 and CTNNB1-TP53 using RNAi knockdown and small molecule inhibitors of CSNK1E in isogenic HCT-116 and RKO cells. Furthermore, synthetic lethality of CSNK1E and TP53 was verified in mouse model. Importantly, multivariate analysis revealed that CSNK1E-P53, CTNNB1-P53, MSH2-RB1, and BRCA1-WNT5A were independent prognosis markers from stage, with CSNK1E-P53 applicable to early-stage and the remaining three throughout all stages. Our findings suggest that CSNK1E is a promising target for TP53-mutant CRC patients which constitute ~40% to 50% of patients, while to date safety regarding inhibition of TP53 is controversial. Thus the integrated approach is useful in finding novel SL pairs for cancer therapeutics, and it is readily accessible and applicable to other cancers.

J51 ⏰ Constructing a molecular interaction network for thyroid cancer via large-scale text mining of gene and pathway events
Jean-Marc Schwartz, University of Manchester, UK
Chengkun Wu^†, National University of Defense Technology, China
Georg Brabant, University of Manchester, UK
Shaoliang Peng, National University of Defense Technology, China
Goran Nenadic^†, School of Computer Science, University of Manchester, UK

Background: Biomedical studies need assistance from automated tools and easily accessible data to address the problem of the rapidly accumulating literature. Text-mining tools and curated databases have been developed to address such needs and they can be applied to improve the understanding of molecular pathogenesis of complex diseases like thyroid cancer.
Results: We present a systematic approach to reconstruct the molecular interaction context of thyroid cancer from the literature. We first developed a system, PWTEES, which extracts pathway interactions from the literature utilizing event extraction and pathway named entity recognition. We then applied the system on a thyroid cancer corpus and systematically extracted molecular interactions involving either genes or pathways. With the extracted information, we constructed a molecular interaction network taking genes and pathways as nodes. Using curated pathway information and network topological analysis, we were able to highlight key genes and pathways that were not prominent without curated pathway knowledge.
Conclusions: Mining events involving genes and pathways from the literature and integrating curated pathway knowledge can help improve the understanding of molecular interactions of complex diseases. The system developed for this study can be applied in studies other than thyroid cancer. The source code is freely available online.

DNA Methylation

J65 ⏰ Detection of differentially methylated regions from bisulfite-seq data by hidden Markov models incorporating genome-wide methylation level distributions
Yutaka Saito^†, Computational Regulatory Genomics Research Group, National institute of Advanced Industrial Science and Technology (AIST), Japan
Toutai Mituyama, Computational Regulatory Genomics Research Group, National institute of Advanced Industrial Science and Technology (AIST), Japan

Background: Detection of differential methylation between biological samples is an important task in bisulfite-seq data analysis. Several studies have attempted de novo finding of differentially methylated regions (DMRs) using hidden Markov models (HMMs). However, there is room for improvement in the design of HMMs, especially on emission functions that evaluate the likelihood of differential methylation at each cytosine site.
Results: We describe a new HMM for DMR detection from bisulfite-seq data. Our method utilizes emission functions that combine binomial models for aligned read counts, and beta mixtures for incorporating genome-wide methylation level distributions. We also develop unsupervised learning algorithms to adjust parameters of the beta-binomial models depending on differential methylation types (up, down, and not changed). In experiments on both simulated and real datasets, the new HMM improves DMR detection accuracy compared with HMMs in our previous study. Furthermore, our method achieves better accuracy than other methods using Fisher's exact test and methylation level smoothing.
Conclusions: Our method enables accurate DMR detection from bisulfite-seq data. The implementation of our method is named ComMet, and distributed as a part of the Bisulfighter package.

J55 ⏰ An Integrative Approach for Efficient Analysis of Whole Genome Bisulfite Sequencing Data
Jonghun Lee, Department of computational biology and medical sciences, Graduate school of frontier sciences, the University of Tokyo, Japan
Sung-Joon Park, Human Genome Center, the Institute of Medical Science, the University of Tokyo, Japan
Kenta Nakai^†, Human Genome Center, the Institute of Medical Science, the University of Tokyo, Japan

Background: Whole genome bisulfite sequencing (WGBS) is a high-throughput technique for profiling genome-wide DNA methylation at single nucleotide resolution. However, the applications of WGBS are limited by low accuracy resulting from bisulfite-induced damage on DNA fragments. Although many computer programs have been developed for accurate detecting, most of the programs have barely succeeded in improving either quantity or quality of the methylation results. To improve both, we attempted to develop a novel integration of most widely used bisulfite-read mappers: Bismark, BSMAP, and BS-seeker2.
Results: A comprehensive analysis of the three mappers revealed that the mapping results of the mappers were mutually complementary under diverse read conditions. Therefore, we sought to integrate the characteristics of the mappers by scoring them to gain robustness against artifacts. As a result, the integration significantly increased detection accuracy compared with the individual mappers. In addition, the amount of detected cytosine was higher than that by Bismark. Furthermore, the integration successfully reduced the fluctuation of detection accuracy induced by read conditions. We applied the integration to real WGBS samples and succeeded in classifying the samples according to the originated tissues by both CpG and CpH methylation patterns.
Conclusions: In this study, we improved both quality and quantity of methylation results from WGBS data by integrating the mapping results of three bisulfite-read mappers. Also, we succeeded in combining and comparing WGBS samples by reducing the effects of read heterogeneity on methylation detection. This study contributes to DNA methylation researches by improving efficiency of methylation detection from WGBS data and facilitating the comprehensive analysis of public WGBS data.

J70 ⏰ MethGO: a comprehensive tool for analyzing whole genome bisulfite sequencing data
Ming-Ren Yen, Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
Wen-Wei Liao, Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
Evaline Ju, Department of Electrical and Computer Engineering, Carnegie Mellon University, USA
Fei-Man Hsu, Institute of Plant and Microbial Biology, Academia Sinica, Taiwan
Larry Lam, Department of Molecular, Cell and Developmental Biology, University of California, Los Angeles, USA
Pao-Yang Chen^†, Institute of Plant and Microbial Biology, Academia Sinica, Taiwan

Background: DNA methylation is a major epigenetic modification regulating several biological processes. Genome-wide sequencing data from bisulfite converted DNA is able to profile DNA methylation at single base resolution (BS-seq). Customized aligners have been developed for mapping reads from BS-seq, and still bioinformatic pipelines are required for downstream data analysis. Most post-alignment programs generate methylation calls, and our tool carries out subsequent genomic and epigenomic analyses to comprehensively explore BS-seq datasets.
Results: Here we developed MethGo, a specifically designed software for analyzing the data from whole genome bisulfite sequencing (WGBS) and targeted bisulfite sequencing (RRBS). MethGo provides genome-wide view of DNA methylation, estimates of DNA methylation level in both global and gene level, SNP calling, copy number variation (CNV) profiling, and the analysis of DNA methylation at transcriptional factor binding sites.
Conclusions: MethGo is simple and fast for BS-seq data including both WGBS and RRBS. It contains 4 major modules to analyse (epi)genome. It profiles genome wide DNA methylation along, or coupled with transcription factor binding sites and assesses genetic variations such as SNP and CNV. The Python program is publically available.

J90 ⏰ Subset Selection of High-Depth Next Generation Sequencing Reads for De Novo Genome Assembly Using MapReduce Framework
Yu-Jung Chang^†, Institute of Information Science, Academia Sinica, Taiwan
Chih-Hao Fang, Institute of Information Science, Academia Sinica, Taiwan
Yu-Jung Chang, Institute of Information Science, Academia Sinica, Taiwan
Wei-Chun Chung, Institute of Information Science, Academia Sinica, Taiwan
Ping-Heng Hsieh, Institute of Information Science, Academia Sinica, Taiwan
Chung-Yen Lin, Institute of Information Science, Academia Sinica, Taiwan
Jan-Ming Ho, Institute of Information Science, Academia Sinica, Taiwan

Background: Recent progress in next-generation sequencing technology has afforded several improvements such as ultra-high throughput at low cost, very high read quality, and substantially increased sequencing depth. State-of-the-art high-throughput sequencers, such as the Illumina MiSeq system, can generate ~15 Gbp sequencing data per run, with >80% bases above Q30 and a sequencing depth of up to several 1000x for small genomes. Illumina HiSeq 2500 is capable of generating up to 1 Tbp per run, with >80% bases above Q30 and often >100x sequencing depth for large genomes. To speed up otherwise time-consuming genome assembly and/or to obtain a skeleton of the assembly quickly for scaffolding or progressive assembly, methods for noise removal and reduction of redundancy in the original data, with almost equal or better assembly results, are worth studying.
Results: We developed two subset selection methods for single-end reads and a method for paired-end reads based on base quality scores and other read analytic tools using the MapReduce framework. We proposed two strategies to select reads: MinimalQ and ProductQ. MinimalQ selects reads with minimal base-quality above a threshold. ProductQ selects reads with probability of no incorrect base above a threshold. In the single-end experiments, we used Escherichia coli and Bacillus cereus datasets of MiSeq, Velvet assembler for genome assembly, and GAGE benchmark tools for result evaluation. In the paired-end experiments, we used grouper dataset of HiSeq, ALLPATHS-LG genome assembler, and QUAST quality assessment tool for comparing genome assemblies of the original set and the subset. The results show that subset selection not only can speed up the genome assembly but also can produce substantially longer scaffolds.

Dynamic Network Inference

H150 ⏰ Positive feedback within a kinase signaling complex functions as a switch mechanism for NF-κB activation
Kentaro Inoue^†, RIKEN, Japan
Hisaaki Shinohara^†, RIKEN, Japan
Mariko Okada^†, RIKEN, Japan

Nuclear factor-κB (NF-κB) is a key transcription factor which regulates expression of a variety of genes which play important roles in cell fate decisions. NF-κB activity shows a switch-like behavior to extracellular stimulus dose response. Here, we identified a positive feedback loop within CARMA1-TAK1-IKKβ module, which is an upstream of NF-κB, necessary to the switch-like activation of NF-κB in B cell receptor (BCR) signaling by using quantitative experiments and mathematical modeling.
CARMA1 is an essential modulator for NF-κB activation in BCR signaling and is activated at multiple sites of itself (Shinohara, 2007, JEM). After BCR stimulation, PKCβ induces phosphorylation of CARMA1 on serine 668, allowing activation of TAK1 and IKKβ. IKKβ activation induces IκB phosphorylation and degradation, resulting in nuclear translocation of NF-κB. We measured time-courses and dose responses of TAK1 and IKKβ activity because CARMA1 at serine 578 is activated by IKKβ. TAK1 activity showed two peaks in time-course and dose response of TAK1 and IKKβ showed switch-like manners. To investigate the cause of the second peak of TAK1 activity, we employed a mutant form of CARMA1 at serine 578 (S578A). The results showed that the second peak of TAK1 and the switch-like responses are absent. To analyze the dynamics in detail, we constructed an ordinary differential equations based model which recapitulates dynamics of TAK1 and IKKβ activity. The model explained that the form of the positive feedback loop from IKKβ to TAK1 via phosphorylation of CARMA1 at serine 578 regulates the second peak of TAK1. To determine whether the positive feedback loop also functions to induce the switch like activation of NF-κB, we examined nuclear translocation of NF-κB at cell population level and single cell level. Peaks of the NF-κB activity at both of the levels exhibited switch-like response in wild type and graded response in S578A mutant, suggesting a lack of positive cooperativity in this mutant.
Our results indicate that the positive feedback loop from IKKβ to TAK1 via phosphorylation of CARMA1 at serine 578 serves as a basis of switch-like activation of NF-κB, thereby determining an activation threshold in BCR signaling.

J76 ⏰ Identification of network-based biomarkers of cardioembolic stroke using a systems biology approach with time series data
Yung-Hao Wong, NTHU, Taiwan
Chia-Chou Wu, NTHU, Taiwan
Hsien-Yong Lai, Institute of Review Board[IRB], Christian Mennonite Hospital, Taiwan
Bo-Ren Jheng, NTHU, Taiwan
Hsing-Yu Weng, Graduate Institute of Clinical Medicine, Taipei Medical University, Taiwan
Tzu-Hao Chang^†, Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taiwan
Bor-Sen Chen^†, NTHU, Taiwan

Background: Molecular signaling of post-stroke angiogenesis begins within hours of initiation of a stroke, with sequential increases in expressions of initially destabilizing combinations of growth factor receptors and vascular growth factors. Growth factor combinations provide the ability to promote endothelial cell stabilization. Recent studies further revealed insights into coordinated patterns of gene expressions associated with stroke and relationships of neurodegenerative and neural repair processes after a stroke.
Results: Differential protein-protein interaction networks (PPINs) were constructed at three post-stroke time points, and proteins with a significant stroke relevance value (SRV) were discovered. Genes, including UBC, CUL3, APP, NEDD8, JUP, and SIRT7, showed high associations with time after the stroke, and Ingenuity Pathway Analysis results showed that these post-stroke time series-associated genes were related to molecular and cellular functions of cell death and survival, the cell cycle, cellular development, cellular movement, and cell-to-cell signaling and interactions. These biomarkers may be helpful for the early detection, diagnosis, and prognosis of ischemic stroke.
Conclusions: This is our first attempt to use our similar theory of a systems biology framework on strokes. We focused on 3 key post-stroke time points. We identified the network and corresponding network biomarkers for the 3 time points where further studies are needed to experimentally confirm the findings and compare them with causes of ischemic stroke. Further studies are needed to experimentally confirm the findings to compare with causes of ischemic stroke. Our finding showed that stroke associated biomarker genes among different time points are significantly involved in cell cycle processing, including G2-M, G1-S and Meiosis, which contributes to the current understanding of the etiology of stroke. We hope this work helps scientists reveal more hidden cellular mechanisms of stroke etiology and repair processes.

J52 ⏰ Inference of gene interaction networks using conserved subsequential patterns from multiple time course gene expression datasets
Renhua Song, University of Technology, Sydney, Australia
Qian Liu^†, University of Technology, Sydney, Australia
Jinyan Li, Advanced Analytics Insitute, University of Technology, Sydney, Australia

Deciphering gene interaction networks (GINs) from time-course gene expression (TCGx) data is highly valuable to understand gene behaviors (e.g., activation, inhibition, time-lagged causality) at the system level. Existing methods usually use a global or local proximity measure to infer GINs from a single dataset. As the noise contained in a single data set is hardly self-resolved, the results are sometimes not reliable. Also, these proximity measurements cannot handle the co-existence of the various in vivo positive, negative and time-lagged gene interactions. In this work, we propose to infer reliable GINs from multiple TCGx datasets using a novel conserved subsequential pattern of gene expression. A subsequential pattern is a maximal subset of genes sharing positive, negative or time-lagged correlations of one expression template on their own subsets of time points. Based on these patterns, a GIN can be built from each of the datasets. It is assumed that reliable gene interactions would be detected repeatedly. We thus use conserved gene pairs from the individual GINs of the multiple TCGx datasets to construct a reliable GIN for a species. We apply our method on six TCGx datasets related to yeast cell cycle, and validate the reliable GINs using protein interaction networks, biopathways and transcription factor-gene regulations. We also compare the reliable GINs with those GINs reconstructed by a global proximity measure Pearson correlation coefficient method from single datasets. It has been demonstrated that our reliable GINs achieve much better prediction performance especially with much higher precision. The functional enrichment analysis also suggests that gene sets in a reliable GIN are more functionally significant. Our method is especially useful to decipher GINs from multiple TCGx datasets related to less studied organisms where little knowledge is available except gene expression data.

J13 ⏰ Detecting the shifts of gene regulatory networks during time-course experiments with a single time point temporal resolution
Yoichi Takenaka^†, Osaka University, Japan
Shigeto Seno^†, Osaka University, Japan
Hideo Matsuda^†, Osaka University, Japan

Background: Comprehensively understanding the dynamics of biological systems is currently one of the biggest challenges in biology. Vastly improved biological technologies have provided huge amounts of information, which must be undertaken by bioinformatics and systems biology research. Current state-of-the-art computational approaches for analyzing time-course gene expression profiles assume a single model through the experiment. Moreover, gene regulations are not easily analyzed to single time-point resolution by these methods.
Results: We propose a score that analyzes the gene regulations at each time point. The score is based on the information gains of information criterion values. The method detects the shifts in gene regulatory networks during time-course experiments to single time-point resolution. The effectiveness of the method is evaluated on the diauxic shift from glucose to lactose in E. coli. Gene regulation shifts were detected at two time points; the first corresponding to the time at which the growth of E. coli ceased and the second corresponding to the end of the experiment, when the nutrient sources (glucose and lactose) had become exhausted. According to these results, the proposed score and method can appropriately detect the times of gene regulation shifts.
Conclusions: The method based on the proposed score provides a new tool for analyzing dynamic biological systems. As the score value indicates the strength of gene regulation at each time point on a gene expression profile, it can potentially infer hidden gene regulatory networks throughout time-course experiments.