List of Accepted Posters
01 – Disease | 02 – Database and Tools | 03 – Sequence Analysis | 04 – Mathematical Model | 05 – Genome | 06 – Transcriptome | 07 – Proteome | 08 – Metabolome | 09 – Glycan and Compounds | 10 – Network |
|1||Feng-Chi Chen, Yu-Chieh Liao, Jie-Mao Huang, Chieh-Hua Lin, Yih-Yuan Chen, Horng-Yunn Dou and Chao Agnes Hsiung.
Empirical and experimental validation of the tuberculosis drugome
Abstract: Drug resistance of Mycobacteria tuberculosis (MTB) is a serious threat to public health. The development of new drugs against MTB has been slow despite the severity of global tuberculosis infections. One shortcut to tackle this issue is drug repositioning, namely application of approved drugs to an unapproved medical usage – anti-tuberculosis in this case. Computational methods have been proposed to prioritize drugs for such repositioning. The “TB drugome” approach, first proposed in 2009, is one of these methods. This approach evaluates the local structural similarities between drug target proteins and MTB proteins, and predicts the number of MTB proteins that may be bound (and presumably inhibited) by each examined drug. The drugs are then ranked according to such predicted numbers. Here we update the TB drugome by adding structural information accumulated in the past three years, and experimentally examined the effectiveness of twenty-three of the drugs on inhibiting the growth of MTB. We report that two of the examined drugs – tamoxifen and 4-hydroxytamoxifen – alone or in combination with first-line anti-tuberculosis drugs, can effectively inhibit the growth of MTB in vitro. The effectiveness of the TB drugome approach is thus experimentally supported. We also discuss how the approach can be further improved.
APPLICATION OF GENOME WIDE ASSOCIATION STUDIES IN THE INVESTIGATION OF THE GENETIC ARCHITECTURE OF SOME KILLER DISEASES IN NIGERIA
Abstract: Nigeria is the most populous country in Africa with a population of more than 170 million people. There are more than 250 ethnic groups, 380 languages, and a diverse range of cultural and religious beliefs and practices in Nigeria. There are many health challenges in the country leading to millions of deaths annually. The top ten causes of death in Nigeria are: Malaria 20%, Lower Respiratory Infections 19%, HIV 9%, Diarrhea Diseases 5%, Road Injuries 5%, Protein-Energy Malnutrition 4%, Cancer 3%, Meningitis 3%, Stroke 3%, and Tuberculosis 2%. Study of the human genetics provides avenue to identify genetic risk factors of common and complex diseases. There are many different technologies and tools for identifying genetic risk factors one of which is Genome Wide Association Studies (GWAS). GWAS in its efforts to identify genetic risk factors for common diseases in a given population measures and analyzes DNA sequence variations across the human genome. GWAS gives us the opportunity to predict who is at risk and develop new prevention and treatment strategies based on identified biological foundations.
With high rate of the above mentioned killer diseases in Nigeria, investigation of the genetic variation of such diseases within the population will play an important role in pharmacologic therapies, drug discovery and personalized medicine geared to tailor healthcare to individual patients based on their genetic background and other biological features. GWAS is capable of revealing genetic effects of non-genic DNA regions and casual genes alleged in disease etiology. Genome Wide Association Studies of Cancer, Diabetes and stroke in Nigeria will help to estimate the comparatively complete genetic additive, non-additive pleiotropy effects of these diseases in balanced approach.
This work aims to consider the potential impact of GWAS studies on killer diseases: cancer, diabetes and stroke in Nigeria. Study designs such as Case Control versus Quantitative Designs, Standardized Phenotype Criteria and Phenotype Extraction from Electronic Medical Records will be reviewed. For the Association test, Single Locus Analysis, Covariate Adjustment and Population Stratification, Corrections for Multiple Testing, and Multi-Locus Analysis will be considered.
|3||Gianfranco Alpini, Shannon Glaser and Fanyin Meng.
Definitive endoderm differentiation of biliary-committed progenitor cells during cholestatic liver injury
Abstract: BACKGROUND & AIMS: The biliary tree is a complex network of interconnected ducts that increase in diameter from small to large bile ducts. Biliary-committed progenitor cells (small cholangiocytes, SMCCs) from small bile ducts are more resistant to hepatobiliary injury than large cholangiocytes (LGCCs) from large bile ducts. The definitive endoderm marker, FoxA2, is the key transcriptional factor that regulates cell differentiation and tissue regeneration. Our aim was to characterize the functional role of FoxA2 in biliary progenitor cells during cholestatic liver injury.
METHODS: Murine biliary committed progenitors (SMCCs) and control LGCCs were isolated from mouse liver based on size distribution by counterflow elutriation. mRNA expression in SMCCs and LGCCs was assessed by PCR array analysis. Bile duct ligation (BDL) and MDR2 knockout mice (MDR2-/-) were used as animal models of cholestatic liver injury. We also performed studies to determine whether suitable biliary support would permit hepatic repair and regrowth of the damaged liver with healthy transplanted cholangiocytes in NOD/SCID mice with BDL injury.
RESULTS: We identified definitive endoderm markers including FoxA2, Sox17 as well as BMP1 that are differentially expressed in SMCCs when compared to control LGCCs by PCR array analysis. FoxA2 was also notably more enhanced in murine liver progenitor cells compared with SMCCs and LGCCs. Activated FoxA2 expression was observed in murine small bile ducts in BDL and MDR2-/- mice liver, suggesting that it is an important mediator for biliary remodeling. Because biliary progenitors/small cholangiocytes can be transplanted in large numbers in the peritoneal cavity, readily equaling or exceeding those in the liver, we examined the benefits of serum chemistry from transplanted cells. The expansion of engrafted biliary progenitors/cholangiocytes in the liver has been confirmed by PKH26 red fluorescent labeling and detection inside the specific liver sections after cell therapy. Serum ALT and AST levels in NOD/SCID mice engrafted with SMCCs and liver stem cells (3X107, i.p.) showed significant changes compared with vehicle treated mice (n =5), along with the significantly improved sirius red staining. Enhanced expression of definitive endoderm differentiation marker FoxA2 was observed in BDL mice liver after SMCC cell therapy. Furthermore, activation of MMP-9/MMP-2 and α-SMA were observed in BDL/Mdr2 knockout mice liver, and recovered after SMCC engraft.
CONCLUSION: The therapeutic effect of biliary-committed progenitor cells during cholestatic liver injury is mediated by definitive endoderm marker FoxA2, the known positive regulator of biliary development and injury. The findings provide new insight into the therapeutic potentials of small cholangiocytes during cholestatic liver injury and fibrosis.
An Immunoinformatics approach for designing epitope based vaccine strategy against S protein of mysterious new Middle East Respiratory Syndrome Coronavirus (MERS-CoV)
Abstract: In 2012, a deadly virus namely Middle East Respiratory Syndrome Coronavirus (MERS-CoV) has emerged from the Arabian Peninsula and it is striking fear in the hearts of public health officials throughout the world. Recent studies find that, similar to SARS-CoV, the spike (S) protein of MERS-CoV also plays important roles in receptor binding and viral entry that affects viral host range. As the major protein causing virus infection, S protein can be an ideal target for both vaccines and MERS-CoV entry inhibitors. Hence, analyzing the properties of MERS-CoV S protein is a high research priority. [1,2]As there is no effective drug available, novel approaches regarding epitope prediction for vaccine development were performed in this study. In this study, we identified several immunodominant sites on the S protein by by immunoinformatics tools. Epitopes or peptide fragments as nonamers of these antigenic S proteins were analyzed by according to their proteasomal cleavage sites, TAP scores and IC50>250 nM, the predictions were scrutinized. Furthermore, the epitope sequences were examined by in silico docking simulation with different specific HLA receptors. This study suggests that the S protein is highly immunogenic and induces protection against MERS -CoV challenge and that neutralizing antibodies alone may be able to suppress virus proliferation, further advocating the rationale that vaccines against MERS-CoV can be evolved based on the S protein, which can provide high population coverage.
1. de Groot RJ, Baker SC, Baric RS, Brown CS, Drosten C, Enjuanes L, Fouchier RA, Galiano M, Gorbalenya AE, Memish ZA, Perlman S, Poon LL, Snijder EJ, Stephens GM, Woo PC, Zaki AM, Zambon M, Ziebuhr J. Middle East respiratory syndrome coronavirus (MERS-CoV): announcement of the Coronavirus Study Group. J Virol. 2013;87(14):7790-2.
2. Memish ZA, Zumla AI, Al-Hakeem RF, Al-Rabeeah AA, Stephens GM. Family cluster of Middle East respiratory syndrome coronavirus infections. N Engl J Med. 2013;368(26):2487-94
|5||Yi-Ting Wang, Pei-Yuan Sung, Peng-Lin Lin and Ren-Hua Chung.
A Gene Set Association Test for Complex Diseases Incorporating an Optimal Threshold Algorithm in Nuclear Families
Abstract: Genome-wide association studies (GWAS) have become a common approach to identifying single nucleotide polymorphisms (SNPs) associated with complex diseases. However, complex diseases, such as hypertension, diabetes, and Alzheimer disease, are caused by the joint effects of multiple genes, while the effect of individual gene or SNP is modest. Therefore, a method considering the joint effects of multiple SNPs can be more powerful than testing individual SNPs. SNPs in a gene or a pathway are often used for the multi-SNP analysis. Large genes or pathways can result in a large set of SNPs for the analysis. A challenge for the multi-SNP analysis is how to effectively select a subset of SNPs with promising association signals. Moreover, current multi-SNP analysis methods were developed mainly for unrelated case-control design, and only a few methods are available for family-based studies. We developed the Optimal P-value Threshold Pedigree Disequilibrium Test (OTPDT). The OTPDT uses general nuclear families. A variable p-value threshold algorithm is used to determine an optimal p-value threshold for selecting a subset of SNPs. A permutation procedure is used to assess the significance of the test. We used simulations to verify that the OTPDT has correct type I error rates. Our power studies showed that the OTPDT can be more powerful than the set-based test in PLINK and the multi-SNP FBAT test. We applied the OTPDT to a family-based GWAS dataset for hypertension. Some candidate genes for hypertension were identified. The method will be useful for the secondary analysis of existing GWAS datasets.
|6||Ningning He and Sukjoon Yoon.
Integrated Smart Screening Platform for Systems Medicine in Cancers
Abstract: Systematic understanding of genotype-dependent drug sensitivity on cancer cell lines will provides therapeutic benefits on the cancer chemotherapy. Here we present a statistical framework to associate the response of anticancer agents with major genotypes of cancers. Multi-level omics data such as transcriptome, ptoteome and phosphatome data were integrated with drug and siRNA screening data based on the genotypic classification of cancer cell lines and human tissue samples. The present mutation-oriented integrative approach was able to reproduce the known patterns of mechanism-based drug response in cancers. Furthermore, it revealed novel patterns for new targets and drug repositioning from gene perturbation assays using siRNA libraries. Our platform provides valuable tools on accelerating hypothesis generation and target validation for optimizing the therapeutic window for single or combined anticancer agents.
|7||Tse-Yi Wang, Yen-Ho Chen and Kuang-Chi Chen.
Identifying SNP-SNP interactions with CNAs in lymphoma susceptibility
Abstract: Genome-wide association studies (GWAS) identify many single nucleotide polymorphisms (SNPs) that are associated with diseases and are involved in their pathogenesis. The currently identified SNP variants only explain a portion of the heritability underlying the complex diseases. Some recent studies have focused on investigating SNP-SNP interactions to explore the missing aspects of the heritability. Several have looked for genetic variations other than SNPs such as copy number alterations (CNAs). However, few have considered both SNP-SNP interactions and CNAs comprehensively. In our study, we employed a fusion approach that incorporated the information of copy numbers to identify SNP-SNP interactions in analyzing a public dataset of 214 lymphoma cases. The copy numbers were first examined by clustering analysis, and then the SNP-SNP interactions were detected by the multifactor dimensionality reduction (MDR) method. The results showed that the identified SNP-SNP interactions with CNAs were more significantly associated with lymphoma susceptibility than the SNP-SNP interactions detected without regarding CNAs. Therefore, we conclude that combining SNP-SNP interactions with CNAs provides a more comprehensive strategy for disease association studies.
|8||Heidi Tessmer and Kimihito Ito.
Tracking Influenza Epidemics using Bioinformatics Techniques
Abstract: Influenza is a viral disease with annual epidemics resulting in 3 to 5 million cases and between 250,000 and 500,000 deaths each year. In addition to the large loss of human life, influenza has a detrimental impact on societies and economies due to increased absenteeism and productivity loss associated with the disease and recovery .
Currently, the most effective way to control influenza is the administration of a once-yearly trivalent vaccine. This vaccine is distributed at the beginning of each influenza season, in the late fall for each hemisphere. However, the influenza virus has a unique characteristic among viruses in that it mutates rapidly. These rapid mutations often make one year’s vaccine ineffectual the following year.
To be best prepared for influenza epidemics, it is necessary to have an awareness of historic mutations in the viral genome, a model for predicting changes in the virus and the spread of future epidemics, and an understanding of the factors which affect the speed and path the disease takes as it progresses. The more knowledge we have about the mutational behavior of the influenza virus, the better we will understand the mechanisms which cause and promote the spread of this disease and the better chance we have to curtail its propagation and limit its overall impact on humans. With this knowledge, we can improve strain selection for vaccines by creating high-probability, high-confidence models which will improve our ability to control the disease.
This research focuses on the analysis of changes in nucleotide and amino acid sequences in human H1N1 and H3N2 influenza viruses over time. More than 20,000 hemagglutinin (HA) segments of human influenza A isolates were downloaded from the NCBI Influenza Virus Resource Database. These nucleotide sequences were then sorted by year and HA subtype, and converted to their corresponding amino acids. Once pre-processed, the sequences were aligned, equalized for length, and a dominant strain for each year was identified. Comparing dominant strains from previous years, changes in nucleotide composition and amino acid substitutions were observed.
Our analyses found patterns emerging in the nucleotide and amino acid composition and sequences of influenza A viruses. By mapping the temporal changes in the disease with respect to nucleotide and amino acid composition and sequence frequencies we hope to identify simple, novel techniques to predict future evolution of the virus. The goal of follow-on research will be to incorporate the information discovered here into variables and concepts within a model of influenza mutation and spread. We anticipate these techniques could be used in the selection of vaccine strains to improve the effectiveness of annual influenza vaccines and our ability to control this disease.
 WHO Influenza Fact Sheet. http://www.who.int/mediacentre/factsheets/fs211/en/
|9||Nipawit Karnbunchob, Yonezawa Kouki, Keisuke Ueno, Manabu Igarashi and Kimihito Ito.
Use of reciprocal best hits to explore the interspecies transmission of influenza A viruses
Abstract: Background: Influenza A virus is a zoonotic pathogen that infects avian and mammalian hosts. Genetic reassortment between human and avian viruses in pigs is thought to be responsible for the emergence of pandemic influenza viruses. The monitoring of the interspecies transmission of avian influenza viruses to pig population will plays an important role for the prevention of pandemic influenza. Phylogenetic analysis of viral sequences is a conventional approach to study the transmission of viruses. Nucleotide sequences of avian and swine influenza viruses are rapidly accumulating in the public databases, because of the continuing effort in the surveillance of animal influenza. The growth in viral genetic information increases the size of phylogenetic tree, creating technical difficulty in detecting the transmission.
Materials and Methods: A total of 33,587 nucleotide sequences of hemagglutinin and neuraminidase from avian and swine influenza viruses were downloaded from the Influenza Virus Resource at National Center for Biotechnology Information. To clarify the interspecies transmission of influenza viruses between pigs and birds, we employed the reciprocal best BLAST hits algorithm. First we constructed two BLAST databases—one from nucleotide sequences of avian isolates and the other from those of swine isolates. By performing BLAST searches using avian sequence queries against swine virus database and using swine sequence queries against avian virus databases, we looked for pairs of avian and swine isolates that shared same sequences. Those pairs sharing genes of 100% identity were regarded as the evidence of interspecies transmissions between avian and swine viruses.
Results: Our method detected one hundred six possible interspecies transmissions between avian and swine. Our results are consistent with the results from scientific papers that were published previously, suggesting that our method can correctly detect interspecies transmission.
|10||Abdul Musaweer Habib, Md. Habibul Hasan Mazumder and Md. Saiful Islam.
Mining the Proteome of Fusobacterium nucleatum for Potential Therapeutics Discovery
Abstract: The plethora of genome sequence information of bacteria in recent times has ushered in many novel strategies for antibacterial drug discovery and facilitated medical science to take up the challenge of the increasing resistance of pathogenic bacteria to current antibiotics. Subtractive genomics approach is one of the groundbreaking strategies for hunting subset of genes that are probably to be imperative for the pathogen but not present in the host. In this study, we employed the same strategy to analyze the whole genome sequence of the Fusobacterium nucleatum, a human oral pathogen having association with colorectal cancer. Our study divulged 1499 proteins of Fusobacterium nucleatum, which have no homolog’s in human genome. These proteins were subjected to screening further by using the Database of Essential Genes (DEG) that resulted in the identification of 32 vitally important proteins for the bacterium. Subsequent analysis of the identified pivotal proteins, using the KEGG Automated Annotation Server (KAAS) resulted in sorting 3 key enzymes of F. nucleatum that may be good candidates as potential drug targets, since they are unique for the bacterium and absent in humans. In addition, we have demonstrated the 3-D structure of these three proteins. Finally, determination of ligand binding sites of the key proteins as well as screening for functional inhibitors that best fitted with the ligands sites were conducted to discover effective novel therapeutic compounds against Fusobacterium nucleatum.
|11||Shah Md. Shahik, Md. Saiful Islam, Naman Patwary, Md. Sohel and Mohd. Sikder.
In silico structure analysis and epitope prediction of E3 CR1-beta protein of Human Adenovirus E for vaccine design
Description: We use numerous bio-informatics and immuno-informatics implements comprising sequence and construction tools for construction of 3D model and epitope prediction. The 3D structure of E3 CR1-beta protein was generated and total of ten antigenic B cell epitopes, 6 MHC class I and 11 MHC class II binding peptides were predicted.
Conclusions: The study was carried out to predict antigenic determinants/epitopes of the E3 CR1-beta protein of Human adenovirus type 4 along with the 3D protein modeling. The study revealed potential T-cell and B-cell epitopes that can raise the desired immune response against E3 CR1-beta protein and useful in developing effective vaccines against HAdVs-E.
|12||Myungguen Chung, Seok Won Jeong, Soo-Jung Park, Seong Beom Cho and Kyung-Won Hong.
Genome-Wide Association Study and Network Analysis of Metabolic Syndrome in Korean
Abstract: Metabolic syndrome (METS) is a disorder of energy utilization and storage, and increases the risk of developing cardiovascular disease and diabetes. To identify the genetic risk factors of the METS, we carried out the genome-wide association study (GWAS) for 2,657 cases and 5,917 controls in Korean populations. As the results, we could identify the 2 SNPs of genome-wide significant level p-values (< 5 ×10-8), the 8 SNPs of genome-wide suggestive p-values (5×10-8≤p-values<1×10-5), and 2 SNPs of more functional variants with borderline p-values (5×10-5≤p-values<1×10-4). On the other hands, the multiple correction criteria of conventional GWASs would exclude false-positive loci, but simultaneously it discards many true-positive loci. To reconsider the discarded true-positive loci, we attempted to include the functional variants [nonsynonymous SNPs (nsSNP) and expression quantitative trait loci (eQTL)] among top 5000 SNPs based on the proportion of phenotypic variance explained by genotypic variance. Total 159 eQTLs and 18 nsSNPs were presented in the top 5000 SNPs. Although they should be replicated in the other independent populations, 6 eQTLs and 2 nsSNP loci were located in the molecular pathway of LPL, APOA5 and CHRM2 which were the significant or suggestive loci with the METS GWAS. Conclusively, our approach using the conventional GWAS, reconsidered functional variants and the pathway based interpretation suggests a useful method to understand the GWAS results of complex traits and can be expanded in the other genome-wide association studies.
|13||Neeraja Krishnan, Saurabh Gupta and Binay Panda.
Mutational context derived from high-throughput cancer sequencing data
Abstract: Rapid advancements in the field of genome sequencing are aiding our understanding on many human diseases, especially cancer. In the last five years, high-throughput sequencing data has been made available for many human cancer types making it possible to identify driver mutations in those cancers. Computational biologists and bioinformatics specialists use various tools to discover, analyze and interpret somatic changes in human cancers. This is followed by experiments to attribute functions to a specific mutations/genes. So far, studies on cancer genomes have been focused on somatic mutations and how they might play a role in the process of carcinogenesis. Using a machine learning approach followed by validation, which we refer to as the Mutation Microenvironment Test (MMT), we show that the context of somatic mutations to play an important role in human cancer. We have tested MMT using both in house and data from multiple cancer types obtained from TCGA and ICGC and show the MMT is very sensitive in identifying cancer-specific signature. We plan to present details of the MMT methodology along with the results in the poster.
Characterization of the Sputum Microbiome in Chronic Obstructive Pulmonary Disease by Ion Torrent 16S rRNA-based Sequencing
Abstract: Chronic obstructive pulmonary disease (COPD) is a progressive lung disease caused primarily by cigarette smoking and other airway irritants. The microbial community composition also contributes to disease progression in COPD patients. However, the bacterial profiling in the airways of patients with COPD is not well established. To determine the sputum microbial pattern in COPD and its relationship with disease progression, we establish the 16S rRNA gene-based sequencing platform using Ion Torrent Personal Genome Machine (PGM) system. DNA is isolated from four sputum samples and multiple amplicons targeted bacterial 16S rRNA genes are sequenced. An average of 800,000 reads (ranging from 740,000 to 1,035,000) is obtained per sample. Downstream sequence processing and quality filtering remove about 30% of the raw sequencing data using the Ion ReporterTM software. Subsequently, the valid reads are analyzed by a two-step mapping process against the MicroSEQ and GreenGenes databases. On average, 85% mapped reads are obtained per sample. Finally, the taxonomics classification is performed and operational taxonomic units (OTUs) in the samples at 97% identity are determined. Our data found that, in the genus level, Pseudomonas, Corynebacterium, and Bacteroides are the most abundant bacterial in the COPD patients. We will enroll more patients and study the correlation between sputum microbiome and disease progression.
|15||Thitima Benjachat, Pumipat Tongyoo, Nattiya Hirankarn, Asada Leelahavanichkul and Yingyos Avihingsanon.
Biomarker Discovery in Lupus Nephritis by Transcriptomics Approach
Abstract: Lupus nephritis (LN) is the severe form of systemic lupus erythematosus (SLE). Kidney biopsy is necessary for diagnosis of relapse. This study will attempt to find biomarkers for therapeutic resistance prediction in kidney biopsy of LN patients using gene expression microarray technique and bioinformatics analysis.Forty renal biopsies samples were collected from LN patients, who were in a flare stage. The samples were collected right before treatment. All patients underwent a standard of care for 6 months or longer. Treatment responses were classified into a responder (R) or nonresponder (NR). All biopsies were extracted RNA, and performed either gene expression microarray study (training set samples) or real-time PCR validation (validation set samples). In training set, 23 renal RNA (R = 14, NR =9) were analyzed the transcriptional profiles using Illumina Human HT-12 BeadChips gene expression microarray, and open source bioinformatics software (R-lumi package) and the Database for Annotation, Visualization and Integrated Discovery (DAVID). There were 442 up-regulated and 374 down-regulated probe sets in the NR patients compared with R patients. The selected candidate genes (p-value < 0.01, and fold change > 2) were analyzed the expression levels in validation set (N=40; R=24, NR=16) using real-time PCR technique. There were 3 candidate markers that still have significant difference between NR and R LN patients, including ILMN_13072, ILMN_9808 and ILMN_6731 (p-value = 0.0007, 0.023 and 0.027,respectively). Moreover, the functional annotation clustering analysis with DAVID showed the clusters that those 3 significance genes were classified, including immune response and tight junction pathways.In conclusions, intra-renal mRNA levels of candidate biomarkers may be the potential biomarkers of poor therapeutics response in LN. A larger clinical study is warranted.
02-Database and Tools
|16||Hiroyuki Kurata, Yurie Sugimoto and Kazuhiro Maeda.
BioFNet Database for Rational Design of Biological Systems
Abstract: Systems biology and synthetic biology aim to reveal the mechanism of how complex, modular, hierarchical structures of biological networks generate a variety of functions. A functional network is defined as a subnetwork of biomolecules that performs a particular function such as adaptation, bistablity, and oscillation. In this report, we present BioFNet, a biological functional network database, which stores a broad range of functional networks within the whole cell at the level of molecular interactions. BioFNet allows users to simulate the mathematical models of the functional networks and to visualize the simulated results. BioFNet contributes to rational design of biochemical networks and understanding how functional networks are assembled to create complex, high-level functions, which would reveal design principles underlying biochemical networks in terms of engineering science.
|17||Takashi Abe, Hachiro Inokuchi, Yuko Yamada, Akira Muto and Toshimichi Ikemura.
tRNADB-CE: tRNA gene database curated manually by experts
Abstract: We constructed the tRNADB-CE by analyzing 1966 complete and 5272 draft genomes of prokaryotes and eukaryotes, 151 complete virus genomes, 121 complete chloroplast genomes and approximately 230 million sequences obtained by metagenome analyses of 210 environmental samples. This exhaustive search for tRNA genes was performed by running three computer programs used for tRNA gene search (tRNAscan-SE, ARAGORN, and tRNAfinder) to enhance completeness and accuracy of the prediction. The discordant cases were manually checked by experts in the tRNA experimental field. In addition, tRNA genes of Archaea obtained from SPLITSdb were included.
The 595,115 tRNA genes in total, and thus two times of the tRNA genes compiled previously, are compiled, in which sequence information, clover-leaf structure and results of sequence similarity and oligonucleotide-pattern search can be browsed. In order to pool collective knowledge with help from any experts in the tRNA research field, we included a column to which comments can be added on each tRNA gene.
By compiling tRNAs of known prokaryotes with identical sequences, we found high phylogenetic preservation of tRNA sequences, especially at a phylum level. Furthermore, a large number of tRNAs obtained by metagenome analyses of environmental samples had sequences identical to those found in known prokaryotes. The tRNADB-CE provided functions, with which users can obtain the phylotype-specific markers (e.g. genus-specific markers) by themselves and clarify microbial community structures of ecosystems in detail.
tRNADB-CE can be accessed freely at http://trna.ie.niigata-u.ac.jp.
|18||Seok-Won Kim, Naveen Kumar and Todd Taylor.
Integrated database and sample tracking system for various types of experimental data
Abstract: There are many bioinformatic platforms, e.g. Galaxy, and Laboratory Information Management Systems (LIMS) which can be used to both automate and track laboratory experiments and data processing. However, freely available systems lack several important features such as the abilities to manage massive datasets or various types of data. Because many experimentalists do not have sufficient bioinformatics skills and the available tools are too complicated and do not meet their exact needs, they often end up managing their data using some sort of spreadsheet application. For small datasets this may make sense, but for larger and various datasets this will soon lead to mayhem. To overcome these limitations, we have developed an integrated database and sample tracking system for the storage and distribution of massive and various types of experimental data. This system supports several features including data imputation and an integrated view of different types of data. Furthermore, we are also establishing a common protocol for transferring massive data among different data management systems. This protocol will support secure and stable communication in the current “big data” era. Detailed information about our system will be presented and a live demonstration will be given at the meeting.
|19||Eli Kaminuma, Yukino Baba, Takatomo Fujisawa, Asao Fujiyama, Hisashi Kashima and Yasukazu Nakamura.
Performance Evaluation between Crowdworkers and Biocurators towards Constructing a CrowdR&D Platform
Abstract: High-performance next-generation sequencing (NGS) technologies are advancing genomics and molecular biological research. At 2010, we released an automatic high-throughput annotation pipeline “DDBJ Read Annotation Pipeline” for NGS sequencing data, which analyzes by using computer facilities of Japan’s National Institute of Genetics supercomputer. After automatic annotation analysis, human curation tasks are performed to modify errors. However, massive amounts of NGS sequence data have created a bottleneck at human curation with manual tasks. To resolve the problem, we investigate crowdsourcing approach to accomplish curation tasks. First, we evaluated performances between non-professional crowdworkers of a commercial crowdsourcing platform and our expert biocurators. Two tasks of image-based gene structural annotation and text annotation of gene names with technical knowledge were attempted. In the image annotation task, we found all incorrect cases by crowd with all correct by experts. This indicates that tasks should be clarified with informative sentences reflecting professional knowledge. However it may be high cost. As for the text annotation task, the comparison of performances between three biocurators and 17 crowdworkers confirmed that several crowds exhibited high performance levels equivalent to the curators. Next, we propose a crowdsourcing research platform under development, named by CrowdR&D. Researchers can use the CrowdR&D site as a portal to generate crowdsourcing tasks. It provides quantitative evaluation of individual tasks and manages separated tasks as a workflow. Moreover it includes user authentication function and data sharing function. Finally, we provide the information of ethical review for protecting crowdworkers required at paper submission to life science research journals.
|20||Meng-Pin Weng and Ben-Yang Liao.
modPhEA: model organism phenotype enrichment analyzer
Abstract: Meng-Pin Weng and Ben-Yang Liao*
Division of Biostatistics & Bioinformatics, Institute of Population Health Sciences,
National Health Research Institutes, Zhunan, Miaoli County 350, Taiwan, R.O.C.
The advance of high throughput sequencing technologies have enabled researchers to perform “omics” analyses on a diverse species, even for those without a sequenced genome. In order to facilitate knowledge discovery based on the results obtained from large-scale experiments, we developed modPhEA (model organism phenotype enrichment analyzer), which aims to identify enriched/depleted mutant phenotypes of model organisms for given gene sets. The model organisms supported by modPhEA so far include mouse (Mus musculus), zebrafish (Danio rerio), fruit fly (Drosophilla melanogaster), nematode (Caenorhabditis elegans), and budding yeast (Saccharomyces cerevisiae). Sequencing based experiments on poorly understood organisms often produce assembled nucleotides, which are known as contigs/scaffolds, as the results. Therefore, in addition to the option of inputting gene sets presented as lists of predefined gene IDs/symbols (often from species with an annotated genome), the input “gene sets” can be lists of nucleotide/amino acid sequences. When gene lists from non-model organisms are given, modPhEA searches orthologs, conducts enrichment analysis, and reports the result based on the most closely related model organism to the users. modPhEA has the flexibility in specifying the types of mutations from which the phenotypes were observed and the format of output. modPhEA is available at: http://evol.nhri.org.tw/modPhEA.
PDBnet: Integration of Structural and Biochemical Protein-Protein Interaction Network
Abstract: Protein-protein interaction (PPI) have been mostly studied by structural and biochemical methods. Each interaction data have advantages and disadvantages, which can be complement by each other. There are a huge amount of biochemical protein-protein interaction data which were analyzed by several methods, such as two-hybrid, affinity chromatography technology, pull down assay and biochemical method etc.. But the quality of these biochemical experimental data is not necessarily uniform, making it more difficult to analyze PPI network. Also, it is difficult to assume the protein function by only these data. On the other hand, the structural data are important for understanding the protein function. They include the domain information which represent the functional units of the protein. But from the viewpoint of proteome, it is difficult to identify which proteins and which parts of protein were structured. Here, we integrated the structural and biochemical information of PPIs. The structural information was based on PDB. The biochemical interaction data was collected from some PPI databases; BioGrid, DIP, BOND, HPRD, IntAct and Mint. And then, we integrated the data into PDBnet by using Uniprotkb ID and PDB chain ID. Owning to this integration, PDBnet now contains more comprehensive interaction network. Many biochemical interactions are not present in the structural interactions. This would help researchers working on structures to identify missing potential interactions for structural analysis. On the other hand, many interactions are also present only in the structural information. This is because many proteins derived from various kinds of species were registered in PDB, which is relatively easy to analyze the structure. The integration of structural information into the biochemical network also helps researchers to understand the molecular mechanism of protein interactions. PDBnet is available on website http://dna00.bio.kyutech.ac.jp/pdbnet/
|22||Shu-Hwa Chen, Chi-Wei Huang and Chung-Yen Lin.
Electronic Laboratory Notebook (Elegance)
Abstract: For a long time, scientists used pens and glues to record their findings, musing, ideas and inference in paper-bound laboratory notebook. This old fashion is lasted even in the 21st centuries. However, the hand-writing, paper-based recording way is not competent to keep data in increasing volumes and complexity, and is hard to make data sharing in a cooperating project among various disciplines and research communities. With more and more outputs generated with digital deluge, a platform for knowledge repository with the functions like search, backup, reconstruction will be an important issue in current laboratories for daily records.
In our conception, electronic laboratory notebook (ELN) should not only help scientists to put everything as records, but also raise the possibility for new discoveries and problem solving, which may have significantly increased the competitiveness of whole research team. Although there are some ELNs available on market and public domain, the interfaces and prices of these ELNs are not so friendly with shape learning curve. The essential functions inside ELN will be included simple installation with few clicks, note creating with attached experimental digital outputs, full text search with image gallery, succinct user management with digital signature, automatic system backup, calendar with coming event notification, personalized interface with privacy, data sharing and exchange via web, duplication and backup of whole ELN, high availability on function extension, and all the features existed in web 2.0. We have developed the draft of pure web-based ELN (windows/ MAC version) which can be deployed on most available PCs and portable devices instead of high manpower required ELN server /client architecture. Meanwhile, there will be two kinds of robust ELNs released recently; one is group ELN designed for research team as collaboration platform, the other one is portable ELN suitable for personal use as mobile web blog. Currently, we have developed ELN in English, Traditional Chinese and Japanese (windows/ MAC version). Meanwhile, we also got the support from Microsoft Inc. to migrate ELN to Azure cloud for research community. By dissemination in international and domestic conferences, we plan to show our ELN in schedule to research community for revolution of lab notebook.
In brief, we believe the ELN developed by our team will really help research community on supporting interventions, sharing information, re-organizing knowledge, and manifesting actual laboratory works. On the contrary, the feedbacks from users will evoke new developments on IT issues requested by emerging massive experimental results.
Screen casts and prototype of ELN: http://eln.iis.sinica.edu.tw
|23||Shu-Hwa Chen, Yueh-Hsia Tang, Ming-Hsin Tsai, Chao A. Hsiung and Chung-Yen Lin.
AfterGenBank: Assembly of the Sequence Features and Annotations Derived from Genbank
Along with the recent advances on sequencing technologies, the accumulated sequences in GenBank increase faster than ever. This big data is an important treasure for biomedical community. The feature table of each sequence entry describes the roles and locations of higher order sequence domains and elements within the genome of an organism such as mRNA, gene, transcript, exon, intron, ncRNA, 5’UTR, and etc. However, there is no systematic and flexible way to compose these intelligences from the huge flat files by the specified feature terms and the taxonomic category in GenBank.
By integration of big data, system framework, self-developed programs/ scripts and intuitive graphic interface, we constructed a web database with automatic updated mechanism named as AfterGenBank to assemble annotated sequences with specific features from GenBank. Two major programs in parallel computing named as Features Extractor and Sequences Conjunctor were developed by our team to compose the meaningful fragments from hundred million sequence records.
AfterGenBank is a value-added web database for accessing, searching and managing sequences with all specific features in all organisms from GenBank CoreNucleotide (the main collection). Presently, sixty-one feature terms from twelve sequence divisions (Bacterial, Plant, Primate, etc.) are available. Users can employ full text searches / sequence-similarity searches to identify specific featured sequences, retrieve these results in fasta / CSV via the intuitive web interface. According to researchers’ interesting, the dataset fetched from AfterGenBank, are available to various purposes combined with existed tools, such as summarizing a consensus sequence patterns, performing phylogenetic analysis, designing sequence-specific primers/ probes, or re-formatting into a novel database of specific features, etc. AfterGenBank and its related analysis services, are freely accessible at http://aftergenbank.nhri.org.tw.
Flash Demonstration: http://aftergenbank.nhri.org.tw/AfterGenbank/demo.html
|24||Yushi Takahashi and Kiyoko Aoki-Kinoshita.
Refactoring of NeuronDB, a database of neuronal experimental data
Abstract: 1 Introduction
In order to elucidate various brain functions, researchers around the world perform electrophysiological experiments of neural cells. In these experiments, researchers focus on the action potential of neuronal cells. As is well known, the brain consists of a mass of neuronal cells. Therefore, to understand the brain mechanisms, obviously it is important to understand the functions of neuronal cells, and action potential is a key feature of neuronal cells.
NeuronDB is a database storing electrophysiological experiment data of neural cells in the brain cortex of mouse and has been developed in our laboratory since 2012 as a collaboration with Prof. Hideki Derek Kawai, a neurophysiologist at Soka University. NeuronDB uses the MySQL database to store data. NeuronDB consists of not only a database system, but it also provides data analysis programs and allows users to perform analyses of experimental data quickly. NeuronDB aims to be the foundation to analyze large amounts of electrophysiological experiment data and reduces the cost of data management for the researcher. However, at this point, this system is still a prototype and currently contains minimal functionality to store test data offered by Prof. Kawai.
In this study, we made a review of the system architecture of NeuronDB and implemented additional features in this system. Then, we refined the user interface layout of this system to make it more user-friendly.
2 Methods and Results
The user interface of NeuronDB was built using HTML5 with Bootstrap3, and all data is stored in a MySQL database. Then, the main server side program of NeuronDB was developed with PHP and CakePHP2 framework. These PHP scripts serve as an interface to communicate between the user interface and the MySQL database.
Figure 1 depicts the main interface of NeuronDB. At present, NeuronDB has three main functions: data input function, experiment query analysis function and group query analysis function. We describe each of these functions in detail.
Using the data input function, users can submit their data in Microsoft Excel worksheet format, CSV file format or from the clipboard. We also added a simple data validation function to this system. From the Microsoft Excel format, users can upload a book at a time, and each sheet will be uploaded for confirmation separately.
The experiment query analysis function provides users with a data analysis tool for each individual experiment, including plotting the data in a graph. In contrast, the group query analysis function enables users to compare multiple datasets simultaneously. Additionally, we added a user information management function to this system. Thus, users can modify their own user information such as their password from the web interface.
|25||Ai Muto, Katsuhisa Ozaki and Masaaki Kotera.
Development of insect orthologue search database
Abstract: There have not been many cooperative studies with genome biology and insect biology. In order to gain a big social benefit out of the long-term insect research, providing the relevant genomic information is essential. We have been conducting novel sequencing using next generation sequencers, with incorporating the vast amount of insect genes containing public complete genomes, draft genomes, cDNA library, RNA-seq, and cloned genes, aiming at the development of a cross-search database of insect genomics.
Complete genomes have only been revealed for a small number of specific groups in all the insect taxonomy, and the sequence information of other insect species are spread across the Internet. However, it is not yet possible to search the insect orthologues. We believe that, even if they are not complete genome sequences, enabling cross-search of vast amount of insect genes is beneficial for better understand the biodiversity of insects. We thus develop the database to search insect orthologues with additional information such as taxonomic classification of insects, their feeding habits and their symbionts. We hope that our development of the database will contribute to bring entomology into the genomic-level understanding.
|26||Naoki Yamamoto, Tomoyuki Takano, Shin Terashima, Masaaki Kobayashi, Hajime Ohyanagi, Youhei Sasaki, Maasa Kanno, Kyoko Morimoto, Hiromi Kanegae, Misa Saito, Satomi Asano, Koji Yokoyama, Koichiro Aya, Keita Suwabe, Go Suzuki, Toshio Sugimoto, Takehiro Masumura, Masao Watanabe, Makoto Matsuoka and Kentaro Yano.
Plant Omics Data Center (PODC): a knowledge-based transcriptomic database for exploring functional gene modules in plants
Abstract: Establishment of a useful transcriptomic database is one of important themes in plant bioinformatics for enhancement in bioproduction. Next-generation sequencing technology has facilitated to monitor whole-genomic expression in various types of tissues in several plant species. Currently abundant RNA-seq data in plants are available at NCBI Sequence Read Archive. By combining the data at various tissues and developmental stages in a plant species, spatiotemporal gene expression patterns, which give hints for understanding molecular and physiological function of the gene, are produced. The gene expression patterns can be then employed to compare among genes to construct gene expression network (GEN). When giving reliable functional annotations of genes, functional gene modules could emerge from GEN. Functional gene modules provide clues not only for elucidating genes involving in the same biological process and forming a protein complex, but for disclosing gene regulatory mechanism. Also, conservation and divergence of functional gene modules among multiple organisms could clarify key genetic information determining biological characteristics of plants.
Here, we introduce Plant Omics Data Center (PODC, http://bioinf.mind.meiji.ac.jp/podc/), which contains plant GEN integrated with knowledge based functional annotation of genes. The GEN was generated from the genomic sequences and RNA-seq data in 8 representative plant species (Arabidopsis, rice, tomato, Sorghum, grape, potato, barrel medic and soybean) by using the correspondence analysis. On the database, GEN can be accessed by BLAST or keyword search and viewed on an interactive graphical interface. Detailed gene annotations that were provided by Natural Language Processing and manual curation are also presented upon the web-page of GEN.
We applied PODC on analysis of physiological role of phosphoenolpyruvate carboxylase (PEPC) in immature soybean seeds. Divergent expression patterns of ten PEPC isogenes during seed development suggested functional partitioning among the isogenes. In addition, gene expression pattern of one isogene implied its involvement in accumulation of seed storage proteins. To validate the physiological function of the PEPC isogene, we are conducting a comparison of gene expression in two Japanese cultivars, of which seed protein contents were different from each other.
The update project of PODC in GEN and functional gene annotation would accelerate studies in plant biology. We welcome your any requests on addition of target plant species and functional annotation of genes of interest for particular research fields.
Acknowledgements: Grants-in-Aid for Scientific Research on Innovative Areas (No. 24113518). Research Funding for Computational Software Supporting Program from Meiji University. Takano Life Science Research Foundation.
|27||Seong-Jin Park, Gunhwan Ko and Byungwook Lee.
HASV: Large scale NGS data analysis Hadoop platform -based system for the genomic structure variations
Abstract: Recently, the cost of whole-genome sequencing has decreased dramatically due to the development of next generation sequencing (NGS) technology. and A huge amount of sequencing data has been generated and released by research laboratories worldwide. This paper proposes a improve algorithm for detects structural variation in next generation sequencing data. HSV is developed using a Hadoop based on structural variation detection is algorithm. HSV uses the open-source Hadoop implementation of MapReduce, and an extended partitioning method to maintain load balancing of each node in the cloud computing environment. To verify the superiority of our approach, we performed extensive experiments using public available Yoruba in Ibadan, Nigeria data. The result of experiments revealed that our HSV method efficiently finds the structural variation from enormous NGS data.
|28||Gunhwan Ko, Pan-Gyu Kim and Byungwook Lee.
Hybrid analytic platform of cloud service for bio-big data
Abstract: New genome sequencing techniques such as NGS(next generation sequencing) have caused the exponential growth of biological data. To find out meaningful information from these massive data, the computational analysis processes of information are essential and becoming more and more important. . Generally the analysis process of bioinformatics data need the complicated procedure in combination with different types of tools which has features of searching databases, extracting and analyzing data, annotation of results, and so on.. Therefore, researchers have difficult in composing and performing complicated bioinformatics analysis process. We developed CLOSHA2, an integrated and automatic workflow system, which helps researchers find meaningful information from bioinformatics data by simplifying a complex analytical process as a batch workflow which is a combination of various tools.
CLOSHA2 is an automatic workflow modeling system that researchers can represent the process of bio-data analysis as a workflow which is composed of a sequence of analysis tools by connecting the output of preceding tool and the input of following tool in sequence, with same formats. Users can easily analysis the complicated and massive bio-data through CLOSHA2. It has real-time monitoring and cooperation with other researchers features. CLOSHA2 provides stand-alone version for private research groups and web service version based on elastic high-performance computing cluster for cloud-service users.
CLOSHA2 offers client version for individual user and group researchers and web service based on elastic high-performance computing cluster to open source software. CLOSHA2 web-services that is available massive date high analysis can use at this page, http://closha.kobic.re.kr. Open source software can download program and manual https://code.google.com/p/kobic-closha
|91||Kunie Sakurai, Junko Yamane, Kenta Kobayashi, Koji Yamanegi, Takeaki Taniguchi, Yuki Kato and Wataru Fujibuchi.
Stem Cell Informatics Database: a framework for a new repository on single cell assay data and diverse knowledge of human cells.
Abstract: Researchers have actively investigated potential applications of the induced pluripotent stem cell (iPS cell) for disease modeling, drug screening, and regenerative medicine since its discovery in 2006, and now it is about to start one of the projects for clinical trials. In parallel with the medical practice, there has been arising a need of more precise knowledge of the cell of its disposition, while facing a lack of the information system that enables us to store such complex biomedical knowledge regarding cells in well-organized way.
Here we introduce our cell knowledge repository called “Stem Cell Informatics Database” that is an extended work of the previous SHOGoiN database. It has been designed to integrate information comprehensively for defining cells with diverse knowledge and scientific data from biomedical research. In the Stem Cell Informatics Database, there are several indispensable contents, such as gene expression profiles and images of cells, curated assay metadata, and the cell taxonomy associated with anatomical location information. Stem Cell Informatics Database is now under development, and we are currently working on i) creating our own ontology to formally describe/model knowledge about the cells, and ii) developing analysis tools for gene expression data produced in single cell experiments. Depositing all of those in one database, this will provide a framework of integrative system for cell knowledge dictionary.
|29||Evaline Ju, Wen-Wei Liao, Larry Lam and Pao-Yang Chen.
MethGo: a tool for analysis of bisulfite sequencing data
Abstract: DNA methylation is an important epigenetic modification involved in many biological processes. Bisulfite treatment coupling with high throughput sequencing (BS-seq) is used to examine DNA methylation and epigenomic variants. Whole-genome bisulfite sequencing (WGBS) is commonly used to profile genome-wide DNA methylation, demanding convenient ways to evaluate bisulfite sequencing data. In order to analyze the data in depth, MethGo was developed in Python with five modules, including single nucleotide polymorphism (SNP) calling, epimutation identification, copy number variation (CNV) calling, gene body and promoter methylation level identification, and comparisons of methylation at transcription factor binding sites. SNP calling involves identifying homozygous and heterozygous SNPs within the genome. Examining the reads in which there are heterozygous SNPs aids in epimutation identification. The epimutation module reports differentially methylated cytosines between the two parents. CNV calling uses the assumption of a Poisson distribution in order to find areas in the genome likely to have large-scale genome rearrangement. The methylation levels module uses information from methylation calls and gene annotations in order to return methylation levels of genes and methylation levels of their respective promoters. Lastly, the txn module plots the comparisons of methylation among the transcription factor binding sites. Identifying SNPs and epimutations has important implications for studying methylation levels as well as genetic diseases, and investigating CNVs has influence on cancer research. MethGo comprehensively utilizes BS-seq data and provides analyses that are important in epigenomics. The software is available as a standalone package and can also be accessed online as a web tool.
|30||Wen-Wei Liao and Pao-Yang Chen.
HETGEN: a bioinformatic tool to assess genome-wide heterogeneity of DNA methylation
Abstract: DNA methylation plays critical roles in transcriptional regulation, development, and imprinting. High-throughput DNA sequencing technology coupling with bisulfite treatment is able to profile genome-wide DNA methylation at single nucleotide resolution. The analysis of genome-wide methylation data starts with alignment of bisulfite-converted reads. After alignment, methods are employed to identify differentially methylated regions (DMRs) among samples. Identification of DMRs considers only the change in methylation levels, but the heterogeneity of methylation patterns in a cell population is often ignored. In cancer research, such information may suggest subsets of cells are progressing differently at certain regions. The more heterogeneous one region is, the more likely it is involved in such process. A few metrics have been developed to measure the methylation heterogeneity. For example, hamming distance was utilized to evaluate the non-randomness of methylation patterns at each locus, however, with a low sensitivity. In another study, methylation entropy modified from Shannon entropy was used to measure the variation of methylation patterns, but it needs at least 16x coverage to ensure all possible patterns would be considered for a given four contiguous CpG dinucleotides. Here we present an approach based on string kernels to assess the heterogeneity of DNA methylation patterns from alignment. It was applied to both simulated and real data in the identification of genomic regions exhibiting heterogeneity. This approach provides insight into epigenetic regulation by identifying potential epigenetic regulatory regions.
|31||Akihito Kikuchi, Shigehiko Kanaya, Toshimichi Ikemura and Takashi Abe.
Development of Self-Compress BLSOM for comprehending big sequence data
Abstract: With the remarkable increase of genomic sequence data of various organisms, novel tools are needed for comprehensive analyses of the big sequence data available. We have previously developed a Batch Learning Self-Organizing Map (BLSOM), which can cluster genomic fragment sequences according to phylotypes solely dependent on oligonucleotide composition, and applied to genome studies. BLSOM is suitable for high-performance parallel-computing, and can analyze big data, such as billions of genomic sequences, simultaneously. On the other hand, this large-scale BLSOM needs a large computer resource.
We have developed Self-Compress BLSOM (SC-BLSOM) for reduction of computation time and for comprehensive analyses of the big sequence data. The strategy of SC-BLSOM was hierarchically constructed with BLSOM according to the data class. At first, BLSOMs were constructed with each the divided input data to represent the distribution of data subclass, and compress the number of data. In addition, 2nd BLSOM was constructed with a set of first BLSOM to summarize the class distribution.
We compared SC-BLSOM with BLSOM by analyzing bacterial complete genome sequences. SC-BLSOM can be constructed faster than BLSOM, and can be clustered according to phylotypes with high accuracy. This method is more suitable for efficient knowledge discovery for big sequence data.
LAST: statistically-rigorous, large-scale sequence comparison
Abstract: This poster presents LAST, open-source software for general-purpose, large-scale sequence comparison and alignment (http://last.cbrc.jp/).
* It combines a traditional substitution score matrix (which models sequence divergence) with per-base uncertainty (e.g. fastq) in a rigorous way. This is useful for alignments with non-negligible divergence (e.g. cross-species alignment, ancient DNA) or unusual base frequencies (e.g. malaria, bisulfite converted DNA).
* It uses the statistical (pair HMM) basis of alignment to annotate the reliability of every column in an alignment.
* It is the only tool that can align DNA to proteins, *allowing frameshifts*, for genome-scale data. This is useful for: annotating pseudogenes, and analyzing metagenomic DNA (where frameshifts are surprisingly common).
* It can do “split alignment” of a query sequence to a genome, where it looks for a unique best match for each part of the query. It rigorously calculates the reliability (uniqueness) of each part of the alignment. This is useful for: cancer (DNA reads that cross rearrangement breakpoints), spliced RNA (where it models splice signals, intron sizes, and allows trans-splicing), and whole genome comparison (where different parts of one query chromosome match different parts of the target genome).
* Using newly-optimized transition seeds, LAST found ~20,000 new alignments between the human and mouse genomes, which are missing in the standard UCSC genome alignments.
|33||Mizuya Kusunoki and Satoshi Mizuta.
Further Development of Chargaff’s Second Parity Rule
Abstract: Chargaff’s first parity rule states that the occurrence frequencies of bases in a double-stranded DNA molecule are equal between adenine and thymine, and cytosine and guanine. This rule is approximately valid in each strand of a DNA molecule of genomes of a wide range of species, which is known as Chargaff’s second parity rule. The second parity rule is naturally extended to a word symmetry, in which the occurrence frequencies of a given word (or a k-mer) and its reverse complement are almost equal, because their occurrence probabilities are calculated to be equal when the second parity rule is realized and the four kinds of bases are independent and identically distributed. In this study, we further develop the word symmetry, investigating the correlation between the discrepancy between the observed occurrence frequencies of words and their expectation values, and that discrepancy for the reverse complements. We observed positive correlations between the discrepancies for some bacterial genomes, which would provide clues to genome evolution.
|34||Yu-Chieh Liao and Hsin-Hung Lin.
Patch: upgrading microbial genome assemblies using third-generation sequences
Abstract: Background: Despite the ever-increasing output of next-generation sequencing data along with progressive assemblers, dozens to hundreds of gaps still exist in de novo microbial assemblies due to the biased coverage and the unavoidable fundamental limitations of assembly from short reads. Third-generation single-molecule, real-time (SMRT) sequencing technology avoids amplification artifacts and generates kilobase-long reads with the potential to complete microbial genome assembly. However, due to the low accuracy of third-generation sequences, a considerable amount of long reads (100X) are required for self-correction prior to de novo assemble microbial genome. Although a hybrid approach has been proposed to combine second- and third-generation data, the assembly completeness is inferior to the non-hybrid method (using long reads only). Here, we provide a new method to exploit a small amount of long reads to improve the off-the-shelf draft genomes assembled from short reads.
Results: We developed a tool (Patch) that takes corrected long reads and pre-assembled contigs along with raw next-generation short reads as inputs to improve the microbial genome assemblies. We demonstrated that, with the addition of long reads (~20X), three draft bacterial genomes of Escherichia coli K12 MG1655, Meiothermus ruber DSM1279 and Pdeobacter heparinus DSM2366 can be significantly upgraded by using Patch. Besides, we applied Patch to assemble S. cerevisiae W303 genome and showed that Patch drastically reduced the number of contigs from 2782 to 81 using the additional long read sequence information.
Conclusions: Our evaluations on the performances show that Patch outperforms scaffolders using single molecule long reads (e.g., AHA, Cerulean and SSPACE-LongRead) and successfully improves the completeness of microbial genome. Patch therefore provides a great opportunity to completely utilize already sequenced next-generation data so as to upgrade microbial genomes with the additional third-generation sequences.
|35||Shuichiro Ishikawa, Yusei Kobori and Satoshi Mizuta.
A novel method of feature extraction of DNA sequences by graphical representation
Abstract: In recent years, alignment-free methods of sequence comparison are actively studied. Graphical representation of biological sequences is one of those methods, where a biological sequence is converted to a numerical array and represented on a two- or three-dimensional space. This method is advantageous over other sequence comparison methods in that sequence similarities can be not only quantitatively analyzed but also visually estimated. In this study, we extract the global features of DNA sequences by fitting each graph represented to a straight line and a spherical surface; the straight line and the spherical surface provide the direction and the curvature of the graph, respectively, which together constitute a four-dimensional feature vector. We reconstructed the phylogeny of beta-globin genes of mammalian species based on the feature vectors, and obtained the phylogenetic tree which is basically consistent with that given by ClustalW.
|36||Mariko Morita, David R. Nelson and Osamu Gotoh.
Application of Famaln towards comprehensive identification of eukaryotic P450 genes
Researchers often want to collect a set of protein or coding nucleotide sequences belonging to a particular family of genes for structural, functional, and evolutionary studies. However, it has been a laborious procedure to obtain a desired gene set from widely distributed resources. Moreover, computationally identified genes are known to be highly error prone, hampering effective use of the massive information available from genomic sequences of a large number of divergent organisms.
Famaln is a tool to find the genes on multiple genomes that are homologous to the given set of seed sequences. Through iterative refinement mediated by gene-structure-aware multiple protein sequence alignments (GSA-MPSAs), Famaln refines the initial gene models to an extent at which nearly 99% of exon-intron boundaries of presumably genuine genes are correctly predicted. We applied Famaln to a total of 647 animal, plant, fungal, and protist complete or near-complete genomes with ca. 1,000 seed sequences selected from our own cytochrome P450 sequence database. The numbers and quality of the genes thus identified considerably exceed those of P450 genes retrievable from various public sequence databases.
Famaln is proven to be highly effective to near-comprehensively find P450 genes encoded in a large number of divergent eukaryotic genomes in an automated manner. As P450 is one of the largest gene super-families distributed in all kingdoms of life, the present results indicate that Famaln will be applicable to various gene (super-) families to discover new members and to remedy errors in existing annotations.
|37||Kazunori Yamada, Kazutaka Katoh and Kentaro Tomii.
The effect of a novel amino acid substitution matrix, MIQS on the MAFFT multiple sequence aligner
Abstract: Multiple sequence alignment plays an important role in comparative sequence analysis. Currently, more and more biological sequences are becoming available and consequently it becomes more crucial thing for multiple aligners to align sequences rapidly and accurately than ever. MAFFT, which is one of the most popular multiple sequence aligners, is applicable to large-scaled data. In the meantime, previously we had developed a novel amino acid substitution matrix, MIQS, which shows excellent homology detection performance and decent alignment quality in a pairwise alignment level, compared to other existing substitution matrices. Therefore in this study, we combined MAFFT and MIQS to examine whether MIQS could improve the performance of MAFTT or not, using large-scaled database, HomFam. As a result, MIQS had a positive effect on MAFFT, especially for aligning extremely large dataset.
|38||Yu Bai, Yuki Iwasaki, Tetsuo Sato, Naoaki Ono, Ming Huang, Tadao Sugiura, Md. Altaf-Ul-Amin, Yue Zhao, Shigehiko Kanaya and Toshimichi Ikemura.
Systematization of genomes based on occurrence of penta-nucleotides sequences
Abstract: Genome sequences, both protein coding and non-coding parts of the sequences, contain a wealth of information. The G + C content (G + C%) is a fundamental characteristic of individual genomes and used for a long period as a basic phylogenetic parameter to characterize inter- and intragenomic differences. The G + C%, however, is too simple to differentiate wide varieties of genomes. Novel tools are needed for comprehensive analyses of the big genomic sequence data of a wide range of species. Unsupervised neural network algorithm, Self-Organizing Map (SOM) is an effective tool for clustering and visualizing high-dimensional data such as oligonucleotide composition on a two-dimensional map. By modifying the conventional SOM, Batch-Learning SOM (BLSOM) was previously developed to make the output independent of the order of input data, which allows classification of sequence fragments according to species, solely depending on the oligonucleotide composition. In the present work, BLSOM was used for characterization of vertebrate genomes based on the occurrences for penta-nucleotides.
At first, we analyzed penta-nucleotide compositions in 100 kb sequences derived from a wide range of vertebrate genomes in order to investigate an efficient method for detecting differences between the closely related genomes. Then compositions of penta-nucleotides were analyzed by BLSOM in 100 kb sequences in the human and mouse genomes. The species-specific key combination of oligonucleotide frequencies in each genome was recognized by BLSOM, which is called a “genome signature”, and the “genome signatures” are associated to transcription-factor-binding sequences. Specifically, the specific regions of the human genome enriched in transcription-factor-binding sequences, were located in a small zone within the mouse territory and surrounded by white lattice points, which do not have any sequences.
Because the classification and visualization power is very high, BLSOM is an efficient powerful tool for extracting a wide range of information from massive amounts of genomic sequences (i.e., big sequence data).
|39||Anish Man Singh Shrestha, Martin C. Frith, Kiyoshi Asai and Hugues Richard.
Identifying structural variations by jointly aligning a group of reads to a reference genome
Abstract: Accurately detecting structural variations in a genome from its raw, high-throughput sequencing data remains a challenge. Most of the currently available methods treat each read individually and infer variations by considering pairwise alignments between reads and a reference genome. In order to better leverage the information from a set of reads from a possible variant site, we present a strategy to simultaneously align such a group to a reference. We extend the traditional probabilistic model of pairwise alignment to incorporate the case of multiple sequences and also the possibility of genomic rearrangements (we consider only deletions for now). Based on our model, we can not only compute the most probable alignment, but also derive important information such as the probability of a deletion or the reconstruction of the genomic sequence of the sample. We present some preliminary results.
|40||Yasunobu Okamura, Takeshi Obayashi and Kengo Kinoshita.
RNA-seq profile classification by machine learning
Abstract: Today, the thousands of RNA-seq data were published. Although these data must be useful for meta-analysis or re-analysis, but annotations of those RNA-seq data are insufficient. Since those annotations are written in natural language, it is hard to classify or extract with a machine. Comprehensive machine-friendly annotations are required to perform large-scale meta-analysis. In this study, we focused on RNA-seq data to add annotations because some data do not have enough annotation in natural text.
To classify RNA-seq runs, we selected characteristic genes by testing a difference of gene expression. First, we classified 844 RNA-seq runs into 16 organs by hand. We tested a difference of gene expression between samples of an organ and all human samples with Wilcoxson-test (Holm correction). We found 32.2 characteristic genes for each organ on average. We also applied Support Vector Machine to add machine-friendly annotation to poor-annotated gene expression profiles. We succeeded to classify 844 RNA-seq runs into 16 organs with 76.4% success rate (5-fold cross validation).
|41||Takeaki Taniguchi, Yuki Kato, Susumu Goto and Wataru Fujibuchi.
Development of a pipeline for analysis of meta- and single cell genomic sequences
Abstract: Metagenome has attracted much attention since it can help to understand marine environment that constitutes mostly uncultured microbes. However, it is known that complete assembly of reads for metagenomes is often hard to carry out. Recently, single cell genomic data have increasingly been available with the developments of next-generation experimental techniques. Therefore, integrating single cell data into metagenome from environmental samples can be a reasonable point to analyze such uncultured microbes. In this abstract, we present a pipeline for assembly, gene structural and functional annotation when both metagenomic and single cell sequence data are given. More precisely, we consider a series of pipelines specialized for pre-processing reads, assembly and annotation, resulting in visualized maps of gene annotation along with links to KEGG maps.
|42||Thomas M. Poulsen, Martin C. Frith and Paul Horton.
Utilizing higher order distributions and information in low-complexity regions to detect homologous regions in biological sequences
Abstract: Nucleotide sequence alignment and detection of homologous regions is a fundamental topic in biological research. One difficulty in accurately identifying homologs is that they may contain tandem repeats and low complexity regions that produce alignments that are not homologous. Standard methods attempt to address this issue by masking such regions, but masking does not distinctively model the difference between alignment of low complexity and ‘standard’ regions, and thus risks disregarding important information. In this study, a Hidden Markov based architecture (HMM) was developed to distinguish alignment of repetitive and non-repetitive regions and utilize alignment probabilities that are region dependent for the purpose of homolog detection. Our HMM is also able to employ higher order probability distributions for matching multiple nucleotides and a comparison with other methods shows how our model exploits nucleotide distributions and region complexity for sequence alignment.
|43||Norio Shinkai, Makoto Ikeda, Masaki Takazawa, Thomas Poulsen, Martin Frith, Osamu Ohara and Paul Horton.
Computational Methods for indel and fusion gene detection from cancer sequencing data
Abstract: Genome mutations are central to the mechanism of cancer and therefore the recent rapid increase in the availability of tumor sequencing data promises to reveal new insights into the cancer and cancer drug resistance. Unfortunately, current data analysis methods are not able to consistently detect genome insertions and deletions, which may play key roles in cancer.
We are adapting the LAST sequence alignment program (Kielbasa et al. Genome Research 2011) to the problems of genome insertion/deletion (indel) and fusion gene detection. Both of these problems involve effectively using information from reads which map to the genome in a “split” manner, i.e. part of the read maps to one locus and the remaining part to a distinct locus in the reference genome.
Our study explores the effectiveness of a particular version of LAST (LAST SPLIT) recently developed by Martin Frith. Using a modified version of the KDRI Somatic Pipeline (Ikeda et al. unpublished) as a computing platform, we compare standard tools, such as the Genome Analyst Toolkit (GATK) (McKenna et al. Genome Research 2010), to the same pipeline using LAST on the problems of indel and fusion gene detection from cancer sequencing data.
|44||Ernesto Borrayo and Masaru Takeya.
Signal Processing Tools in Core Collection Selection
Abstract: This research was partially supported by JST/JICA, SATREPS: Diversity Assessment and development of Sustainable Use of Mexican Genetic Resources.
The Core Collection (CC) concept has become one of the fundamental approaches in Genetic Resources management to exploit the potential of a complete collection (MC) in terms of viability of both data management and monetary expenses .
Although several algorithms have been successfully implemented to construct CC [1,4] based on genotypic, phenotypic, passport and geographical data; either by individual data-sets or by consensus , to our knowledge a single comprehensive data-set has not yet been explored. We hypothesize that a feasible solution for this multiple data-set evaluation is to manage all data available as a discrete signal, which allows the implementation of Signal Processing Tools in data analysis.
In this work, we present the proof of concept of the possibility to map to a discrete signal any kind of data from MC accessions, in order to take advantage of Signal Processing Tools for CC construction.
2 Method and Results
Genotypical data and agromorphological traits (AT) from Rice (Oriza sativa (L.)) and foxtail millet (Setaria italica subsp. italica (L.) P. Beauv.) accession data, was retrieved from the National Institute of Agrobiological Sciences (NIAS)  to test the algorithm’s CCs versus MCs scores [2,5,6]. By means of substitution tables, each accession’s data was mapped into a discrete signal, and their power spectra obtained by Fast Fourier Transform. A Distance Matrix (DM) was constructed with the mean squared error of power spectra pairwise comparisons.
K elements where selected in a Jackknife-like iterative process, selecting and eliminating r elements on the DM in each iteration: a) the ith sample with most lower distance values among jth elements, b) the ith sample with most higher distance values among jth elements, c) the ith sample with lower distance average, d) the ith sample with higher distance average, e) the ith sample with lower overall distance and f) the ith sample with higher overall distance. This process continued until r(1)+r(2)…r(iterations required)=>K.
With the K list, it is possible to perform evaluation of the CC. Eight criteria were established for this purpose: a) average distance between each MC sample and nearest CC sample (ANE), b) average distance between each CC sample and the nearest CC sample (ENE), c) average distance between CC samples (E), d) means homogeneity t-test (MD), e) variances homogeneity F-test (VD), f) coincidence rate (CR), g) variable rate (VR) and h) alleles coverage (CA) [2,5,6].
The evaluation of different K CCs of rice and foxtail MCs are presented in table 1.
Table 1 Evaluation scores for rice and foxtail millet CCs with both genotype-only and genotype-AT signal constructions.
Rice Foxtail Millet
Genotype Gen&AT Genotype Gen&AT
CM 780 780 273 273 423 423 141 141
K 24 96 24 24 24 48 24 24
ANE 0.6081 0.4438 0.574 0.3914 0.6491 0.5915 0.5677 0.5276
ENE 0.4654 0.5583 0.2961 0.5492 0.3364 0.3993 0.4254 0.4268
E 0.8794 0.8729 0.8917 0.7895 0.9022 0.9031 0.8661 0.8704
MD 12.2396 3.9062 2.1201 0 0.885 0.295 2.5723 0.7067
VT 45.0521 32.0312 51.2367 31.5789 40.118 41.5929 30.5466 45.583
CR 62.931 75.2941 68.9046 62.3252 77.4905 83.8585 89.0691 90.0526
VR 0.0001 0.0001 63.4782 1.9284 0.3534 0.202 76.4806 49.1965
CA 90.73 98.06 84.45 99.39 92.37 95.76 94.05 91.39
The results show good diversity representation of MC elements in selected CCs from each collection, which implies that the DM generated and the selection procedures have proven their efficiency in CC selection. With single-signal construction from both AT and genotypic mapped data, regardless of its origin, it is possible to construct a comprehensive data-set for each analyzed element of the MC. The use of this comprehensive signals in CC selection improved the scores in several parameters. This suggests that the algorithm’s implementation based on signals generated by different-origin data is adequate for MC multiple data-set evaluation in CC construction.
The possibility to include genotypic data with phenotypic traits, geographical locations, climates, habitats, nutritional requirements, symbiotic relationships, etc. provides the opportunity to determine which is the best information to be included in the selection process to cope with the particular objectives of what the CC is being selected for. This concept, in addition to adequate scoring systems may prove useful for the design of tailored CC to comply with specific research/breeding objectives and it is something worth to be explored in the future.
The successful implementation of Signal Processing Tools in CC selection encourage the idea that this concept may also be useful in MC analysis for other purposes as well, which we believe will be an important asset to Genetic Resources exploitation.
 De Beukelaer, H., Smýkal, P., Davenport, G.F., and Fack, V., Core Hunter II: fast core subset selection based on multiple genetic diversity measures using Mixed Replica search., BMC Bioinformatics, 13:312, 2012.
 Franco, J., Crossa, J., Ribaut, J.M., Betran, J., Warburton, M.L., Khairallah, M., A method for combining molecular markers and phenotypic attributes for classifying plant genotypes. Theor Appl Genet, 103:944-952, 2001.
 Guo, Y., Li, Y, Hong, H., Qiu, L., Establishment of the integrated applied core collection and its comparison with mini core collection in soybean (Gycine max), The Crop Journal, 1:38-45, 2014.
 Jansen, J. van Hintum, Th., Genetic Distance sampling: a novel sampling method for obtaining core collections using genetic distances with an application to cultivated lettuce, Theor Appl Genet, 114:421-428, 2007.
 Hu, J., Zhu, J., Xu, H.M., Methods of construction core collections by stepwise clustering with three sampling strategies based on the genotypic values of crops, Theor Appl Genet, 101:264-268, 2000.
 Odong, T.L., Jansen, J., van Eeuijk, F.A., van Hintum T.J.L., Quality of core collections for effective utilization of genetic resources review, discussion and interpretation, Theor Appl Genet, 126:289-305,2013.
Heuristic principal component analysis based unsupervised feature extraction and its application to bioinformatics
Abstract: Feature extraction (FE) is a difficult task when the number of features is much larger than the number of samples, although that is a typical situation when biological (big) data are analyzed. This is especially true when FE is stable, independent of the samples considered (stable FE), and is often required. However, the stability of FE has not been considered seriously. In this poster, we demonstrate that principal component analysis (PCA) based unsupervised FE functions as stable FE. Three Bioinformatics applications of PCA based unsupervised FE: 1. Detection of aberrant DNA methylation associated with diseases [1-4], 2. biomarker identification using circulating MicroRNA [5-7] and 3. proteomic analysis of bacterial culturing processes , are discussed.
In the first application, we have treated three examples: identification of genes with aberrant promoter methylation commonly associated with three autoimmune diseases , identification of genes with genotype specific aberrant DNA methylation associated with Esophageal squamous cell carcinoma, and identification of genes associated with aberrant gene expression and aberrant promoter methylation that are negatively correlated with each other in non-small cell lung cancer cell lines[3-4]. For all applications, we have successfully identified genes with significant probabilities.
In the second application, we have also treated three examples: identification of blood miRNAs discriminating between three liver inflammatory diseases and health controls , identification of blood miRNAs discriminating 14 diseases and healthy controls , and proposal of universal disease biomarkers using miRNAs identified in Reference 6 . For these three applications, we have successfully identified sets of limited number (about ten) of miRNAs discriminating samples from 0.8 to 0.9 accuracies.
In the third application, we have applied PCA based unsupervised FE to culturing processes of bacteria, S. Pigeons that often cause life-threading diseases . In this application, our method successfully identified critical proteins in culturing processes of bacteria,
In conclusion, PCA based unsupervised FE is promising method which can be applied to a wide range of Bioinformatics applications.
1) S. Ishida et al, (2014) Bioinformatic screening of autoimmune disease genes and protein structure prediction with FAMS for drug discovery, Protein Pept Lett. in press.(PMID: 23855671)
2) R. Kinoshita et al, (2014) Genes associated with genotype-specific DNA methylation in squamous cell carcinoma as candidate drug targets, BMC Syst Biol. 8(S1):S4.
3) H. Umeyama, M. Iwadate, Y-h. Taguchi, BMC Genomics, in press.
TINAGL1 and B3GALNT1 are potential therapy target genes to suppress metastasis in non-small cell lung cancer
4) Y-h. Taguchi, (2014) Integrative Analysis of Gene Expression and Promoter Methylation during Reprogramming of a Non-Small-Cell Lung Cancer Cell Line Using Principal Component Analysis-Based Unsupervised Feature Extraction
in “Intelligent Computing in Bioinformatics”, Lecture Notes in Computer Science, Vol. 8590, pp.445-455
5) Y. Murakami et al, (2012) Comprehensive miRNA expression analysis in peripheral blood can diagnose liver disease PLoS ONE, 7(10):e48366.
6) Y-h. Taguchi and Y. Murakami, (2013) Principal Component Analysis Based Feature Extraction Approach to Identify Circulating microRNA Biomarkers, PLoS ONE, 8(6):e66714.
7) Y-h. Taguchi, Y. Murakami, BMC Research Notes, in press
Universal disease biomarker: can a fixed set of blood microRNAs diagnose multiple diseases?
8) YH Taguchi, Akira Okamoto (2012) Principal Component Analysis for Bacterial Proteomic Analysis, in “Pattern Recognition in Bioinformatics 2012″, Lecture Notes in Computer Science, Vol. 7632, PP.141-152
|46||Nusrat Jahan, Yu Matsuoka and Hiroyuki Kurata.
Dynamic modeling and simulation of the central metabolism in Escherichia coli
Abstract: One of the main objectives of systems biology is to recognize the capacity of biological systems and change and forward the metabolic and regulatory systems basis on computable estimates by the support of mathematical models. Mathematical models are widely used and precious tools for the researcher to reconstruct and predict complex cellular system for our understanding of the living cells. A common type of mathematical model is the differential equation model, which particularly suitable to investigate the dynamic behavior of the metabolic networks and molecular interactions. Both sciences (for example, biochemistry) and engineering (for example, metabolic engineering) angle, it is deeply essential to recognize the whole metabolic regulation mechanism of bacterial cells, like as Escherichia coli (E.coli). Therefore, it is highly preferred to construct a mathematical model which can define the dynamic behavior of the cell with respect to the changes in the cultural environment and/or specific genomic changes. However, different kind of models has been suggested for analyzing the dynamic behavior of the cells and most of them concentrate on the definite metabolic pathways. It has been studied and proposed the kinetic equation for the glycolysis pathway, the pentose phosphate (PP) pathway, the tricarboxylicacid (TCA) cycle for E.coli. These models do not focus on enzyme activity and transcription factor (TF) activity. In the current study, we integrate Kadir model and Kotte model and built a large-scale metabolic and regulatory model of the central metabolism in E. coli, where we included the glycolysis pathway, the PP pathway, the TCA cycle, anaplerotic enzymes, and the glyoxylate shunt. For transcriptional regulation, we take in cyclic AMP receptor protein (Crp), catabolite repressor/activator (Cra) protein, pyruvate dehydrogenase complex repressor (PdhR) protein and acetate operon repressor (IclR) protein. The total 61 differential equation present in our integrated model, which include biomass, concentration of extracellular carbon sources (glucose and acetate), concentrations of metabolites, concentrations and phosphorylation states of enzymes and PTS proteins, binding states of transcription factors. A dynamic model which simulates the central metabolic pathways like as glycolysis, TCA cycle, PP pathway and the anapleorotic pathways of E.coli has been proposed. This model contains transcriptional and enzymatic regulation and is focused on the combination of the genetic and metabolic layers, which is executed by TF–metabolite interactions. To demonstrate the applicability of the integrated model, we simulated extracellular and intracellular metabolite concentrations in the batch culture and compare it with the experimental data. The comparison of the simulation result with experimental data indicates that the integrated model can precisely simulate and reproduce the experimental data.
|47||Yuanyuan Peng, Yoshihiko Hasegawa, Nasimul Noman and Hitoshi Iba.
Nonlinear protein degradation for temperature compensation
Abstract: Robustness of genetic circuits is very important, thus it is necessary to know which mechanisms are responsible for their robust phenotypes. We adopt cooperative stability for genetic oscillators to realize cooperation in the process of protein degradation, where cooperative stability denotes that high-order oligomers are more stable than the monomeric components. Then the linear programming method is applied to analyze the influence of protein degradation to the temperature sensitivity of the period of the oscillators. One notable property of circadian oscillators is temperature compensation, meaning its period is insensitive to the variation of temperature, but the mechanism underlying temperature compensation is still un-clear. Our theoretical results show nonlinear protein degradation by cooperative stability is more beneficial for realizing temperature compensation of circadian clocks than the linear protein degradation model.
|48||Jiun-Huang Ju and Jung-Hsien Chiang.
Identifying drug-gene-disease interactions and an application to explore the undiscovered networks
Abstract: The automatic identification of drug-gene-disease interactions (DGDi) is an important part of genomic drug discovery. However, the rapid growth of biomedical information and the labor-intensive manual annotation make most of these interactions still undiscovered. This study aims to explore the undiscovered drug-gene-disease interactions from large-scale biomedical literature as accurately as possible. The proposed approach includes: 1) a pattern-based relation extraction; 2) a trigger words learning; 3) an indirect relations inference. With rapidly exploring more undiscovered interactions could enable greater insights into drug discovery such as drug repurposing for any further exploration.
A Formalized Notation of Gene Kinetic Events
Abstract: Abstract: A notation for gene kinetic events is developed that bypasses verbal information and lifts to the foreground the essential results of gene regulation in scientific literature in a manner that is easily searchable using binary strings. The information compression thus achieved not only makes it easier to find the relevant data irrespective of research focus, but also presents the data in a form that appeals to the intellect by its simplicity. The notation is exemplified by its application to genes involved in the regulation of cell proliferation and cell differentiation by way of the regulatory proteins p300 and cyclin D1.
Other Information: The original paper submission is hereby transferred from presentation to poster session by invitation of the GIW/ISCB 2014 PC co-chairs whereby the last sentence above has been added in order to provide further information about details in the original text. Thanks are extended to the reviewers for taking time to write so thorough and helpful comments!
|50||Hirotaka Matsumoto and Hisanori Kiryu.
A stochastic process for time series analysis of gene expression dynamics.
Abstract: Dynamic biological processes such as cellular differentiation and cancer evolution can be regarded as transition of cellular states along a lineage. Modeling such processes will shed some light on understanding mechanism of differentiation and malignant by integrating sequencing data such as RNA-Seq, ChIP-Seq, and BS-Seq. We are developing a stochastic model to describe the transition of multivariate traits by using Ornstein-Uhlenbeck process.
|51||Ryunosuke Itasaki, Hiroyuki Masunaga and Hiroyuki Kurata.
Dynamic sensitivity analysis of ErbB signaling system
Abstract: ErbB signaling plays an important role in the regulation of cell proliferation, survival, metastasis, and invasion into various tumors. To quantitatively understand the ErbB signaling, we employ the Marc and Okada’s mathematical model that describes how stimulation of all four ErbB receptors with epidermal growth factor (EGF) and heregulin (HRG) activates two critical downstream proteins, extracellular-signal-regulated kinase (ERK) and Akt. We used dynamic sensitivity analysis to understand critically important reactions for regulating the ErbB signaling network.
The dynamic sensitivity is an important measure to estimate the robustness and to find critically important kinetic parameters. We simulated the time-varying of protein concentrations or their activities and the dynamic sensitivity of the activities of ERK (ERK*) and Akt (Akt*) with respect to kinetic parameters at different sets of the initial concentrations of EGF and HRG. We calculated the time-integrated value and peak value for the dynamic sensitivities of ERK* and Akt* with respect to each kinetic parameter. We found some critical parameters that show high-integrated and peak values at most sets of the EGF and HRG initial concentrations. These parameters involve the dissociation of the complex of ligand-bound ErbB receptors and PTP-1B. In addition, we suggest some critical parameters that provide a high time-integrated dynamic sensitivity specifically to Akt*, without being affected by the initial ligand concentrations.
|52||Junko Yamane, Toru Maruyama, Michihiro Ito, Haruko Takeyama and Wataru Fujibuchi.
Kernel CCA for Microbial Meta-transcriptome Data to Investigate Coral Survivability in Ryukyu Sea
Abstract: We apply the kernel canonical correlation analysis (KCCA) method to microbial meta-transcriptome data to investigate marine conditions for improving coral living environment. The results show that the KCCA with RBF kernel generates better associations between the meta-transcriptome and KEGG pathways than the linear CCA. The obtained Pearson correlations are r=0.99 and r=0.56 for KCCA and linear CCA, respectively. Currently, we are analyzing substantial pathways for marine conditions based on the canonical coefficients.
|53||Yusuke Azuma and Shuichi Onami.
Evaluation of the effectiveness of simple nuclei-segmentation methods on Caenorhabditis elegans embryogenesis images
Abstract: Advances in microscopy and molecular labeling enabled us to understand embryogenesis in terms of cellular dynamics. For the analysis of the dynamics, various automated processing methods have been developed for nuclei segmentation. These methods tend to be complex for segmentation of images with crowded nuclei, preventing the simple reapplication of the methods to other problems. On the other hand, simple methods do exist, and if it can provide sufficiently accurate segmentation, researchers unfamiliar with image processing would be able to trace cellular dynamics. Once the dynamics is traced, many techniques in the bioinformatics field can be applied to analyze it. For example, clustering technique may find similar motions by comparing the traced trajectories and gene expression analysis for the cells in the same cluster may find commonly expressed genes. Thus, it is useful to evaluate the availability of the simple methods.
Here, we selected six simple methods from various watershed based and local maxima detection based methods that are frequently used for nuclei segmentation, and created a system to evaluate their segmentation accuracy by applying them to spatio-temporal images of Caenorhabditis elegans embryo. By applying the methods to image data between the 50- to 500-cell developmental stages at 50-cell intervals, the error rate for nuclei detection could be reduced to ≤ 2.1% at every stage until the 350-cell stage. The fractions of total errors throughout the stages could be reduced to ≤ 2.4%. The error rates improved at most of the stages and the total errors improved when a 4D noise filter was used. The methods with the least errors were two watershed-based methods with 4D noise filters. For all the other methods, the error rate and the fraction of errors could be reduced to ≤ 4.2% and ≤ 4.1%, respectively. The minimum error rate for each stage between the 400- to 500-cell stages ranged from 6.0% to 8.4%. However, similarities between the computational and manual segmentations measured by volume overlap and Hausdorff distance varied from ~10% to ~70% and from ~2 to ~4.5 μm, respectively. The methods were also applied to Drosophila and zebrafish embryos and found to be effective.
The simple segmentation methods were found to be useful for detecting nuclei until the 350-cell stage, but not very useful after the 400-cell stage. The 350-cell stage is the second-to-last stage of embryonic cell division, and cell tracing to this stage has been used to measure reporter expressions with cellular resolution and succeeded in finding biological knowledge in some studies, suggesting that tracing until this stage can produce useful results.
Image processing of microscopy images accelerates understanding of biological phenomena and will create new research fields in bioinformatics.
Nutritional Systems Biology
Abstract: “Prevention is better than cure” and when it comes to human health, this strategy translates into many socioeconomic benefits. Practically all the cellular processes, including every step in the flow of genetic information from gene expression to protein synthesis and degradation, can be affected by diet and lifestyle. Similar to the role of pharmaceuticals, nutrients contain a number of different compounds that act as modifiers of network function and stability.
However, the level of complexity in nutrition studies is further increased by the simultaneous presence of a variety of nutrients, with diverse chemical structures that can have numerous targets with different affinities and specificities. Obviously, this differentiates the nutritional from the pharmacological studies, where single elements are used at low concentrations and with a relatively high affinity and specificity in a small number of thoroughly selected targets. Our need for fundamental understanding of the building blocks of the complex biological systems had been the main reason for the reductionist approach that was mainly applied in the past to elucidate these systems.
We used advanced data-mining tools for the construction of a database with available, state-of-the-art information concerning the interaction of food and its molecular components with biological systems and their connection to health and disease. The database was enriched with predicted interactions between food components and protein targets, based on their structural and pharmacophore similarity with known small molecule ligands. Further to this, the associations of bioactive food components with metabolic pathways was investigated from a chemical-protein network perspective, while their effects in network robustness was confirmed by proteome analyses and high-throughput genotype-phenotype characterization.
We introduce the worlds fist ‘exhaustive’ resource, NutriChem, on the health benefits associated to specific dietary interventions, with no prior resource covering the broad molecular content of food.
NutriChem, available at http://cbs.dtu.dk/services/NutriChem-1.0, a database generated by text mining of 21 million MEDLINE abstracts for collecting all available information that link plant-based foods with their small molecule components and human disease phenotypes. NutriChem contains text-mined data for 18,478 pairs of 1,772 plant-based foods and 7,898 phytochemicals, and 6,242 pairs of 1,066 plant-based foods and 751 diseases. In addition, it includes predicted associations for 548 phytochemicals and 252 diseases. To the best of our knowledge this database is the only resource linking the chemical space of plant-based foods with human disease phenotypes and provides a fundamental foundation for understanding mechanistically the consequences of eating behaviors on health.
|55||Robert Cox, Masahiko Nakatsui, Hiroki Makiguchi, Teppei Ogawa, Akihiko Kondo and Michihiro Araki.
Meta-synthetic metabolism: in silico design of novel amino acids from all-organism data
Abstract: We explored the metabolism of amino acids by predicting an expansion of the all-organism amino acid biosynthesis network as curated by KEGG. Starting with the KEGG network, we assigned putative enzymatic reactions to new amino acid derivatives. In addition to finding many natural amino acid derivatives–which are not part of the core network–we identified thousands of putative amino acid derivatives that might be accessible with suitable choice of promiscuous or engineered enzymes.
|56||Chun-Ying Yu, Hsiao-Jung Liu, Li-Yuan Hung, Hung-Chih Kuo and Trees-Juen Chuang.
Reconfirmation and evolutionary analysis of previously-annotated chimeric and circular RNA products from EST/RNA-seq data
Abstract: Global transcriptome investigations often result in the detection of an enormous number of transcripts composed of non-co-linear sequence fragments. Such “aberrant” transcript products may arise from post-transcriptional events or genetic rearrangements, or may otherwise be false positives (sequencing/alignment errors or in vitro artifacts). Moreover, post-transcriptionally non-co-linear (“PtNcl”) transcripts can arise from trans-splicing or back-splicing in cis (to generate so-called “circular RNA”). Here, we collected previously-predicted human non-co-linear RNA candidates, and designed a validation procedure integrating in silico filters with multiple experimental validation steps to examine their authenticity. We showed that >50% of the tested candidates were in vitro artifacts, even though some had been previously validated by RT-PCR. After excluding the possibility of genetic rearrangements, we distinguished between trans-spliced and circular RNAs, and confirmed that these two splicing forms can share the same non-co-linear junction. Importantly, the experimentally-confirmed PtNcl RNA events and their corresponding PtNcl splicing types (i.e., trans-splicing, circular RNA, or both sharing the same junction) were all expressed in rhesus macaque, and some were even expressed in mouse. Our study thus describes an essential procedure for confirming PtNcl transcripts, and provides further insight into the evolutionary role of PtNcl RNA events, opening up this important, but understudied, class of post-transcriptional events for comprehensive characterization.
|57||Shih-Wen Huang, Hsin-Yeh Yang, Chia-Rui Yen, Sheng-Jou Hung, Shuen-Lin Jeng and Tsunglin Liu.
In vivo relationship between differential DNA methylation and differential gene expression in hepatocellular carcinoma
DNA methylation is involved in various key processes including carcinogenesis. Several studies have shown that promoter methylation is associated with gene silencing while gene-body methylation correlates positively with expression, which are obtained by comparing different genes in normal tissues. However, it is not clear whether the relationship holds for the aberrant methylation and aberrant expression of the same genes across patients with hepatocellular carcinoma (HCC). A suitable statistical method that integrates methylation in multiple gene regions for explaining expression is also missing because methylation correlates with expression in a non-linear and composite manner.
We probed global methylation at CpG islands in both promoter and gene-body regions for twelve liver tumor and the adjacent non-tumor tissues using a mDIP-Chip approach. Global expression levels were also measured by microarray. Combining the two datasets, we found a position-dependent relationship between differential methylation and differential expression in HCC tissues, which resembles the previously discovered relationship between methylation and expression. We further proposed a piecewise-linear statistical model to identify genes whose differential expression could be explained by the combination of differential methylations in multiple gene regions. Via permutation and simulation analysis, we identified 122 significant genes and the false positive rate was estimated as ~34%.
Our data indicate that the position-dependent relationship between differential methylation and differential expression in HCC tissues is similar to the previous finding in normal tissues. The relationship was captured by our model with statistical significance, but the signal was moderate, suggesting a moderate impact of DNA methylation in HCC. For some of the identified genes, the aberrant expressions have been reported and our statistical models suggest that differential methylation in these genes is involved in regulating the expression.
|58||Takahiro Nukui and Kiyoshi Asai.
More Precise Detection of Horizontal Gene Transfer
Abstract: It is known to be computationally difficult to detect the horizontal gene transfer (HGT) when species tree and gene tree are given. The problem is called Maximum Agreement Forest problem, and is known to be NP-hard problem. The approximation algorithm which guarantees the worst case approximation ratio 3 was previously devised, but it’s always no better than approximation ratio 2. We improved the algorithm and the new algorithm guarantees approximation ratio better than or equal to the solution of the previous algorithm in spite of the same computational complexity O(n^2). Moreover, according to computational simulations, the new algorithm sometimes returns two times better solutions than previous one. We hope that this algorithm will make HGT analysis more precise and accurate.
|59||Trupti Joshi, Saad Khan, Yang Liu, Joao V. Maldonado Dos Santos, Juexin Wang, Mats Rynge, Nirav Merchant, Dong Xu and Henry Nguyen.
Next Generation Resequencing of Soybean Germplasm for Trait Discovery
Abstract: With the advances in next generation sequencing (NGS) technology and significant reduction in sequencing costs it is now possible to sequence large sets of crop germplasm and generate whole genome scale structural variations and genotypic data. In depth informatics analysis of the genotypic data can provide better understanding of the links with the observed phenotypic changes. This approach can be used to further understand and study different traits for the improvement of crops by design.
We have conducted resequencing of several soybean germplasm lines selected for major traits including oil, protein, soybean cyst nematode resistance (SCN), abiotic stress resistance (drought, heat and salt) and root system architecture. We have done bioinformatics analysis and identified SNPs and insertion, deletions using GATK3.0 software. We have also conducted copy number variations (CNV) analysis and SNP annotations with SnpEff. We conducted 25 genomes case study for analysis of SCN resistance and classified them into four different categories of resistance and susceptibility levels. GWAS analysis identified major SNPs associated with the phenotypic changes between these lines. We have performed linkage disequilibrium and haplotype analysis using Haploview. We have also applied generalized linear models (GLM) and mixed linear models (MLM) using TASSEL for identifying SNPs significant for phenotypic changes.
Analysis was conducted using XSEDE as the computing infrastructure, iPlant as the data and cloud infrastructure, and the Pegasus workflow systems to control and coordinate the data management and computational tasks. All data including GWAS, SNP annotations can be accessed through Soybean Knowledge Base (SoyKB) at http://soykb.org.
|60||Tony Chien-Yen Kuo, Chuan-Hung Chen, Chien-Yu Chen and Long-Fang O. Chen.
Genome-Wide Analysis of DNA methylation and Gene Expression in Agarwood to Investigate Cucurbitacin Biosynthesis under Far Red Light and Red Light Conditions
Abstract: Agarwood is derived from Aquilaria trees, an endangered species, the trade of which has come under strict control. Many secondary metabolites of agarwood are known to have medicinal value to humans, including compounds that have been shown to elicit sedative effects and exhibit anti-cancer properties. However, little is known about the genome, transcriptome, and the biosynthetic pathways responsible for producing such secondary metabolites in agarwood.
Recently, we published a draft genome of Aquilaria agallocha (BMC Genomics 2014, 15:578) and in this study, we have additionally constructed various genome wide profiles to investigate cucurbitacins, an important secondary metabolite in agarwood with medicinal value.
In vitro samples of A. agallocha were grown under far red light and red light conditions to stimulate anti-pathogen pathways. DNA and RNA data were utilized to annotate genes and protein functions in the draft genome. Time-course RNA-seq, sRNA, and bisulfite sequence data were utilized to constructed expression and methylation profiles. The expression changes for cucurbitacin are shown to be consistent with known responses of A. agallocha to biotic stress and a set of homologous genes in Arabidopsis thaliana related to cucurbitacin biosynthesis is presented. Methylation and sRNA profiles provide evidence that these factors are involved in biotic stress response in A. agallocha.
Our recent publication was the first attempt to identify cucurbitacin from in vitro agarwood and the first draft genome for any species of Aquilaria. The draft genome and the results of this study will aid in future investigations of secondary metabolite pathways in Aquilaria and other non-model medicinal plants.
Regulation of transcription by chromatin dynamics upon LPS stimulation
Abstract: Background: Gene expression changes in response to stimuli are regulated in concert by epigenetic modifications and transcription factors. However, the complex interactions between these levels of regulation are still poorly understood. Here, we studied the genome-wide dynamics in epigenetic markers and their interplay with transcription factor binding, in mouse dendritic cells (DCs) following stimulation with lipopolysaccharide (LPS), a potent stimulator of the innate immune response.
Results: We generated ChIP-seq data for several histone modifications and RNA polymerase II, in DCs in a time series after LPS stimulation (0, 0.5, 1, 2, 3, 4, 6, 8, 16, and 24 hrs). This data was integrated with gene expression data and genome-wide transcription initiation data taken from the same cell type stimulated by the same stimulus. Considerable stimulus-induced changes in histone modifications at promoters and enhancers were observed. Focusing on the promoters of induced genes, we found that some epigenetic markers are induced early after stimulation, roughly simultaneously with gene induction times. However, others increase during a later time frame, and independently of gene induction times. Modeling the epigenetic dynamics of enhancers using a 14-state Hidden Markov model (HMM) suggested the existence of a specific state for stimulus-activated enhancers. Enhancers in this state transiently increase in numbers shortly after stimulation. A 12-state HMM for promoters revealed, amongst others, widespread shifts in Pol2 binding immediately following stimulation. Many of these phenomena are significantly correlated with transcription factors binding events.
Conclusions: Our time series epigenetic data and its analysis reveal a high level of dynamic complexity in the stimulus-induced regulation of gene expression.
|62||James Perkins, Jose Antonio Cornejo-Garcia and Miguel Blanca.
Investigating the effects of genetic variation in 5’ flanking regions using fpFun
Abstract: Most SNPs found by recent genome wide association studies (GWAS) are located in non-exonic regions of the genome. Elucidation of their function and unravelling their link with disease is difficult, and requires the use of various techniques .
Here we present a bioinformatics workflow, fpFun, which takes a given set of genes and looks to investigate the putative function of SNPs in the 5’ upstream region of a given gene. It uses data from the ENCODE project relating to transcription factor binding sites, areas of the genome involved in chromatin modifications, and regions of DNA methylation . It also looks for SNPs located in eQTL and SNPs in genetic linkage with disease associated SNPs identified by GWAS experiments taken from the NHGRI GWAS catalogue .
We applied the workflow to analyse eicosanoid receptor genes, involved in inflammation and immune system function, and implicated in many pathologies. We found SNPs located in important regulatory regions, affecting gene expression, and associated with various pathologies including asthma and various cancers.
This meta-analysis pipeline will complement current efforts looking for the effect of exonic SNPs on protein structure by adding contextual information regarding their function, expression and regulation.
1 Edwards, S.L. et al. (2013) Beyond GWASs: illuminating the dark road from association to function. Am. J. Hum. Genet. 93, 779–797
2 Dunham, I. et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74
3 Welter, D. et al. (2014) The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, D1001–1006
|63||Chieh-Hua Lin, Yu Bin Wang, Chao A. Hsiung and Chung Yen Lin.
EVIDENCE: a web-based tool for genotyping and recombination detection of enterovirus
Abstract: Different genotypes within genus Enterovirus (EV) of the Picornaviridae family cause diverse infection diseases in human and other mammalian. The 13 coding regions of EV genome which compose a single strand RNA have frequent recombination events between and within genotypes, causing the continuously growing number of novel genotypes. With the increasing numbers of EV sequences, more sensitive and specific molecular typing tools are needed for rapid clinical diagnosis and treatment. To achieve these purposes, a web-based tool, named EVIDENCE (Enterovirus in deep conception), was developed for EV genotyping and recombination detection. The 366 up-to-date known genotypes were collected as reference set for query and download. For genotyping a new genome or sequence fragment, user can submit the sequence in fasta for comparison, then select single or combine several coding regions to infer plausible genotypes by summation of similarity scores among submission and references. Furthermore, the phylogenetic relationship constructed by these selected genotypes might be the clues of genome recombination to tell those possible genotypes involved. Comparing these sequences of selected genotypes with submission by bootstrap and genetic similarity analyses, user can try to detect the site for recombination with friendly web interface.
In addition, EVIDENCE re-classified the genotypes of all EV sequences in Genbank by using latest classification and nomenclature of EV. Eight-six genotypes are not included or needed to update in Genbank. The automatically genotyping and recombination detection tools of EVIDENCE can facilitate genotypes assignment and identification recombinant, thus increasing the number of typable EV isolates and providing the understanding of EV evolution.
The EVIDENCE is available at http://symbiont.iis.sinica.edu.tw/evidence.
|64||Yoshitaka Onishi, Naohisa Goto, Kosuke M. Teshima and Teruo Yasunaga.
Comprehensive analysis of genes that experienced positive selection in Hominoidea evolution
Abstract: We investigated whether the adaptive evolution of genes has happened during the speciation process by finding phylogenetic branches which show high ka/ks ratios (ω ＞ 1). In this research, we used Ensembl database  for searching orthologous genes among human (H), chimpanzee (C), gorilla (G), orangutan (O), and macaque (M). Species tree is well established on these species ((H, C), G), O), M)). We used the tree topology for calculating number of synonymous substitutions per synonymous site (ks), number of nonsynonymous substitutions per nonsynonymous site (ka), and ka/ks ratio (ω) in each branch by using PAML . By conducting statistical tests, genes were classified into groups based on the pattern of ω values (ω ＞ 1, ω = 1, ω ＜ 1) along the branches in the phylogeny. Especially, we focused on the pattern of branches which have positive ω values (ω ＞ 1). We will investigate all orthologous genes that show a positive selection. We are also generating a database of orthologous genes which facilitates such research.
 Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., … & Clamp, M. (2002). The Ensembl genome database project. Nucleic acids research, 30(1), 38-41.
 Yang, Z. (1997). PAML: a program package for phylogenetic analysis by maximum likelihood. Computer applications in the biosciences: CABIOS, 13(5), 555-556.
|65||Jan Engelhardt and Peter F. Stadler.
Evolution of the Unspliced Transcriptome
Abstract: Despite their abundance, unspliced EST data have received little attention as a source of information on non-coding RNAs. Very little is know, therefore, about the genomic distribution of unspliced non-coding transcripts and their relationship with the much better studied regularly spliced products. In particular, their evolution has remained virtually unstudied.
We systematically study the evidence on unspliced transcripts available in EST annotation tracks for human and mouse, comprising 104,980 and 66,109 unspliced EST clusters, respectively. Roughly one third of these are located totally inside introns of known genes (TINs) and another third overlaps exonic regions (PINs). 11% are “intergenic”, far away from any annotated gene. Direct evidence for the independent transcription of many PINs and TINs is obtained from CAGE tag and chromatin data. 15-20% of the unspliced EST cluster are conserved between human and mouse. With the exception of TINs, the sequences of unspliced EST clusters are evolved significantly slower than genomic background, and conservation between man and mouse is not uncommon.
Unspliced long non-coding RNAs are an important, rapidly evolving, component of mammalian transcriptomes. Their analysis is complicated by their preferential association with complex transcribed loci that usually also harbor a plethora of spliced transcripts.
The analysis of unspliced ESTs uncovers a largely unexplored realm of long transcripts. Including many which where previously unknown. The frequently postulated connection between lack of splicing and nuclear retention suggests that this class of transcripts might be involved in chromatin organization and possibly other mechanisms of epigenetic control. Further experimental investigation of the long transcripts could provide great insight into these mechanisms.
|66||Kyung-Won Hong, Seok Won Jeong, Myungguen Chung and Seong Beom Cho.
eQTL SNPs and metabolic traits in Koreans
Abstract: Most genome-wide association studies consider genes that are located closest to single nucleotide polymorphisms (SNPs) that are highly significant for those studies. However, the significance of the associations between SNPs and candidate genes has not been fully determined. An alternative approach that used SNPs in expression quantitative trait loci (eQTL) was reported previously for Crohn’s disease; it was shown that eQTL-based preselection for follow-up studies was a useful approach for identifying risk loci from the results of moderately sized GWAS. In this study, we propose an approach that uses eQTL SNPs to support the functional relationships between an SNP and a candidate gene in a genome-wide association study. The genome-wide SNP genotypes and 10 biochemical measures (fasting glucose levels, BUN, serum albumin levels, AST, ALT, gamma GTP, total cholesterol, HDL cholesterol, triglycerides, and LDL cholesterol) were obtained from the Korean Association Resource (KARE) consortium. The eQTL SNPs were isolated from the SNP dataset based on the RegulomeDB eQTL-SNP data from the ENCODE projects and two recent eQTL reports. A total of 25,658 eQTL SNPs were tested for their association with the 10 metabolic traits in 2 Korean populations (Ansung and Ansan). The proportion of phenotypic variance explained by eQTL and non-eQTL SNPs showed that eQTL SNPs were more likely to be associated with the metabolic traits genetically compared with non-eQTL SNPs. Finally, via a meta-analysis of the two Korean populations, we identified 14 eQTL SNPs that were significantly associated with metabolic traits. These results suggest that our approach can be expanded to other genome-wide association studies.
|67||Zing Tsung-Yeh Tsai, Shin-Han Shiu and Huai-Kuang Tsai.
Impacts of sequence motif, chromatin state, and DNA structure features for predicting transcription factor binding
Abstract: Transcription factor binding is determined by multiple factors including sequence specificity and chromatin accessibility where the latter is influenced by both chromatin state and DNA structural properties. Although these features can be used to predict TF binding sites, their relative and joint contributions remain unclear. Particularly, given some of these features can be predicted based on genomic sequence alone, it remains an open question how well they can be applied for predicting binding regions. By a systematic assessment on the impact of jointly considering 23 features in predicting TF binding preference, chromatin state and DNA structural properties are better predictors for binding than sequence motif of a TF. In addition, simultaneously considering chromatin state and DNA structural properties further improves the accuracy of TF binding prediction, indicating that these two feature sets are highly synergistic. However, their relative contributions differ greatly between TFs. Most importantly, we show that three DNA intrinsic properties are particularly critical in predicting TF binding. Using the intrinsic model, we can predict binding regions not only across TFs but also across DNA-binding domain families with distinct structural folds. The intrinsic property model allows TF binding predictions across DNA-binding domain families that are present in most eukaryotes, suggesting that the model is likely universal and can be used across species. Thus our findings demonstrate the feasibility in establishing a universal model for identifying regulatory regions in any sequenced genomes.
|68||Paweł Bednarz and Bartek Wilczynski.
Predicting chromatin boundaries genome-wide from ChIP-Seq data
Abstract: Three-dimensional chromatin structure has substantial impact on transcription regulation. Although predicting exact three-dimensional chromatin conformation is hard, recently new methods have been developed, giving an opportunity to determine which chromatin fragments tend to be close to each other. Owing to this, it is possible to divide chromatin into disjoint domains with small fragments constituting boundaries between them. There is evidence in literature that some insulator proteins have a tendency to bind near those boundaries. To better understand which factors drive partition into domains, we developed models able to predict domain boundaries from insulator binding sites and histone modification locations. We chose a set of insulators and histone modifications which according to our knowledge may play a role in shaping the chromatin structure. Afterwards we created three models predicting boundaries basing on k-means clustering, Hidden Markov Models and Bayesian Networks. It turned out that the one based on Bayesian Network outperformed two others. Over the last years, many attempts (using different approaches) have been made to find a reliable segmentation of chromatin, hence it was possible to validate our model. Besides capturing precisely boundaries predicted by other frameworks, our model is also able to find new predictions that can elucidate observed biological phenomena. Furthermore, by inspecting the structure of the resultant bayesian network we are in a position to state, which of the considered factors play major role in giving rise to domain boundaries. We also assayed the robustness of classifier between the set of domain boundaries reported using damID data and Hidden Markov Models and the one derived from HiC experiment. It turned out, that not only predictors are largely the same but also the classifier trained on the first set of boundaries generalizes to the second one. In comparison to sequence-based predictors our method has much higher accuracy with comparable predictive power.
|69||Theresa Tsun-Hui Tsao, Cheng-Yan Kao, Kuang-Chi Chen, Ko-Chun Yang and Sheng-An Lee.
Gene Neighbourhood (GeNei), a Novel Concept for Analyzing the Collective mRNA Expression of Neighbouring Genes on a Protein-Protein Interaction Network
Abstract: The concept of “gene neighbourhood”, abbreviated as GeNei, was introduced to describe a central gene and its immediate level 1 interactors on protein-protein interaction (PPI) networks. It was proposed based on the observation that neighbouring genes usually have similar mRNA expression patterns, thus that the abundance of neighbouring genes may be considered collectively, rather than individually. In this study, we applied this concept of GeNei to analyze microarray datasets of Arabidopsis thaliana under high salt stress. High salt stress has been studied as the model osmotic stress. There are many microarray studies on the effect of high salt on mRNA expression. However, the stress response has not been well investigated from the PPI perspective, partly because the lack of sufficient A. thaliana PPI data until recent years. Although the GeNei method was not particularly designed to identify protein functional modules and complexes, it revealed many potential salt-regulated protein functional modules and complexes that were not seen using the conventional single gene analytical methods.
|70||Euna Jeong and Sukjoon Yoon.
Analysis of intrinsic behavior of gene expression for the improved interpretation of transcriptome data
Abstract: Individual genes exhibit various ranges of expression levels to perform their biological functions. Thus, considering the intrinsic variability of gene expression can improve geneset-based functional analyses which are typically used to interpret transcriptome data. We applied a new data analysis strategy to capture the expressional variability of individual genes using large collections of transcriptome and proteome data. The analysis have been revealed the intrinsic variability of gene expression at the transcriptional level and different levels of regulation at the transcription and translation stages. In this study, we present a method to redefine the pathway genesets by excluding genes with low transcriptional variation for transcriptome data analysis. This approach provides improved resolution to geneset enrichment analysis in prioritizing target functional pathways in several different experimental datasets.
|71||Bahtiyor Nosirov, Daron Standley, Shunsuke Teraguchi and Alexis Vandenbon.
Improving miRNA target predictions by integrating gene expression
Abstract: Background: MicroRNAs (miRNAs) are widely recognized as important post-transcriptional regulators of gene expression. However, the prediction of their target genes remains a difficult problem. Current computational target prediction approaches rely on features such as seed region complementarity, binding site conservation and binding site accessibility etc. However, such approaches tend to produce many false positive predictions.
Results: Here, we propose a target prediction methodology which incorporates gene expression data of candidate target genes in a cell or tissue type of interest. We applied our approach on human serum miRNAs and their target genes predicted by selected commonly used target prediction algorithms. Integrating this data with gene expression levels of immune cells in resting and stimulated states, we filtered out biologically irrelevant target genes.
Conclusion: We believe that our approach can help us to better understand the functional mechanisms of serum miRNAs and their putative target genes in the development of immune cells and immune response.
|72||Trupti Joshi, Shiyuan Chen, Jiaojiao Wang, Gary Stacey, Henry Nguyen and Dong Xu.
Soybean Knowledge Base (SoyKB): Bridging the gap between soybean translational genomics and breeding
Abstract: Many genome-scale data are available in soybean including genomic sequence, transcriptomics (microarray, RNA-seq), proteomics and metabolomics datasets, together with growing knowledge of soybean in gene, microRNAs, pathways, and phenotypes. This represents rich and resourceful information which can provide valuable insights, if mined in an innovative and integrative manner and thus, the need for informatics resources to achieve that.
Towards this we have developed Soybean Knowledge Base (SoyKB), a comprehensive all-inclusive web resource for soybean translational genomics and breeding. SoyKB handles the management and integration of soybean genomics and multi-omics data along with gene function annotations, biological pathway and trait information. It has many useful tools including Affymetrix probeID search, gene family search, multiple gene/metabolite analysis, motif analysis tool, protein 3D structure viewer and download/upload capacity for experimental data and annotations. It has a user-friendly web interface together with genome browser and pathway viewer, which display data in an intuitive manner to the soybean researchers, breeders and consumers.
SoyKB has new innovative tools for soybean breeding including a graphical chromosome visualizer targeted towards ease of navigation for breeders. It integrates QTLs, traits, germplasm information along with genomic variation data such as single nucleotide polymorphisms (SNPs) and genome-wide association studies (GWAS) data from multiple genotypes, cultivars and G.soja. QTLs for multiple traits can be queried and visualized in the chromosome visualizer simultaneously and overlaid on top of the genes and other molecular markers as well as multi-omics experimental data for meaningful inferences. SoyKB can be publicly accessed at http://soykb.org.
|73||Anton Kratz, Pascal Beguin, Megumi Kaneko, Takahiko Chimura, Ana Maria Suzuki, Atsuko Matsunaga, Sachiko Kato, Nicolas Bertin, Timo Lassmann, Rejan Vigot, Piero Carninci, Charles Plessy and Thomas Launey.
Digital expression profiling of the compartmentalized translatome of Purkinje neurons
Abstract: Underlying the complexity of the mammalian brain is its network of neuronal connections, but also the molecular networks of signaling pathways, protein interactions, and regulated gene expression within each individual neuron. The diversity and complexity of the spatially intermingled neurons pose a serious challenge to the identification and quantification of single neuron components. To address this challenge, we present a novel approach for the study of the ribosome-associated transcriptome – the translatome – from selected subcellular domains of specific neurons, and apply it to the Purkinje cells (PCs) in the rat cerebellum. We combined microdissection, translating ribosome affinity purification (TRAP) in nontransgenic animals, and quantitative nanoCAGE sequencing to obtain a snapshot of RNAs bound to cytoplasmic or rough endoplasmic reticulum (rER)?associated ribosomes in the PC and its dendrites. This allowed us to discover novel markers of PCs, to determine structural aspects of genes, to find hitherto uncharacterized transcripts, and to quantify biophysically relevant genes of membrane proteins controlling ion homeostasis and neuronal electrical activities.
|74||Jednipit Borthong, Ryo Nakao, Yongjin Qiu, Aiko Ohnuma, Manabu Igarashi, Chihiro Sugimoto, Orasa Suthienkul and Kimihito Ito.
Determination of bacterial population in aquatic samples using bioinformatics analysis of 16S rRNA fragments
Abstract: Bacteria are abundant microorganisms in aquatic environments that play significant roles for element cycles and ecological systems. Most of them truly inhabit in aquatic environment but some species can be successfully cultured through the ability of conventional method. With this limitation, it is difficult to use this method for comprehensive determination of whole bacteria in samples. For last decades, the advance in sequencing technologies has been improved and revealed the bacterial sequence genes, especially 16S rRNA. This gene is highly variable region, has been used to identify the bacterial species, and nowadays become to the important gene for hierarchy classification in the metagenomic strategy. In order to comprehensively study the bacterial population in aquatic environments, we launched a metagenomic strategy through the ability of 454 pyrosequencing to analyze bacterial population in ten aquatic samples collected from the Hokkaido University campus. The genetic material was extracted and 16S rRNA region was then amplified. The reads were consequently submitted to the Ribosomal Database Project (RDP) Classifier, and analyzed with 80% threshold similarity and recommend parameters for hierarchy classification. The 16S rRNA alignment can divide all samples into three groups according to overall of bacterial population. In addition, the sequences of bacteria related with infectious disease in human health were also revealed following Acinetobacter, Aeromonas, Arcobacter, Bacillus, Brevudimonas, Campylobacter, Chitinophaga, Chryseobacterium, Clostridium, Corynebacterium, Enterococcus, Erwinia, Helicobacter, Legionella, Mycobacterium Pseudomonas, Serratia, Sphingomanas, Staphylococcus, Stenotrophomonas, and Streptococcus. In conclusion, 16S rRNA information is strongly powerful for determination of bacterial population. Besides, it can be simultaneously used to distinguish pathogenic bacteria in aquatic sample through metagenomic strategy.
|75||Midori Iida, Shunpei Tateno, Yuki Yamaguchi and Satoshi Fujii.
Significance Analysis of shRNA-seq Applied to the Pooled Lentiviral shRNA Expression Libraries
Abstract: Ribonucleic acid interference (RNAi) screening has become an indispensable genetic research tool, allowing determination of phenotypic effects after silencing entire suites of genes. As the catalog of fully sequenced genomes and transcriptomes grows, production of small interfering/short-hairpin RNA (siRNA/shRNA) libraries that are designed to target complete genomes is achievable, allowing high-throughput “genome-wide” RNAi screening. One of the main objectives in the analysis of RNAi screening is the identification of genes that are differentially expressed under two experimental conditions. For broad genome-wide coverage, there are at least 5-6 shRNAs targeting a specific transcript in Cellecta’s genome-wide pooled lentiviral bar-coded shRNA libraries. However, the efficiency of shRNAs for the target is non-uniform. Therefore we put rank product technique on Cellecta’s genome-wide pooled lentiviral bar-coded shRNA library screening data to determine the differently expression genes (NRP.geo). NRP.geo is based on calculating rank products (RP) for each probe and averaging RP for each gene from replicate experiments. At the same time, it provides a straightforward and statistically stringent way to determine the significance level for each gene and allows for the flexible control of the false-discovery rate and familywise error rate in the multiple testing situation of a shRNA-seq experiment. We applied the NRP.geo technique on a biological data sets and demonstrated that it performs more reliably than average fold change.
|76||Takaya Saito and Marc Rehmsmeier.
A fast microRNA target prediction tool that provides a single-entry interface to multiple algorithms
Abstract: MicroRNA (miRNA) is a class of small non-coding RNAs that post-transcriptionally repress the expression of protein coding mRNAs. Finding effective miRNA targets – mRNAs with their protein expression repressed by miRNA binding – has been one of the most challenging fields in bioinformatics. In recent years, tools that combine several algorithms to predict miRNA targets have gained popularity rather than relying on a single tool. Nonetheless, the majority of such tools are web-based to provide pre-calculated predictions. Here, we have re-implemented six popular miRNA target prediction algorithms with C++ and template libraries. Each re-implemented algorithm is 3-40 times faster than the original. The main goal of our tool is to provide a single-entry interface for multiple algorithms with fast computational time.
|77||Takashi Matsuda, Michiaki Hamada and Kiyoshi Asai.
Prediction of joint RNA secondary structure by using their homologous sequence information
Abstract: RNA-RNA interactions have a contribution to specific functions on siRNA, snoRNA, piRNA and other ncRNAs.Because a joint secondary structure between tow RNA sequences is useful to clarify the molecular mechanism of RNAs, improving the accuracy of joint secondary structure predictions is important. On conventional RNA secondary structure prediction from a single sequence, the accuracy was improved by utilizing the information of conserved structure between a targe RNA sequencce and its homologous sequences. Motivated by this, in this study, we develop a method to predict joint secondary structure of tow target RNA sequences by utilizing the information of their homologous sequences. Our method is an extension of RactIP, a program for predicting joint secondary structure for two RNA sequences, and is based on stochastic models considering the conserved interaction structure of RNA joint secondary structures and alignments.
RactIP: fast and accurate prediction of RNA-RNA interaction using integer programming. Yuki Kato et al. Bioinformatics(2010).
Predictions of RNA secondary structure by combining homologous sequence information. Michiaki Hamada et al. Bioinformatics(2009).
|78||Yu-Shu Lo, Chun-Yu Lin and Jinn-Moon Yang.
The origin of species as revealed by molecular interfaces and interactomes
Abstract: A major goal of evolutionary biology is to understand the mechanisms by which novel phenotypes originate and diversify. Some works proposed a new concept “evolutionary novelties” which highlights the genetic changes involved in evolutionary processes. However, the genetic changes may not lead to species evolution because proteins produced by the changed genes may not affect molecular interfaces, which are essential to carry out biological functions. Therefore, we will utilize molecular interactions and interactomes to study the linkage between evolutionary novelties and genetic changes. In current state, scientists often identified the genes to study evolutionary novelties without a detailed mechanism. Here, we will use molecular interactions and interactomes to overcome this gap by the following steps: 1) Identify genetic changes inducing function diversity via molecular interfaces; 2) Analyze functional modules by molecular interfaces across multiple species; 3) Derive interactome evolution and functions based on conserved and specific modules; 4) Provide insights and mechanisms between the interactomes and evolutionary novelties across multiple species. We believe that this work based on molecular interfaces and module families is helpful to understand the essential element of life and the basic mechanism of a new species.
|79||Nagarajan Raju and Micheal Gromiha M.
Structure and function based selection of best predictors for identifying the binding sites in RNA binding proteins
Abstract: Protein-RNA complexes play key roles in several cellular processes by the interactions of amino acids with RNA. To understand the recognition mechanism, it is important to identify the specific amino acids involved in RNA binding. Various computational methods have been developed for predicting RNA binding residues from protein sequence. However, their performances mainly depend on the training dataset, feature selection for developing a model and learning capacity of the model. Hence, it is important to reveal the correspondence between the performance of methods and properties of RNA-binding proteins (RBPs). In this work, we have collected all available RNA binding residues prediction methods and revealed their performances on unbiased, stringent and diverse datasets for RBPs with less than 25% sequence identity based on structural class, fold, superfamily, family, protein function, RNA type, RNA strand and RNA conformation. The best methods for each type of RBPs and the type of RBPs, which require further refinement in prediction, have been brought out. We also analyzed the performance of these methods for the disordered regions, structures which are not included in the training dataset and recently solved structures. The reliability of prediction is better than randomly choosing any method or combination of methods. This approach would be a valuable resource for biologists to choose the best method based on the type of RBPs for designing their experiments and the tool is freely accessible online at www.iitm.ac.in/bioinfo/RNA-protein/.
|80||Kenichiro Imai, Yoshinori Fukasawa, Kentaro Tomii and Paul Horton.
Human mitochondrial proteome analysis by novel mitochondrial targeting signal prediction
Abstract: Mitochondria provide numerous essential functions for cells, and their dysfunction leads to very diverse diseases. Thus obtaining a complete mitochondrial proteome should be a crucial step towards understand the roles of mitochondria. Many mitochondrial proteins have been identified experimentally but a complete list is not available. According to recent presumption, animal mitochondria are estimated to contain 1,500 different genes, and 50~70% of them are translocated by N-terminal cleavable targeting signals (presequence).
We recently developed a novel predictor named MitoFates for mitochondrial presequence, which is a first major advance in the last decade. We used MitoFates to look for undiscovered mitochondrial proteins from 42,217 human proteins (including isoforms such as alternative splice or translation initiation variants). MitoFates predicts 1167 genes to have at least one isoform with a presequence, and 580 out of these genes were annotated as “mitochondria” in neither UniProt nor Gene Ontology. Interestingly, these predictions include 42 regulator candidates of parkin translocation to damaged mitochondria, and also many genes with known disease mutations. This suggests that careful investigation of MitoFates predictions will be helpful in elucidating the functions of mitochondria in health and disease. In addition, we performed presequence prediction against about 400 other available eukaryotic proteomes. Although presequence pathway is a common and conserved import system from fungi to metazoa or plant, some organisms show apparently low fraction for predicted presequence in their proteomes. We will discuss such decrease of presequence fraction in evolutionary history for eukaryotic cells as well.
|81||Masahito Ohue, Takehiro Shimoda, Shuji Suzuki, Yuri Matsuzaki, Takashi Ishida and Yutaka Akiyama.
MEGADOCK 4.0: an ultra-high-performance protein-protein docking software for heterogeneous supercomputers
Abstract: The application of protein-protein docking in large-scale interactome analysis is a major challenge in structural bioinformatics and requires huge computing resources. In this work, we present MEGADOCK 4.0, an FFT-based docking software that makes extensive use of recent heterogeneous supercomputers and shows powerful, scalable performance of >97% strong scaling. MEGADOCK 4.0 is written in C++ with OpenMPI and NVIDIA CUDA 5.0 (or later) and is freely available to all academic and non-profit users at: http://www.bi.cs.titech.ac.jp/megadock.
|82||Shah Md. Shahik and Md. Saiful Islam.
A systematic Study on Structure and function of ATPase of Wuchereria bancrofti
Abstract: Background: Analyzing the structures and functions of different proteins of Wuchereria bancrofti is very important because till date no effective drug or vaccine has been discovered to treat lymphatic filariasis. ATPase is one of the most important proteins of Wuchereria bancrofti. Adenosine triphosphate (ATP) converts into adenosine diphosphate (ADP) and a free phosphate ion by the action of these ATPase enzymes. Energy releases from these de-phosphorylation reactions drive the other chemical reactions in the cell.
Methods: In this study we worked on the protein ATPase of Wuchereria bancrofti which has been annotated from NCBI. Various computational tools and databases have been used to determine the various characteristics of that enzyme such as physiochemical properties, secondary structure, 3D structure, conserved domain, epitope & their molecular evolutionary relationship.
Result: Subcellular localization of ATPase was identified and we have found that 55.5 % are localized in the cytoplasm. Secondary and three dimensional structure of this protein were also predicted. Both structure and function analysis of ATPase of Wuchereria bancrofti showed unique non homologous epitope sites and non-homologous antigenicity sites. Moreover it resulted in 15 ligand drug binding sites in its tertiary structure.
Conclusion: Structure prediction of these proteins and detection of binding sites and antigenicity sites from this study would indicate a potential target aiding docking studies for therapeutic designing against filariasis.
|83||Makio Shiraishi, Naoaki Ono, Tetsuo Sato, Md. Altaf-Ul-Amin, Tadao Sugiura and Shigehiko Kanaya.
Comparison of Phylogenetic Relationship and Physical Features of Membrane Proteins in the Respiratory Chain of Microorganisms
Abstract: The power plant of a cell is the respiratory chain, an attribute of energy metabolism. This has long been studied, but its origin and contribution in evolutionary history of organisms is still a subject of discussion. Moreover, most of these studies focus on individual species and on a single component of the respiratory chain such as a protein or a complex, thus the comprehensive research of various microorganisms have been mostly ignored despite its evolutionary importance.Although the diversity of the respiratory chain as chemical compound is quite large, they are restricted by evolutional and environmental constraints to maintain efficiency of their energy metabolism. It implies that respiratory chains of a wide range of microorganisms can be evaluated from the viewpoint of their physical features, for example membrane potential, as a substitute for their component. Thus we study on this subject comprehensively by comparing reported measurements of the respiratory chain among molecular phylogenetic groups in bacteria and archaea. First, we gathered 216 data of physical feature of the respiratory chain, such as membrane potential, proton potential, proton motive force and the environmental conditions (temperature, pH), in 46 species from literature. We confirmed that the energy consumption of ATP, proton motive force (PMF), is maintained over the wide range of the environment, namely temperature (20 °C ~ 75 °C) and pH (1 ~ 11), by analyzing the relations among them. Then, we compared 16s rRNA phylogenetic tree and phylogenetic protein groups according to amino acid sequences of each protein in the respiratory chain, using multiple sequence analysis and clustering method. As a result, we found that there are structural differences between 16s rRNA tree and some protein groups, such as ATPase subunit beta and NADH dehydrogenase subunit F. Furthermore, we also found the physical feature (proton potential) of these protein groups are significantly different from others. This result suggests that the evolution in enzyme proteins of respiratory chain is independent of that of rRNA, and that efficiency of the whole energy metabolism can be environmental constrains and affect the evolution of proteins. To make these relationships between the physical features and the protein group more clear in detail, we are further analyzing their connection with parameters using the principal component analysis (PCA).
|84||Md. Bahadur Badsha and Hiroyuki Kurata.
Dynamic flux balance analysis for CHO cell metabolism
Abstract: Metabolic engineering aims at finding a drug target in metabolic networks and at enhancing production of useful compounds and proteins. Mathematical modeling and dynamic simulation are the important tools for a researcher to investigate the cellular behavior of biochemical networks and assistance to find the valuable results from investigations. There are many different methods have been suggested for analyzing of metabolic networks include metabolic control analysis (MCA), metabolic flux analysis (MFA), biochemical systems theory (BST) and flux balance analysis (FBA). These suggested methods need a functional form of kinetics for the cellular reactions. Typically, FBA solutions indicate an instantaneous change of the metabolic fluxes, where no need for any information on kinetic, which is the greatest advantage by the FBA. However, FBA predictions may not always be accurate, and we can not get any information about concentrations of metabolites and dynamic behavior of metabolic fluxes. Therefore, it is very important to study about the system and characterizing the function of metabolic networks with respect to time. Dynamic FBA (dFBA) is now used to build a dynamic model, which combines FBA and kinetic equations for substrate uptake and product secretion. To rationally design the metabolic networks, we can integrate kinetics into metabolic network models and dFBA is effective in the integration of kinetics into metabolic networks by solving an optimization problem. Chinese hamster ovary (CHO) cells are extensively used for biological and medical research and commercially in the production of recombinant protein therapeutics and, it becomes the standard industry host. CHO cells are widely used to produce antibody drugs that are very popular as anticancer drugs. It will be an enormous benefit for a drug company, if they reduce the manufacturing cost by a few percentages. It has been studied and proposed a novel framework for simulating and analyzing the dynamic behavior of metabolic and biosynthetic pathways of a kinetic model of CHO cell metabolism in feed-batch culture. However, the modeling of the complete cell metabolism greatly increases the challenges to the researcher and it is a very difficult task for comprehending and interpreting the results of all extracellular reactions in where concentrations are dependent on the rate equations. In the present study, we proposed a dynamic model for CHO cells, where kinetic model is replaced by dFBA. The feed-batch cultivation were simulated for extracellular and intracellular metabolite concentrations were compared with experimental data. The comparison of the simulation result indicates that the proposed method could precisely predict the flux distributions, antibody production and cell growth and, reproduce the experimental data.
|85||Donghan Li, Naoaki Ono, Tetsuo Sato, Tadao Sugiura, Md. Altaf-Ul-Amin, Masanori Arita, Ken Tanaka, Zhiqiang Ma and Shigehiko Kanaya.
Targeted Integration Between RNA-Seq and Metabolomic Data to Elucidate Curcuminoid Biosynthesis Flux in Four Curcuma Species
Abstract: Turmeric, a rhizomatous herbaceous perennial plant of the ginger family, has been used as a spice and herbal medicine in many Asian countries for centuries. Curcuminoid (namely curcumin and its analogs) are the primary active constituents accumulated in the rhizome of turmeric. Currently, the increasing demand for natural products as additives for functional food and beverages makes turmeric an ideal candidate and global consumption of this spice has been increasing. It has also been identified to be of great medicinal value (e.g. anticancer, antiphlogistic and anti oxygenic properties) and has been increasingly utilised recently. Despite that curcumin has great medical and pharmaceutical potential, currently its production largely depends on natural plant growth. Curcuminoid synthesis mechanism has not been fully understood. Here we compared two wild strains and two cultivars to understand differences in the synthesis of curcuminoids by analysing metabolite concentrations using gas chromatography-mass spectrometry and gene expressions using next generation sequencers in the curcumin biosynthesis pathway. We developed a method using the RNA-seq analysis approach that focused on a specific set of genes to detect expression differences between samples in detail. Using this approach, we found that the difference in the contents of curcuminoids among the species were consistent with the changes in the expression of genes encoding diketide-CoA synthase, and the curcumin synthase enzyme at the branching point of curcuminoid biosynthesis.
|86||Yuki Otana, Shigehiko Kanaya, Tsuyoshi Shirai, Takaaki Nishioka, Md Altaf-Ul-Amin, Tadao Sugiura, Naoaki Ono, Tetsuo Sato, Ming Huang, Tetsuo Katsuragi and Yukiko Nakamura.
Clustering of 3D structure similarity based network of secondary metabolites to reveal their relationships with biological activities
Abstract: A database describing the relationships between species and their metabolites would be significantly useful for metabolomics research, because it targets systematic analysis of enormous number of organic compounds with known or unknown structures. We constructed a species-metabolite DB, the KNApSAcK Core DB, which contains 101,500 species-metabolite relationships encompassing 20,741 species and 50,048 metabolites. To systematize the relations between secondary metabolites and biological activities and to facilitate a comprehensive understanding of the relationships between the metabolites of organisms and the chemical-level contribution of metabolites to biological activity, we constructed a metabolite activity DB known as the KNApSAcK Metabolite Activity DB. It comprises 9,584 triplet relationships (metabolite-biological activity-target species), including 2,356 metabolites, 140 activity categories, 2,963 specific descriptions of biological activities and 778 target species. In the previous study, we obtained evidence that approximately 46% of the activities described in the DB are related to chemical ecology, most of which are attributed to antimicrobial agents and plant growth regulators. Over half of the DB contents are related to human health care and medicine.The five largest groups are toxins, anticancer agents, nervous system agents, cardiovascular agents and non-therapeutic agents, such as flavors and fragrances. The KNApSAcK Metabolite Activity DB is integrated within the KNApSAcK Family DBs to facilitate further systematized research in various omics fields, especially metabolomics, nutrigenomics and foodomics. The KNApSAcK Metabolite Activity DB could also be utilized for developing novel drugs and materials, as well as for identifying viable drug resources and other useful compounds. Using KNApSAcK Metabolite Activity DB, we examined relationships between 3D structure and biological activity. Initially, we examined 3D similarities between 2072 secondary metabolites by a fast heuristic graph-matching algorithm called COMPLIG developed by Shirai et al. (J. Mol. Biol., 424, 379-390, 2012). The threshold of the similarity is tentatively assigned to 0.80. Those pairs of secondary metabolites (50,228 pairs) were considered on a network. Then by applying the graph clustering algorithm DPClusO (ISRN Biomathematics, Vol 2012, Article ID 726329) which was developed for overlapping graph clustering using the similar concepts of the DPClus algorithm (BMC Bioinformatics, 207, 2006). DPClusO generated 671 densely connected clusters of secondary metabolites were generated. Here two parameters of DPClusO, cp and density are tentatively set as 0.5 and 0.9, respectively. DPClusO can generate clusters characterized by high density and identified by periphery. Secondly we examined relationship between the cluster and biological activity based on chi-squared test regulated by False discovery rate (FDR). As a result, we have obtained 501 pairs of cluster-biological activity relations. Those clusters are statistically characterized having p-values between 0 to 0.001 corresponding to FDR 0.007. Interestingly, the present analysis reveals a comprehensive structure-activity relation that antimicrobial activities are highly correlated with flavonoids and phenylpropanoids. Relations between Metabolite structure and biological activity are discussed based on those statistically significant pairs of cluster and biological activity. The KNApSAcK Metabolite Activity DB is integrated within the KNApSAcK Family DBs to facilitate further systematic research in various fields, especially metabolomics, nutrigenomics and foodomics. For developing novel drugs and materials, the resources of KNApSAcK Metabolite Activity DB can be utilized.
|87||Riyanto Heru Nugroho, Katsunori Yoshikawa and Hiroshi Shimizu.
Metabolic profiling of proline contribution for improving cell density in Saccharomyces cerevisiae
Abstract: Proline is known as an amino acid and also contributes in increasing cell tolerances against stresses such as osmotic pressure, freezing and desiccation in Saccharomyces cerevisiae. We observed the proline effect on the cell growth of S. cerevisiae which is the addition of proline increased the maximum cell densities without apparent consumption as carbon or nitrogen sources. Even during acid stress condition by the presence of lactic acid, the addition of proline could improve the specific growth rate and maximum cell density. In this study, metabolome analysis was performed how the proline addition affected cell metabolisms and improved maximum cell density.
Wild strain Saccharomyces cerevisiae BY4739 (MATα leu2∆0 lys2∆0 ura3∆0) was cultured in minimum media containing 2% glucose, 0.67 % yeast nitrogen base (YNB) without amino acid, 0.0076% L-leucine and 0.038% L-lysine monochloride, with the addition of 0.9% lactic acid and/or 0.16% L-proline. Metabolome analysis was performed using capillary electrophoresis time-of-flight mass spectrometry (CE-TOFMS). Samples for metabolome analysis were taken during mid-log growth phase at optical density (OD660 ≈ 0.5).
In order to confirm the increase of maximum cell density by proline addition, various concentrations of proline (0.05; 0.1; 0.16 %) were added into S. cerevisiae cultures and showed that the maximum densities were increased with proportional to the incremental addition of proline concentration. The extracellular proline concentration did not changed in all cases. To understand the effect of proline addition on the yeast metabolisms, metabolome analysis was conducted using CE-TOFMS on four culture conditions; control culture; lactic-acid-added culture; proline-added culture; and both lactic acid and proline-added cultures. Metabolome data indicated that the intracellular concentrations of most of the metabolites of the central carbon metabolism were increased, and those of amino acids were decreased in the proline addition culture. The metabolome results indicated that the proline addition caused drastic metabolism changes. Further analysis is being performed to understand increase of cell density by addition of proline.
|88||Mnv Prasad Gajula, Anuj Kumar, Ak Polimetla and H-J Steinhoff.
A computational study on displacement of the tyrosyl radical RNR enzyme
Abstract: Amino-acid radicals are involved in the catalytic cycles of a number of enzymes. Advanced spectroscopic and structural studies have been performed since several years to investigate the origin of these protein based radicals. TYR122* is a stable radical that is generated in the R2 subunit of ribonucleotide reductase (RNR), the enzyme responsible for the synthesis of deoxyribo- nucleotides. EPR experiments performed by Lendzian et al. have shown the orientation of the tyrosyl radical in the active state of E.coli RNR. The different g-tensor values obtained indicate the displacement of the tyrosine radical in the vicinity of the di-iron center, however failed to identify the exact location and orientation of the tyrosyl radical. We applied Molecular dynamics simulations and further mathematical calculations in-silico in order to analyze and interpret the experimental data. The results obtained helped identifying the location and orientation of the tyrosyl radical clarifying the experimental results. For the first time such a comparison was made for the g-factor based orientations obtained from the EPR experiments directly with those of MD simulations results. This kind of methodology offers a comprehensive knowledge on the structural dynamics of biomolecules.
09-Glycan and Compounds
|89||Yuki Ushioda, Shinichiro Tsuchiya and Kiyoko Aoki-Kinoshita.
Modification of the RINGS Glycan Structure Conversion Tool for Utilization in the International Glycan Structure Repository
Abstract: Several glycan structure databases have been developed during the last ten years, and as a result, many glycan structure formats have been developed in parallel. Although several attempts to integrate these data have been made, integration has been difficult mainly due to the complexity of the glycan structures and that there does not exist any glycan structure repository to assign accession numbers to glycans.
Just as there is the Protein Data Bank (PDB) for protein structures and GenBank for genes, an international glycan structure repository called “GlyTouCan” has been developed by glycoscientists in Japan and the US. GlyTouCan allows users to register glycan structures and assigns unique IDs to each individual glycan structure. With this functionality, when researchers publish papers referencing a GlyTouCan ID, their papers will be annotated by databases which share the same glycan IDs, and therefore they will be linked with the glycan structure in GlyTouCan.
When users register new glycan structures in GlyTouCan, they must use a text format that can accommodate any glycan structure, including ambiguous structures, those with repeating units, and even monosaccharide compositions. Therefore, the WURCS Working Group  developed a new format called “WURCS” for GlyTouCan, as the existing glycan formats did not fulfill all functionality . However, because the majority of researchers use the GlycoCT format all over the world, various glycan formats including GlycoCT must be converted to WURCS.
Meanwhile, our laboratory have been developing RINGS as a resource for glycoinformatics analysis tools and utilities. Within the RINGS tools, a glycan structure conversion tool was also developed which converts various glycan formats centering around the KEGG Chemical Function format (KCF) .
Our work thus aims to incorporate new conversion functions to accommodate WURCS and to extend the conversion pathway of conversion tools in RINGS. Moreover, we aim to integrate this tool into GlyTouCan.
|90||Masaaki Shiota, Michiko Ehara and Kiyoko Aoki-Kinoshita.
Analysis of lectin array data using the Glycan Kernel Tool
Abstract: Glycans are molecules made from monosaccharides that form complex tree structures. Glycans constitute one of the most important protein modifications, and identification of the glycome is a pressing problem in biology. Glycan structure analysis is often based on the principle of lectins that specifically bind to certain glycans. We used the lectin microarray data that is related to precancerous lesion of the tongue, and include four samples of normal, dysplasia, CIS (carcinoma in situ), and SCC (squamous cell cancer). However, simple clustering analysis of the microarray data does not clearly distinguish between each sample because the variation of the array results is large even among the samples of the same sample type.
Therefore, we narrowed down the subject of analysis to lectins whose binding intensity is relatively high and analysed with data using the Glycan Kernel Tool in RINGS (Resource for Informatics of Glycomes at Soka). The Glycan Kernel Tool can be used to extract characteristic glycan substructures by comparing and classifying two glycan structure datasets.
Although these results are preliminary, we were able to utilize glycoinformatics techniques to experimental data to extract interesting results which were otherwise difficult to obtain. We currently plan on performing further experiments on these results to verify their validity. If this is successful, new findings into the relationship between glycans and oral carcinoma may be revealed.
|92||Shinichiro Tsuchiya and Kiyoko Aoki-Kinoshita.
Implementation of WURCS converter tool
Abstract: Glycans are related to various important biological functions such as cellcell
communication and signaling, and they are biomolecules that are often attached to lipids and proteins. Much analyses have attempted to elucidate glycan function, and many of the experimental data from these analyses have been published. However, these data are not integrated across the databases in which they are registered.
Various glycan database developed in countries, and it need to input a text format of glycan as GlycoCT. GlycoCT format is proposed in EuroCarbDB project which can correctly represent a complicated glycan structure. This text format is primarily utilized in glycomics, however, GlycoCT can not represent some monosaccharide that is found in bacteria.
Recently, an international glycan repository has been developed to assign unique IDs to all glycans, which in turn enables the integration of glycan data across databases. Data integration is accomplished by utilizing Semantic Web technologies, which requires glycan structures to be uniquely written as a string. However, because glycans are very diverse due to their branched structures, repeating subunits and various molecular modifications, it is difficult to represent glycans uniquely and accurately as a string. Therefore, we proposed a new glycan structure representation called Web3 Unique Representation of Carbohydrate Structures (WURCS) , which is able to represent any glycan structure as a string. Our work aims to develop a WURCS conversion tool and will be used in the glycan repository web interface to search for glycan information.
|93||Risa Sekimoto, Yushi Takahashi and Kiyoko Aoki-Kinoshita.
Enhancements of GlycomeAtlas: incorporation of glycan metadata
Abstract: It is known that glycans have a key role in many biological functions such as cell-cell adhesions and various recognition processes. These occur due to the large conformational diversity of glycan structures. Because of this, glycomics technologies such as mass spectrometry are often used to understand the glycome and its diversity. Consequently, many researchers have reported the glycan structures found in specific tissues and organisms. These datasets are called glycan profiles, and they are are obtained from matrix-assisted laser desorption/ionization mass spectrometry mass spectrometry (MALDI-MS) analysis, and a major provider of such data is the Consortium for Functional Glycomics (CFG).
We have been developing GlycomeAtlas on RINGS to visualize glycan structures and their localizations. This tool allows users to see the localization of each glycan structure by clicking organs or selecting glycan structures from the list. Moreover, users can search the distributions of glycan structures across human and mouse. When users select the glycan from the detail list, the display for glycan information shows the glycan structure in detail including their glycosidic linkage conformations. However, users are not able to see its experimental background from the GlycomeAtlas display. Therefore, we further modified GlycomeAtlas to provide the metadata for each glycan to support users in gaining a better understanding of glycan structures.
IDs are assigned to each glycan in CFG, and because the ID is a part of the URL of glycan detailed pages of CFG, we can access the metadata for any glycan using this ID. Thus, it is possible to display the page by specifying the ID. At first, we collected IDs assigned to each glycan from the experimental data of CFG, and we correlated them with each glycan structure in Linear Code format included in the existing database (MySQL). At the same time, in order to obtain information correctly from the database, we modified the JSP (Java Server Program) server program to obtain the CFG IDs. In addition, we made changes to the user interface to display the CFG web pages with Adobe Flash. For the experimental data, we have created pages in HTML.
By clicking on the display for the glycan information of the tool, it is now possible to display a new window for the web page of CFG corresponding to the particular glycan. In this page, in addition to the glycan profiling data, there are links to other databases referring to the glycan and references. Thus, it is now possible to check the details of each glycan registered in GlycomeAtlas. Moreover, localization and structures of glycan of the data can be visualized in this tool by inputting text files containing species, organs and glycan structure information in Linear Code format.
For future work, in order to enhance GlycomeAtlas, we will consider methods for the tool to utilize Semantic Web technologies to better share the data with related databases and to add glycan profile visualizations of other organisms. Therefore, we will continue to develop GlycomeAtlas so that it can truly become an atlas for glycomes.
Properties of Chemical Spaces about Natural and Commercial Compounds
Abstract: “Chemical space” is an important concept to estimate dataset of compounds. Our interest of this study is the difference or similarity between natural and commercial compounds. Selection of dataset for comparison is essential because employing biased dataset misleads the results. In this study, we constructed our dataset base on the book entitled “ROMPP Encyclopedia of Natural Products (ROMPP)”, a dictionary of natural compounds. The reason why this book was chosen is that the contents are not specialized to any groups of the compounds. To compare compound properties, the commercially available dataset of the ZINC database  was also analyzed.
Many descriptors about molecular properties, shape, and energies based on the 3D-structures were calculated by MOE 2011.10  to determine similarities and differences between ZINC and ROMPP datasets. The average values of many descriptors (i.e., weight, ASA, logP(o/w), logS, and the others) were almost similar. However, their standard deviations for ROMPP were larger than ZINC. It suggests that natural products diverse much more than commercial compounds on the molecular properties. From the comparing distribution of each descriptor, the distribution of rsynth (synthesis plausibility score) was significantly different. Rsynth values of ZINC set showed a Gaussian type distribution with a peak around 0.8. However, the values of ROMPP set did not form Gaussian shape, 22% of the data concentrated in the range of 0.1 to 0.2 and the others distributed 0.2 to 0.8 evenly. The 20% of ROMPP entries were distributed the rsynth values of 0.6 to 1.0 (the range of most of ZINC entries were distributed). We also detected duplicated structures between ZINC and ROMPP. The correspondence is approximately 1,300 entries (around 20% of ROMPP entries). We consider that the structures of rsynth 0.6 to 1.0 may be these entries. The other results will be also discussed on the poster.
1. Irwin JJ et al. J. Chem. Inf. Model., 52(7), 1757-1768, 2012, http://zinc.docking.org/.
2. MOE 2011.10, Chemical Computing Group, http://www.chemcomp.com/.
Statistical stage transition detection method for small sample gene expression time series data
Abstract: In terms of their internal (genetic) and external (phenotypic) states, living cells are always changing at varying rates. Periods of stable or low rate of change are often called States, Stages, or Phases, whereas high-rate periods are called Transitions or Transients. While states and transitions are observed phenotypically, such as cell differentiation, cancer progression, for example, are related with gene expression levels. On the other hand, stages of gene expression are definable based on changes of expression levels. Analyzing relations between state changes of phenotypes and stage transitions of gene expression levels is a general approach to elucidate mechanisms of life phenomena.
Herein, we propose an algorithm to detect stage transitions in a time series of expression levels of a gene by defining statistically optimal division points. The algorithm shows detecting ability for simulated datasets. An annotation based analysis on detecting results for a dataset of initial development of Caenorhabditis elegans agrees with that are presented in the literature.
|96||A.B.M. Shamim Ul Hasan and Hiroyuki Kurata.
A Study of Genetic Noise in Biochemical Network
Abstract: System biology is the successful in spite of existing in a stochastic environment and in spite of the probabilistic nature of the biochemical reactions. Modern Scientists are just beginning to discover the complicated interplay of noise with determinism in systems biology. They are showed by an increasing number of theoretical, computational, and experimental tools. These ways and means have been proven successful in each strand of biology, including neural, genetic, and genetic networks. Gene expression is a complex that a lot of biochemical processes in the cell involve low molecule numbers or rare interactions and consequently give rise to stochastic fluctuations.
Here the aim of investigation of the research work is to study the mechanism of genetic noise in biochemical network by Gillespie Algorithm.
|97||Yu Matsuoka and Hiroyuki Kurata.
S-system based sensitivity analysis for the central metabolic network in Escherichia coli
Abstract: It is quite important to promote a quantitative understanding of the complex and highly interrelated cellular behavior. This may be achieved with the help of modeling and computer simulation by integrating different levels of ever-increasing amount of experimental data and biological knowledge. Quantitative models enable us to predict changes in dynamic behaviors. In addition, they also enable us to investigate the effects of perturbations on the overall system. It is important to understand such perturbations in response to changes in cellular internal and external environments. In the present study, we developed a dynamic model that includes transcriptional regulation such as Crp and Cra as well as enzymatic reactions such as glycolysis, TCA cycle, and pentose phosphate pathway. S-system based sensitivity analysis was performed to grasp the underlying metabolic phenomena at different dilution rate. MPS (multi-parameter sensitivity) was calculated as the sum of the squared magnitudes of single-parameter sensitivities, to quantify the robustness of the central metabolic pathways. The MPSs of glycolytic fluxes had local minimums with respect to dilution rate. A part of glycolytic pathways become highest reversibility at that dilution rate because the glycolytic pathways that go to the TCA cycle are dominant at the high dilution rate, while gluconeogenic pathway is dominant at the low dilution rate.
|98||Daisuke Koishi, Cuncun Chen, Noorlin Mohd Ali and Hiroyuki Kurata.
Automatic construction of calculable metabolic networks from public database
Abstract: The major human diseases such as diabetes, obesity, cardiovascular disease and cancer are involved in failure of human metabolism systems. Metabolic network maps are represented by a complex chain of chemical reactions and are highly associated between genes, proteins and enzymes; consequently mathematical and/or computational approaches are necessary for integration of them.
At present, large-scale metabolic networks are being constructed. There are many public databases that store many network maps for each human organ with different diseases. Constructing a metabolic network constantly requires refinement processes. They should be calculated using the elementary mode analysis and flux balance analysis for an understanding of relationships between network structures and functions, but the existing network map data are not directly used for such calculation. One of the major problems is the blocked reaction problem that hampers computational simulation. In principle, this is caused by the dead ends of metabolites, where individual metabolites cannot be created or consumed. In addition, there are missing metabolites and reactions, which are associated with the component that may need to be filled.
In order to solve the above problems, we have been developing a network conversion algorithm that suggests missing of metabolites and reactions. We have extracted information on experimental condition, reactions, metabolites and compartments from various databases to reconstruct computable large-scale network maps. These data are prepared in the excel format that can be used as an input file for many application programs.
|99||Nelson Kibinge, Naoaki Ono, Masafumi Horie, Ming Huang, Tetsuo Sato, Tadao Sugiura, Md. Altaf-Ul-Amin, Saito Akira and Shigehiko Kanaya.
A systems mapping of transcription regulation in genes and modules of genes in lung cancer pathways
Abstract: Lung cancer is the most frequent type of cancer. In order to wholly characterize this disease, current research focuses on understanding it from the perspective of transcription regulation of individual and multi-tiered genes as well as molecular processes involved. Recent transcriptome profiling technologies have proven more superior in detecting low abundance transcripts compared to older technologies such as microarrays. In transcription genomics, next generation sequencing technologies yield large amounts of data thus the need for computational and statistical strategies for systematic analysis. Mechanisms of transcription regulation while well studied, have still not conclusively accounted for specific differences in expression patterns of lung cancer. In addition to this, computational workflows for examining data from newer expression profiling technologies need to be further assembled and customized more specifically to comprehend diseases such as cancer. In the present study we developed a framework for analysis of digital gene expression data from CAGE (cap analysis of gene expression) technology. The pipeline incorporates a series of 8 steps including: Expression quantification, Normalization, explorative visualization, regulation pattern identification, differential expression analysis, gene set enrichment analysis, mapping the transcription factor binding sites and regulation network visualization of the whole lung cancer genome. We tested our method using two expression datasets. One set was the mouse lung tissue at different developmental stages and the second was Human lung cell lines comprising of 264 normal cell lines and 19 cancer cell lines. From this work we were able to identify significantly expressed genes and 26 out of 202 KEGG pathways significantly regulated in lung cancer. These modules were mainly associated to DNA replication, energy metabolism, cell growth and cell death. We also mapped binding sites of 128 transcription factors by promoter motif searching throughout the whole genome landscape of lung cancer to examine its transcriptional regulation network. As a whole, we identified vital machinery of transcriptional regulation in the lung.
|100||Shu Tadaka, Takeshi Obayashi and Kengo Kinoshita.
Detection of functional modules in protein networks by near-clique extraction
Abstract: Identification of functional modules from protein-protein interaction networks (PINs) is one of the fundamental steps to understand of biological features of PINs. Detection of functional modules in PINs is mainly performed by searching densely connected sub-networks, and many methods have been proposed. Proteins included in a functional module are categorized to “core proteins” that perform a central role of proteins, and “attachment proteins” that are used when the module performs specific biological functions. Consideration of core-attachment structure of functional modules is important to understand functions of modules, however, there is no method to clearly elucidate the structure of functional modules.
We here propose NCMine that is a novel network clustering method with visualization of core-attachment structure of functional modules. It extracts complete graph-like structures from PINs based on node-weighting scheme using degree-centrality, and reports them as functional modules. We implemented this method as a plugin of Cytoscape, which is widely used to analyze biological networks. The plugin allows users to extract functional modules from PINs, and filter modules of interest, using proteins or annotations such as gene ontology. We demonstrate the method to human PIN, and confirmed the core-attachment structure of modules that seems to be related to cancer development. We also report analysis of functional modules extracted from the human PIN.
Systematic model construction and simulation analysis for regulatory networks in cellular signal transduction systems
Abstract: Some typical cellular signal transduction systems, such as MAPK cascading systems, have been studied intensively for last several decades to elucidate their various interesting dynamic and control characteristics, including transient response such as oscillating behaviors, and on-off switch like response and bi-stability due to the ultra-sensitivity in the stimulus-response curve. On the other hand, it has been known that the cellular signaling systems exist not as simple MAPK-like cascades, but as complex mutually regulatory networks. Cellular signal transduction systems are comprised of enzymatic reaction cascades and organized as a complex reaction network operating the regulation by the enzymatic activation-inactivation mechanism such as the allosteric reaction mechanism or phosphoryl modifications. The primary building block of the network is an enzymatic activation-inactivation cyclic reaction system.
In this study, a method is proposed to analyze the effects of architectures of the complex regulatory networks on their control characteristics, and the simulation analysis is performed for the preliminary results of the emergence of bi-stability. A procedure is devised for the systematic construction of the cellular mutual regulatory networks with the primary building blocks. A node in the network represents each of enzymatic cyclic reaction system. The activated enzyme in a node acts on another node as activating enzyme or inactivating enzyme.
A specified node in the network is defined as the output node in which the activation level of the enzyme is thought to be the output of the network. The procedure starts with construction of the complete set of the feed-forward regulatory networks, in which the nodes except the output node regulate the output node directly or indirectly. Then, feedback regulations are added to each feed-forward network in a systematic manner. This procedure has two parameters; one is the number of the nodes in every network, and the other is the number of nodes regulated by every regulator node. The adjacent matrix suits to represent these enzymatic regulatory networks. The values of the matrix components are zero, unity, or negative unity, representing regulation-free, activation, or inactivation, respectively.
Given the specific mechanism for the enzymatic cyclic reactions and the mutual regulations, the signal transduction process could be simulated and analyzed systematically with respect to each regulatory network constructed. The required condition for bi-stability is explored for the complete set of regulatory networks of up to 6 nodes constructed by the proposed procedure. It is demonstrated that the simple first-order regulation mechanism yields single steady state, but the positive cooperativity due to the higher-order regulations leads to the multiple steady states.
|102||Keiko Tokunaga, Hiromu Takematsu and Kiyoko Aoki-Kinoshita.
Simulation analysis of signaling in B cell activation with CellDesigner
Abstract: Glycans often exist on cell surfaces, and they have roles in many functions such as cell-cell communications and signaling. Glycans are synthesized by enzymes called glycosyltransferases. In mammalian glycans, sialic acid residues are frequently assumed to play a key role in complex immune systems. In this study, we focus on sialyltransferases, which are necessary to produce sialic acid-containing glycans.
ST3GalI is one of the glycosyltranferases, which adds a sialic acid onto galactose with glycosidic linkage “a2-3”. ST3GalI mediates the transfer of sialic acid residues to a “Gal β1-3 GalNAc” sequence. This enzyme is expressed in non-activated B cells. B cells are one of the lymphocytes in immune system; they are able to proliferate when they contact with matching antigens such as foreign proteins, bacteria or viruses. Some activated B cells form the germinal center, which can be bound by peanut agglutinin (PNA). PNA recognizes glycans with “Gal β1-3 GalNAc”, and it is known that the expression of ST3GalI mRNA and the binding to PNA are very significantly but inversely correlated. However, the functional relevance between the result of PNA staining and the signaling mechanisms in B cell activation are still unknown.
To understand the functional relevance in activation signaling, we established B cell line transfected with ST3GalI gene, and control cells. We observed clear differences in cellular profile of Ca2+ concentrations upon activation. However, we were not able to identify the signaling molecule responsible for the difference in the calcium influx. We thought that systems biology approach could identify the responsible molecule in the signaling systems of B cell activation. In this study, we focused on simulation analysis to reproduce the signaling pathway and to compare with experimental results.
We used the CellDesigner software for our simulation. CellDesigner is a powerful tool for calculating and simulating biological pathway models. It allows users to download models for simulating pathway models contained in databases, such as BioModels.net and PantherDB.org. We used a pathway model of B cell activation registered in PantherDB.org. It consists of the mitogen-activated protein kinase (MAPK) signaling pathway including extracellular signal-regulated kinase (ERK) transfer and Ca2+ concentration in B cell. Furthermore, we extracted reaction formulae and their parameters from BioModels.net and added them to our pathway model. At the time of this writing, we successfully simulated the Ca2+ concentration fluctuation that we obtained from our control experiment. We will next simulate the Ca2+ concentration fluctuation in activated B cells with a transfected ST3GalI gene and attempt to understand the mechanism of this fluctuation from our results.
|103||Tetsuo Katsuragi, Naoaki Ono, Shigehiko Kanaya, Tetsuo Sato and Md. Altaf-Ul-Amin.
Comparison between genetic algorithm and distributed genetic algorithm in the context of parameter estimation for dynamic simulation of metabolic networks
Abstract: Simulation tools are useful for getting further information of dynamic behavior of metabolites which can explain the data observed in experiments. For dynamic simulation of metabolites, all the initial metabolite concentrations and the reaction rates should be known beforehand. But it is difficult to get these information since the metabolite concentrations measured by mass spectrometry are generally obtained as relative-levels, and the reaction rates that fit the experimental conditions are not always available from literatures. In the previous work, 56 reactions including 59 metabolites in Arabidopsis thaliana were selected according to the experimental data and the time course of the metabolite concentrations are simulated based on the estimated parameters that fitted to the experimental data (Plant Cell Phys. 54, 728-739, 2013). Parameters were estimated using genetic algorithm (GA). The results showed that the parameter estimation by GA is useful, but the stochastic approach in GA may cause the dispersion of the result.
In this study, we propose a tool SS-dGA that uses distributed genetic algorithm (DGA) for estimation of parameters that can reproduce the dynamic behavior of metabolites to match the observed experimental data. DGA is one of the derivations from GA designed to improve it by importing a concept of multiple populations for keeping the diversity of the parameters to search for the solutions in the wider area. Using this tool, dynamics of amino acid biosynthesis observed in Arabidopsis thaliana was reproduced. The parameters estimated by the GA in the previous work and that by the DGA were then compared. The estimation capability of DGA is better than that of GA; the mean value of the fitness for all test runs are higher in case of DGA than that of GA, and the standard deviations are less which implies that the dispersion of the results are smaller. It may be concluded that the diversity of parameters kept in each population in case of DGA may have played an important role to improve the fitness of the estimation. Based on time series of metabolite concentration data generated by mass spectrometry, SS-dGA can estimate the parameters needed for dynamic simulation and simulate the metabolite concentrations to match the data. Users of SS-dGA can get further information of the behavior of the metabolites that are not visible directly in the experimental data.