Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts

Takeshi Sekimizu[1] (sekimizu@is.s.u-tokyo.ac.jp)
Hyun S. Park[1][3] (hsp20@is.s.u-tokyo.ac.jp)
Jun'ichi Tsujii[1][2] (tsujii@is.s.u-tokyo.ac.jp)

[1] Department of Information Science, University of Tokyo
7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8654, Japan
[2] Department of Language Engineering, UMIST, PO Box 88, Manchester M60 1QD, United Kingdom
[3] Department of Computer Science, Sungshin Women's University
249-1 Dongsun-dong, Sungbuk-gu, Seoul, Korea


Abstract

We have selected the most frequently seen verbs from raw texts made up of 1-million-words of Medline abstracts, and we were able to identify (or bracket) noun phrases contained in the corpus, with a precision rate of 90%. Then, based on the noun-phrase-bracketted corpus, we tried to find the subject and object terms for some frequently seen verbs in the domain. The precision rate of finding the right subject and object for each verb was about 73%. This task was only made possible because we were able to linguistically analyze (or parse) a large quantity of a raw corpus. Our approach will be useful for classifying genes and gene products and for identifying the interaction between them. It is the first step of our effort in building a genome-related thesaurus and hierarchies in a fully automatic way.

[ Full-text PDF | Table of Contents ]


Japanese Society for Bioinformatics