Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction

Denys Proux[1] (proux@xrce.xerox.com)
François Rechenmann[2] (Francois.Rechenmann@inria.fr)
Laurent Julliard[1] (julliard@xrce.xerox.com)
Violaine Pillet[3] (violaine@crrm.univ-mrs.fr)
Bernard Jacq[4] (jacq@lgpd.univ-mrs.fr)

[1]Xerox Research Centre Europe
6 chemin de Maupertuis, 38240 Meylan, France
[2]INRIA Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique
655 avenue de l'Europe, 38330 Montbonnot Saint Martin, France
[3]CRRM, Centre de Recherche Rétrospective de Marseille
Faculté des Sciences et Techniques de St-Jérôme, Universit é Aix-Marseille, Marseille, France
[4]LGPD, Laboratoire de Génétique et Physiologie du Développement
Parc Scientifique de Luminy, CNRS case 907, 13288 Marseille cedex 9, France


Abstract

Gathering data on molecular interactions to be fed into a specialized database has motivated the development of a computer system to help extracting pertinent information from texts, relying on advanced linguistic tools, completed with object-oriented knowledge modeling capabilities. As a first step toward this challenging objective, a program for the identification of gene symbols and names inside sentences has been devised. The main difficulty is that these names and symbols do not appear to follow construction rules. The program is thus made up of a series of sieves of different natures, lexical, morphological and semantic, to distinguish among the words of a sentence those which can only be potential gene symbols or names. Its performance has been evaluated, in terms of coverage and precision ratios, on a corpus of texts concerning D. melanogaster for which the list of names of known genes is available for checking.

[ Full-text PDF | Table of Contents ]


Japanese Society for Bioinformatics