IIIT Hyderabad Publications |
|||||||||
|
Chemical Named Entity Recognition of Role Labelled Synthetic Chemical Procedures from PatentsAuthor: Shubhangi Dutta 2018113004 Date: 2024-04-06 Report no: IIIT/TH/2024/27 Advisor:Prabhakar Bhimalapuram AbstractDiscovering new reaction pathways lies at the heart of drug discovery and chemical experimentation. Chemical patent texts contain new reactions and reaction pathways. Thus, a huge amount of drug reaction data lies in unannotated patent texts which are not machine-readable. Reaction roles play an important part in analysing chemical pathways, and tracing chemicals through them, and while there is a vast body of chemical data available, the unavailability of reaction role annotated data is a blocker to effectively deploy deep learning methods for reaction discovery. To overcome this hurdle, this work introduces a new dataset, WEAVE 2.0, an expansion of the existing WEAVE dataset, a chemical NER dataset obtained from chemical patents. WEAVE 2.0 augments WEAVE named entities along with full, manual, annotations of novel chemical reactions with reaction role information. We provide baseline models for chemical entity recognition from our raw dataset, using simple architectures commonly used for chemical NER and related tasks, such as biomedical NER. As the size of the dataset is small, we introduce dataset augmentation techniques to improve learning. These techniques can be used to generate further data from other patent-based datasets, such as WEAVE [17]. We also introduce and test improved models, which structure the problem as two smaller parts instead of one, both against the raw dataset, as well as data augmented using the above methods. Further, we compare our best models against other similar datasets for chemical NER, showing its performance across multiple similar tasks. Our dataset and associated models form the foundation of neural understanding of chemical reaction pathways via reaction roles and will allow models trained for downstream tasks to utilise this information to generally lead to better predictions. Full thesis: pdf Centre for Computational Natural Sciences and Bioinformatics |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |