IIIT Hyderabad Publications |
|||||||||
|
Hybrid Tokenization and Datasets for Solving Mathematics and Science Problems Using Transformers.Authors: Pratik Mandlecha,Snehith Kumar Chatakonda,Neeraj Kollepara,Pawan Kumar Conference: SIAM International Conference on Data Mining (SDM22)At: Virginia, US Pages: 1-9 Date: 2021-12-01 Report no: IIIT/TR/2021/116 AbstractTransformers, which were introduced for solving the task of machine translation, have expanded their utility in multiple domains. A recent application of transformers is in solving elementary mathematics problems. In this paper, we use a hybrid tokenization technique for encoding the mathematics and science problems and answers, which is used to train the transformer. We compare the performance of our tokenization with that of the char-to-char tokenzation in solving various types of mathematics and science problems. We discuss the accuracy, memory usage, and time to train the model with proposed tokenization. The proposed tokenization shows higher accuracy for some problems, and requires lesser memory compared to char-to-char tokenization. We propose an extended dataset of science and mathematics problems that consists of billions of samples in questionanswer format in raw text. Code and Dataset: https: //github.com/misterpawan/scimat2 Full paper: pdf Centre for Security, Theory and Algorithms |
||||||||
Copyright © 2009 - IIIT Hyderabad. All Rights Reserved. |