Articles
Detecting and correcting real-word errors in Tamil sentences
Authors:
R. Sakuntharaj ,
Eastern University, LK
About R.
Centre for Information and Communication Technology
S. Mahesan
University of Jaffna, LK
About S.
Department of Computer Science, Faculty of Science
Abstract
Spell checkers concern two types of errors namely non-word errors and real-word errors. Non-word errors fall into two sub-categories: First one is that the word itself is invalid; the other is that the word is valid but not present in a valid lexicon. Real-word error means that the word is valid but inappropriate in the context of the sentence. An approach to correcting real-word errors in Tamil language is proposed in this paper. A bigram probabilistic model is constructed to determine appropriateness of the valid word in the context of the sentence using a 3GB volume of corpora of Tamil text. In case of lacking appropriateness, the word is marked as a real-word error and minimum edit distance technique is used to find lexically similar words, and the appropriateness of such words is measured by a word-level n-gram language probabilistic model. A hash table with word-length as the key is used to speed up the search for words to check for the lexical similarity. Words of length differing less than two with the length of the ‘inappropriate’ word are considered to search in the hash table. Test results show that the suggestions generated by the system are with 98% accuracy as approved by a Scholar in Tamil language.
How to Cite:
Sakuntharaj, R. and Mahesan, S., 2018. Detecting and correcting real-word errors in Tamil sentences. Ruhuna Journal of Science, 9(2), pp.150–159. DOI: http://doi.org/10.4038/rjs.v9i2.43
Published on
27 Dec 2018.
Peer Reviewed
Downloads