Reading: Detecting and correcting real-word errors in Tamil sentences

Download

A- A+
Alt. Display

Articles

Detecting and correcting real-word errors in Tamil sentences

Authors:

R. Sakuntharaj ,

Eastern University, LK
About R.
Centre for Information and Communication Technology
X close

S. Mahesan

University of Jaffna, LK
About S.
Department of Computer Science, Faculty of Science
X close

Abstract

Spell checkers concern two types of errors namely non-word errors and real-word errors. Non-word errors fall into two sub-categories: First one is that the word itself is invalid; the other is that the word is valid but not present in a valid lexicon. Real-word error means that the word is valid but inappropriate in the context of the sentence. An approach to correcting real-word errors in Tamil language is proposed in this paper. A bigram probabilistic model is constructed to determine appropriateness of the valid word in the context of the sentence using a 3GB volume of corpora of Tamil text. In case of lacking appropriateness, the word is marked as a real-word error and minimum edit distance technique is used to find lexically similar words, and the appropriateness of such words is measured by a word-level n-gram language probabilistic model. A hash table with word-length as the key is used to speed up the search for words to check for the lexical similarity. Words of length differing less than two with the length of the ‘inappropriate’ word are considered to search in the hash table. Test results show that the suggestions generated by the system are with 98% accuracy as approved by a Scholar in Tamil language.
How to Cite: Sakuntharaj, R. and Mahesan, S., 2018. Detecting and correcting real-word errors in Tamil sentences. Ruhuna Journal of Science, 9(2), pp.150–159. DOI: http://doi.org/10.4038/rjs.v9i2.43
Published on 27 Dec 2018.
Peer Reviewed

Downloads

  • PDF (EN)

    comments powered by Disqus