(1) Consider a simple language that allows two words: yes and no. Develop a statistical model for spelling errors based on observations of the following data: yes, yis, yis, yas, yes, yes no, mo, do, mo, do, no This is the type of input you might see from an optical character recognition system. solution: The OCR system considered here should incorporate non-word error detection and isolated word error correction approach (Kukich). OCR errors are grouped into give classes: substitutions - caused by visual similarity multi-substitutions - framing errors space deletions space insertions failures - OCR algorithm does not select any letter with sufficient accuracy. The most probable word w given some observation O is a product of two probabilities for each word - p(w) [ prior probability] and p(O|w) [likelihood]. Assumed that we have words, yes and no equally likely. We have, prior probabilities -> p(yes) = 0.50 and p(no) = 0.50. Likelihood probabilities: (Kernighan) ----------------------------------------------------------------------- Error Correction letter letter Position Type correct Error ----------------------------------------------------------------------- yis yes e i 2 substitutions yas yes e a 2 substitutions ----------------------------------------------------------------------- ----------------------------------------------------------------------- mo no n m 1 substitutions do no n d 1 substitutions ----------------------------------------------------------------------- statistical model: likelihood probabilities: p(yis | yes) = 2 / 6 = 1/3 p(yas | yes) = 1 / 6 = 1/6 p(y*s | yes) = 3 / 6 = 1/2 p(mo | no) = 2 / 6 = 1/3 p(do | no) = 2 / 6 = 1/3 p(*o | no) = 4 / 6 = 2/3 ----------------------------------------------------------------------- ----------------------------------------------------------------------- correct p(c) p(t | c) p(t|c) * p(t) % word (c) ----------------------------------------------------------------------- yes 1/2 1/2 1/4 100% ----------------------------------------------------------------------- no 1/2 2/3 1/3 100% ----------------------------------------------------------------------- ----------------------------------------------------------------------- As the confusion matrix drawn for the characters i, e, y, n, m, d, o, s confusion matrix: character correct character ------> wrong i e y n m d o s a i **** - 2 0 0 0 0 0 0 0 e 0 - 0 0 0 0 0 0 0 y 0 0 - 0 0 0 0 0 0 n 0 0 0 - 0 0 0 0 0 m **** 0 0 0 2 - 0 0 0 0 d **** 0 0 0 2 0 - 0 0 0 o 0 0 0 0 0 0 - 0 0 s 0 0 0 0 0 0 0 - 0 a **** 0 1 0 0 0 0 0 0 - The confusion matrix indicates that only character e is confused as i or a, and character n is confused as m or d. The confusion matrix clearly indicates no necessity of a statistical model and the replacement of occurence of i or a TO e, and that of m and d TO n needs to be done. In my opinion, we do not need a statistical model for spelling errors, as we can clearly see the OCR system here demarks the two words - yes and no - by giving out ocr output as three characters (for yes) and 2 characters (for no). So, the simple model which just counts the number of characters will be the best model based on the given data. reference: section 5.1 and 5.2 of the textbook.