(1) Consider a simple language that allows two words: yes and no.
    Develop a statistical model for spelling errors based on
    observations of the following data:

     yes, yis, yis, yas, yes, yes
     no, mo, do, mo, do, no

    This is the type of input you might see from an optical
    character recognition system.

solution: 

The OCR system considered here should incorporate non-word error
detection and isolated word error correction approach (Kukich).

OCR errors are grouped into give classes: 

substitutions       - caused by visual similarity 
multi-substitutions - framing errors 
space deletions 
space insertions 
failures            - OCR algorithm does not select any letter with sufficient
accuracy.

The most probable word w given some observation O is a product of two
probabilities for each word - p(w) [ prior probability] and p(O|w)
[likelihood]. 

Assumed that we have words, yes and no equally likely. 

We have, prior probabilities -> p(yes) = 0.50 and p(no) = 0.50. 

Likelihood probabilities: (Kernighan)
-----------------------------------------------------------------------

Error      Correction     letter     letter   Position   Type 
			  correct    Error       
-----------------------------------------------------------------------
yis        yes             e          i         2        substitutions
yas        yes             e          a         2        substitutions
-----------------------------------------------------------------------

-----------------------------------------------------------------------
mo         no             n           m         1        substitutions
do         no             n           d         1        substitutions
-----------------------------------------------------------------------

statistical model:

likelihood probabilities:

p(yis | yes) = 2 / 6 = 1/3
p(yas | yes) = 1 / 6 = 1/6
p(y*s | yes) = 3 / 6 = 1/2

p(mo | no)  = 2 / 6 = 1/3
p(do | no)  = 2 / 6 = 1/3
p(*o | no)  = 4 / 6 = 2/3

-----------------------------------------------------------------------
-----------------------------------------------------------------------
correct      p(c)     p(t | c)       p(t|c) * p(t)   %
word (c)
-----------------------------------------------------------------------
yes          1/2      1/2            1/4             100%

-----------------------------------------------------------------------
no           1/2      2/3            1/3             100%
-----------------------------------------------------------------------
-----------------------------------------------------------------------

As the confusion matrix drawn for the characters i, e, y, n, m, d, o, s 

confusion matrix:

character       correct character ------>
wrong           i	e	y	n	m	d	o	s   a
i ****		-       2       0	0	0	0	0	0   0
e		0	-	0	0	0	0	0	0   0
y		0	0	-	0	0	0	0	0   0
n		0	0	0	-	0	0	0	0   0
m ****		0	0	0	2	-	0	0	0   0
d ****		0	0	0	2	0	-	0	0   0
o		0	0	0	0	0	0	-	0   0
s		0	0	0	0	0	0	0	-   0
a ****		0	1	0	0	0	0	0	0   -

The confusion matrix indicates that only character e is confused as i
or a, and character n is confused as m or d. The confusion matrix
clearly indicates no necessity of a statistical model and the
replacement of occurence of i or a TO e, and that of m and d TO n
needs to be done.

In my opinion, we do not need a statistical model for spelling errors,
as we can clearly see the OCR system here demarks the two words - yes
and no - by giving out ocr output as three characters (for yes) and 2
characters (for no). So, the simple model which just counts the number
of characters will be the best model based on the given data.

reference: section 5.1 and 5.2 of the textbook.