N-Grams
Q1. Trigram Probability Derivation and Calculation A trigram is a second-order Markov model.
(a) Derive the general formula for calculating the probability of a word sequence using a trigram model.
(b) For the following corpus, construct the trigram table and calculate the probability of the sentence "Can I sit near you" using your table:
(eos) Can I sit near you (eos) You can sit (eos) Sit near him (eos) I can sit you (eos)
Q2. Character-Based N-Grams in String Similarity A character-based N-Gram is a set of n consecutive characters extracted from a word. These are widely used in spellcheckers, stemming, and OCR error correction.
(a) List all possible trigrams for each of the following words:
- quote
- patient
- patent
- impatient
(b) Explain how character N-Grams can help in identifying spelling errors or similarities between words.
Q3. N-Gram Probability and Spelling Correction Given the following candidate words, calculate the probability of each using a bigram or trigram model (assume you have access to a suitable corpus or N-Gram table). Which word is most likely to be the correct spelling?
- qotient
- quotent
- quotient
Explain your reasoning based on N-Gram probabilities.