POS Tagging - Hidden Markov Model
Core Markov Model and POS Tagging Terms
Markov Model:
A statistical model that predicts the probability of a sequence based on the assumption that each item depends only on the previous item.
Hidden Markov Model (HMM):
A Markov model where the sequence of states (such as POS tags) is hidden and only the observed outputs (words) are visible.
Part-of-Speech (POS) Tagging:
The process of assigning grammatical categories (noun, verb, adjective, etc.) to each word in a sentence.
Transition Probability:
The probability of moving from one state (POS tag) to another in a sequence, e.g., P(VERB | NOUN).
Emission Probability:
The probability of observing a word given a particular POS tag, e.g., P(dog | NOUN).
Viterbi Algorithm:
A dynamic programming algorithm for finding the most likely sequence of hidden states (POS tags) in an HMM.
Ambiguity:
The property of words that can belong to multiple grammatical categories depending on context (e.g., "can" as a verb or noun).
Training Data:
Annotated sentences used to learn the parameters of a statistical model.
Corpus:
A large collection of written or spoken texts used for linguistic analysis and training statistical models.
N-gram:
A contiguous sequence of n items (words or tags) from a given sequence of text or speech.
Unigram, Bigram, Trigram:
A single word/tag, a sequence of two, or a sequence of three, respectively.
Smoothing:
Techniques used to handle zero probabilities in statistical models by redistributing probability mass.
Sequence Labeling:
The task of assigning labels (such as POS tags) to elements in a sequence.
Observation:
The visible outputs (words) generated by the hidden states (POS tags) in an HMM.
State:
A condition or situation in a system, in HMM referring to the hidden grammatical categories.
Lexical Category:
The grammatical class of a word (noun, verb, adjective, etc.).
Word Tokenization:
The process of breaking text into individual words or tokens.
Out-of-Vocabulary (OOV):
Words that appear in test data but were not seen during training.
Probability Matrix:
A matrix containing probability values, such as transition or emission probabilities in an HMM.
Statistical Model:
A mathematical model that uses probability distributions to represent data and make predictions.
Decoding:
The process of finding the most likely sequence of hidden states (POS tags) given the observed sequence (words).
Maximum Likelihood Estimation:
A method of estimating model parameters by finding values that maximize the likelihood of the observed data.
Natural Language Processing (NLP):
A field of computer science and artificial intelligence concerned with interactions between computers and human language.