Virtual Labs

POS Tagging - Hidden Markov Model

Core Markov Model and POS Tagging Terms

Markov Model:
A statistical model that predicts the probability of a sequence based on the assumption that each item depends only on the previous item.

Hidden Markov Model (HMM):
A Markov model where the sequence of states (such as POS tags) is hidden and only the observed outputs (words) are visible.

Part-of-Speech (POS) Tagging:
The process of assigning grammatical categories (noun, verb, adjective, etc.) to each word in a sentence.

Transition Probability:
The probability of moving from one state (POS tag) to another in a sequence, e.g., P(VERB | NOUN).

Emission Probability:
The probability of observing a word given a particular POS tag, e.g., P(dog | NOUN).

Viterbi Algorithm:
A dynamic programming algorithm for finding the most likely sequence of hidden states (POS tags) in an HMM.

Ambiguity:
The property of words that can belong to multiple grammatical categories depending on context (e.g., "can" as a verb or noun).

Training Data:
Annotated sentences used to learn the parameters of a statistical model.

Corpus:
A large collection of written or spoken texts used for linguistic analysis and training statistical models.

N-gram:
A contiguous sequence of n items (words or tags) from a given sequence of text or speech.

Unigram, Bigram, Trigram:
A single word/tag, a sequence of two, or a sequence of three, respectively.

Smoothing:
Techniques used to handle zero probabilities in statistical models by redistributing probability mass.

Sequence Labeling:
The task of assigning labels (such as POS tags) to elements in a sequence.

Observation:
The visible outputs (words) generated by the hidden states (POS tags) in an HMM.

State:
A condition or situation in a system, in HMM referring to the hidden grammatical categories.

Lexical Category:
The grammatical class of a word (noun, verb, adjective, etc.).

Word Tokenization:
The process of breaking text into individual words or tokens.

Out-of-Vocabulary (OOV):
Words that appear in test data but were not seen during training.

Probability Matrix:
A matrix containing probability values, such as transition or emission probabilities in an HMM.

Statistical Model:
A mathematical model that uses probability distributions to represent data and make predictions.

Decoding:
The process of finding the most likely sequence of hidden states (POS tags) given the observed sequence (words).

Maximum Likelihood Estimation:
A method of estimating model parameters by finding values that maximize the likelihood of the observed data.

Natural Language Processing (NLP):
A field of computer science and artificial intelligence concerned with interactions between computers and human language.