Tutorial on Hidden Markov Model
Loc Nguyen
Sunflower Soft Company, Ho Chi Minh city, Vietnam
Email address:
To cite this article:
Loc Nguyen. Tutorial on Hidden Markov Model. Applied and Computational Mathematics. Special Issue: Some Novel Algorithms for Global Optimization and Relevant Subjects. Vol. 6, No. 4-1, 2017, pp. 16-38. doi: 10.11648/j.acm.s.2017060401.12
Received: September 11, 2015; Accepted: September 13, 2015; Published: June 17, 2016
Abstract: Hidden Markov model (HMM) is a powerful mathematical tool for prediction and recognition. Many computer software products implement HMM and hide its complexity, which assist scientists to use HMM for applied researches. However comprehending HMM in order to take advantages of its strong points requires a lot of efforts. This report is a tutorial on HMM with full of mathematical proofs and example, which help researchers to understand it by the fastest way from theory to practice. The report focuses on three common problems of HMM such as evaluation problem, uncovering problem, and learning problem, in which learning problem with support of optimization theory is the main subject.
Keywords: Hidden Markov Model, Optimization, Evaluation Problem, Uncovering Problem, Learning Problem
1. Introduction
There are many real-world phenomena (so-called states) that we would like to model in order to explain our observations. Often, given sequence of observations symbols, there is demand of discovering real states. For example, there are some states of weather: sunny, cloudy, rainy [1, p. 1]. Suppose you are in the room and do not know the weather outside but you are notified observations such as wind speed, atmospheric pressure, humidity, and temperature from someone else. Basing on these observations, it is possible for you to forecast the weather by using hidden Markov model (HMM). Before discussing about HMM, we should glance over the definition of Markov model (MM). First, MM is the statistical model which is used to model the stochastic process. MM is defined as below [2]:
- Given a finite set of state S={s_{1}, s_{2},…, s_{n}} whose cardinality is n. Let ∏ be the initial state distribution where π_{i}∏ represents the probability that the stochastic process begins in state s_{i}. In other words π_{i} is the initial probability of state s_{i}, where
- The stochastic process which is modeled gets only one state from S at all time points. This stochastic process is defined as a finite vector X=(x_{1}, x_{2},…, x_{T}) whose element x_{t} is a state at time point t. The process X is called state stochastic process and x_{t} S equals some state s_{i} S. Note that X is also called state sequence. Time point can be in terms of second, minute, hour, day, month, year, etc. It is easy to infer that the initial probability π_{i} = P(x_{1}=s_{i}) where x_{1} is the first state of the stochastic process. The state stochastic process X must meet fully the Markov property, namely, given previous state x_{t}_{–1} of process X, the conditional probability of current state x_{t} is only dependent on the previous state x_{t}_{–1}, not relevant to any further past state (x_{t}_{–2}, x_{t}_{–3},…, x_{1}). In other words, P(x_{t} | x_{t}_{–1}, x_{t}_{–2}, x_{t}_{–3},…, x_{1}) = P(x_{t} | x_{t}_{–1}) with note that P(.) also denotes probability in this report. Such process is called first-order Markov process.
- At time point, the process changes to the next state based on the transition probability distribution a_{ij}, which depends only on the previous state. So a_{ij} is the probability that the stochastic process changes current state s_{i} to next state s_{j}. It means that a_{ij} = P(x_{t}=s_{j} | x_{t–}_{1}=s_{i}) = P(x_{t+}_{1}=s_{j} | x_{t}=s_{i}). The probability of transitioning from any given state to some next state is 1, we have
All transition probabilities a_{ij} (s) constitute the transition probability matrix A. Note that A is n by n matrix because there are n distinct states. It is easy to infer that matrix A represents state stochastic process X. It is possible to understand that the initial probability matrix ∏ is degradation case of matrix A.
Briefly, MM is the triple 〈S, A, ∏〉. In typical MM, states are observed directly by users and transition probabilities (A and ∏) are unique parameters. Otherwise, hidden Markov model (HMM) is similar to MM except that the underlying states become hidden from observer, they are hidden parameters. HMM adds more output parameters which are called observations. Each state (hidden parameter) has the conditional probability distribution upon such observations. HMM is responsible for discovering hidden parameters (states) from output parameters (observations), given the stochastic process. The HMM has further properties as below [2]:
- Suppose there is a finite set of possible observations Φ = {φ_{1}, φ_{2},…, φ_{m}} whose cardinality is m. There is the second stochastic process which produces observations correlating with hidden states. This process is called observable stochastic process, which is defined as a finite vector O = (o_{1}, o_{2},…, o_{T}) whose element o_{t} is an observation at time point t. Note that o_{t} Φ equals some φ_{k}. The process O is often known as observation sequence.
- There is a probability distribution of producing a given observation in each state. Let b_{i}(k) be the probability of observation φ_{k} when the state stochastic process is in state s_{i}. It means that b_{i}(k) = b_{i}(o_{t}=φ_{k}) = P(o_{t}=φ_{k} | x_{t}=s_{i}). The sum of probabilities of all observations which observed in a certain state is 1, we have
All probabilities of observations b_{i}(k) constitute the observation probability matrix B. It is convenient for us to use notation b_{ik} instead of notation b_{i}(k). Note that B is n by m matrix because there are n distinct states and m distinct observations. While matrix A represents state stochastic process X, matrix B represents observable stochastic process O.
Thus, HMM is the 5-tuple ∆ = 〈S, Φ, A, B, ∏〉. Note that components S, Φ, A, B, and ∏ are often called parameters of HMM in which A, B, and ∏ are essential parameters. Going back weather example, suppose you need to predict how weather tomorrow is: sunny, cloudy or rainy since you know only observations about the humidity: dry, dryish, damp, soggy. The HMM is totally determined based on its parameters S, Φ, A, B, and ∏ according to weather example. We have S = {s_{1}=sunny, s_{2}=cloudy, s_{3}=rainy}, Φ = {φ_{1}=dry, φ_{2}=dryish, φ_{3}=damp, φ_{4}=soggy}. Transition probability matrix A is shown in table 1.
From table 1, we have a_{11}+a_{12}+a_{13}=1, a_{21}+a_{22}+a_{23}=1, a_{31}+a_{32}+a_{33}=1.
Initial state distribution specified as uniform distribution is shown in table 2.
From table 2, we have π_{1}+π_{2}+π_{3}=1.
Observation probability matrix B is shown in table 3.
From table 3, we have b_{11}+b_{12}+b_{13}+b_{14}=1, b_{21}+b_{22}+b_{23}+b_{24}=1, b_{31}+b_{32}+b_{33}+b_{34}=1.
The whole weather HMM is depicted in fig. 1.
There are three problems of HMM [2] [3, pp. 262-266]:
1. Given HMM ∆ and an observation sequence O = {o_{1}, o_{2},…, o_{T}} where o_{t} Φ, how to calculate the probability P(O|∆) of this observation sequence. Such probability P(O|∆) indicates how much the HMM ∆ affects on sequence O. This is evaluation problem or explanation problem. Note that it is possible to denote O = {o_{1} → o_{2} →…→ o_{T}} and the sequence O is aforementioned observable stochastic process.
2. Given HMM ∆ and an observation sequence O = {o_{1}, o_{2},…, o_{T}} where o_{t} Φ, how to find the sequence of states X = {x_{1}, x_{2},…, x_{T}} where x_{t} S so that X is most likely to have produced the observation sequence O. This is uncovering problem. Note that the sequence X is aforementioned state stochastic process.
3. Given HMM ∆ and an observation sequence O = {o_{1}, o_{2},…, o_{T}} where o_{t} Φ, how to adjust parameters of ∆ such as initial state distribution ∏, transition probability matrix A, and observation probability matrix B so that the quality of HMM ∆ is enhanced. This is learning problem.
These problems will be mentioned in sections 2, 3, and 4, in turn.
2. HMM Evaluation Problem
The essence of evaluation problem is to find out the way to compute the probability P(O|∆) most effectively given the observation sequence O = {o_{1}, o_{2},…, o_{T}}. For example, given HMM ∆ whose parameters A, B, and ∏ specified in tables 1, 2, and 3, which is designed for weather forecast. Suppose we need to calculate the probability of event that humidity is soggy and dry in days 1 and 2, respectively. This is evaluation problem with sequence of observations O = {o_{1}=φ_{4}=soggy, o_{2}=φ_{1}=dry, o_{3}=φ_{2}=dryish}. There is a complete set of 3^{3}=27 mutually exclusive cases of weather states for three days: {x_{1}=s_{1}=sunny, x_{2}=s_{1}=sunny, x_{3}=s_{1}=sunny}, {x_{1}=s_{1}=sunny, x_{2}=s_{1}=sunny, x_{3}=s_{2}=cloudy}, {x_{1}=s_{1}=sunny, x_{2}=s_{1}=sunny, x_{3}=s_{3}=rainy}, {x_{1}=s_{1}=sunny, x_{2}=s_{2}=cloudy, x_{3}=s_{1}=sunny}, {x_{1}=s_{1}=sunny, x_{2}=s_{2}=cloudy, x_{3}=s_{2}=cloudy}, {x_{1}=s_{1}=sunny, x_{2}=s_{2}=cloudy, x_{3}=s_{3}=rainy}, {x_{1}=s_{1}=sunny, x_{2}=s_{3}=rainy, x_{3}=s_{1}=sunny}, {x_{1}=s_{1}=sunny, x_{2}=s_{3}=rainy, x_{3}=s_{2}=cloudy}, {x_{1}=s_{1}=sunny, x_{2}=s_{3}=rainy, x_{3}=s_{3}=rainy}, {x_{1}=s_{2}=cloudy, x_{2}=s_{1}=sunny, x_{3}=s_{1}=sunny}, {x_{1}=s_{2}=cloudy, x_{2}=s_{1}=sunny, x_{3}=s_{2}=cloudy}, {x_{1}=s_{2}=cloudy, x_{2}=s_{1}=sunny, x_{3}=s_{3}=rainy}, {x_{1}=s_{2}=cloudy, x_{2}=s_{2}=cloudy, x_{3}=s_{1}=sunny}, {x_{1}=s_{2}=cloudy, x_{2}=s_{2}=cloudy, x_{3}=s_{2}=cloudy}, {x_{1}=s_{2}=cloudy, x_{2}=s_{2}=cloudy, x_{3}=s_{3}=rainy}, {x_{1}=s_{2}=cloudy, x_{2}=s_{3}=rainy, x_{3}=s_{1}=sunny}, {x_{1}=s_{2}=cloudy, x_{2}=s_{3}=rainy, x_{3}=s_{2}=cloudy}, {x_{1}=s_{2}=cloudy, x_{2}=s_{3}=rainy, x_{3}=s_{3}=rainy}, {x_{1}=s_{3}=rainy, x_{2}=s_{1}=sunny, x_{3}=s_{1}=sunny}, {x_{1}=s_{3}=rainy, x_{2}=s_{1}=sunny, x_{3}=s_{2}=cloudy}, {x_{1}=s_{3}=rainy, x_{2}=s_{1}=sunny, x_{3}=s_{3}=rainy}, {x_{1}=s_{3}=rainy, x_{2}=s_{2}=cloudy, x_{3}=s_{1}=sunny}, {x_{1}=s_{3}=rainy, x_{2}=s_{2}=cloudy, x_{3}=s_{2}=cloudy}, {x_{1}=s_{3}=rainy, x_{2}=s_{2}=cloudy, x_{3}=s_{3}=rainy}, {x_{1}=s_{3}=rainy, x_{2}=s_{3}=rainy, x_{3}=s_{1}=sunny}, {x_{1}=s_{3}=rainy, x_{2}=s_{3}=rainy, x_{3}=s_{2}=cloudy}, {x_{1}=s_{3}=rainy, x_{2}=s_{3}=rainy, x_{3}=s_{3}=rainy}.
According to total probability rule [4, p. 101], the probability P(O|∆) is:
We have:
(Because observations o_{1}, o_{2}, and o_{3} are mutually independent)
(Because an observation is only dependent on the day when it is observed)
(Due to multiplication rule [4, p. 100])
(Due to Markov property, current state is only dependent on right previous state)
(Due to multiplication rule [4, p. 100])
(According to parameters A, B, and ∏ specified in tables 1, 2, and 3)
Similarly, we have:
It implies
It is easy to explain that given weather HMM modeled by parameters A, B, and ∏ specified in tables 1, 2, and 3, the event that it is soggy, dry, and dryish in three successive days is rare because the probability of such event P(O|Δ) is low (1.3%). It is easy to recognize that it is impossible to browse all combinational cases of given observation sequence O = {o_{1}, o_{2},…, o_{T}} as we knew that it is necessary to survey 3^{3}=27 mutually exclusive cases of weather states with a tiny number of observations {soggy, dry, dryish}. Exactly, given n states and T observations, it takes extremely expensive cost to survey n^{T} cases. According to [3, pp. 262-263], there is a so-called forward-backward procedure to decrease computational cost for determining the probability P (O|Δ). Let α_{t}(i) be the joint probability of partial observation sequence {o_{1}, o_{2},…, o_{t}} and state x_{t}=s_{i} where , specified by (1).
(1)
The joint probability α_{t}(i) is also called forward variable at time point t and state s_{i}. The product α_{t}(i)a_{ij} where a_{ij} is the transition probability from state i to state j counts for probability of join event that partial observation sequence {o_{1}, o_{2},…, o_{t}} exists and the state s_{i} at time point t is changed to s_{j} at time point t+1.
(Due to multiplication rule [4, p. 100])
(Because the partial observation sequence {o_{1}, o_{2},…, o_{t}} is independent from next state x_{t+}_{1} given current state x_{t})
(Due to multiplication rule [4, p. 100])
Summing product α_{t}(i)a_{ij} over all n possible states of x_{t} produces probability of join event that partial observation sequence {o_{1}, o_{2},…, o_{t}} exists and the next state is x_{t+}_{1}=s_{j} regardless of the state x_{t}.
The forward variable at time point t+1 and state s_{j} is calculated on α_{t}(i) as follows:
(Due to multiplication rule)
(Due to observations are mutually independent)
Where b_{j}(o_{t+}_{1}) is the probability of observation o_{t+}_{1} when the state stochastic process is in state s_{j}, please see an example of observation probability matrix shown in table 3. In brief, please pay attention to recurrence property of forward variable specified by (2).
(2)
The aforementioned construction of forward recurrence equation (2) is essentially to build up Markov chain, illustrated by fig. 2 [3, p. 262].
According to the forward recurrence equation (2), given observation sequence O = {o_{1}, o_{2},…, o_{T}}, we have:
The probability P(O|Δ) is sum of α_{T}(i) over all n possible states of x_{T}, specified by (3).
(3)
The forward-backward procedure to calculate the probability P(O|Δ), based on forward equations (2) and (3), includes three steps as shown in table 4 [3, p. 262].
It is required to execute n+2n^{2}(T–1)+n–1 = 2n^{2}(T–1)+2n–1 operations for forward-backward procedure based on forward variable due to:
- There are n multiplications at initialization step.
- There are n multiplications, n–1 additions, and 1 multiplication over the expression . There are n cases of values α_{t+}_{1}(j) for all at time point t+1. So, there are (n+n–1+1) n = 2n^{2} operations over values α_{t+}_{1}(j) for all at time point t. The recurrence step runs over T–1 times and so, there are 2n^{2}^{ }(T–1) operations at recurrence step.
- There are n–1 additions at evaluation step.
Inside 2n^{2}(T–1)+2n–1 operations, there are n+(n+1)n(T–1) = n+(n^{2}+n)(T–1) multiplications and (n–1)n(T–1)+n–1 = (n^{2}+n)(T–1)+n–1 additions.
Going back example with weather HMM whose parameters A, B, and ∏ are specified in tables 1, 2, and 3. We need to re-calculate the probability of observation sequence O = {o_{1}=φ_{4}=soggy, o_{2}=φ_{1}=dry, o_{3}=φ_{2}=dryish} by forward-backward procedure shown in table 4 (based on forward variable). According to initialization step of forward-backward procedure based on forward variable, we have:
According to recurrence step of forward-backward procedure based on forward variable, we have:
According to evaluation step of forward-backward procedure based on forward variable, the probability of observation sequence O = {o_{1}=s_{4}=soggy, o_{2}=s_{1}=dry, o_{3}=s_{2}=dryish} is:
The result from the forward-backward procedure based on forward variable is the same to the one from aforementioned brute-force method that browses all 3^{3}=27 mutually exclusive cases of weather states.
There is interesting thing that the forward-backward procedure can be implemented based on so-called backward variable. Let β_{t}(i) be the backward variable which is conditional probability of partial observation sequence {o_{t}, o_{t+}_{1},…, o_{T}} given state x_{t}=s_{i} where , specified by (4).
(4)
We have
(Because observations o_{t}_{+1}, o_{t}_{+2},…, o_{T} are mutually independent)
(Because partial observation sequence o_{t}_{+1}, o_{t}_{+2},…, o_{T} is independent from state x_{t} at time point t)
(Due to multiplication rule [4, p. 100])
Summing the product a_{ij}b_{j}(o_{t}_{+1})β_{t}_{+1}(j) over all n possible states of x_{t}_{+1}=s_{j}, we have:
(Due to the total probability rule [4, p. 101])
In brief, the recurrence property of backward variable specified by (5).
(5)
Where b_{j}(o_{t+}_{1}) is the probability of observation o_{t+}_{1} when the state stochastic process is in state s_{j}, please see an example of observation probability matrix shown in table 3. The construction of backward recurrence equation (5) is essentially to build up Markov chain, illustrated by fig. 3 [3, p. 263].
According to the backward recurrence equation (5), given observation sequence O = {o_{1}, o_{2},…, o_{T}}, we have:
The product π_{i}b_{i}(o_{1})β_{1}(i) is:
(Because observations o_{1}, o_{2},…, o_{T} are mutually independent)
It implies that the probability P(O|Δ) is:
(Due to the total probability rule [4, p. 101])
Shortly, the probability P(O|Δ) is sum of product π_{i}b_{i}(o_{1})β_{1}(i) over all n possible states of x_{1}=s_{i}, specified by (6).
(6)
The forward-backward procedure to calculate the probability P(O|Δ), based on backward equations (5) and (6), includes three steps as shown in table 5 [3, p. 263].
It is required to execute (3n–1)n(T–1)+2n+n–1 = 3n^{2}(T–1)–n(T–4)–1 operations for forward-backward procedure based on forward variable due to:
- There are 2n multiplications and n–1 additions over the sum . So, there are (2n+n–1)n = (3n–1)n operations over values β_{t}(i) for all at time point t. The recurrence step runs over T–1 times and so, there are (3n–1)n(T–1) operations at recurrence step.
- There are 2n multiplications and n–1 additions over the sum at evaluation step.
Inside 3n^{2}(T–1)–n(T–4)–1 operations, there are 2n^{2}(T–1)+2n multiplications and (n–1)n(T–1)+n–1 = n^{2}(T–1)–n(T–2)–1 additions.
Going back example with weather HMM whose parameters A, B, and ∏ are specified in tables 1, 2, and 3. We need to re-calculate the probability of observation sequence O = {o_{1}=φ_{4}=soggy, o_{2}=φ_{1}=dry, o_{3}=φ_{2}=dryish} by forward-backward procedure shown in table 5 (based on backward variable). According to initialization step of forward-backward procedure based on backward variable, we have:
According to recurrence step of forward-backward procedure based on backward variable, we have:
According to evaluation step of forward-backward procedure based on backward variable, the probability of observation sequence O = {o_{1}=φ_{4}=soggy, o_{2}=φ_{1}=dry, o_{3}=φ_{2}=dryish} is:
The result from the forward-backward procedure based on backward variable is the same to the one from aforementioned brute-force method that browses all 3^{3}=27 mutually exclusive cases of weather states and the one from forward-backward procedure based on forward variable.
The evaluation problem is now described thoroughly in this section. The uncovering problem is mentioned particularly in successive section.
3. HMM Uncovering Problem
Recall that given HMM ∆ and observation sequence O = {o_{1}, o_{2},…, o_{T}} where o_{t} Φ, how to find out a state sequence X = {x_{1}, x_{2},…, x_{T}} where x_{t} S so that X is most likely to have produced the observation sequence O. This is the uncovering problem: which sequence of state transitions is most likely to have led to given observation sequence. In other words, it is required to establish an optimal criterion so that the state sequence X leads to maximizing such criterion. The simple criterion is the conditional probability of sequence X with respect to sequence O and model ∆, denoted P(X|O,∆). We can apply brute-force strategy: "go through all possible such X and pick the one leading to maximizing the criterion P(X|O,∆)".
This strategy is impossible if the number of states and observations is huge. Another popular way is to establish a so-called individually optimal criterion [3, p. 263] which is described right later.
Let γ_{t}(i) be joint probability that the stochastic process is in state s_{i} at time point t with observation sequence O = {o_{1}, o_{2},…, o_{T}}, equation (7) specifies this probability based on forward variable α_{t} and backward variable β_{t}.
(7)
The variable γ_{t}(i) is also called individually optimal criterion with note that forward variable α_{t} and backward variable β_{t} are calculated according to (2) and (5), respectively.
Following is proof of (7).
(Due to Bayes’ rule [4, p. 99])
(Due to multiplication rule [4, p. 100])
(Because observations o_{1}, o_{2},…, o_{T} are observed independently)
(According to (1) and (4) for determining forward variable and backward variable)
The state sequence X = {x_{1}, x_{2},…, x_{T}} is determined by selecting each state x_{t} S so that it maximizes γ_{t}(i).
(Due to Bayes’ rule [4, p. 99])
(Due to (7))
Because the probability is not relevant to state sequence X, it is possible to remove it from the optimization criterion. Thus, equation (8) specifies how to find out the optimal state x_{t} of X at time point t.
(8)
Note that index i is identified with state according to (8). The optimal state x_{t} of X at time point t is the one that maximizes product α_{t}(i) β_{t}(i) over all values s_{i}. The procedure to find out state sequence X = {x_{1}, x_{2},…, x_{T}} based on individually optimal criterion is called individually optimal procedure that includes three steps, shown in table 6.
It is required to execute n+(5n^{2}–n)(T–1)+2nT operations for individually optimal procedure due to:
- There are n multiplications for calculating α_{1}(i) (s).
- The recurrence step runs over T–1 times. There are 2n^{2}(T–1) operations for determining α_{t+}_{1}(i) (s) over all and . There are (3n–1)n(T–1) operations for determining β_{t}(i) (s) over all and t=T–1, t=T–2,…, t=1. There are nT multiplications for determining γ_{t}(i)=α_{t}(i)β_{t}(i) over all and . There are nT comparisons for determining optimal state over all and . In general, there are 2n^{2}(T–1)+ (3n–1)n(T–1) + nT + nT = (5n^{2}–n)(T–1) + 2nT operations at the recurrence step.
Inside n+(5n^{2}–n)(T–1)+2nT operations, there are n+(n+1)n(T–1)+2n^{2}(T–1)+nT = (3n^{2}+n)(T–1)+nT+n multiplications and (n–1)n(T–1)+(n–1)n(T–1) = 2(n^{2}–n)(T–1) additions and nT comparisons.
For example, given HMM ∆ whose parameters A, B, and ∏ specified in tables 1, 2, and 3, which is designed for weather forecast. Suppose humidity is soggy and dry in days 1 and 2, respectively. We apply individual optimal procedure into solving the uncovering problem that finding out the optimal state sequence X = {x_{1}, x_{2}} with regard to observation sequence O = {o_{1}=φ_{4}=soggy, o_{2}=φ_{1}=dry, o_{3}=φ_{2}=dryish}. According to (2) and (5), forward variable and backward variable are calculated as follows:
According to recurrence step of individually optimal procedure, individually optimal criterion γ_{t}(i) and optimal state x_{t} are calculated as follows:
As a result, the optimal state sequence is X = {x_{1}=rainy, x_{2}=sunny, x_{3}=sunny}.
The individually optimal criterion γ_{t}(i) does not reflect the whole probability of state sequence X given observation sequence O because it focuses only on how to find out each partially optimal state x_{t} at each time point t. Thus, the individually optimal procedure is heuristic method. Viterbi algorithm [3, p. 264] is alternative method that takes interest in the whole state sequence X by using joint probability P(X,O|Δ) of state sequence and observation sequence as optimal criterion for determining state sequence X. Let δ_{t}(i) be the maximum joint probability of observation sequence O and state x_{t}=s_{i} over t–1 previous states. The quantity δ_{t}(i) is called joint optimal criterion at time point t, which is specified by (9).
(9)
The recurrence property of joint optimal criterion is specified by (10).
(10)
The semantic content of joint optimal criterion δ_{t} is similar to the forward variable α_{t}. Following is the proof of (10).
(Due to multiplication rule [4, p. 100])
(Due to observations are mutually independent)
(The probability b_{j}(o_{t}_{+1}) is moved out of the maximum operation because it is independent from states x_{1}, x_{2},…, x_{t})
(Due to multiplication rule [4, p. 100])
(Due to multiplication rule [4, p. 100])
(Because observation x_{t}_{+1} is dependent from o_{1}, o_{2},…, o_{t}, x_{1}, x_{2},…, x_{t}_{–1})
(Due to multiplication rule [4, p. 100])
Given criterion δ_{t+}_{1}(j), the state x_{t+}_{1}=s_{j} that maximizes δ_{t+}_{1}(j) is stored in the backtracking state q_{t+}_{1}(j) that is specified by (11).
(11)
Note that index i is identified with state according to (11). The Viterbi algorithm based on joint optimal criterion δ_{t}(i) includes three steps described in table 7.
The total number of operations inside the Viterbi algorithm is 2n+(2n^{2}+n)(T–1) as follows:
- There are n multiplications for initializing n values δ_{1}(i) when each δ_{1}(i) requires 1 multiplication.
- There are (2n^{2}+n)(T–1) operations over the recurrence step because there are n(T–1) values δ_{t+}_{1}(j) and each δ_{t+}_{1}(j) requires n multiplications and n comparisons for maximizing plus 1 multiplication.
- There are n comparisons for constructing the state sequence X, .
Inside 2n+(2n^{2}+n)(T–1) operations, there are n+(n^{2}+n)(T–1) multiplications and n^{2}(T–1)+n comparisons. The number of operations with regard to Viterbi algorithm is smaller than the number of operations with regard to individually optimal procedure when individually optimal procedure requires (5n^{2}–n)(T–1)+2nT+n operations. Therefore, Viterbi algorithm is more effective than individually optimal procedure. Besides, individually optimal procedure does not reflect the whole probability of state sequence X given observation sequence O.
Going backing the weather HMM ∆ whose parameters A, B, and ∏ are specified in tables 1, 2, and 3. Suppose humidity is soggy and dry in days 1 and 2, respectively. We apply Viterbi algorithm into solving the uncovering problem that finding out the optimal state sequence X = {x_{1}, x_{2}, x_{3}} with regard to observation sequence O = {o_{1}=φ_{4}=soggy, o_{2}=φ_{1}=dry, o_{3}=φ_{2}=dryish}. According to initialization step of Viterbi algorithm, we have:
According to recurrence step of Viterbi algorithm, we have:
According to state sequence backtracking of Viterbi algorithm, we have:
As a result, the optimal state sequence is X = {x_{1}=rainy, x_{2}=sunny, x_{3}=sunny}. The result from the Viterbi algorithm is the same to the one from aforementioned individually optimal procedure described in table 6.
The uncovering problem is now described thoroughly in this section. Successive section will mention the learning problem of HMM which is the main subject of this tutorial.
4. HMM Learning Problem
The learning problem is to adjust parameters such as initial state distribution ∏, transition probability matrix A, and observation probability matrix B so that given HMM ∆ gets more appropriate to an observation sequence O = {o_{1}, o_{2},…, o_{T}} with note that ∆ is represented by these parameters. In other words, the learning problem is to adjust parameters by maximizing probability of observation sequence O, as follows:
The Expectation Maximization (EM) algorithm is applied successfully into solving HMM learning problem, which is equivalently well-known Baum-Welch algorithm [3]. Successive sub-section 4.1 describes EM algorithm in detailed before going into Baum-Welch algorithm.
4.1. EM Algorithm
Expectation Maximization (EM) is effective parameter estimator in case that incomplete data is composed of two parts: observed part and missing (or hidden) part. EM is iterative algorithm that improves parameters after iterations until reaching optimal parameters. Each iteration includes two steps: E(xpectation) step and M(aximization) step. In E-step the missing data is estimated based on observed data and current estimate of parameters; so the lower-bound of likelihood function is computed by the expectation of complete data. In M-step new estimates of parameters are determined by maximizing the lower-bound. Please see document [5] for short tutorial of EM. This sub-section focuses on practice general EM algorithm; the theory of EM algorithm is described comprehensively in article "Maximum Likelihood from Incomplete Data via the EM algorithm" by authors [6].
Suppose O and X are observed data and missing (hidden) data, respectively. Note O and X can be represented in any form such as discrete values, scalar, integer number, real number, vector, list, sequence, sample, and matrix. Let represent parameters of probability distribution. Concretely, includes initial state distribution ∏, transition probability matrix A, and observation probability matrix B inside HMM. In other words, represents HMM Δ itself. EM algorithm aims to estimate by finding out which maximizes the likelihood function = .
Where is the optimal estimate of parameters which is called usually parameter estimate. Because the likelihood function is product of factors, it is replaced by the log-likelihood function LnL() that is natural logarithm of the likelihood function , for convenience. We have:
Where,
The method finding out the parameter estimate by maximizing the log-likelihood function is called maximum likelihood estimation (MLE). Of course, EM algorithm is based on MLE.
Suppose the current parameter is after the t^{th} iteration. Next we must find out the new estimate that maximizes the next log-likelihood function . In other words it maximizes the deviation between current log-likelihood and next log-likelihood with regard to .
Where is the deviation between current log-likelihood and next log-likelihood with note that is function of when was determined.
Suppose the total probability of observed data can be determined by marginalizing over missing data:
The expansion of is total probability rule [4, p. 101]. The deviation is re-written:
(Due to multiplication rule [4, p. 100])
Because hidden X is the complete set of mutually exclusive variables, the sum of conditional probabilities of X is equal to 1 given O and .
Applying Jensen’s inequality [5, pp. 3-4]
into deviation , we have:
Where,
Because C is constant with regard to , it is possible to eliminate C in order to simplify the optimization criterion as follows:
The expression is essentially expectation of given conditional probability distribution when is totally determined. Let denote this conditional expectation, equation (12) specifies EM optimization criterion for determining the parameter estimate, which is the most important aspect of EM algorithm.
(12)
Where,
If is continuous density function, the continuous version of this conditional expectation is:
Finally, the EM algorithm is described in table 8.
General EM algorithm is simple but please pay attention to the concept of lower-bound and what the essence of EM is. Recall that the next log-likelihood function is current likelihood function plus the deviation . We have:
Where,
Let denote the lower-bound of the log-likelihood function given current parameter [5, pp. 7-8]. The lower-bound is the function of as specified by (13):
(13)
Determining is to calculate the EM conditional expectation because terms and C were totally determined. The lower-bound has a feature where its evaluation at equals the log-likelihood function .
In fact,
(Due to multiplication rule [4, p. 100])
It implies
Fig. 4. [8, p. 7] shows relationship between the log-likelihood function and its lower-bound .
The essence of maximizing the deviation is to maximize the lower-bound with respect to . For each iteration the new lower-bound and its maximum are computed based on previous lower-bound. A single iteration in EM algorithm can be understood as below:
1. E-step: the new lower-bound is determined based on current parameter according to (13). Of course, determining is to calculate the EM conditional expectation .
2. M-step: finding out the estimate so that reaches maximum at . The next parameter is assigned by the estimate , we have:
Of course becomes current parameter for next iteration. Note, maximizing is to maximize the EM conditional expectation .
In general, it is easy to calculate the EM expectation but finding out the estimate based on maximizing such expectation is complicated optimization problem. It is possible to state that the essence of EM algorithm is to determine the estimate . Now the EM algorithm is introduced with full of details. How to apply it into solving HMM learning problem is described in successive sub-section.
4.2. Applying EM Algorithm into Solving Learning Problem
Now going back the HMM learning problem, the EM algorithm is applied into solving this problem, which is equivalently well-known Baum-Welch algorithm [3]. The parameter becomes the HMM model Δ = (A, B, ∏). Recall that the learning problem is to adjust parameters by maximizing probability of observation sequence O, as follows:
Where , , are parameter estimates and so, the purpose of HMM learning problem is to determine them.
The observation sequence O = {o_{1}, o_{2},…, o_{T}} and state sequence X = {x_{1}, x_{2},…, x_{T}} are observed data and missing (hidden) data within context of EM algorithm, respectively. Note O and X is now represented in sequence. According to EM algorithm, the parameter estimate is determined as follows:
Where Δ_{r} = (A_{r}, B_{r}, ∏_{r}) is the known parameter at the current iteration. Note that we use notation Δ_{r} instead of popular notation Δ_{t} in order to distinguish iteration indices of EM algorithm from time points inside observation sequence O and state sequence X. The EM conditional expectation in accordance with HMM is:
(Because observations o_{1}, o_{2},…, o_{T} are mutually independent)
(Because each observations o_{t} is only dependent on state x_{t})
(Because each state x_{t} is only dependent on previous state x_{t}_{–1})
(Due to recurrence on probability P(x_{1}, x_{2},…, x_{t}))
It is conventional that where x_{0} is pseudo-state, equation (14) specifies general EM conditional expectation for HMM:
(14)
Let and are two index functions so that
We have:
Because of the convention , matrix ∏ is degradation case of matrix A at time point t=1. In other words, the initial probability π_{j} is equal to the transition probability a_{ij} from pseudo-state x_{0} to state x_{1}=s_{j}.
Note that n=|S| is the number of possible states and m=|Φ| is the number of possible observations.
Shortly, the EM conditional expectation for HMM is specified by (15).
(15)
Where,
Note that the conditional expectation is function of Δ. There are two constraints for HMM as follows:
Maximizing with subject to these constraints is optimization problem that is solved by Lagrangian duality theorem [7, p. 8]. Original optimization problem mentions minimizing target function but it is easy to infer that maximizing target function shares the same methodology. Let l(Δ, λ, μ) be Lagrangian function constructed from together with these constraints [9, p. 9], we have (16) for specifying HMM Lagrangian function as follows:
(16)
Where λ is n-component vector λ = (λ_{1}, λ_{2},…, λ_{n}) and μ is m-component vector μ = (μ_{1}, μ_{2},…, μ_{m}). Factors λ_{i} ≥ 0 and μ_{j} ≥ 0 are called Lagrange multipliers or Karush-Kuhn-Tucker multipliers [10] or dual variables. The expectation is specified by (15).
The parameter estimate is extreme point of the Lagrangian function. According to Lagrangian duality theorem [11, p. 216] [7, p. 8], we have:
The parameter estimate is determined by setting partial derivatives of l(Δ, λ, μ) with respect to a_{ij} and b_{j}(k) to be zero. The partial derivative of l(Δ, λ, μ) with respect to a_{ij} is:
Setting the partial derivative to be zero:
The parameter estimate is solution of equation , we have:
It is required to estimate the Lagrange multiplier λ_{i}. The multiplier estimate is determined by setting the partial derivative of l(Δ, λ, μ) with respect to λ_{i} to be zero as follows:
Substituting for a_{ij}, we have:
It implies:
Where, is index function.
Substituting for λ_{i} inside
We have:
Evaluating the numerator, we have:
(Due to total probability rule [4, p. 101])
(Due to multiplication rule [4, p. 100])
Evaluating the denominator, we have:
(Due to total probability rule [4, p. 101])
(Due to multiplication rule [4, p. 100])
It implies
Because of the convention , the estimate is fixed as follows:
The estimate of initial probability is known as specific estimate from pseudo-state x_{0} to state x_{1}=s_{j}. It means that
Recall that the parameter estimate is determined by setting partial derivatives of l(Δ, λ, μ) with respect to a_{ij} and b_{j}(k) to be zero. The parameter estimate was determined. Now it is required to calculate the parameter estimate . The partial derivative of Lagrangian function l(Δ, λ, μ) with respect to b_{j}(k) is:
Setting the partial derivative to be zero:
The parameter estimate is solution of equation , we have:
It is required to estimate the Lagrange multiplier μ_{j}. The multiplier estimate is determined by setting the partial derivative of l(Δ, λ, μ) with respect to μ_{j} to be zero as follows:
Substituting for b_{j}(k) we have:
It implies:
Where, is index function.
Substituting for μ_{j} inside
We have:
Evaluating the numerator, we have:
(Due to total probability rule [4, p. 101])
(Due to multiplication rule [4, p. 100])
Note, the expression expresses the sum of probabilities over T time points with condition o_{t} = φ_{k}.
Evaluating the denominator, we have:
(Due to total probability rule [4, p. 101])
(Due to multiplication rule [4, p. 100])
It implies
In general, the parameter estimate is totally determined as follows:
As a convention, we use notation Δ instead of Δ_{r} for denoting known HMM at current iteration of EM algorithm. We have (17) for specifying HMM parameter estimate given current parameter Δ = (a_{ij}, b_{j}(k), π_{j}) as follows:
(17)
The parameter estimate is the ultimate solution of the learning problem. As seen in (17), it is necessary to calculate probabilities P(O, x_{t–}_{1}=s_{i}, x_{t}=s_{j}) and P(O, x_{t–}_{1}=s_{i}) when other probabilities P(O, x_{t}=s_{j}), P(O, x_{1}=s_{i}), and P(O, x_{1}=s_{j}) are represented by the joint probability γ_{t} specified by (7).
Let ξ_{t}(i, j) is the joint probability that the stochastic process receives state s_{i} at time point t–1 and state s_{j} at time point t given observation sequence O [3, p. 264].
Given forward variable α_{t} and backward variable β_{t}, if , we have:
(Due to multiplication rule [4, p. 100])
(Because the partial observation sequence {o_{1}, o_{2},…, o_{t}} is independent from current state x_{t} given previous state x_{t}_{–1})
(Due to multiplication rule [4, p. 100])
(Because observations o_{t}, o_{t}_{+1}, o_{t}_{+2},…, o_{T} are mutually independent)
(Due to multiplication rule [4, p. 100])
(Due to multiplication rule [4, p. 100])
In general, equation (18) determines the joint probability ξ_{t}(i, j) based on forward variable α_{t} and backward variable β_{t}.
(18)
Where forward variable α_{t} and backward variable β_{t} are calculated by previous recurrence equations (2) and (5).
Shortly, the joint probability ξ_{t}(i, j) is constructed from forward variable and backward variable, as seen in fig. 5 [3, p. 264].
Recall that γ_{t}(j) is the joint probability that the stochastic process is in state s_{j} at time point t with observation sequence O = {o_{1}, o_{2},…, o_{T}}, specified by (7).
According to total probability rule [4, p. 101], it is easy to infer that γ_{t} is sum of ξ_{t} over all states with , as seen in (19).
(19)
Deriving from (18) and (19), we have:
By extending (17), we receive (20) for specifying HMM parameter estimate given current parameter Δ = (a_{ij}, b_{i}(k), π_{i}) in detailed.
(20)
Followings are interpretations relevant to the joint probabilities ξ_{t}(i, j) and γ_{t}(j) with observation sequence O.
- The sum expresses expected number of transitions from state s_{i} to state s_{j} [3, p. 265].
- The double sum expresses expected number of transitions from state s_{i} [3, p. 265].
- The sum expresses expected number of times in state s_{j} and in observation φ_{k} [3, p. 265].
- The sum expresses expected number of times in state s_{j} [3, p. 265].
Followings are interpretations of the parameter estimate :
- The transition estimate is expected frequency of transitions from state s_{i} to state s_{j}.
- The observation estimate is expected frequency of times in state s_{j} and in observation φ_{k}.
- The initial estimate is (normalized) expected frequency of state s_{j} at the first time point (t=1).
It is easy to infer that the parameter estimate is based on joint probabilities ξ_{t}(i, j) and γ_{t}(j) which, in turn, are based on current parameter Δ = (a_{ij}, b_{j}(k), π_{j}). The EM conditional expectation is determined by joint probabilities ξ_{t}(i, j) and γ_{t}(j); so, the main task of E-step in EM algorithm is essentially to calculate the joint probabilities ξ_{t}(i, j) and γ_{t}(j) according to (18) and (7). The EM conditional expectation gets maximal at estimate and so, the main task of M-step in EM algorithm is essentially to calculate according to (20). The EM algorithm is interpreted in HMM learning problem, as shown in table 9.
The algorithm to solve HMM learning problem shown in table 9 is known as Baum-Welch algorithm [3]. Please see document "Hidden Markov Models Fundamentals" by [9, pp. 8-13] for more details about HMM learning problem. As aforementioned in sub-section 4.1, the essence of EM algorithm applied into HMM learning problem is to determine the estimate .
As seen in table 9, it is not difficult to run E-step and M-step of EM algorithm but how to determine the terminating condition is considerable problem. It is better to establish a computational terminating criterion instead of applying the general statement "EM algorithm stops when it meets the terminating condition, for example, the difference of current parameter Δ and next parameter is insignificant". Going back the learning problem that EM algorithm solves, the EM algorithm aims to maximize probability P(O|Δ) of given observation sequence O=(o_{1}, o_{2},… , o_{T}) so as to find out the estimate . Maximizing the probability P(O|Δ) is equivalent to maximizing the conditional expectation. So it is easy to infer that EM algorithm stops when probability P(O|Δ) approaches to maximal value and EM algorithm cannot maximize P(O|Δ) any more. In other words, the probability P(O|Δ) is terminating criterion. Calculating criterion P(O|Δ) is evaluation problem described in section 2. Criterion P(O|Δ) is determined according to forward-backward procedure; please see tables 4 and 5 for more details about forward-backward procedure.
At the end of M-step, the next criterion P(O|) that is calculated based on the next parameter (also estimate) is compared with the current criterion P(O|Δ) that is calculated in the previous time. If these two criterions are the same or there is no significantly difference between them then, EM algorithm stops. This implies EM algorithm cannot maximize P(O|Δ) any more. However, calculating the next criterion P(O|) according to forward-backward procedure causes EM algorithm to run slowly. This drawback is overcome by following comment and improvement. The essence of forward-backward procedure is to determine forward variables α_{t} while EM algorithm must calculate all forward variables and backward variables in its learning process (E-step). Thus, the evaluation of terminating condition is accelerated by executing forward-backward procedure inside the E-step of EM algorithm. In other words, when EM algorithm results out forward variables in E-step, the forward-backward procedure takes advantages of such forward variables so as to determine criterion P(O|Δ) the at the same time. As a result, the speed of EM algorithm does not decrease. However, there is always a redundant iteration; suppose that the terminating criterion approaches to maximal value at the end of the r^{th} iteration but the EM algorithm only stops at the E-step of the (r+1)^{th} iteration when it really evaluates the terminating criterion. In general, the terminating criterion P(O|Δ) is calculated based on the current parameter Δ at E-step instead of the estimate at M-step. Table 10 shows the proposed implementation of EM algorithm with terminating criterion P(O|Δ). Pseudo-code like programming language C is used to describe the implementation of EM algorithm. Variables are marked as italic words and comments are followed by the signs // and /*. Please pay attention to programming language keywords: while, for, if, [], ==, !=, &&, //, /*, */, etc. For example, notation [] denotes array index operation; concretely, α[t][i] denotes forward variable α_{t}(i) at time point t with regard to state s_{i}.
According to table 10, the number of iterations is limited by a pre-defined maximum number, which aims to solve a so-called infinite loop optimization. Although it is proved that EM algorithm always converges, maybe there are two different estimates and at the final convergence. This situation causes EM algorithm to alternate between and in infinite loop. Therefore, the final estimate or is totally determined but the EM algorithm does not stop. This is the reason that the number of iterations is limited by a pre-defined maximum number.
Going back given weather HMM ∆ whose parameters A, B, and ∏ are specified in tables 1, 2, and 3, suppose observation sequence is O = {o_{1}=φ_{4}=soggy, o_{2}=φ_{1}=dry, o_{3}=φ_{2}=dryish}, the EM algorithm and its implementation described in tables 9 and 10 are applied into calculating the parameter estimate which is the ultimate solution of the learning problem, as below.
At the first iteration (r=1) we have:
Within the E-step of the first iteration (r=1), the terminating criterion P(O|Δ) is calculated according to forward-backward procedure (see table 4) as follows:
Within the E-step of the first iteration (r=1), the joint probabilities ξ_{t}(i,j) and γ_{t}(j) are calculated based on (18) and (7) as follows:
Within the M-step of the first iteration (r=1), the estimate is calculated based on joint probabilities ξ_{t}(i,j) and γ_{t}(j) determined at E-step.
At the second iteration (r=2), the current parameter Δ = (a_{ij}, b_{j}(k), π_{j}) is received values from the estimate above. By repeating the similar calculation, it is easy to determine HMM parameters at the second iteration. Table 11 summarizes HMM parameters resulted from the first iteration and the second iteration of EM algorithm.
As seen in table 11, the EM algorithm does not converge yet when it produces two different terminating criterions (0.013 and 0.0776) at the first iteration and the second iteration. It is necessary to run more iterations so as to gain the most optimal estimate. Within this example, the EM algorithm converges absolutely after 10 iterations when the criterion P(O|Δ) approaches to the same value 1 at the 9^{th} and 10^{th} iterations. Table 12 shows HMM parameter estimates along with terminating criterion P(O|Δ) at the 9^{th} and 10^{th} iterations of EM algorithm.
As a result, the learned parameters A, B, and ∏ are shown in table 13:
Such learned parameters are more appropriate to the training observation sequence O = {o_{1}=φ_{4}=soggy, o_{2}=φ_{1}=dry, o_{3}=φ_{2}=dryish} than the original ones shown in tables 1, 2, and 3 when the terminating criterion P(O|Δ) corresponding to its optimal state sequence is 1.
Now three main problems of HMM are described; please see an excellent document "A tutorial on hidden Markov models and selected applications in speech recognition" written by the author Rabiner [3] for advanced details about HMM.
5. Conclusion
In general, there are three main problems of HMM such as evaluation problem, uncovering problem, and learning problem. For evaluation problem and uncovering problem, researchers should pay attention to forward variable and backward variable. Most computational operations are relevant to them. They reflect unique aspect of HMM. The Viterbi algorithm is very effective to solve the uncovering problem. The Baum-Welch algorithm is often used to solve the learning problem. It is easier to explain Baum-Welch algorithm by combination of EM algorithm and optimization theory, in which the Lagrangian function is maximized so as to find out optimal parameters of EM algorithms when such parameters are also learned parameters of HMM.
Observations of normal HMM described in this report are quantified by discrete probability distribution which is observation probability matrix B. In the most general case, observation is represented by continuous variable and matrix B is replaced by probability density function. At that time the normal HMM becomes continuously observational HMM. Readers are recommended to research continuously observational HMM, an enhanced variant of normal HMM.
References