Marangoni number: Difference between revisions
No edit summary |
en>Rememberlands m Specific (non-disambiguation) link |
||
Line 1: | Line 1: | ||
The '''forward–backward algorithm''' is an [[inference]] [[algorithm]] for [[hidden Markov models]] which computes the [[posterior probability|posterior]] [[marginal probability|marginals]] of all hidden state variables given a sequence of observations/emissions <math>o_{1:t}:= o_1,\dots,o_t</math>, i.e. it computes, for all hidden state variables <math>X_k \in \{X_1, \dots, X_t\}</math>, the distribution <math>P(X_k\ |\ o_{1:t})</math>. This inference task is usually called ''smoothing''. The algorithm makes use of the principle of [[dynamic programming]] to compute efficiently the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name ''forward–backward algorithm''. | |||
The term ''forward–backward algorithm'' is also used to refer to any algorithm belonging to the general class of algorithms that operate on sequence models in a forward–backward manner. In this sense, the descriptions in the remainder of this article refer but to one specific instance of this class. | |||
==Overview == | |||
In the first pass, the forward–backward algorithm computes a set of forward probabilities which provide, for all <math>k \in \{1, \dots, t\}</math>, the probability of ending up in any particular state given the first <math>k</math> observations in the sequence, i.e. <math>P(X_k\ |\ o_{1:k})</math>. In the second pass, the algorithm computes a set of backward probabilities which provide the probability of observing the remaining observations given any starting point <math>k</math>, i.e. <math>P(o_{k+1:t}\ |\ X_k)</math>. These two sets of probability distributions can then be combined to obtain the distribution over states at any specific point in time given the entire observation sequence: | |||
:<math>P(X_k\ |\ o_{1:t}) = P(X_k\ |\ o_{1:k}, o_{k+1:t}) \propto P(o_{k+1:t}\ |\ X_k) P(X_k\ |\ o_{1:k})</math> | |||
The last step follows from an application of [[Bayes' rule]] and the [[conditional independence]] of <math>o_{k+1:t}</math> and <math>o_{1:k}</math> given <math>X_k</math>. | |||
As outlined above, the algorithm involves three steps: | |||
# computing forward probabilities | |||
# computing backward probabilities | |||
# computing smoothed values. | |||
The forward and backward steps may also be called "forward message pass" and "backward message pass" - these terms are due to the ''message-passing'' used in general [[belief propagation]] approaches. At each single observation in the sequence, probabilities to be used for calculations at the next observation are computed. The smoothing step can be calculated simultaneously during the backward pass. This step allows the algorithm to take into account any past observations of output for computing more accurate results. | |||
The forward–backward algorithm can be used to find the most likely state for any point in time. It cannot, however, be used to find the most likely sequence of states (see [[Viterbi algorithm]]). | |||
==Forward probabilities== | |||
The following description will use matrices of probability values rather than probability distributions, although in general the forward-backward algorithm can be applied to continuous as well as discrete probability models. | |||
We transform the probability distributions related to a given [[hidden Markov model]] into matrix notation as follows. | |||
The transition probabilities <math>\mathbf{P}(X_t\mid X_{t-1})</math> of a given random variable <math>X_t</math> representing all possible states in the hidden Markov model will be represented by the matrix <math>\mathbf{T}</math> where the row index, i, will represent the start state and the column index, j, represents the target state. The example below represents a system where the probability of staying in the same state after each step is 70% and the probability of transitioning to the other state is 30%. The transition matrix is then: | |||
:<math>\mathbf{T} = \begin{pmatrix} | |||
0.7 & 0.3 \\ | |||
0.3 & 0.7 | |||
\end{pmatrix} | |||
</math> | |||
In a typical Markov model we would multiply a state vector by this matrix to obtain the probabilities for the subsequent state. In a hidden Markov model the state is unknown, and we instead observe events associated with the possible states. An event matrix of the form: | |||
:<math>\mathbf{B} = \begin{pmatrix} | |||
0.9 & 0.1 \\ | |||
0.2 & 0.8 | |||
\end{pmatrix} | |||
</math> | |||
provides the probabilities for observing events given a particular state. In the above example, event 1 will be observed 90% of the time if we are in state 1 while event 2 has a 10% probability of occurring in this state. In contrast, event 1 will only be observed 20% of the time if we are in state 2 and event 2 has an 80% chance of occurring. Given a state vector (<math>\mathbf{\pi}</math>), the probability of observing event j is then: | |||
:<math>\mathbf{P}(O = j)=\sum_{i} \pi_i b_{i,j}</math> | |||
This can be represented in matrix form by multiplying the state vector (<math>\mathbf{\pi}</math>) by an observation matrix (<math>\mathbf{O_j} = \mathrm{diag}(b_{*,o_j})</math>) containing only diagonal entries. Each entry is the probability of the observed event given each state. Continuing the above example, an observation of event 1 would be: | |||
:<math>\mathbf{O_1} = \begin{pmatrix} | |||
0.9 & 0.0 \\ | |||
0.0 & 0.2 | |||
\end{pmatrix} | |||
</math> | |||
This allows us to calculate the probabilities associated with transitioning to a new state and observing the given event as: | |||
:<math> | |||
\mathbf{f_{0:1}} = \mathbf{\pi} \mathbf{T} \mathbf{O_1} | |||
</math> | |||
The probability vector that results contains entries indicating the probability of transitioning to each state and observing the given event. This process can be carried forward with additional observations using: | |||
:<math> | |||
\mathbf{f_{0:t}} = \mathbf{f_{0:t-1}} \mathbf{T} \mathbf{O_t} | |||
</math> | |||
This value is the forward probability vector. The i'th entry of this vector provides: | |||
:<math> | |||
\mathbf{f_{0:t}}(i) = \mathbf{P}(o_1, o_2, \dots, o_t, X_t=x_i | \mathbf{\pi} ) | |||
</math> | |||
Typically, we will normalize the probability vector at each step so that its entries sum to 1. A scaling factor is thus introduced at each step such that: | |||
:<math> | |||
\mathbf{\hat{f}_{0:t}} = c_t^{-1}\ \mathbf{\hat{f}_{0:t-1}} \mathbf{T} \mathbf{O_t} | |||
</math> | |||
where <math>\mathbf{\hat{f}_{0:t-1}}</math> represents the scaled vector from the previous step and <math>c_t</math> represents the scaling factor that causes the resulting vector's entries to sum to 1. The product of the scaling factors is the total probability for observing the given events irrespective of the final states: | |||
:<math> | |||
\mathbf{P}(o_1, o_2, \dots, o_t|\mathbf{\pi}) = \prod_{s=1}^t c_s | |||
</math> | |||
This allows us to interpret the scaled probability vector as: | |||
:<math> | |||
\mathbf{\hat{f}_{0:t}}(i) = | |||
\frac{\mathbf{f_{0:t}}(i)}{\prod_{s=1}^t c_s} = | |||
\frac{\mathbf{P}(o_1, o_2, \dots, o_t, X_t=x_i | \mathbf{\pi} )}{\mathbf{P}(o_1, o_2, \dots, o_t|\mathbf{\pi})} = | |||
\mathbf{P}(X_t=x_i | o_1, o_2, \dots, o_t, \mathbf{\pi} ) | |||
</math> | |||
We thus find that the product of the scaling factors provides us with the total probability for observing the given sequence up to time t and that the scaled probability vector provides us with the probability of being in each state at this time. | |||
==Backward probabilities== | |||
A similar procedure can be constructed to find backward probabilities. These intend to provide the probabilities: | |||
:<math> | |||
\mathbf{b_{t:T}}(i) = \mathbf{P}(o_{t+1}, o_{t+2}, \dots, o_{T} | X_t=x_i ) | |||
</math> | |||
That is, we now want to assume that we start in a particular state (<math>X_t=x_i</math>), and we are now interested in the probability of observing all future events from this state. Since the initial state is assumed as given (i.e. the prior probability of this state = 100%), we begin with: | |||
:<math> | |||
\mathbf{b_{T:T}} = [1\ 1\ 1\ \dots]^T | |||
</math> | |||
Notice that we are now using a column vector while the forward probabilities used row vectors. We can then work backwards using: | |||
:<math> | |||
\mathbf{b_{t-1:T}} = \mathbf{T}\mathbf{O_t}\mathbf{b_{t:T}} | |||
</math> | |||
While we could normalize this vector as well so that its entries sum to one, this is not usually done. Noting that each entry contains the probability of the future event sequence given a particular initial state, normalizing this vector would be equivalent to applying Bayes' theorem to find the likelihood of each initial state given the future events (assuming uniform priors for the final state vector). However, it is more common to scale this vector using the same <math>c_t</math> constants used in the forward probability calculations. <math>\mathbf{b_{T:T}}</math> is not scaled, but subsequent operations use: | |||
:<math> | |||
\mathbf{\hat{b}_{t-1:T}} = c_t^{-1} \mathbf{T}\mathbf{O_t}\mathbf{\hat{b}_{t:T}} | |||
</math> | |||
where <math>\mathbf{\hat{b}_{t:T}}</math> represents the previous, scaled vector. This result is that the scaled probability vector is related to the backward probabilities by: | |||
:<math> | |||
\mathbf{\hat{b}_{t:T}}(i) = | |||
\frac{\mathbf{b_{t:T}}(i)}{\prod_{s=t+1}^T c_s} | |||
</math> | |||
This is useful because it allows us to find the total probability of being in each state at a given time, t, by multiplying these values: | |||
:<math> | |||
\mathbf{\gamma_t}(i) = | |||
\mathbf{P}(X_t=x_i | o_1, o_2, \dots, o_T, \mathbf{\pi}) = | |||
\frac{ \mathbf{P}(o_1, o_2, \dots, o_T, X_t=x_i | \mathbf{\pi} ) }{ \mathbf{P}(o_1, o_2, \dots, o_T | \mathbf{\pi} ) } = | |||
\frac{ \mathbf{f_{0:t}}(i) \cdot \mathbf{b_{t:T}}(i) }{ \prod_{s=1}^T c_s } = | |||
\mathbf{\hat{f}_{0:t}}(i) \cdot \mathbf{\hat{b}_{t:T}}(i) | |||
</math> | |||
To understand this, we note that <math>\mathbf{f_{0:t}}(i) \cdot \mathbf{b_{t:T}}(i)</math> provides the probability for observing the given events in a way that passes through state <math>x_i</math> at time t. This probability includes the forward probabilities covering all events up to time t as well as the backward probabilities which include all future events. This is the numerator we are looking for in our equation, and we divide by the total probability of the observation sequence to normalize this value and extract only the probability that <math>X_t=x_i</math>. These values are sometimes called the "smoothed values" as they combine the forward and backward probabilities to compute a final probability. | |||
The values <math>\mathbf{\gamma_t}(i)</math> thus provide the probability of being in each state at time t. As such, they are useful for determining the most probable state at any time. It should be noted, however, that the term "most probable state" is somewhat ambiguous. While the most probable state is the most likely to be correct at a given point, the sequence of individually probable states is not likely to be the most probable sequence. This is because the probabilities for each point are calculated independently of each other. They do not take into account the transition probabilities between states, and it is thus possible to get states at two moments (t and t+1) that are both most probable at those time points but which have very little probability of occurring together, i.e. <math> \mathbf{P}(X_t=x_i,X_{t+1}=x_j) \neq \mathbf{P}(X_t=x_i) \mathbf{P}(X_{t+1}=x_j) </math>. The most probable sequence of states that produced an observation sequence can be found using the [[Viterbi algorithm]]. | |||
==Example == | |||
This example takes as its basis the umbrella world in [[#RussellNorvig10|Russell & Norvig 2010 Chapter 15 pp. 566]] in which we would like to infer the weather given observation of a man either carrying or not carrying an umbrella. We assume two possible states for the weather: state 1 = rain, state 2 = no rain. We assume that the weather has a 70% chance of staying the same each day and a 30% chance of changing. The transition probabilities are then: | |||
:<math>\mathbf{T} = \begin{pmatrix} | |||
0.7 & 0.3 \\ | |||
0.3 & 0.7 | |||
\end{pmatrix} | |||
</math> | |||
We also assume each state generates 2 events: event 1 = umbrella, event 2 = no umbrella. The conditional probabilities for these occurring in each state are given by the probability matrix: | |||
:<math>\mathbf{B} = \begin{pmatrix} | |||
0.9 & 0.1 \\ | |||
0.2 & 0.8 | |||
\end{pmatrix} | |||
</math> | |||
We then observe the following sequence of events: {umbrella, umbrella, no umbrella, umbrella, umbrella} which we will represent in our calculations as: | |||
:<math> | |||
\mathbf{O_1} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}~~\mathbf{O_2} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}~~\mathbf{O_3} = \begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}~~\mathbf{O_4} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}~~\mathbf{O_5} = \begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix} | |||
</math> | |||
Note that <math>\mathbf{O_3}</math> differs from the others because of the "no umbrella" observation. | |||
In computing the forward probabilities we begin with: | |||
:<math> | |||
\mathbf{f_{0:0}}= \begin{pmatrix} 0.5 & 0.5 \end{pmatrix} | |||
</math> | |||
which is our prior state vector indicating that we don't know which state the weather is in before our observations. While a state vector should be given as a row vector, we will use the transpose of the matrix so that the calculations below are easier to read. Our calculations are then written in the form: | |||
:<math> | |||
(\mathbf{\hat{f}_{0:t}})^T = c^{-1}\mathbf{O_t}(\mathbf{T})^T(\mathbf{\hat{f}_{0:t-1}})^T | |||
</math> | |||
instead of: | |||
:<math> | |||
\mathbf{\hat{f}_{0:t}} = c^{-1}\mathbf{\hat{f}_{0:t-1}} \mathbf{T} \mathbf{O_t} | |||
</math> | |||
Notice that the transformation matrix is also transposed, but in our example the transpose is equal to the original matrix. Performing these calculations and normalizing the results provides: | |||
:<math> | |||
(\mathbf{\hat{f}_{0:1}})^T = | |||
c_1^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.5000 \\ 0.5000 \end{pmatrix}= | |||
c_1^{-1}\begin{pmatrix}0.4500 \\ 0.1000\end{pmatrix}= | |||
\begin{pmatrix}0.8182 \\ 0.1818 \end{pmatrix} | |||
</math> | |||
:<math> | |||
(\mathbf{\hat{f}_{0:2}})^T = | |||
c_2^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.8182 \\ 0.1818 \end{pmatrix}= | |||
c_2^{-1}\begin{pmatrix}0.5645 \\ 0.0745\end{pmatrix}= | |||
\begin{pmatrix}0.8834 \\ 0.1166 \end{pmatrix} | |||
</math> | |||
:<math> | |||
(\mathbf{\hat{f}_{0:3}})^T = | |||
c_3^{-1}\begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.8834 \\ 0.1166 \end{pmatrix}= | |||
c_3^{-1}\begin{pmatrix}0.0653 \\ 0.2772\end{pmatrix}= | |||
\begin{pmatrix}0.1907 \\ 0.8093 \end{pmatrix} | |||
</math> | |||
:<math> | |||
(\mathbf{\hat{f}_{0:4}})^T = | |||
c_4^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.1907 \\ 0.8093 \end{pmatrix}= | |||
c_4^{-1}\begin{pmatrix}0.3386 \\ 0.1247\end{pmatrix}= | |||
\begin{pmatrix}0.7308 \\ 0.2692 \end{pmatrix} | |||
</math> | |||
:<math> | |||
(\mathbf{\hat{f}_{0:5}})^T = | |||
c_5^{-1}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.7308 \\ 0.2692 \end{pmatrix}= | |||
c_5^{-1}\begin{pmatrix}0.5331 \\ 0.0815\end{pmatrix}= | |||
\begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix} | |||
</math> | |||
For the backward probabilities we start with: | |||
:<math> | |||
\mathbf{b_{5:5}} = \begin{pmatrix} 1.0 \\ 1.0\end{pmatrix} | |||
</math> | |||
We are then able to compute (using the observations in reverse order and normalizing with different constants): | |||
:<math> | |||
\mathbf{\hat{b}_{4:5}} = \alpha\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}1.0000 \\ 1.0000 \end{pmatrix}=\alpha\begin{pmatrix}0.6900 \\ 0.4100\end{pmatrix}=\begin{pmatrix}0.6273 \\ 0.3727 \end{pmatrix} | |||
</math> | |||
:<math> | |||
\mathbf{\hat{b}_{3:5}} = \alpha\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}0.6273 \\ 0.3727 \end{pmatrix}=\alpha\begin{pmatrix}0.4175 \\ 0.2215\end{pmatrix}=\begin{pmatrix}0.6533 \\ 0.3467 \end{pmatrix} | |||
</math> | |||
:<math> | |||
\mathbf{\hat{b}_{2:5}} = \alpha\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.1 & 0.0 \\ 0.0 & 0.8 \end{pmatrix}\begin{pmatrix}0.6533 \\ 0.3467 \end{pmatrix}=\alpha\begin{pmatrix}0.1289 \\ 0.2138\end{pmatrix}=\begin{pmatrix}0.3763 \\ 0.6237 \end{pmatrix} | |||
</math> | |||
:<math> | |||
\mathbf{\hat{b}_{1:5}} = \alpha\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}0.3763 \\ 0.6237 \end{pmatrix}=\alpha\begin{pmatrix}0.2745 \\ 0.1889\end{pmatrix}=\begin{pmatrix}0.5923 \\ 0.4077 \end{pmatrix} | |||
</math> | |||
:<math> | |||
\mathbf{\hat{b}_{0:5}} = \alpha\begin{pmatrix} 0.7 & 0.3 \\ 0.3 & 0.7 \end{pmatrix}\begin{pmatrix}0.9 & 0.0 \\ 0.0 & 0.2 \end{pmatrix}\begin{pmatrix}0.5923 \\ 0.4077 \end{pmatrix}=\alpha\begin{pmatrix}0.3976 \\ 0.2170\end{pmatrix}=\begin{pmatrix}0.6469 \\ 0.3531 \end{pmatrix} | |||
</math> | |||
Finally, we will compute the smoothed probability values. These result also must be scaled so that its entries sum to 1 because we did not scale the backward probabilities with the <math>c_t</math>'s found earlier. The backward probability vectors above thus actually represent the likelihood of each state at time t given the future observations. Because these vectors are proportional to the actual backward probabilities, the result has to be scaled an additional time. | |||
:<math> | |||
(\mathbf{\gamma_0})^T = \alpha\begin{pmatrix}0.5000 \\ 0.5000 \end{pmatrix}\circ \begin{pmatrix}0.6469 \\ 0.3531 \end{pmatrix}=\alpha\begin{pmatrix}0.3235 \\ 0.1765\end{pmatrix}=\begin{pmatrix}0.6469 \\ 0.3531 \end{pmatrix} | |||
</math> | |||
:<math> | |||
(\mathbf{\gamma_1})^T = \alpha\begin{pmatrix}0.8182 \\ 0.1818 \end{pmatrix}\circ \begin{pmatrix}0.5923 \\ 0.4077 \end{pmatrix}=\alpha\begin{pmatrix}0.4846 \\ 0.0741\end{pmatrix}=\begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix} | |||
</math> | |||
:<math> | |||
(\mathbf{\gamma_2})^T = \alpha\begin{pmatrix}0.8834 \\ 0.1166 \end{pmatrix}\circ \begin{pmatrix}0.3763 \\ 0.6237 \end{pmatrix}=\alpha\begin{pmatrix}0.3324 \\ 0.0728\end{pmatrix}=\begin{pmatrix}0.8204 \\ 0.1796 \end{pmatrix} | |||
</math> | |||
:<math> | |||
(\mathbf{\gamma_3})^T = \alpha\begin{pmatrix}0.1907 \\ 0.8093 \end{pmatrix}\circ \begin{pmatrix}0.6533 \\ 0.3467 \end{pmatrix}=\alpha\begin{pmatrix}0.1246 \\ 0.2806\end{pmatrix}=\begin{pmatrix}0.3075 \\ 0.6925 \end{pmatrix} | |||
</math> | |||
:<math> | |||
(\mathbf{\gamma_4})^T = \alpha\begin{pmatrix}0.7308 \\ 0.2692 \end{pmatrix}\circ \begin{pmatrix}0.6273 \\ 0.3727 \end{pmatrix}=\alpha\begin{pmatrix}0.4584 \\ 0.1003\end{pmatrix}=\begin{pmatrix}0.8204 \\ 0.1796 \end{pmatrix} | |||
</math> | |||
:<math> | |||
(\mathbf{\gamma_5})^T = \alpha\begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix}\circ \begin{pmatrix}1.0000 \\ 1.0000 \end{pmatrix}=\alpha\begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix}=\begin{pmatrix}0.8673 \\ 0.1327 \end{pmatrix} | |||
</math> | |||
Notice that the value of <math>\mathbf{\gamma_0}</math> is equal to <math>\mathbf{\hat{b}_{0:5}}</math> and that <math>\mathbf{\gamma_5}</math> is equal to <math>\mathbf{\hat{f}_{0:5}}</math>. This follows naturally because both <math>\mathbf{\hat{f}_{0:5}}</math> and <math>\mathbf{\hat{b}_{0:5}}</math> begin with uniform priors over the initial and final state vectors (respectively) and take into account all of the observations. However, <math>\mathbf{\gamma_0}</math> will only be equal to <math>\mathbf{\hat{b}_{0:5}}</math> when our initial state vector represents a uniform prior (i.e. all entries are equal). When this is not the case <math>\mathbf{\hat{b}_{0:5}}</math> needs to be combined with the initial state vector to find the most likely initial state. We thus find that the forward probabilities by themselves are sufficient to calculate the most likely final state. Similarly, the backward probabilities can be combined with the initial state vector to provide the most probable initial state given the observations. The forward and backward probabilities need only be combined to infer the most probable states between the initial and final points. | |||
The calculations above reveal that the most probable weather state on every day except for the third one was "rain." They tell us more than this, however, as they now provide a way to quantify the probabilities of each state at different times. Perhaps most importantly, our value at <math>\mathbf{\gamma_5}</math> quantifies our knowledge of the state vector at the end of the observation sequence. We can then use this to predict the probability of the various weather states tomorrow as well as the probability of observing an umbrella. | |||
==Performance == | |||
The brute-force procedure for the solution of this problem is the generation of all possible <math>N^T</math> state sequences and calculating the joint probability of each state sequence with the observed series of events. This approach has [[time complexity]] <math> O(T \cdot N^T) </math>, where <math>T</math> is the length of sequences and <math>N</math> is the number of symbols in the state alphabet. This is intractable for realistic problems, as the number of possible hidden node sequences typically is extremely high. However, the forward–backward algorithm has time complexity <math> O(N^2 T)\, </math>. | |||
An enhancement to the general forward-backward algorithm, called the [[Island algorithm]], trades smaller memory usage for longer running time, taking <math> O(N^2 T \log T)\, </math> time and <math> O(N^2 \log T)\, </math> memory. On a computer with an unlimited number of processors, this can be reduced to <math> O(N^2 T)\, </math> total time, while still taking only <math> O(N^2 \log T)\, </math> memory.<ref>J. Binder, K. Murphy and S. Russell. Space-Efficient Inference in Dynamic Probabilistic Networks. Int'l, Joint Conf. on Artificial Intelligence, 1997.</ref> | |||
In addition, algorithms have been developed to compute <math>\mathbf{f_{0:t+1}}</math> efficiently through online smoothing such as the fixed-lag smoothing (FLS) algorithm [[#RussellNorvig10|Russell & Norvig 2010 Figure 15.6 pp. 580]]. | |||
==Pseudocode== | |||
<pre> | |||
ForwardBackward(guessState, sequenceIndex): | |||
if sequenceIndex is past the end of the sequence, return 1 | |||
if (guessState, sequenceIndex) has been seen before, return saved result | |||
result = 0 | |||
for each neighboring state n: | |||
result = result + (transition probability from guessState to n given observation element at sequenceIndex) | |||
* ForwardBackward(n, sequenceIndex+1) | |||
save result for (guessState, sequenceIndex) | |||
return result | |||
</pre> | |||
==Python example== | |||
Given HMM (just like in [[Viterbi algorithm]]) represented in the [[Python programming language]]: | |||
<source lang="python"> | |||
states = ('Healthy', 'Fever') | |||
end_state = 'E' | |||
observations = ('normal', 'cold', 'dizzy') | |||
start_probability = {'Healthy': 0.6, 'Fever': 0.4} | |||
transition_probability = { | |||
'Healthy' : {'Healthy': 0.69, 'Fever': 0.3, 'E': 0.01}, | |||
'Fever' : {'Healthy': 0.4, 'Fever': 0.59, 'E': 0.01}, | |||
} | |||
emission_probability = { | |||
'Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1}, | |||
'Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6}, | |||
} | |||
</source> | |||
We can write implementation like this: | |||
<source lang="python"> | |||
def fwd_bkw(x, states, a_0, a, e, end_st): | |||
L = len(x) | |||
fwd = [] | |||
f_prev = {} | |||
# forward part of the algorithm | |||
for i, x_i in enumerate(x): | |||
f_curr = {} | |||
for st in states: | |||
if i == 0: | |||
# base case for the forward part | |||
prev_f_sum = a_0[st] | |||
else: | |||
prev_f_sum = sum(f_prev[k]*a[k][st] for k in states) | |||
f_curr[st] = e[st][x_i] * prev_f_sum | |||
fwd.append(f_curr) | |||
f_prev = f_curr | |||
p_fwd = sum(f_curr[k]*a[k][end_st] for k in states) | |||
bkw = [] | |||
b_prev = {} | |||
# backward part of the algorithm | |||
for i, x_i_plus in enumerate(reversed(x[1:]+(None,))): | |||
b_curr = {} | |||
for st in states: | |||
if i == 0: | |||
# base case for backward part | |||
b_curr[st] = a[st][end_st] | |||
else: | |||
b_curr[st] = sum(a[st][l]*e[l][x_i_plus]*b_prev[l] for l in states) | |||
bkw.insert(0,b_curr) | |||
b_prev = b_curr | |||
p_bkw = sum(a_0[l] * e[l][x[0]] * b_curr[l] for l in states) | |||
# merging the two parts | |||
posterior = {} | |||
for st in states: | |||
posterior[st] = [fwd[i][st]*bkw[i][st]/p_fwd for i in range(L)] | |||
assert p_fwd == p_bkw | |||
return fwd, bkw, posterior | |||
</source> | |||
The function <code>fwd_bkw</code> takes the following arguments: | |||
<code>x</code> is the sequence of observations, e.g. <code>['normal', 'cold', 'dizzy']</code>; | |||
<code>states</code> is the set of hidden states; | |||
<code>a_0</code> is the start probability; | |||
<code>a</code> are the transition probabilities; | |||
and <code>e</code> are the emission probabilities. | |||
For simplicity of code, we assume that the observation sequence <code>x</code> is non-empty and that <code>a[i][j]</code> and <code>e[i][j]</code> is defined for all states i,j. | |||
In the running example, the forward-backward algorithm is used as follows: | |||
<source lang="python"> | |||
def example(): | |||
return fwd_bkw(observations, | |||
states, | |||
start_probability, | |||
transition_probability, | |||
emission_probability, | |||
end_state) | |||
for line in example(): | |||
print(' '.join(map(str, line))) | |||
</source> | |||
==See also == | |||
* [[Baum-Welch algorithm]] | |||
* [[Viterbi algorithm]] | |||
* [[BCJR algorithm]] | |||
== References== | |||
{{reflist}} | |||
*[[Lawrence Rabiner|Lawrence R. Rabiner]], A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. ''Proceedings of the [[IEEE]]'', 77 (2), p. 257–286, February 1989. [http://dx.doi.org/10.1109/5.18626 10.1109/5.18626] | |||
*{{cite journal |author=Lawrence R. Rabiner, B. H. Juang|title=An introduction to hidden Markov models|journal=IEEE ASSP Magazine |month=January |pages=4–15 |year=1986}} | |||
*{{cite book | author = Eugene Charniak|title = Statistical Language Learning|publisher = MIT Press| publication-place=Cambridge, Massachusetts|year = 1993|isbn=978-0-262-53141-2}} | |||
<cite id = RussellNorvig10> | |||
*{{cite book | author = Stuart Russell and Peter Norvig|title = Artificial Intelligence A Modern Approach 3rd Edition|publisher = Pearson Education/Prentice-Hall|publication-place = Upper Saddle River, New Jersey|year = 2010|isbn=978-0-13-604259-4}} | |||
==External links == | |||
* [http://www.cs.jhu.edu/~jason/papers/#eisner-2002-tnlp An interactive spreadsheet for teaching the forward–backward algorithm] (spreadsheet and article with step-by-step walk-through) | |||
* [http://www.cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html Tutorial of hidden Markov models including the forward–backward algorithm] | |||
* [http://code.google.com/p/aima-java/ Collection of AI algorithms implemented in Java] (including HMM and the forward–backward algorithm) | |||
{{DEFAULTSORT:Forward-backward algorithm}} | |||
[[Category:Dynamic programming]] | |||
[[Category:Error detection and correction]] | |||
[[Category:Machine learning algorithms]] | |||
[[Category:Markov models]] |
Revision as of 23:44, 24 June 2013
The forward–backward algorithm is an inference algorithm for hidden Markov models which computes the posterior marginals of all hidden state variables given a sequence of observations/emissions , i.e. it computes, for all hidden state variables , the distribution . This inference task is usually called smoothing. The algorithm makes use of the principle of dynamic programming to compute efficiently the values that are required to obtain the posterior marginal distributions in two passes. The first pass goes forward in time while the second goes backward in time; hence the name forward–backward algorithm.
The term forward–backward algorithm is also used to refer to any algorithm belonging to the general class of algorithms that operate on sequence models in a forward–backward manner. In this sense, the descriptions in the remainder of this article refer but to one specific instance of this class.
Overview
In the first pass, the forward–backward algorithm computes a set of forward probabilities which provide, for all , the probability of ending up in any particular state given the first observations in the sequence, i.e. . In the second pass, the algorithm computes a set of backward probabilities which provide the probability of observing the remaining observations given any starting point , i.e. . These two sets of probability distributions can then be combined to obtain the distribution over states at any specific point in time given the entire observation sequence:
The last step follows from an application of Bayes' rule and the conditional independence of and given .
As outlined above, the algorithm involves three steps:
- computing forward probabilities
- computing backward probabilities
- computing smoothed values.
The forward and backward steps may also be called "forward message pass" and "backward message pass" - these terms are due to the message-passing used in general belief propagation approaches. At each single observation in the sequence, probabilities to be used for calculations at the next observation are computed. The smoothing step can be calculated simultaneously during the backward pass. This step allows the algorithm to take into account any past observations of output for computing more accurate results.
The forward–backward algorithm can be used to find the most likely state for any point in time. It cannot, however, be used to find the most likely sequence of states (see Viterbi algorithm).
Forward probabilities
The following description will use matrices of probability values rather than probability distributions, although in general the forward-backward algorithm can be applied to continuous as well as discrete probability models.
We transform the probability distributions related to a given hidden Markov model into matrix notation as follows. The transition probabilities of a given random variable representing all possible states in the hidden Markov model will be represented by the matrix where the row index, i, will represent the start state and the column index, j, represents the target state. The example below represents a system where the probability of staying in the same state after each step is 70% and the probability of transitioning to the other state is 30%. The transition matrix is then:
In a typical Markov model we would multiply a state vector by this matrix to obtain the probabilities for the subsequent state. In a hidden Markov model the state is unknown, and we instead observe events associated with the possible states. An event matrix of the form:
provides the probabilities for observing events given a particular state. In the above example, event 1 will be observed 90% of the time if we are in state 1 while event 2 has a 10% probability of occurring in this state. In contrast, event 1 will only be observed 20% of the time if we are in state 2 and event 2 has an 80% chance of occurring. Given a state vector (), the probability of observing event j is then:
This can be represented in matrix form by multiplying the state vector () by an observation matrix () containing only diagonal entries. Each entry is the probability of the observed event given each state. Continuing the above example, an observation of event 1 would be:
This allows us to calculate the probabilities associated with transitioning to a new state and observing the given event as:
The probability vector that results contains entries indicating the probability of transitioning to each state and observing the given event. This process can be carried forward with additional observations using:
This value is the forward probability vector. The i'th entry of this vector provides:
Typically, we will normalize the probability vector at each step so that its entries sum to 1. A scaling factor is thus introduced at each step such that:
where represents the scaled vector from the previous step and represents the scaling factor that causes the resulting vector's entries to sum to 1. The product of the scaling factors is the total probability for observing the given events irrespective of the final states:
This allows us to interpret the scaled probability vector as:
We thus find that the product of the scaling factors provides us with the total probability for observing the given sequence up to time t and that the scaled probability vector provides us with the probability of being in each state at this time.
Backward probabilities
A similar procedure can be constructed to find backward probabilities. These intend to provide the probabilities:
That is, we now want to assume that we start in a particular state (), and we are now interested in the probability of observing all future events from this state. Since the initial state is assumed as given (i.e. the prior probability of this state = 100%), we begin with:
Notice that we are now using a column vector while the forward probabilities used row vectors. We can then work backwards using:
While we could normalize this vector as well so that its entries sum to one, this is not usually done. Noting that each entry contains the probability of the future event sequence given a particular initial state, normalizing this vector would be equivalent to applying Bayes' theorem to find the likelihood of each initial state given the future events (assuming uniform priors for the final state vector). However, it is more common to scale this vector using the same constants used in the forward probability calculations. is not scaled, but subsequent operations use:
where represents the previous, scaled vector. This result is that the scaled probability vector is related to the backward probabilities by:
This is useful because it allows us to find the total probability of being in each state at a given time, t, by multiplying these values:
To understand this, we note that provides the probability for observing the given events in a way that passes through state at time t. This probability includes the forward probabilities covering all events up to time t as well as the backward probabilities which include all future events. This is the numerator we are looking for in our equation, and we divide by the total probability of the observation sequence to normalize this value and extract only the probability that . These values are sometimes called the "smoothed values" as they combine the forward and backward probabilities to compute a final probability.
The values thus provide the probability of being in each state at time t. As such, they are useful for determining the most probable state at any time. It should be noted, however, that the term "most probable state" is somewhat ambiguous. While the most probable state is the most likely to be correct at a given point, the sequence of individually probable states is not likely to be the most probable sequence. This is because the probabilities for each point are calculated independently of each other. They do not take into account the transition probabilities between states, and it is thus possible to get states at two moments (t and t+1) that are both most probable at those time points but which have very little probability of occurring together, i.e. . The most probable sequence of states that produced an observation sequence can be found using the Viterbi algorithm.
Example
This example takes as its basis the umbrella world in Russell & Norvig 2010 Chapter 15 pp. 566 in which we would like to infer the weather given observation of a man either carrying or not carrying an umbrella. We assume two possible states for the weather: state 1 = rain, state 2 = no rain. We assume that the weather has a 70% chance of staying the same each day and a 30% chance of changing. The transition probabilities are then:
We also assume each state generates 2 events: event 1 = umbrella, event 2 = no umbrella. The conditional probabilities for these occurring in each state are given by the probability matrix:
We then observe the following sequence of events: {umbrella, umbrella, no umbrella, umbrella, umbrella} which we will represent in our calculations as:
Note that differs from the others because of the "no umbrella" observation.
In computing the forward probabilities we begin with:
which is our prior state vector indicating that we don't know which state the weather is in before our observations. While a state vector should be given as a row vector, we will use the transpose of the matrix so that the calculations below are easier to read. Our calculations are then written in the form:
instead of:
Notice that the transformation matrix is also transposed, but in our example the transpose is equal to the original matrix. Performing these calculations and normalizing the results provides:
For the backward probabilities we start with:
We are then able to compute (using the observations in reverse order and normalizing with different constants):
Finally, we will compute the smoothed probability values. These result also must be scaled so that its entries sum to 1 because we did not scale the backward probabilities with the 's found earlier. The backward probability vectors above thus actually represent the likelihood of each state at time t given the future observations. Because these vectors are proportional to the actual backward probabilities, the result has to be scaled an additional time.
Notice that the value of is equal to and that is equal to . This follows naturally because both and begin with uniform priors over the initial and final state vectors (respectively) and take into account all of the observations. However, will only be equal to when our initial state vector represents a uniform prior (i.e. all entries are equal). When this is not the case needs to be combined with the initial state vector to find the most likely initial state. We thus find that the forward probabilities by themselves are sufficient to calculate the most likely final state. Similarly, the backward probabilities can be combined with the initial state vector to provide the most probable initial state given the observations. The forward and backward probabilities need only be combined to infer the most probable states between the initial and final points.
The calculations above reveal that the most probable weather state on every day except for the third one was "rain." They tell us more than this, however, as they now provide a way to quantify the probabilities of each state at different times. Perhaps most importantly, our value at quantifies our knowledge of the state vector at the end of the observation sequence. We can then use this to predict the probability of the various weather states tomorrow as well as the probability of observing an umbrella.
Performance
The brute-force procedure for the solution of this problem is the generation of all possible state sequences and calculating the joint probability of each state sequence with the observed series of events. This approach has time complexity , where is the length of sequences and is the number of symbols in the state alphabet. This is intractable for realistic problems, as the number of possible hidden node sequences typically is extremely high. However, the forward–backward algorithm has time complexity .
An enhancement to the general forward-backward algorithm, called the Island algorithm, trades smaller memory usage for longer running time, taking time and memory. On a computer with an unlimited number of processors, this can be reduced to total time, while still taking only memory.[1]
In addition, algorithms have been developed to compute efficiently through online smoothing such as the fixed-lag smoothing (FLS) algorithm Russell & Norvig 2010 Figure 15.6 pp. 580.
Pseudocode
ForwardBackward(guessState, sequenceIndex): if sequenceIndex is past the end of the sequence, return 1 if (guessState, sequenceIndex) has been seen before, return saved result result = 0 for each neighboring state n: result = result + (transition probability from guessState to n given observation element at sequenceIndex) * ForwardBackward(n, sequenceIndex+1) save result for (guessState, sequenceIndex) return result
Python example
Given HMM (just like in Viterbi algorithm) represented in the Python programming language:
states = ('Healthy', 'Fever')
end_state = 'E'
observations = ('normal', 'cold', 'dizzy')
start_probability = {'Healthy': 0.6, 'Fever': 0.4}
transition_probability = {
'Healthy' : {'Healthy': 0.69, 'Fever': 0.3, 'E': 0.01},
'Fever' : {'Healthy': 0.4, 'Fever': 0.59, 'E': 0.01},
}
emission_probability = {
'Healthy' : {'normal': 0.5, 'cold': 0.4, 'dizzy': 0.1},
'Fever' : {'normal': 0.1, 'cold': 0.3, 'dizzy': 0.6},
}
We can write implementation like this:
def fwd_bkw(x, states, a_0, a, e, end_st):
L = len(x)
fwd = []
f_prev = {}
# forward part of the algorithm
for i, x_i in enumerate(x):
f_curr = {}
for st in states:
if i == 0:
# base case for the forward part
prev_f_sum = a_0[st]
else:
prev_f_sum = sum(f_prev[k]*a[k][st] for k in states)
f_curr[st] = e[st][x_i] * prev_f_sum
fwd.append(f_curr)
f_prev = f_curr
p_fwd = sum(f_curr[k]*a[k][end_st] for k in states)
bkw = []
b_prev = {}
# backward part of the algorithm
for i, x_i_plus in enumerate(reversed(x[1:]+(None,))):
b_curr = {}
for st in states:
if i == 0:
# base case for backward part
b_curr[st] = a[st][end_st]
else:
b_curr[st] = sum(a[st][l]*e[l][x_i_plus]*b_prev[l] for l in states)
bkw.insert(0,b_curr)
b_prev = b_curr
p_bkw = sum(a_0[l] * e[l][x[0]] * b_curr[l] for l in states)
# merging the two parts
posterior = {}
for st in states:
posterior[st] = [fwd[i][st]*bkw[i][st]/p_fwd for i in range(L)]
assert p_fwd == p_bkw
return fwd, bkw, posterior
The function fwd_bkw
takes the following arguments:
x
is the sequence of observations, e.g. ['normal', 'cold', 'dizzy']
;
states
is the set of hidden states;
a_0
is the start probability;
a
are the transition probabilities;
and e
are the emission probabilities.
For simplicity of code, we assume that the observation sequence x
is non-empty and that a[i][j]
and e[i][j]
is defined for all states i,j.
In the running example, the forward-backward algorithm is used as follows:
def example():
return fwd_bkw(observations,
states,
start_probability,
transition_probability,
emission_probability,
end_state)
for line in example():
print(' '.join(map(str, line)))
See also
References
43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.
- Lawrence R. Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE, 77 (2), p. 257–286, February 1989. 10.1109/5.18626
- One of the biggest reasons investing in a Singapore new launch is an effective things is as a result of it is doable to be lent massive quantities of money at very low interest rates that you should utilize to purchase it. Then, if property values continue to go up, then you'll get a really high return on funding (ROI). Simply make sure you purchase one of the higher properties, reminiscent of the ones at Fernvale the Riverbank or any Singapore landed property Get Earnings by means of Renting
In its statement, the singapore property listing - website link, government claimed that the majority citizens buying their first residence won't be hurt by the new measures. Some concessions can even be prolonged to chose teams of consumers, similar to married couples with a minimum of one Singaporean partner who are purchasing their second property so long as they intend to promote their first residential property. Lower the LTV limit on housing loans granted by monetary establishments regulated by MAS from 70% to 60% for property purchasers who are individuals with a number of outstanding housing loans on the time of the brand new housing purchase. Singapore Property Measures - 30 August 2010 The most popular seek for the number of bedrooms in Singapore is 4, followed by 2 and three. Lush Acres EC @ Sengkang
Discover out more about real estate funding in the area, together with info on international funding incentives and property possession. Many Singaporeans have been investing in property across the causeway in recent years, attracted by comparatively low prices. However, those who need to exit their investments quickly are likely to face significant challenges when trying to sell their property – and could finally be stuck with a property they can't sell. Career improvement programmes, in-house valuation, auctions and administrative help, venture advertising and marketing, skilled talks and traisning are continuously planned for the sales associates to help them obtain better outcomes for his or her shoppers while at Knight Frank Singapore. No change Present Rules
Extending the tax exemption would help. The exemption, which may be as a lot as $2 million per family, covers individuals who negotiate a principal reduction on their existing mortgage, sell their house short (i.e., for lower than the excellent loans), or take part in a foreclosure course of. An extension of theexemption would seem like a common-sense means to assist stabilize the housing market, but the political turmoil around the fiscal-cliff negotiations means widespread sense could not win out. Home Minority Chief Nancy Pelosi (D-Calif.) believes that the mortgage relief provision will be on the table during the grand-cut price talks, in response to communications director Nadeam Elshami. Buying or promoting of blue mild bulbs is unlawful.
A vendor's stamp duty has been launched on industrial property for the primary time, at rates ranging from 5 per cent to 15 per cent. The Authorities might be trying to reassure the market that they aren't in opposition to foreigners and PRs investing in Singapore's property market. They imposed these measures because of extenuating components available in the market." The sale of new dual-key EC models will even be restricted to multi-generational households only. The models have two separate entrances, permitting grandparents, for example, to dwell separately. The vendor's stamp obligation takes effect right this moment and applies to industrial property and plots which might be offered inside three years of the date of buy. JLL named Best Performing Property Brand for second year running
The data offered is for normal info purposes only and isn't supposed to be personalised investment or monetary advice. Motley Fool Singapore contributor Stanley Lim would not personal shares in any corporations talked about. Singapore private home costs increased by 1.eight% within the fourth quarter of 2012, up from 0.6% within the earlier quarter. Resale prices of government-built HDB residences which are usually bought by Singaporeans, elevated by 2.5%, quarter on quarter, the quickest acquire in five quarters. And industrial property, prices are actually double the levels of three years ago. No withholding tax in the event you sell your property. All your local information regarding vital HDB policies, condominium launches, land growth, commercial property and more
There are various methods to go about discovering the precise property. Some local newspapers (together with the Straits Instances ) have categorised property sections and many local property brokers have websites. Now there are some specifics to consider when buying a 'new launch' rental. Intended use of the unit Every sale begins with 10 p.c low cost for finish of season sale; changes to 20 % discount storewide; follows by additional reduction of fiftyand ends with last discount of 70 % or extra. Typically there is even a warehouse sale or transferring out sale with huge mark-down of costs for stock clearance. Deborah Regulation from Expat Realtor shares her property market update, plus prime rental residences and houses at the moment available to lease Esparina EC @ Sengkang - 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.
My blog: http://www.primaboinca.com/view_profile.php?userid=5889534
- 20 year-old Real Estate Agent Rusty from Saint-Paul, has hobbies and interests which includes monopoly, property developers in singapore and poker. Will soon undertake a contiki trip that may include going to the Lower Valley of the Omo.
My blog: http://www.primaboinca.com/view_profile.php?userid=5889534
External links
- An interactive spreadsheet for teaching the forward–backward algorithm (spreadsheet and article with step-by-step walk-through)
- Tutorial of hidden Markov models including the forward–backward algorithm
- Collection of AI algorithms implemented in Java (including HMM and the forward–backward algorithm)
- ↑ J. Binder, K. Murphy and S. Russell. Space-Efficient Inference in Dynamic Probabilistic Networks. Int'l, Joint Conf. on Artificial Intelligence, 1997.