Generic Stream Encapsulation - Revision history

en>Yobot: /* Products Supporting GSE */WP:CHECKWIKI error fixes using AWB (10072)

2014-04-05T17:19:00Z

Products Supporting GSE: WP:CHECKWIKI error fixes using AWB (10072)

← Older revision		Revision as of 19:19, 5 April 2014
Line 1:		Line 1:
	~~{{redirect\|MEMM\|the German Nordic combined skier\|Silvio Memm}}~~		My name is Kit and I am studying Physics and Greek and Roman Culture at Goldsboro / United States.<br><br>Here is my blog :: [http://www.wallpaperhdquality.com/profile/euheysen buy personal narrative essay]

	~~In [[machine learning]], a '''maximum-entropy Markov model''' ('''MEMM'''), or '''conditional Markov model''' ('''CMM'''),~~ is ~~a [[graphical model]] for [[sequence labeling]] that combines features of [[hidden Markov model]]s (HMMs)~~ and [[Maximum entropy probability distribution\|maximum entropy]] (MaxEnt) models. An MEMM is a [[discriminative model]] that extends a standard [[maximum entropy classifier]] by assuming that the unknown values to be learnt are connected in a [[Markov chain]] rather than being [[conditionally independent]] of each other. MEMMs find applications in [[natural language processing]], specifically in [[part-of-speech tagging]]<ref>{{Cite conference\|last = Toutanova\|first = Kristina\|last2 = Manning\|first2 = Christopher D.\|title = Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger\|booktitle = Proc. J. SIGDAT Conf. on Empirical Methods in NLP and ~~Very Large Corpora (EMNLP/VLC-2000)\|year = 2000\|pages = 63–70}}</ref>~~ and [[information extraction]].<ref name=orig>{{Cite conference\|last = McCallum\|first = Andrew\|last2 = Freitag\|first2 = Dayne\|last3 = Pereira\|first3 = Fernando\|title = Maximum Entropy Markov Models for Information Extraction and Segmentation\|booktitle = Proc. ICML 2000\|year = 2000\|pages = 591–598\|url=http://www.ai.mit.edu/courses/6.891-nlp/READINGS/maxent.pdf}}</ref>

	~~==Model==~~

	Suppose we have a sequence of observations <math>O_1, \dots, O_n</math> that we seek to tag with the labels <math>S_1, \dots, S_n</math>that maximize the conditional probability <math>P(S_1, \dots, S_n \| O_1, \dots, O_n)</math>. In a MEMM, this probability is factored into Markov transition probabilities, where the probability of transitioning to a particular label depends only on the observation at ~~that position and the previous position's label:~~
	~~:<math>P(S_1, \dots, S_n \| O_1, \dots, O_n) = \prod_{t = 1}^nP(S_{t}\|S_{t-1},O_t).</math>~~
	Each of these transition probabilities come from the same general distribution <math>P(s\|s',o)</math>. For each possible label value of the previous label <math>s'</math>, the probability of a certain label <math> s </~~math> is modeled in the same way as a [[maximum entropy classifier]]:<ref>{{Cite journal \|title = A maximum entropy approach to natural language processing\|~~
	~~author=Berger, A~~.~~L. and Pietra, V.J.D. and Pietra, S.A.D.\|~~
	~~journal=Computational Linguistics\|~~
	~~volume=22\| issue=1\| pages=39–71\|~~
	~~year=1996\|~~
	~~publisher=MIT Press\|~~
	}}<~~/ref>~~
	~~:<math~~>~~P(s\|s',o) = P_{s'}(s\|,o) = \frac{1}{Z(o,s')}\exp\left(\sum_a\lambda_af_a(o,s)\right).~~<~~/math~~>
	Here~~, the <math>f_a(o,s)</math> are real-valued or categorical feature-functions, and <math> Z(o,s') </math>~~ is a normalization term ensuring that the distribution sums to one. This form for the distribution corresponds to the [[maximum entropy probability distribution]] satisfying the constraint that the empirical expectation for the feature is equal to the expectation given the model:
	:~~<math> \operatorname{E}_e\left[f_a(o,s)\right] = \operatorname{E}_p\left[f_a(o,s)\right] \quad \text{ for all } a .</math>~~
	~~The parameters <math>\lambda_a</math> can be estimated using~~ [~~[generalized iterative scaling]].<ref>{{Cite journal \| title=Generalized iterative scaling for log-linear models\|~~
	~~author=Darroch, J.N. and Ratcliff, D.\|~~
	~~journal=The Annals of Mathematical Statistics\|~~
	~~volume=43\| issue=5\| pages=1470–1480\|~~
	~~year=1972\|~~
	~~publisher=Institute of Mathematical Statistics\|~~
	~~url=~~http://~~projecteuclid~~.~~org/download/pdf_1/euclid~~.~~aoms~~/~~1177692379 \|~~
	~~}}<~~/ref> Furthermore, a variant of the [[Baum–Welch algorithm]], which is used for training HMMs, can be used to estimate parameters when training data has [[Semi-supervised learning\|incomplete or missing labels]].<ref name="orig"/>

	~~The optimal state sequence <math>S_1, \dots, S_n</math> can be found using a very similar [[Viterbi algorithm]] to the one used for HMMs. The dynamic program uses the forward probability:~~
	~~:<math>\alpha_{t+1}(s) = \sum_{s' \in S} \alpha_t(s') P_{s'}(s\|o_{t+1}).</math>~~

	~~==Strengths and weaknesses==~~
	An advantage of MEMMs rather than HMMs for sequence tagging is that they offer increased freedom in choosing features to represent observations. In sequence tagging situations, it is useful to use domain knowledge to design special-purpose features. In the original paper introducing MEMMs, the authors write that "when trying to extract previously unseen company names from a newswire article, the identity of a word alone is not very predictive; however, knowing that the word is capitalized, that is a noun, that it is used in an appositive, and that it appears near the top of the article would all be quite predictive (in conjunction with the context provided by the state-transition structure)."<ref name="orig"/> Useful sequence tagging features, such as these, are often non-independent. Maximum entropy models do not assume independence between features, but generative observation models used in HMMs do.<ref name="orig"/> Therefore, MEMMs allow the user to specify lots of correlated, but informative features.

	Another advantage of MEMMs versus HMMs and [[conditional random field]]s (CRFs) is that training can be considerably more efficient. In HMMs and CRFs, one needs to use some version of the [[forward–backward algorithm]] as an inner loop in training. However, in MEMMs, estimating the parameters of the maximum-entropy distributions used for the transition probabilities can be done for each transition distribution in isolation.

	A drawback of MEMMs is that they potentially suffer from the "label bias problem," where states with low-entropy transition distributions "effectively ignore their observations." Conditional random fields were designed to overcome this weakness,<ref name="crf">{{cite conference\|last = Lafferty\|first = John\|last2 = McCallum\|first2 = Andrew\|last3 = Pereira\|first3 = Fernando\|title = Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data\|booktitle = Proc. ICML 2001\|year = 2001}}</ref>
	which had already been recognised in the context of neural network-based Markov models in the early 1990s.<ref>{{cite thesis \|author=Léon Bottou \|year=1991 \|title=Une Approche théorique de l'Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole \|publisher=Université de Paris XI \|type=Ph.D. \|url=http://leon.bottou.org/papers/bottou-91a}}</ref><ref name="crf"/>
	~~Another source of label bias is that training is always done with respect to known previous tags, so the model struggles at test time when there is uncertainty in the previous tag.~~

	~~==References==~~
	~~{{reflist\|2}}~~
	~~[[Category:Markov models]]~~
	~~[[Category:Statistical natural language processing]~~]

82.228.88.181: /* External links */ add link to opensource implementation

2013-09-09T13:17:45Z

External links: add link to opensource implementation

← Older revision		Revision as of 15:17, 9 September 2013
Line 1:		Line 1:
	~~The one wrote post~~ is ~~called Berta~~. ~~To collect coins~~ is ~~something she~~ by ~~no means give it down~~. ~~His house~~ is ~~already~~ in ~~Puerto Rico~~ and ~~his awesome parents live nearby~~. ~~My day job~~ is a ~~database boss~~. ~~See what~~'s ~~new on my website here~~: [http://~~www~~.~~holmesnaden~~.~~com~~/~~contatti~~.~~asp Cheap air max 95~~]		{{redirect\|MEMM\|the German Nordic combined skier\|Silvio Memm}}

			In [[machine learning]], a '''maximum-entropy Markov model''' ('''MEMM'''), or '''conditional Markov model''' ('''CMM'''), is a [[graphical model]] for [[sequence labeling]] that combines features of [[hidden Markov model]]s (HMMs) and [[Maximum entropy probability distribution\|maximum entropy]] (MaxEnt) models. An MEMM is a [[discriminative model]] that extends a standard [[maximum entropy classifier]] by assuming that the unknown values to be learnt are connected in a [[Markov chain]] rather than being [[conditionally independent]] of each other. MEMMs find applications in [[natural language processing]], specifically in [[part-of-speech tagging]]<ref>{{Cite conference\|last = Toutanova\|first = Kristina\|last2 = Manning\|first2 = Christopher D.\|title = Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger\|booktitle = Proc. J. SIGDAT Conf. on Empirical Methods in NLP and Very Large Corpora (EMNLP/VLC-2000)\|year = 2000\|pages = 63–70}}</ref> and [[information extraction]].<ref name=orig>{{Cite conference\|last = McCallum\|first = Andrew\|last2 = Freitag\|first2 = Dayne\|last3 = Pereira\|first3 = Fernando\|title = Maximum Entropy Markov Models for Information Extraction and Segmentation\|booktitle = Proc. ICML 2000\|year = 2000\|pages = 591–598\|url=http://www.ai.mit.edu/courses/6.891-nlp/READINGS/maxent.pdf}}</ref>

			==Model==

			Suppose we have a sequence of observations <math>O_1, \dots, O_n</math> that we seek to tag with the labels <math>S_1, \dots, S_n</math>that maximize the conditional probability <math>P(S_1, \dots, S_n \| O_1, \dots, O_n)</math>. In a MEMM, this probability is factored into Markov transition probabilities, where the probability of transitioning to a particular label depends only on the observation at that position and the previous position's label:
			:<math>P(S_1, \dots, S_n \| O_1, \dots, O_n) = \prod_{t = 1}^nP(S_{t}\|S_{t-1},O_t).</math>
			Each of these transition probabilities come from the same general distribution <math>P(s\|s',o)</math>. For each possible label value of the previous label <math>s'</math>, the probability of a certain label <math> s </math> is modeled in the same way as a [[maximum entropy classifier]]:<ref>{{Cite journal \|title = A maximum entropy approach to natural language processing\|
			author=Berger, A.L. and Pietra, V.J.D. and Pietra, S.A.D.\|
			journal=Computational Linguistics\|
			volume=22\| issue=1\| pages=39–71\|
			year=1996\|
			publisher=MIT Press\|
			}}</ref>
			:<math>P(s\|s',o) = P_{s'}(s\|,o) = \frac{1}{Z(o,s')}\exp\left(\sum_a\lambda_af_a(o,s)\right).</math>
			Here, the <math>f_a(o,s)</math> are real-valued or categorical feature-functions, and <math> Z(o,s') </math> is a normalization term ensuring that the distribution sums to one. This form for the distribution corresponds to the [[maximum entropy probability distribution]] satisfying the constraint that the empirical expectation for the feature is equal to the expectation given the model:
			:<math> \operatorname{E}_e\left[f_a(o,s)\right] = \operatorname{E}_p\left[f_a(o,s)\right] \quad \text{ for all } a .</math>
			The parameters <math>\lambda_a</math> can be estimated using [[generalized iterative scaling]].<ref>{{Cite journal \| title=Generalized iterative scaling for log-linear models\|
			author=Darroch, J.N. and Ratcliff, D.\|
			journal=The Annals of Mathematical Statistics\|
			volume=43\| issue=5\| pages=1470–1480\|
			year=1972\|
			publisher=Institute of Mathematical Statistics\|
			url=http://projecteuclid.org/download/pdf_1/euclid.aoms/1177692379 \|
			}}</ref> Furthermore, a variant of the [[Baum–Welch algorithm]], which is used for training HMMs, can be used to estimate parameters when training data has [[Semi-supervised learning\|incomplete or missing labels]].<ref name="orig"/>

			The optimal state sequence <math>S_1, \dots, S_n</math> can be found using a very similar [[Viterbi algorithm]] to the one used for HMMs. The dynamic program uses the forward probability:
			:<math>\alpha_{t+1}(s) = \sum_{s' \in S} \alpha_t(s') P_{s'}(s\|o_{t+1}).</math>

			==Strengths and weaknesses==
			An advantage of MEMMs rather than HMMs for sequence tagging is that they offer increased freedom in choosing features to represent observations. In sequence tagging situations, it is useful to use domain knowledge to design special-purpose features. In the original paper introducing MEMMs, the authors write that "when trying to extract previously unseen company names from a newswire article, the identity of a word alone is not very predictive; however, knowing that the word is capitalized, that is a noun, that it is used in an appositive, and that it appears near the top of the article would all be quite predictive (in conjunction with the context provided by the state-transition structure)."<ref name="orig"/> Useful sequence tagging features, such as these, are often non-independent. Maximum entropy models do not assume independence between features, but generative observation models used in HMMs do.<ref name="orig"/> Therefore, MEMMs allow the user to specify lots of correlated, but informative features.

			Another advantage of MEMMs versus HMMs and [[conditional random field]]s (CRFs) is that training can be considerably more efficient. In HMMs and CRFs, one needs to use some version of the [[forward–backward algorithm]] as an inner loop in training. However, in MEMMs, estimating the parameters of the maximum-entropy distributions used for the transition probabilities can be done for each transition distribution in isolation.

			A drawback of MEMMs is that they potentially suffer from the "label bias problem," where states with low-entropy transition distributions "effectively ignore their observations." Conditional random fields were designed to overcome this weakness,<ref name="crf">{{cite conference\|last = Lafferty\|first = John\|last2 = McCallum\|first2 = Andrew\|last3 = Pereira\|first3 = Fernando\|title = Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data\|booktitle = Proc. ICML 2001\|year = 2001}}</ref>
			which had already been recognised in the context of neural network-based Markov models in the early 1990s.<ref>{{cite thesis \|author=Léon Bottou \|year=1991 \|title=Une Approche théorique de l'Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole \|publisher=Université de Paris XI \|type=Ph.D. \|url=http://leon.bottou.org/papers/bottou-91a}}</ref><ref name="crf"/>
			Another source of label bias is that training is always done with respect to known previous tags, so the model struggles at test time when there is uncertainty in the previous tag.

			==References==
			{{reflist\|2}}
			[[Category:Markov models]]
			[[Category:Statistical natural language processing]]

en>Wbm1058: Unlinked: Hardware

2012-08-31T17:11:36Z

Unlinked: Hardware

New page

The one wrote post is called Berta. To collect coins is something she by no means give it down. His house is already in Puerto Rico and his awesome parents live nearby. My day job is a database boss. See what's new on my website here: [http://www.holmesnaden.com/contatti.asp Cheap air max 95]

← Older revision		Revision as of 19:19, 5 April 2014
Line 1:		Line 1:
	~~{{redirect\|MEMM\|the German Nordic combined skier\|Silvio Memm}}~~		My name is Kit and I am studying Physics and Greek and Roman Culture at Goldsboro / United States.<br><br>Here is my blog :: [http://www.wallpaperhdquality.com/profile/euheysen buy personal narrative essay]

	~~In [[machine learning]], a '''maximum-entropy Markov model''' ('''MEMM'''), or '''conditional Markov model''' ('''CMM'''),~~ is ~~a [[graphical model]] for [[sequence labeling]] that combines features of [[hidden Markov model]]s (HMMs)~~ and [[Maximum entropy probability distribution\|maximum entropy]] (MaxEnt) models. An MEMM is a [[discriminative model]] that extends a standard [[maximum entropy classifier]] by assuming that the unknown values to be learnt are connected in a [[Markov chain]] rather than being [[conditionally independent]] of each other. MEMMs find applications in [[natural language processing]], specifically in [[part-of-speech tagging]]<ref>{{Cite conference\|last = Toutanova\|first = Kristina\|last2 = Manning\|first2 = Christopher D.\|title = Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger\|booktitle = Proc. J. SIGDAT Conf. on Empirical Methods in NLP and ~~Very Large Corpora (EMNLP/VLC-2000)\|year = 2000\|pages = 63–70}}</ref>~~ and [[information extraction]].<ref name=orig>{{Cite conference\|last = McCallum\|first = Andrew\|last2 = Freitag\|first2 = Dayne\|last3 = Pereira\|first3 = Fernando\|title = Maximum Entropy Markov Models for Information Extraction and Segmentation\|booktitle = Proc. ICML 2000\|year = 2000\|pages = 591–598\|url=http://www.ai.mit.edu/courses/6.891-nlp/READINGS/maxent.pdf}}</ref>

	~~==Model==~~

	Suppose we have a sequence of observations <math>O_1, \dots, O_n</math> that we seek to tag with the labels <math>S_1, \dots, S_n</math>that maximize the conditional probability <math>P(S_1, \dots, S_n \| O_1, \dots, O_n)</math>. In a MEMM, this probability is factored into Markov transition probabilities, where the probability of transitioning to a particular label depends only on the observation at ~~that position and the previous position's label:~~
	~~:<math>P(S_1, \dots, S_n \| O_1, \dots, O_n) = \prod_{t = 1}^nP(S_{t}\|S_{t-1},O_t).</math>~~
	Each of these transition probabilities come from the same general distribution <math>P(s\|s',o)</math>. For each possible label value of the previous label <math>s'</math>, the probability of a certain label <math> s </~~math> is modeled in the same way as a [[maximum entropy classifier]]:<ref>{{Cite journal \|title = A maximum entropy approach to natural language processing\|~~
	~~author=Berger, A~~.~~L. and Pietra, V.J.D. and Pietra, S.A.D.\|~~
	~~journal=Computational Linguistics\|~~
	~~volume=22\| issue=1\| pages=39–71\|~~
	~~year=1996\|~~
	~~publisher=MIT Press\|~~
	}}<~~/ref>~~
	~~:<math~~>~~P(s\|s',o) = P_{s'}(s\|,o) = \frac{1}{Z(o,s')}\exp\left(\sum_a\lambda_af_a(o,s)\right).~~<~~/math~~>
	Here~~, the <math>f_a(o,s)</math> are real-valued or categorical feature-functions, and <math> Z(o,s') </math>~~ is a normalization term ensuring that the distribution sums to one. This form for the distribution corresponds to the [[maximum entropy probability distribution]] satisfying the constraint that the empirical expectation for the feature is equal to the expectation given the model:
	:~~<math> \operatorname{E}_e\left[f_a(o,s)\right] = \operatorname{E}_p\left[f_a(o,s)\right] \quad \text{ for all } a .</math>~~
	~~The parameters <math>\lambda_a</math> can be estimated using~~ [~~[generalized iterative scaling]].<ref>{{Cite journal \| title=Generalized iterative scaling for log-linear models\|~~
	~~author=Darroch, J.N. and Ratcliff, D.\|~~
	~~journal=The Annals of Mathematical Statistics\|~~
	~~volume=43\| issue=5\| pages=1470–1480\|~~
	~~year=1972\|~~
	~~publisher=Institute of Mathematical Statistics\|~~
	~~url=~~http://~~projecteuclid~~.~~org/download/pdf_1/euclid~~.~~aoms~~/~~1177692379 \|~~
	~~}}<~~/ref> Furthermore, a variant of the [[Baum–Welch algorithm]], which is used for training HMMs, can be used to estimate parameters when training data has [[Semi-supervised learning\|incomplete or missing labels]].<ref name="orig"/>

	~~The optimal state sequence <math>S_1, \dots, S_n</math> can be found using a very similar [[Viterbi algorithm]] to the one used for HMMs. The dynamic program uses the forward probability:~~
	~~:<math>\alpha_{t+1}(s) = \sum_{s' \in S} \alpha_t(s') P_{s'}(s\|o_{t+1}).</math>~~

	~~==Strengths and weaknesses==~~
	An advantage of MEMMs rather than HMMs for sequence tagging is that they offer increased freedom in choosing features to represent observations. In sequence tagging situations, it is useful to use domain knowledge to design special-purpose features. In the original paper introducing MEMMs, the authors write that "when trying to extract previously unseen company names from a newswire article, the identity of a word alone is not very predictive; however, knowing that the word is capitalized, that is a noun, that it is used in an appositive, and that it appears near the top of the article would all be quite predictive (in conjunction with the context provided by the state-transition structure)."<ref name="orig"/> Useful sequence tagging features, such as these, are often non-independent. Maximum entropy models do not assume independence between features, but generative observation models used in HMMs do.<ref name="orig"/> Therefore, MEMMs allow the user to specify lots of correlated, but informative features.

	Another advantage of MEMMs versus HMMs and [[conditional random field]]s (CRFs) is that training can be considerably more efficient. In HMMs and CRFs, one needs to use some version of the [[forward–backward algorithm]] as an inner loop in training. However, in MEMMs, estimating the parameters of the maximum-entropy distributions used for the transition probabilities can be done for each transition distribution in isolation.

	A drawback of MEMMs is that they potentially suffer from the "label bias problem," where states with low-entropy transition distributions "effectively ignore their observations." Conditional random fields were designed to overcome this weakness,<ref name="crf">{{cite conference\|last = Lafferty\|first = John\|last2 = McCallum\|first2 = Andrew\|last3 = Pereira\|first3 = Fernando\|title = Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data\|booktitle = Proc. ICML 2001\|year = 2001}}</ref>
	which had already been recognised in the context of neural network-based Markov models in the early 1990s.<ref>{{cite thesis \|author=Léon Bottou \|year=1991 \|title=Une Approche théorique de l'Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole \|publisher=Université de Paris XI \|type=Ph.D. \|url=http://leon.bottou.org/papers/bottou-91a}}</ref><ref name="crf"/>
	~~Another source of label bias is that training is always done with respect to known previous tags, so the model struggles at test time when there is uncertainty in the previous tag.~~

	~~==References==~~
	~~{{reflist\|2}}~~
	~~[[Category:Markov models]]~~
	~~[[Category:Statistical natural language processing]~~]

← Older revision		Revision as of 15:17, 9 September 2013
Line 1:		Line 1:
	~~The one wrote post~~ is ~~called Berta~~. ~~To collect coins~~ is ~~something she~~ by ~~no means give it down~~. ~~His house~~ is ~~already~~ in ~~Puerto Rico~~ and ~~his awesome parents live nearby~~. ~~My day job~~ is a ~~database boss~~. ~~See what~~'s ~~new on my website here~~: [http://~~www~~.~~holmesnaden~~.~~com~~/~~contatti~~.~~asp Cheap air max 95~~]		{{redirect\|MEMM\|the German Nordic combined skier\|Silvio Memm}}

			In [[machine learning]], a '''maximum-entropy Markov model''' ('''MEMM'''), or '''conditional Markov model''' ('''CMM'''), is a [[graphical model]] for [[sequence labeling]] that combines features of [[hidden Markov model]]s (HMMs) and [[Maximum entropy probability distribution\|maximum entropy]] (MaxEnt) models. An MEMM is a [[discriminative model]] that extends a standard [[maximum entropy classifier]] by assuming that the unknown values to be learnt are connected in a [[Markov chain]] rather than being [[conditionally independent]] of each other. MEMMs find applications in [[natural language processing]], specifically in [[part-of-speech tagging]]<ref>{{Cite conference\|last = Toutanova\|first = Kristina\|last2 = Manning\|first2 = Christopher D.\|title = Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger\|booktitle = Proc. J. SIGDAT Conf. on Empirical Methods in NLP and Very Large Corpora (EMNLP/VLC-2000)\|year = 2000\|pages = 63–70}}</ref> and [[information extraction]].<ref name=orig>{{Cite conference\|last = McCallum\|first = Andrew\|last2 = Freitag\|first2 = Dayne\|last3 = Pereira\|first3 = Fernando\|title = Maximum Entropy Markov Models for Information Extraction and Segmentation\|booktitle = Proc. ICML 2000\|year = 2000\|pages = 591–598\|url=http://www.ai.mit.edu/courses/6.891-nlp/READINGS/maxent.pdf}}</ref>

			==Model==

			Suppose we have a sequence of observations <math>O_1, \dots, O_n</math> that we seek to tag with the labels <math>S_1, \dots, S_n</math>that maximize the conditional probability <math>P(S_1, \dots, S_n \| O_1, \dots, O_n)</math>. In a MEMM, this probability is factored into Markov transition probabilities, where the probability of transitioning to a particular label depends only on the observation at that position and the previous position's label:
			:<math>P(S_1, \dots, S_n \| O_1, \dots, O_n) = \prod_{t = 1}^nP(S_{t}\|S_{t-1},O_t).</math>
			Each of these transition probabilities come from the same general distribution <math>P(s\|s',o)</math>. For each possible label value of the previous label <math>s'</math>, the probability of a certain label <math> s </math> is modeled in the same way as a [[maximum entropy classifier]]:<ref>{{Cite journal \|title = A maximum entropy approach to natural language processing\|
			author=Berger, A.L. and Pietra, V.J.D. and Pietra, S.A.D.\|
			journal=Computational Linguistics\|
			volume=22\| issue=1\| pages=39–71\|
			year=1996\|
			publisher=MIT Press\|
			}}</ref>
			:<math>P(s\|s',o) = P_{s'}(s\|,o) = \frac{1}{Z(o,s')}\exp\left(\sum_a\lambda_af_a(o,s)\right).</math>
			Here, the <math>f_a(o,s)</math> are real-valued or categorical feature-functions, and <math> Z(o,s') </math> is a normalization term ensuring that the distribution sums to one. This form for the distribution corresponds to the [[maximum entropy probability distribution]] satisfying the constraint that the empirical expectation for the feature is equal to the expectation given the model:
			:<math> \operatorname{E}_e\left[f_a(o,s)\right] = \operatorname{E}_p\left[f_a(o,s)\right] \quad \text{ for all } a .</math>
			The parameters <math>\lambda_a</math> can be estimated using [[generalized iterative scaling]].<ref>{{Cite journal \| title=Generalized iterative scaling for log-linear models\|
			author=Darroch, J.N. and Ratcliff, D.\|
			journal=The Annals of Mathematical Statistics\|
			volume=43\| issue=5\| pages=1470–1480\|
			year=1972\|
			publisher=Institute of Mathematical Statistics\|
			url=http://projecteuclid.org/download/pdf_1/euclid.aoms/1177692379 \|
			}}</ref> Furthermore, a variant of the [[Baum–Welch algorithm]], which is used for training HMMs, can be used to estimate parameters when training data has [[Semi-supervised learning\|incomplete or missing labels]].<ref name="orig"/>

			The optimal state sequence <math>S_1, \dots, S_n</math> can be found using a very similar [[Viterbi algorithm]] to the one used for HMMs. The dynamic program uses the forward probability:
			:<math>\alpha_{t+1}(s) = \sum_{s' \in S} \alpha_t(s') P_{s'}(s\|o_{t+1}).</math>

			==Strengths and weaknesses==
			An advantage of MEMMs rather than HMMs for sequence tagging is that they offer increased freedom in choosing features to represent observations. In sequence tagging situations, it is useful to use domain knowledge to design special-purpose features. In the original paper introducing MEMMs, the authors write that "when trying to extract previously unseen company names from a newswire article, the identity of a word alone is not very predictive; however, knowing that the word is capitalized, that is a noun, that it is used in an appositive, and that it appears near the top of the article would all be quite predictive (in conjunction with the context provided by the state-transition structure)."<ref name="orig"/> Useful sequence tagging features, such as these, are often non-independent. Maximum entropy models do not assume independence between features, but generative observation models used in HMMs do.<ref name="orig"/> Therefore, MEMMs allow the user to specify lots of correlated, but informative features.

			Another advantage of MEMMs versus HMMs and [[conditional random field]]s (CRFs) is that training can be considerably more efficient. In HMMs and CRFs, one needs to use some version of the [[forward–backward algorithm]] as an inner loop in training. However, in MEMMs, estimating the parameters of the maximum-entropy distributions used for the transition probabilities can be done for each transition distribution in isolation.

			A drawback of MEMMs is that they potentially suffer from the "label bias problem," where states with low-entropy transition distributions "effectively ignore their observations." Conditional random fields were designed to overcome this weakness,<ref name="crf">{{cite conference\|last = Lafferty\|first = John\|last2 = McCallum\|first2 = Andrew\|last3 = Pereira\|first3 = Fernando\|title = Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data\|booktitle = Proc. ICML 2001\|year = 2001}}</ref>
			which had already been recognised in the context of neural network-based Markov models in the early 1990s.<ref>{{cite thesis \|author=Léon Bottou \|year=1991 \|title=Une Approche théorique de l'Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole \|publisher=Université de Paris XI \|type=Ph.D. \|url=http://leon.bottou.org/papers/bottou-91a}}</ref><ref name="crf"/>
			Another source of label bias is that training is always done with respect to known previous tags, so the model struggles at test time when there is uncertainty in the previous tag.

			==References==
			{{reflist\|2}}
			[[Category:Markov models]]
			[[Category:Statistical natural language processing]]