Insertion device: Difference between revisions

Latest revision as of 04:52, 4 December 2013

Template:Bayesian statistics Bayesian spam filtering (Template:IPAc-en Template:Respell; after Rev. Thomas Bayes) is a statistical technique of e-mail filtering. In its basic form, it makes use of a naive Bayes classifier on bag of words features to identify spam e-mail, an approach commonly used in text classification.

Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using Bayesian inference to calculate a probability that an email is or is not spam.

Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low false positive spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s.

History

The first known mail-filtering program to use a naive Bayes classifier was Jason Rennie's ifile program, released in 1996. The program was used to sort mail into folders.^[1] The first scholarly publication on Bayesian spam filtering was by Sahami et al. in 1998.^[2] That work was soon thereafter deployed in commercial spam filters.Potter or Ceramic Artist Truman Bedell from Rexton, has interests which include ceramics, best property developers in singapore developers in singapore and scrabble. Was especially enthused after visiting Alejandro de Humboldt National Park. However, in 2002 Paul Graham greatly decreased the false positive rate, so that it could be used on its own as a single spam filter.^[3]^[4]

Variants of the basic technique have been implemented in a number of research works and commercial software products.^[5] Many modern mail clients implement Bayesian spam filtering. Users can also install separate email filtering programs. Server-side email filters, such as CRM114, DSPAM, SpamAssassin,^[6] SpamBayes,^[7] Bogofilter and ASSP, make use of Bayesian spam filtering techniques, and the functionality is sometimes embedded within mail server software itself.

Process

Particular words have particular probabilities of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "Viagra" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members.

After training, the word probabilities (also known as likelihood functions) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words. This contribution is called the posterior probability and is computed using Bayes' theorem. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.

As in any other spam filtering technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement quarantine mechanisms that define a time frame during which the user is allowed to review the software's decision.

The initial training can usually be refined when wrong judgements from the software are identified (false positives or false negatives). That allows the software to dynamically adapt to the ever evolving nature of spam.

Some spam filters combine the results of both Bayesian spam filtering and other heuristics (pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness.

Mathematical foundation

Bayesian email filters utilize Bayes' theorem. Bayes' theorem is used several times in the context of spam:

a first time, to compute the probability that the message is spam, knowing that a given word appears in this message;
a second time, to compute the probability that the message is spam, taking into consideration all of its words (or a relevant subset of them);
sometimes a third time, to deal with rare words.

Computing the probability that a message containing a given word is spam

Let's suppose the suspected message contains the word "replica". Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not "know" such facts; all it can do is compute probabilities.

The formula used by the software to determine that is derived from Bayes' theorem

\Pr(S|W)={\frac {\Pr(W|S)\cdot \Pr(S)}{\Pr(W|S)\cdot \Pr(S)+\Pr(W|H)\cdot \Pr(H)}}

where:

$\Pr(S|W)$ is the probability that a message is a spam, knowing that the word "replica" is in it;
$\Pr(S)$ is the overall probability that any given message is spam;
$\Pr(W|S)$ is the probability that the word "replica" appears in spam messages;
$\Pr(H)$ is the overall probability that any given message is not spam (is "ham");
$\Pr(W|H)$ is the probability that the word "replica" appears in ham messages.

(For a full demonstration, see Bayes' theorem#Extended form.)

The spamicity of a word

Recent statistics^[8] show that the current probability of any message being spam is 80%, at the very least:

\Pr(S)=0.8;\Pr(H)=0.2

However, most bayesian spam detection software makes the assumption that there is no a priori reason for any incoming message to be spam rather than ham, and considers both cases to have equal probabilities of 50%:Potter or Ceramic Artist Truman Bedell from Rexton, has interests which include ceramics, best property developers in singapore developers in singapore and scrabble. Was especially enthused after visiting Alejandro de Humboldt National Park.

\Pr(S)=0.5;\Pr(H)=0.5

The filters that use this hypothesis are said to be "not biased", meaning that they have no prejudice regarding the incoming email. This assumption permits simplifying the general formula to:

\Pr(S|W)={\frac {\Pr(W|S)}{\Pr(W|S)+\Pr(W|H)}}

This quantity is called "spamicity" (or "spaminess") of the word "replica", and can be computed. The number $\Pr(W|S)$ used in this formula is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase. Similarly, $\Pr(W|H)$ is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase. For these approximations to make sense, the set of learned messages needs to be big and representative enough. It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size.^[9]

Of course, determining whether a message is spam or ham based only on the presence of the word "replica" is error-prone, which is why bayesian spam software tries to consider several words and combine their spamicities to determine a message's overall probability of being spam.

Combining individual probabilities

Most bayesian spam filtering algorithms are based on formulas that are strictly valid (from a probabilistic standpoint) only if the words present in the message are independent events. This condition is not generally satisfied (for example, in natural languages like English the probability of finding an adjective is affected by the probability of having a noun), but it is a useful idealization, especially since the statistical correlations between individual words are usually not known. On this basis, one can derive the following formula from Bayes' theorem:^[10]

p={\frac {p_{1}p_{2}\cdots p_{N}}{p_{1}p_{2}\cdots p_{N}+(1-p_{1})(1-p_{2})\cdots (1-p_{N})}}

where:

$p$ is the probability that the suspect message is spam;
$p_{1}$ is the probability $p(S|W_{1})$ that it is a spam knowing it contains a first word (for example "replica");
$p_{2}$ is the probability $p(S|W_{2})$ that it is a spam knowing it contains a second word (for example "watches");
etc...
$p_{N}$ is the probability $p(S|W_{N})$ that it is a spam knowing it contains an Nth word (for example "home").

This is the formula referenced by Paul Graham in his 2002 article. Some early commentators stated that "Graham pulled his formulas out of thin air",^[11] but Graham had actually referenced his source,^[12] which included a detailed explanation of the formula, and the idealizations on which it is based.

Spam filtering software based on this formula is sometimes referred to as a naive Bayes classifier. The result p is typically compared to a given threshold to decide whether the message is spam or not. If p is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam.

Other expression of the formula for combining individual probabilities

Usually p is not directly computed using the above formula due to floating-point underflow. Instead, p can be computed in the log domain by rewriting the original equation as follows:

{\frac {1}{p}}-1={\frac {(1-p_{1})(1-p_{2})\dots (1-p_{n})}{p_{1}p_{2}\dots p_{n}}}

Taking logs on both sides:

\ln \left({\frac {1}{p}}-1\right)=\sum _{i=1}^{N}\left[\ln(1-p_{i})-\ln p_{i}\right]

Let $\eta =\sum _{i=1}^{N}\left[\ln(1-p_{i})-\ln p_{i}\right]$ . Therefore,

{\frac {1}{p}}-1=e^{\eta }

Hence the alternate formula for computing the combined probability:

p={\frac {1}{1+e^{\eta }}}

Dealing with rare words

In the case a word has never been met during the learning phase, both the numerator and the denominator are equal to zero, both in the general formula and in the spamicity formula. The software can decide to discard such words for which there is no information available.

More generally, the words that were encountered only a few times during the learning phase cause a problem, because it would be an error to trust blindly the information they provide. A simple solution is to simply avoid taking such unreliable words into account as well.

Applying again Bayes' theorem, and assuming the classification between spam and ham of the emails containing a given word ("replica") is a random variable with beta distribution, some programs decide to use a corrected probability:

\Pr '(S|W)={\frac {s\cdot \Pr(S)+n\cdot \Pr(S|W)}{s+n}}

where:

$\Pr '(S|W)$ is the corrected probability for the message to be spam, knowing that it contains a given word ;
$s$ is the strength we give to background information about incoming spam ;
$\Pr(S)$ is the probability of any incoming message to be spam ;
$n$ is the number of occurrences of this word during the learning phase ;
$\Pr(S|W)$ is the spamicity of this word.

(Demonstration:^[13])

This corrected probability is used instead of the spamicity in the combining formula.

$\Pr(S)$ can again be taken equal to 0.5, to avoid being too suspicious about incoming email. 3 is a good value for s, meaning that the learned corpus must contain more than 3 messages with that word to put more confidence in the spamicity value than in the default value.

This formula can be extended to the case where n is equal to zero (and where the spamicity is not defined), and evaluates in this case to $Pr(S)$ .

Other heuristics

"Neutral" words like "the", "a", "some", or "is" (in English), or their equivalents in other languages, can be ignored. More generally, some bayesian filtering filters simply ignore all the words which have a spamicity next to 0.5, as they bring little to a good decision. The words taken into consideration are those whose spamicity is next to 0.0 (distinctive signs of legitimate messages), or next to 1.0 (distinctive signs of spam). A method can be for example to keep only those ten words, in the examined message, which have the greatest absolute value |0.5 − pI|.

Some software products take into account the fact that a given word appears several times in the examined message,^[14] others don't.

Some software products use patterns (sequences of words) instead of isolated natural languages words.^[15] For example, with a "context window" of four words, they compute the spamicity of "Viagra is good for", instead of computing the spamicities of "Viagra", "is", "good", and "for". This method gives more sensitivity to context and eliminates the Bayesian noise better, at the expense of a bigger database.

Mixed methods

There are other ways of combining individual probabilities for different words than using the "naive" approach. These methods differ from it on the assumptions they make on the statistical properties of the input data. These different hypotheses result in radically different formulas for combining the individual probabilities.

For example, assuming the individual probabilities follow a chi-squared distribution with 2N degrees of freedom, one could use the formula:

p=C^{-1}(-2\ln(p_{1}p_{2}\cdots p_{N}),2N)\,

where C⁻¹ is the inverse of the chi-squared function.

Individual probabilities can be combined with the techniques of the Markovian discrimination too.

Discussion

Advantages

The real property market in Singapore has witnessed unprecedented progress in latest times. As a result, the real property company sector is quick rising as as a lucrative business option. This information provides general steerage on beginning a real property company in Singapore.

Only while you pass both paper in the RES exams, can you apply for the salesperson license with Council of Estate Company by means of Knight Frank. An annual price of $299.60 is payable to CEA for this license by way of Knight Frank. Create a Monetary Funds.is crucial given the up and down of this risky market place. Your monetary funds ought to plan to your advertising and marketing costs, any extra costs reminiscent of training and your forecasted earnings. Attention-grabbing, involved or still unsure? Call Leon Low @ 9369 5588 and you'll find out extra on how he and his Group members can assist you on this Real Property Enterprise. See you actual soon! By guest contributor Getty Goh, Director of Ascendant Assets , an actual estate analysis and investment consultancy agency.

You'll need to arrange the remainder of the security deposit and advance rental upon signing of the Tenancy Settlement. For 1 12 months lease - 1 month's deposit and 1 month's advance rental. For 2 years lease - 2 month's deposit and 1 month's advance rental. Minus the goodfaith deposit (if relevant) that you've got paid when signing the Letter of Intent. The explanation behind this clause is that the landlord had paid the total one month's agent fee for a 2 years lease however in the event you terminate the lease by exercising the diplomatic clause, hence unable to complete the total 2 years, you will have to refund the professional-rata commission. Since landlord grants the diplomatic clause, they'll usually demand reimbursement clause to be included within the tenancy settlement.

As a result of they cope with property consumers, sellers, landlords and tenants, and since the transactions they handle can involve massive sums of money, the need for professionalism is of utmost significance. At the time of handover, ensure that all of your requests have been met. You or your agent should observe on the Inventory Record and take file photographs of any defects corresponding to damaged flooring or broken gadgets. This can shield you on the termination of the lease. Signing your contracts and moving in shouldn't be the top 10 residential property developers in singapore of the relationship along with your agent. A very good agent will keep up a correspondence to offer ongoing assist and advice. There's 8 reason to bring up your own home valuation extra $10k to $20k with excessive cash above valuation Bachelor of Real Property (Valuation)

An entirely owned subsidiary of SLP Worldwide, SLP Realty homes 900 associates to service SLP's fast growing portfolio of residential projects. Actual estate is a human-centric trade. Moreover offering complete training packages for our associates, SLP Realty places equal emphasis on growing human capabilities and creating strong teamwork all through all ranges of our organisational hierarchy. Worldwide Presence At SLP International, our staff of pros is pushed to ensure our shoppers meet their enterprise and investment goals. Under is an inventory of some notable purchasers from completely different industries and markets, who have entrusted their actual estate needs to the experience of SLP International.

Most agents that I've seen are afraid to invest. They try to find methods to save lots of as a lot as they will. Effectively, I've nothing towards discovering methods to avoid wasting on promoting cost as a result of I do it too. But these agents are so afraid to the extent that they provide up advertising the property after just 1 attempt. Most agents are simply focusing more on methods to cut value when they need to be thinking where can I make investments my money to usher in more clients. With a imaginative and prescient for making optimistic waves within the trade, each males set out with a single purpose in mind – to contribute to the native property scene. They've gone to the sting to make sure that this goal of theirs is met, and can cease at nothing till it's so. In the true estate industry, only the fittest survive.

Soufan.com owned by Nasdaq-listed Sina Corp, Meiaoju.com and Auproperty.com.au are other marketplaces that work on a similar business model, appearing as intermediaries for Chinese language patrons and overseas agents. Nowadays consumers look at critiques before they make a purchase. Social media is sweet for this. They've more belief in me than those that discovered me by ads, as a result of they have been following my Weibo, and that hurries up the shopping for course of." The shift comes because the rental market holds its personal in the housing sector, which has been hit by slowing demand following several rounds of cooling measures and lending curbs, industry practitioners told RIGHT NOW. a. is resident in Singapore; a. he is resident in Singapore; Association of Singapore Patent Agents One of the main advantagesPotter or Ceramic Artist Truman Bedell from Rexton, has interests which include ceramics, best property developers in singapore developers in singapore and scrabble. Was especially enthused after visiting Alejandro de Humboldt National Park. of Bayesian spam filtering is that it can be trained on a per-user basis.

The spam that a user receives is often related to the online user's activities. For example, a user may have been subscribed to an online newsletter that the user considers to be spam. This online newsletter is likely to contain words that are common to all newsletters, such as the name of the newsletter and its originating email address. A Bayesian spam filter will eventually assign a higher probability based on the user's specific patterns.

The legitimate e-mails a user receives will tend to be different. For example, in a corporate environment, the company name and the names of clients or customers will be mentioned often. The filter will assign a lower spam probability to emails containing those names.

The word probabilities are unique to each user and can evolve over time with corrective training whenever the filter incorrectly classifies an email. As a result, Bayesian spam filtering accuracy after training is often superior to pre-defined rules.

It can perform particularly well in avoiding false positives,Potter or Ceramic Artist Truman Bedell from Rexton, has interests which include ceramics, best property developers in singapore developers in singapore and scrabble. Was especially enthused after visiting Alejandro de Humboldt National Park. where legitimate email is incorrectly classified as spam. For example, if the email contains the word "Nigeria", which is frequently used in Advance fee fraud spam, a pre-defined rules filter might reject it outright. A Bayesian filter would mark the word "Nigeria" as a probable spam word, but would take into account other important words that usually indicate legitimate e-mail. For example, the name of a spouse may strongly indicate the e-mail is not spam, which could overcome the use of the word "Nigeria."

Disadvantages

Depending on the implementation, Bayesian spam filtering may be susceptible to Bayesian poisoning, a technique used by spammers in an attempt to degrade the effectiveness of spam filters that rely on Bayesian filtering. A spammer practicing Bayesian poisoning will send out emails with large amounts of legitimate text (gathered from legitimate news or literary sources). Spammer tactics include insertion of random innocuous words that are not normally associated with spam, thereby decreasing the email's spam score, making it more likely to slip past a Bayesian spam filter. However with (for example) Paul Graham's scheme only the most significant probabilities are used, so that padding the text out with non-spam-related words does not affect the detection probability significantly.

Words that normally appear in large quantities in spam may also be transformed by spammers. For example, «Viagra» would be replaced with «Viaagra» or «V!agra» in the spam message. The recipient of the message can still read the changed words, but each of these words is met more rarely by the Bayesian filter, which hinders its learning process. As a general rule, this spamming technique does not work very well, because the derived words end up recognized by the filter just like the normal ones.^[16]

Another technique used to try to defeat Bayesian spam filters is to replace text with pictures, either directly included or linked. The whole text of the message, or some part of it, is replaced with a picture where the same text is "drawn". The spam filter is usually unable to analyze this picture, which would contain the sensitive words like «Viagra». However, since many mail clients disable the display of linked pictures for security reasons, the spammer sending links to distant pictures might reach fewer targets. Also, a picture's size in bytes is bigger than the equivalent text's size, so the spammer needs more bandwidth to send messages directly including pictures. Some filters are more inclined to decide that a message is spam if it has mostly graphical contents. Finally, a probably more efficient solution has been proposed by Google and is used by its Gmail email system, performing an OCR (Optical Character Recognition) to every mid to large size image, analyzing the text inside.^[17]

General applications of Bayesian filtering

While Bayesian filtering is used widely to identify spam email, the technique can classify (or "cluster") almost any sort of data. It has uses in science, medicine, and engineering. One example is a general purpose classification program called AutoClass which was originally used to classify stars according to spectral characteristics that were otherwise too subtle to notice. There is recent speculation that even the brain uses Bayesian methods to classify sensory stimuli and decide on behavioral responses.^[18]

References

↑ Template:Cite web
↑ Template:Cite web
↑ Paul Graham (2003), Better Bayesian filtering
↑ Brian Livingston (2002), Paul Graham provides stunning answer to spam e-mails
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite news
↑ Template:Cite web
↑ Process Software, Introduction to Bayesian Filtering
↑ Template:Cite web at MathPages
↑ http://mail.python.org/pipermail/python-dev/2002-August/028216.html Tim Peter's comment on the algorithm used by Graham
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite web
↑ Paul Graham (2002), A Plan for Spam
↑ Template:Cite web
↑ Trends in Neuroscience, 27(12):712-9, 2004 (pdf)

External links

Guide to Bayesian spam filters: part 1, part 2.
Gary Robinson's spam blog

[1] Template:Cite web

[2] Template:Cite web

[3] Paul Graham (2003), Better Bayesian filtering

[4] Brian Livingston (2002), Paul Graham provides stunning answer to spam e-mails

[5] Template:Cite web

[twsSep14yy-6] Template:Cite web

[twsSep2-7] Template:Cite news

[8] Template:Cite web

[9] Process Software, Introduction to Bayesian Filtering

[10] Template:Cite web at MathPages

[11] ttp://mail.python.org/pipermail/python-dev/2002-August/028216.html Tim Peter's comment on the algorithm used by Graham

[12] Template:Cite web

[13] Template:Cite web

[14] Template:Cite web

[15] Template:Cite web

[16] Paul Graham (2002), A Plan for Spam

[17] Template:Cite web

[18] Trends in Neuroscience, 27(12):712-9, 2004 (pdf)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

@@ Line 1: / Line 1: @@
+{{Bayesian statistics}}
+'''Bayesian spam filtering''' ({{IPAc-en|ˈ|b|eɪ|z|i|ə|n}} {{respell|BAY|zee-ən}}; after Rev. [[Thomas Bayes]]) is a [[statistics|statistical]] [[scientific technique|technique]] of [[e-mail filtering]]. In its basic form, it makes use of a [[naive Bayes classifier]] on [[bag of words]] features to identify [[Spam (electronic)|spam]] e-mail, an approach commonly used in [[Document classification|text classification]].
+Naive Bayes classifiers work by correlating the use of tokens (typically words, or sometimes other things), with spam and non-spam e-mails and then using [[Bayesian inference]] to calculate a probability that an email is or is not spam.
-You've finally decided think about that giant step and propose to your woman enjoy. You're seeking decide where, when and how to propose. Proposing for marriage is a precious moment, and well-built to make sure everything goes just immediately. But there's a person problem - you haven't a clue as into the type of diamond gemstone you should purchase for your future woman! There are a lot of colors, styles and shapes to choose from with diamond rings, so we've whip up this helpful guide for choosing diamond engagement rings - simply for you.<br><br>Bags, Shoes, and Expensive jewelry. Most, if not all women love the equivalent of one on the. But make positive that before giving one, invariably exactly what she likes and what she will actually wear. Tastes can vary greatly with bags and shoes. Jewelry of course is the best. You can not go wrong with earrings, bracelets, Pendants, necklace, and diamonds or pearl. Beneficial thing is, as long as it's a jewel, the cost is really not important.<br><br><br><br>Round Brilliant - The round brilliant is the modern day version of the round by using a lot more sparkle and shine. Occasion the preferred choice in diamond jewelry.<br><br>Mating. No, not the act of mating itself, but rather the hunt for the perfect mate, that an individual to support furthering the species. Method idea behind jewelry is it increases somebody's attractiveness. An individual who is wearing lots of pricey jewels and diamonds is obviously making decent money, due to the they will improve able to provide for a child. This works in both directions, for both men and women. While physical beauty still reigns supreme thinking about to choosing the best mate possible, will be possible to augment that beauty with indications of status, wealth and health.<br><br>So today, the creation and manufacturing of pet jewelry proves regarding a lucrative business. In fact, canine owners love to spruce up their pets, and themselves, in pet jewelry. Instead of buying jewelry only because of pets, or themselves, they now venture into buying matching jewelry each of their own selves. You can create any type of jewelry; made from sterling silver, pewter, or gemstones. Be as creative as easy to create jewelry designs that will entice each pet and pet rider!<br><br>Your price is really going to alter a lot just depending on kind of jewelry that 1 does choose to go with. It really depends on how expensive distinct birthstone is definitely. In fact, if you're fortunate enough to be an April baby you are exactly going to have to go out and select a diamond. Of course, motivating one that is expensive gemstones out truth be told there. This may mean that you'll have to either stretch your allowance and conserve for a few years or just find methods to purchase it. Of course diamonds often in style but would likely not even recognize some all those pieces as birthstone pieces just considering they are so quite popular.<br><br>Surely, these tips won't make an individual regret. Through the means, you can get pendants, chain, earrings and other fashion jewellery online in India.<br><br>If you enjoyed this post and you would certainly like to obtain additional information concerning [http://www.quantumpendants.org/ Scalar Pendant Fake] kindly go to our website.
+Naive Bayes spam filtering is a baseline technique for dealing with spam that can tailor itself to the email needs of individual users and give low [[false positive]] spam detection rates that are generally acceptable to users. It is one of the oldest ways of doing spam filtering, with roots in the 1990s.
+==History==
+The first known mail-filtering [[computer program|program]] to use a naive Bayes classifier was Jason Rennie's ifile program, released in 1996. The program was used to sort mail into [[Directory (file systems)|folders]].<ref>{{cite web|url=http://people.csail.mit.edu/jrennie/ifile/old/README-0.1A|author=Jason Rennie|title=ifile|year=1996}}</ref> The first scholarly publication on Bayesian spam filtering was by Sahami et al. in 1998.<ref>{{cite web|url=http://robotics.stanford.edu/users/sahami/papers-dir/spam.pdf|author=M. Sahami, S. Dumais, D. Heckerman, E. Horvitz|title=A Bayesian approach to filtering junk e-mail|publisher=AAAI'98 Workshop on Learning for Text Categorization|year=1998}}</ref> That work was soon thereafter deployed in commercial spam filters.{{Citation needed|date=September 2010}} However, in 2002 [[Paul Graham (computer programmer)|Paul Graham]] greatly decreased the false positive rate, so that it could be used on its own as a single spam filter.<ref>Paul Graham (2003), [http://www.paulgraham.com/better.html Better Bayesian filtering]</ref><ref>Brian Livingston (2002), [http://www.infoworld.com/t/business/paul-graham-provides-stunning-answer-spam-e-mails-295 Paul Graham provides stunning answer to spam e-mails]</ref>
+Variants of the basic technique have been implemented in a number of research works and commercial [[Computer software|software]] products.<ref>{{cite web|url=http://kb.mozillazine.org/Junk_Mail_Controls|title=Junk Mail Controls|publisher=MozillaZine|date=November 2009}}</ref> Many modern mail [[Client (computing)|clients]] implement Bayesian spam filtering. Users can also install separate [[E-mail filtering|email filtering programs]]. [[Server-side]] email filters, such as [[CRM114 (program)|CRM114]], [[DSPAM]], [[SpamAssassin]],<ref name=twsSep14yy>{{cite web
+ |title= Installation
+ |publisher= ''Ubuntu manuals''
+ |quote= Gary Robinson’s f(x) and combining algorithms, as used in SpamAssassin
+ |date= 2010-09-18
+ |url= http://manpages.ubuntu.com/manpages/gutsy/man1/sa-learn.1p.html
+ |accessdate= 2010-09-18
+| archiveurl= http://web.archive.org/web/20100929165032/http://manpages.ubuntu.com/manpages/gutsy/man1/sa-learn.1p.html| archivedate= 29 September 2010 <!--DASHBot-->| deadurl= no}}</ref> [[SpamBayes]],<ref name=twsSep2>{{Cite news
+ |title= Background Reading
+ |publisher= ''SpamBayes project''
+ |quote= Sharpen your pencils, this is the mathematical background (such as it is).* The paper that started the ball rolling: Paul Graham's A Plan for Spam.* Gary Robinson has an interesting essay suggesting some improvements to Graham's original approach.* Gary Robinson's Linux Journal article discussed using the chi squared distribution.
+ |date= 2010-09-18
+ |url= http://spambayes.sourceforge.net/background.html
+ |accessdate= 2010-09-18
+| archiveurl= http://web.archive.org/web/20100906031341/http://spambayes.sourceforge.net/background.html| archivedate= 6 September 2010 <!--DASHBot-->| deadurl= no}}</ref> [[Bogofilter]] and [[Anti-Spam SMTP Proxy|ASSP]], make use of Bayesian spam filtering techniques, and the functionality is sometimes embedded within [[mail server]] software itself.
+==Process==
+Particular words have particular [[probability|probabilities]] of occurring in spam email and in legitimate email. For instance, most email users will frequently encounter the word "[[Viagra]]" in spam email, but will seldom see it in other email. The filter doesn't know these probabilities in advance, and must first be trained so it can build them up. To train the filter, the user must manually indicate whether a new email is spam or not. For all words in each training email, the filter will adjust the probabilities that each word will appear in spam or legitimate email in its database. For instance, Bayesian spam filters will typically have learned a very high spam probability for the words "Viagra" and "refinance", but a very low spam probability for words seen only in legitimate email, such as the names of friends and family members.
+After training, the word probabilities (also known as [[likelihood function]]s) are used to compute the probability that an email with a particular set of words in it belongs to either category. Each word in the email contributes to the email's spam probability, or only the most interesting words. This contribution is called the [[posterior probability]] and is computed using [[Bayes' theorem]]. Then, the email's spam probability is computed over all words in the email, and if the total exceeds a certain threshold (say 95%), the filter will mark the email as a spam.
+As in any other [[spam filtering]] technique, email marked as spam can then be automatically moved to a "Junk" email folder, or even deleted outright. Some software implement [[quarantine]] mechanisms that define a time frame during which the user is allowed to review the software's decision.
+The initial training can usually be refined when wrong judgements from the software are identified (false positives or false negatives). That allows the software to dynamically adapt to the ever evolving nature of spam.
+Some spam filters combine the results of both Bayesian spam filtering and other [[metaheuristic|heuristics]] (pre-defined rules about the contents, looking at the message's envelope, etc.), resulting in even higher filtering accuracy, sometimes at the cost of adaptiveness.
+==Mathematical foundation==
+Bayesian [[email filter]]s utilize [[Bayes' theorem]]. Bayes' theorem is used several times in the context of spam:
+* a first time, to compute the probability that the message is spam, knowing that a given word appears in this message;
+* a second time, to compute the probability that the message is spam, taking into consideration all of its words (or a relevant subset of them);
+* sometimes a third time, to deal with rare words.
+===Computing the probability that a message containing a given word is spam===
+Let's suppose the suspected message contains the word "[[replica]]". Most people who are used to receiving e-mail know that this message is likely to be spam, more precisely a proposal to sell counterfeit copies of well-known brands of watches. The spam detection software, however, does not "know" such facts; all it can do is compute probabilities.
+The formula used by the software to determine that is derived from [[Bayes' theorem]]
+:<math>\Pr(S|W) = \frac{\Pr(W|S) \cdot \Pr(S)}{\Pr(W|S) \cdot \Pr(S) + \Pr(W|H) \cdot \Pr(H)}</math>
+where:
+* <math>\Pr(S|W)</math> is the probability that a message is a spam, knowing that the word "replica" is in it;
+* <math>\Pr(S)</math> is the overall probability that any given message is spam;
+* <math>\Pr(W|S)</math> is the probability that the word "replica" appears in spam messages;
+* <math>\Pr(H)</math> is the overall probability that any given message is not spam (is "ham");
+* <math>\Pr(W|H)</math> is the probability that the word "replica" appears in ham messages.
+(For a full demonstration, see [[Bayes' theorem#Extended form]].)
+===The spamicity of a word===
+Recent statistics<ref>{{cite web|url=http://eval.symantec.com/mktginfo/enterprise/other_resources/b-state_of_spam_report_09-2009.en-us.pdf|author=Dylan Mors and Dermot Harnett|title=State of Spam, a Monthly Report - Report #33|year=2009}}</ref> show that the current probability of any message being spam is 80%, at the very least:
+:<math> \Pr(S) = 0.8 ;  \Pr(H) = 0.2</math>
+However, most bayesian spam detection software makes the assumption that there is no ''a priori'' reason for any incoming message to be spam rather than ham, and considers both cases to have equal probabilities of 50%:{{citation needed|date=July 2012}}
+:<math> \Pr(S) = 0.5 ;  \Pr(H) = 0.5</math>
+The filters that use this hypothesis are said to be "not biased", meaning that they have no prejudice regarding the incoming email. This assumption permits simplifying the general formula to:
+:<math>\Pr(S|W) = \frac{\Pr(W|S)}{\Pr(W|S) + \Pr(W|H)}</math>
+This quantity is called "spamicity" (or "spaminess") of the word "replica", and can be computed. The number <math>\Pr(W|S)</math> used in this formula is approximated to the frequency of messages containing "replica" in the messages identified as spam during the learning phase. Similarly, <math>\Pr(W|H)</math> is approximated to the frequency of messages containing "replica" in the messages identified as ham during the learning phase. For these approximations to make sense, the set of learned messages needs to be big and representative enough. It is also advisable that the learned set of messages conforms to the 50% hypothesis about repartition between spam and ham, i.e. that the datasets of spam and ham are of same size.<ref>Process Software, [http://www.process.com/precisemail/bayesian_filtering.htm Introduction to Bayesian Filtering]</ref>
+Of course, determining whether a message is spam or ham based only on the presence of the word "replica" is error-prone, which is why bayesian spam software tries to consider several words and combine their spamicities to determine a message's overall probability of being spam.
+===Combining individual probabilities===
+Most bayesian spam filtering algorithms are based on formulas that are strictly valid (from a probabilistic standpoint) only if the words present in the message are [[Statistical independence|independent events]].  This condition is not generally satisfied (for example, in natural languages like English the probability of finding an adjective is affected by the probability of having a noun), but it is a useful idealization, especially since the statistical correlations between individual words are usually not known. On this basis, one can derive the following formula from Bayes' theorem:<ref>{{cite web|url=http://www.mathpages.com/home/kmath267.htm|title=Combining probabilities}} at MathPages</ref>
+:<math>p = \frac{p_1 p_2 \cdots p_N}{p_1 p_2 \cdots p_N + (1 - p_1)(1 - p_2) \cdots (1 - p_N)}</math>
+where:
+* <math>p</math> is the probability that the suspect message is spam;
+* <math>p_1</math> is the probability <math>p(S|W_1)</math> that it is a spam knowing it contains a first word (for example "replica");
+* <math>p_2</math> is the probability <math>p(S|W_2)</math> that it is a spam knowing it contains a second word (for example "watches");
+* etc...
+* <math>p_N</math> is the probability <math>p(S|W_N)</math> that it is a spam knowing it contains an ''N''th word (for example "home").
+This is the formula referenced by Paul Graham in his 2002 article.  Some early commentators stated that "Graham pulled his formulas out of thin air",<ref>http://mail.python.org/pipermail/python-dev/2002-August/028216.html Tim Peter's comment on the algorithm used by Graham</ref> but Graham had actually referenced his source,<ref>{{cite web|url=http://www.paulgraham.com/naivebayes.html|title=Graham's web page referencing the MathPages article for the probability formula used in his spam algorithm.}}</ref> which included a detailed explanation of the formula, and the idealizations on which it is based.
+Spam filtering software based on this formula is sometimes referred to as a [[naive Bayes classifier]].  The result ''p'' is typically compared to a given threshold to decide whether the message is spam or not. If ''p'' is lower than the threshold, the message is considered as likely ham, otherwise it is considered as likely spam.
+===Other expression of the formula for combining individual probabilities===
+Usually ''p'' is not directly computed using the above formula due to [[Arithmetic underflow|floating-point underflow]]. Instead, ''p'' can be computed in the log domain by rewriting the original equation as follows:
+:<math> \frac{1}{p} - 1 = \frac{(1-p_1)(1-p_2)\dots(1-p_n)}{p_1 p_2 \dots p_n} </math>
+Taking logs on both sides:
+:<math>  \ln \left ( \frac{1}{p} - 1  \right ) = \sum_{i=1}^N \left[ \ln(1-p_i) - \ln p_i \right]</math>
+Let <math>\eta = \sum_{i=1}^N \left[ \ln(1-p_i) -\ln p_i \right] </math>. Therefore,
+:<math> \frac{1}{p} - 1 = e^\eta </math>
+Hence the alternate formula for computing the combined probability:
+:<math> p = \frac{1}{1 + e^\eta} </math>
+===Dealing with rare words===
+In the case a word has never been met during the learning phase, both the numerator and the denominator are equal to zero, both in the general formula and in the spamicity formula. The software can decide to discard such words for which there is no information available.
+More generally, the words that were encountered only a few times during the learning phase cause a problem, because it would be an error to trust blindly the information they provide. A simple solution is to simply avoid taking such unreliable words into account as well.
+Applying again Bayes' theorem, and assuming the classification between spam and ham of the emails containing a given word ("replica") is a [[random variable]] with [[beta distribution]], some programs decide to use a corrected probability:
+:<math>\Pr'(S|W) = \frac{s \cdot \Pr(S) + n \cdot \Pr(S|W)}{s + n }</math>
+where:
+*<math>\Pr'(S|W)</math> is the corrected probability for the message to be spam, knowing that it contains a given word ;
+* <math>s</math> is the ''strength'' we give to background information about incoming spam ;
+* <math>\Pr(S)</math> is the probability of any incoming message to be spam ;
+* <math>n</math> is the number of occurrences of this word during the learning phase ;
+* <math>\Pr(S|W)</math> is the spamicity of this word.
+(Demonstration:<ref>{{cite web|url=http://www.linuxjournal.com/article/6467|publisher=Linux Journal|author=[[Gary Robinson]]|title=A statistical approach to the spam problem|year=2003}}</ref>)
+This corrected probability is used instead of the spamicity in the combining formula.
+<math>\Pr(S)</math> can again be taken equal to 0.5, to avoid being too suspicious about incoming email. 3 is a good value for ''s'', meaning that the learned corpus must contain more than 3 messages with that word to put more confidence in the spamicity value than in the default value.
+This formula can be extended to the case where ''n'' is equal to zero (and where the spamicity is not defined), and evaluates in this case to <math>Pr(S)</math>.
+===Other heuristics===
+"Neutral" words like "the", "a", "some", or "is" (in English), or their equivalents in other languages, can be ignored. More generally, some bayesian filtering filters simply ignore all the words which have a spamicity next to 0.5, as they bring little to a good decision. The words taken into consideration are those whose spamicity is next to 0.0 (distinctive signs of legitimate messages), or next to 1.0 (distinctive signs of spam). A method can be for example to keep only those ten words, in the examined message, which have the greatest [[absolute value]]&nbsp;|0.5&nbsp;&minus;&nbsp;''pI''|.
+Some software products take into account the fact that a given word appears several times in the examined message,<ref>{{cite web|url=http://spamprobe.sourceforge.net/paper.html|author=Brian Burton|title=SpamProbe - Bayesian Spam Filtering Tweaks|year=2003}}</ref> others don't.
+Some software products use ''patterns'' (sequences of words) instead of isolated natural languages words.<ref>{{cite web|url=http://bnr.nuclearelephant.com/l|author=Jonathan A. Zdziarski|title=Bayesian Noise Reduction: Contextual Symmetry Logic Utilizing Pattern Consistency Analysis|year=2004}}</ref> For example, with a "context window" of four words, they compute the spamicity of "Viagra is good for", instead of computing the spamicities of "Viagra", "is", "good", and "for". This method gives more sensitivity to context and eliminates the [[Bayesian noise]] better, at the expense of a bigger database.
+===Mixed methods===
+There are other ways of combining individual probabilities for different words than using the "naive" approach. These methods differ from it on the assumptions they make on the statistical properties of the input data. These different hypotheses result in radically different formulas for combining the individual probabilities.
+For example, assuming the individual probabilities follow a [[chi-squared distribution|chi-squared]] distribution with 2''N'' degrees of freedom, one could use the formula:
+:<math>p = C^{-1}(-2 \ln(p_1 p_2 \cdots p_N), 2N) \, </math>
+where ''C''<sup>&minus;1</sup> is the [[Inverse-chi-squared distribution|inverse of the chi-squared function]].
+Individual probabilities can be combined with the techniques of the [[Markovian discrimination]] too.
+==Discussion==
+===Advantages===
+{{disputed-section|date=May 2013}}
+One of the main advantages{{citation needed|date=May 2013}} of Bayesian spam filtering is that it can be trained on a per-user basis.
+The spam that a user receives is often related to the online user's activities. For example, a user may have been subscribed to an online newsletter that the user considers to be spam. This online newsletter is likely to contain words that are common to all newsletters, such as the name of the newsletter and its originating email address. A Bayesian spam filter will eventually assign a higher probability based on the user's specific patterns.
+The legitimate e-mails a user receives will tend to be different. For example, in a corporate environment, the company name and the names of clients or customers will be mentioned often. The filter will assign a lower spam probability to emails containing those names.
+The word probabilities are unique to each user and can evolve over time with corrective training whenever the filter incorrectly classifies an email. As a result, Bayesian spam filtering accuracy after training is often superior to pre-defined rules.
+It can perform particularly well in avoiding false positives,{{citation needed|date=May 2013}} where legitimate email is incorrectly classified as spam. For example, if the email contains the word "Nigeria", which is frequently used in [[Advance fee fraud]] spam, a pre-defined rules filter might reject it outright. A Bayesian filter would mark the word "Nigeria" as a probable spam word, but would take into account other important words that usually indicate legitimate e-mail. For example, the name of a spouse may strongly indicate the e-mail is not spam, which could overcome the use of the word "Nigeria."
+===Disadvantages===
+Depending on the implementation, Bayesian spam filtering may be susceptible to [[Bayesian poisoning]], a technique used by spammers in an attempt to degrade the effectiveness of spam filters that rely on Bayesian filtering. A spammer practicing Bayesian poisoning will send out emails with large amounts of legitimate text (gathered from legitimate news or literary sources). [[e-mail spam|Spammer]] tactics include insertion of random innocuous words that are not normally associated with spam, thereby decreasing the email's spam score, making it more likely to slip past a Bayesian spam filter. However with (for example) Paul Graham's scheme only the most significant probabilities are used, so that padding the text out with non-spam-related words does not affect the detection probability significantly.
+Words that normally appear in large quantities in spam may also be transformed by spammers. For example, «Viagra» would be replaced with «Viaagra» or «V!agra» in the spam message. The recipient of the message can still read the changed words, but each of these words is met more rarely by the Bayesian filter, which hinders its learning process. As a general rule, this spamming technique does not work very well, because the derived words end up recognized by the filter just like the normal ones.<ref>Paul Graham (2002), [http://www.paulgraham.com/spam.html A Plan for Spam]</ref>
+Another technique used to try to defeat Bayesian spam filters is to replace text with pictures, either directly included or linked. The whole text of the message, or some part of it, is replaced with a picture where the same text is "drawn". The spam filter is usually unable to analyze this picture, which would contain the sensitive words like «Viagra». However, since many mail clients disable the display of linked pictures for security reasons, the spammer sending links to distant pictures might reach fewer targets. Also, a picture's size in bytes is bigger than the equivalent text's size, so the spammer needs more bandwidth to send messages directly including pictures. Some filters are more inclined to decide that a message is spam if it has mostly graphical contents. Finally, a probably more efficient solution has been proposed by Google and is used by its [[Gmail]] email system, performing an [[Optical character recognition|OCR (Optical Character Recognition)]] to every mid to large size image, analyzing the text inside.<ref>{{cite web|url=http://www.google.com/mail/help/fightspam/spamexplained.html|title=Gmail uses Google's innovative technology to keep spam out of your inbox}}</ref>
+==General applications of Bayesian filtering==
+While Bayesian filtering is used widely to identify spam email, the technique can classify (or "cluster") almost any sort of data. It has uses in science, medicine, and engineering. One example is a general purpose classification program called [http://ti.arc.nasa.gov/tech/rse/synthesis-projects-applications/autoclass/ AutoClass] which was originally used to classify stars according to spectral characteristics that were otherwise too subtle to notice. There is recent speculation that even the brain uses Bayesian methods to classify sensory stimuli and decide on behavioral responses.<ref>[http://www.bcs.rochester.edu/people/alex/pub/articles/KnillPougetTINS04.pdf Trends in Neuroscience, 27(12):712-9, 2004] (pdf)</ref>
+==See also==
+* [[Bayesian poisoning]]
+* [[Bayesian programming]]
+* [[Bayesian inference]]
+* [[Bayes's theorem]]
+* [[Email filtering]]
+* [[Markovian discrimination]]
+* [[Naive Bayes classifier]]
+* [[Recursive Bayesian estimation]]
+* [[Stopping e-mail abuse]]
+==References==
+<references />
+==External links==
+* Guide to Bayesian spam filters: [http://lwn.net/Articles/172491/ part 1], [http://lwn.net/Articles/173910/ part 2].
+* [http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html Gary Robinson's spam blog]
+{{DEFAULTSORT:Bayesian Spam Filtering}}
+[[Category:Applications of Bayesian inference|Spam filtering]]
+[[Category:Estimation theory]]
+[[Category:Spam filtering]]

Insertion device: Difference between revisions

Latest revision as of 04:52, 4 December 2013

Contents

History

Process

Mathematical foundation