Triple modular redundancy

From formulasearchengine
Revision as of 06:05, 11 July 2013 by en>DavidCary (mention data scrubbing)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Cosine similarity is a measure of similarity between two vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any other angle. It is thus a judgement of orientation and not magnitude: two vectors with the same orientation have a Cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. Cosine similarity is particularly used in positive space, where the outcome is neatly bounded in [0,1].

Note that these bounds apply for any number of dimensions, and Cosine similarity is most commonly used in high-dimensional positive spaces. For example, in Information Retrieval and text mining, each term is notionally assigned a different dimension and a document is characterised by a vector where the value of each dimension corresponds to the number of times that term appears in the document. Cosine similarity then gives a useful measure of how similar two documents are likely to be in terms of their subject matter.[1]

The technique is also used to measure cohesion within clusters in the field of data mining.[2]

Cosine distance is a term often used for the complement in positive space, that is: DC(A,B)=1SC(A,B). It is important to note, however, that this is not a proper distance metric as it does not have the triangle inequality property and it violates the coincidence axiom; to repair the triangle inequality property whilst maintaining the same ordering, it is necessary to convert to Angular distance (see below.)

One of the reasons for the popularity of Cosine similarity is that it is very efficient to evaluate, especially for sparse vectors, as only the non-zero dimensions need to be considered.

Definition

The cosine of two vectors can be derived by using the Euclidean dot product formula:

ab=abcosθ

Given two vectors of attributes, A and B, the cosine similarity, cos(θ), is represented using a dot product and magnitude as

similarity=cos(θ)=ABAB=i=1nAi×Bii=1n(Ai)2×i=1n(Bi)2

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.

For text matching, the attribute vectors A and B are usually the term frequency vectors of the documents. The cosine similarity can be seen as a method of normalizing document length during comparison.

In the case of information retrieval, the cosine similarity of two documents will range from 0 to 1, since the term frequencies (tf-idf weights) cannot be negative. The angle between two term frequency vectors cannot be greater than 90°.

Angular similarity

The term "cosine similarity" has also been used on occasion to express a different coefficient, although the most common use is as defined above. Using the same calculation of similarity, the normalised angle between the vectors can be used as a bounded similarity function within [0,1], calculated from the above definition of similarity by:

1cos1(similarity)π

in a domain where vector coefficients may be positive or negative, or

12cos1(similarity)π

in a domain where the vector coefficients are always positive.

Although the term "cosine similarity" has been used for this angular distance, the term is oddly used as the cosine of the angle is used only as a convenient mechanism for calculating the angle itself and is no part of the meaning. Anyway this coefficient can not be used as a proper distance metric (by subtracting it from 1), consider 2 vectors with angle 0 but different l2-norms. A proper distance metric would expect these 2 vectors are the same, but this is not the case. However for most uses this is not an important property. For any use where only the relative ordering of similarity or distance within a set of vectors is important, then which function is used is immaterial as the resulting order will be unaffected by the choice.

Confusion with "Tanimoto" coefficient

On occasion, cosine similarity has been confusedPotter or Ceramic Artist Truman Bedell from Rexton, has interests which include ceramics, best property developers in singapore developers in singapore and scrabble. Was especially enthused after visiting Alejandro de Humboldt National Park. as a specialised form of a similarity coefficient with a similar algebraic form:

T(A,B)=ABA2+B2AB

In fact, this algebraic form was first defined by Tanimoto as a mechanism for calculating the Jaccard coefficient in the case where the sets being compared are represented as bit vectors. While the formula extends to vectors in general, it has quite different properties from cosine similarity and bears little relation other than its superficial appearance.

Ochiai coefficient

This coefficient is also known in biology as Ochiai coefficient, or Ochiai-Barkman coefficient, or Otsuka-Ochiai coefficient:[3][4]

K=n(AB)n(A)×n(B)

Here, A and B are sets, and n(A) is the number of elements in A. If sets are represented as bit vectors, the Ochiai coefficient can be seen to be the same as the cosine similarity.

Properties

Cosine similarity is related to Euclidean distance as follows. Denote Euclidean distance by the usual |A - B|, and observe that

|AB|2=(AB)(AB)=|A|+|B|2AB

by expansion. When A and B are normalized to unit length, |A| = |B| = 1 so the previous is equal to

2(1cos(A,B))

See also

References

43 year old Petroleum Engineer Harry from Deep River, usually spends time with hobbies and interests like renting movies, property developers in singapore new condominium and vehicle racing. Constantly enjoys going to destinations like Camino Real de Tierra Adentro.

External links

  1. Singhal, Amit (2001). "Modern Information Retrieval: A Brief Overview". Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 24 (4): 35–43.
  2. P.-N. Tan, M. Steinbach & V. Kumar, "Introduction to Data Mining", , Addison-Wesley (2005), ISBN 0-321-32136-7, chapter 8; page 500.
  3. Ochiai A. Zoogeographical studies on the soleoid fishes found Japan and its neighboring regions. II // Bull. Jap. Soc. sci. Fish. 1957. V. 22. № 9. P. 526-530.
  4. Barkman J.J. Phytosociology and ecology of cryptogamic epiphytes, including a taxonomic survey and description of their vegetation units in Europe. – Assen. Van Gorcum. 1958. 628 p.