S transform: Difference between revisions

From formulasearchengine
Jump to navigation Jump to search
en>David Eppstein
rm orphan tag
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
{{Distinguish|matrix similarity}}
I'm Yoshiko Oquendo. Delaware is our beginning place. One of the things I love most is greeting card gathering but I don't have the time lately. I am a cashier and I'll be promoted soon.<br><br>Visit my webpage auto warranty ([http://Wall-Papers.info/profile/chcgw additional resources])
 
{{merge to|Similarity measure|discuss=Talk:Similarity measure#Proposed merge with Similarity matrix|date=October 2013}}
A '''similarity matrix''' is a [[matrix (mathematics)|matrix]] of scores that represent the similarity between a number of data points. Each ''element'' of the similarity matrix contains a [[Similarity measure|measure of similarity]] between two of the data points. Similarity matrices are strongly related to their counterparts, [[distance matrix|distance matrices]] and [[substitution matrix|substitution matrices]].
 
== Uses ==
 
Similarity matrices have a wide range of uses:
# To find [[Cluster analysis|clusters]] of data points.
# To [[Sequence alignment|align sequences]] of DNA.
 
==Use in clustering==
 
In [[spectral clustering]], a similarity, or affinity, matrix is used to transform data to avoid convexity problems in the data.<ref name="Ng">{{Citation
| last1 = Ng | first1 = A.Y.
| last2 = Jordan | first2 = M.I.
| last3 = Weiss | first3 = Y.
| title = On Spectral Clustering: Analysis and an Algorithm
| journal = Advances in Neural Information Processing Systems
| volume = 14
| pages = 849-856
| publisher = MIT Press
| year = 2001
| url = http://books.nips.cc/papers/files/nips14/AA35.pdf
}}</ref>  The value of point <math>(i,j)</math> in the matrix can be simply the euclidean distance between <math>i</math> and <math>j</math>, or it can be a more complex measure of distance such as the Gaussian <math> e^{-||s_1 - s_1||^2/2\sigma^2}</math>.<ref name="Ng" /> Further modifying this result with network analysis techniques is also common.<ref>{{Citation
| last1 = Li | first1 = Xin-Ye
| last2 = Guo | first2 = Li-Jie
| title = Constructing affinity matrix in spectral clustering based on neighbor propagation
| journal = Neurocomputing
| volume = 97
| pages = 125-130
| publisher = MIT Press
| year = 2012
| url = http://dx.doi.org/10.1016/j.neucom.2012.06.023
| doi = 10.1016/j.neucom.2012.06.023
}}</ref>
 
==Use in sequence alignment==
 
Similarity matrices are used in [[sequence alignment]]. Higher scores are given to more-similar characters, and lower or negative scores for dissimilar characters.
 
[[Nucleotide]] similarity matrices are used to align [[nucleic acid]] sequences. Because there are only four nucleotides commonly found in [[DNA]] ([[Adenine]] (A), [[Cytosine]] (C), [[Guanine]] (G) and [[Thymine]] (T)), nucleotide similarity matrices are much simpler than [[protein]] similarity matrices. For example, a simple matrix will assign identical bases a score of +1 and non-identical bases a score of −1. A more complicated matrix would give a higher score to transitions (changes from a [[pyrimidine]] such as C or T to another pyrimidine, or from a [[purine]] such as A or G to another purine) than to transversions (from a pyrimidine to a purine or vice versa).
The match/mismatch ratio of the matrix sets the target evolutionary distance.<ref>{{cite journal | journal=Methods: a companion to methods in enzymology | volume=3 | issue=1 | pages=66 | year=1991  | author=States, D | title=Improved sensitivity of nucleic acid database searches using application-specific scoring matrices | pmid= | doi = 10.1016/S1046-2023(05)80165-3 | last2=Gish | first2=W | last3=Altschul | first3=S }}</ref><ref>{{cite journal | url=http://informatics.umdnj.edu/bioinformatics/courses/5020/notes/BLOSUM62%20primer.pdf |journal=Nature Biotechnology | title=Where did the BLOSUM62 alignment score matrix come from? | author=Sean R. Eddy | doi=10.1038/nbt0804-1035 | pmid=15286655 | year=2004 | volume=22 | pages=1035 | issue=8}}</ref>  The +1/−3 DNA matrix used by BLASTN is best suited for finding matches between sequences that are 99% identical; a +1/−1  (or +4/−4) matrix is much more suited to sequences with about 70% similarity.  Matrices for lower similarity sequences require longer sequence alignments.
 
[[Amino acid]] similarity matrices are more complicated, because there are 20 amino acids coded for by the [[genetic code]], and so a larger number of possible substitutions. Therefore, the similarity matrix for amino acids contains 400 entries (although it is usually [[symmetric matrix|symmetric]]). The first approach scored all amino acid changes equally. A later refinement was to determine amino acid similarities based on how many base changes were required to change a codon to code for that amino acid. This model is better, but it doesn't take into account the selective pressure of amino acid changes. Better models took into account the chemical properties of amino acids.
 
One approach has been to empirically generate the similarity matrices. The [[Margaret Oakley Dayhoff|Dayhoff]] method used phylogenetic trees and sequences taken from species on the tree. This approach has given rise to the [[point accepted mutation|PAM]] series of matrices. PAM matrices are labelled based on how many nucleotide changes have occurred, per 100 amino acids.
While the PAM matrices benefit from having a well understood evolutionary model, they are most useful at short evolutionary distances (PAM10 - PAM120).  At long evolutionary distances, for example PAM250 or 20% identity, it has been shown that the [[BLOSUM]] matrices are much more effective.
 
The BLOSUM series were generated by comparing a number of divergent sequences. The BLOSUM series are labeled based on how much entropy remains unmutated between all sequences, so a lower BLOSUM number corresponds to a higher PAM number.
 
== See also ==
* [[Recurrence plot]], a visualisation tool of recurrences in dynamical (and other) systems
* [[Self-similarity matrix]]
* [[Similarity testing]]
 
==Notes and references==
<references />
 
{{DEFAULTSORT:Similarity Matrix}}
[[Category:Bioinformatics]]
[[Category:DNA]]
[[Category:Matrices]]
[[Category:Computational phylogenetics]]
[[Category:Statistical distance measures]]
[[Category:Multivariate statistics]]

Latest revision as of 15:30, 8 May 2014

I'm Yoshiko Oquendo. Delaware is our beginning place. One of the things I love most is greeting card gathering but I don't have the time lately. I am a cashier and I'll be promoted soon.

Visit my webpage auto warranty (additional resources)