en>Incnis Mrsi: not restricted to points anymore (+ fixed a hell of typos, apparently also a bit of German accent?)

2014-01-29T17:58:34Z

not restricted to points anymore (+ fixed a hell of typos, apparently also a bit of German accent?)

New page

'''Datafly algorithm''' is an algorithm for providing anonymity in medical data. The algorithm was developed by [[L. Sweeney]] in 1997−98.<ref>{{cite web|last=Latanya Sweeney|title=Datafly: a system for providing anonymity in medical data|url=http://dataprivacylab.org/datafly/|accessdate=19 January 2014}}</ref><ref>L. Sweeney, Datafly: a system for providing anonymity in medical data. Database Security, XI: Status and Prospects, T. Lin and S. Qian (eds), Elsevier Science, Amsterdam, 1998.[http://dataprivacylab.org/datafly/paper2.pdf]</ref> The anonymization is achieved by automatically generalizing, substituting, inserting and removing information as appropriate with-out losing many of the details found within the data. The method can be used on the fly in role-based security within an institution, and in batch mode for exporting data from an institution.
Organizations release and receive medical data with all explicit identifiers, such as name, etc. removed in the wrong belief that patient confidentiality is maintained because the resulting data look anonymous. However the remaining data can be used to re-identify individuals by linking or matching the data to other databases or by looking at unique characteristics found in the fields and records of the database itself.
The Datafly algorithm has been criticized for trying to achieve anonymization by over-generalization. The algorithm selects the attribute with the greatest number of distinct values as the one to generalize first.<ref>{{cite web|last=Li Xiong|title=Data Anonymization - Generalization Algorithms|url=http://www.mathcs.emory.edu/~lxiong/cs573_s12/share/slides/0131_generalization_slawek.pdf|accessdate=19 January 2014}}</ref>

==Core algorithm==
An outline of the Datafly algorithm is presented below.<ref>{{cite book|last=Latanya Sweeney|title=Computational Disclosure Control A Primer on Data Privacy Protection|publisher=MIT|page=113|url= http://hdl.handle.net/1721.1/8589}}</ref>

'''Input''':
Private Table PT; quasi-identifier QI = ( ''A''1, ..., ''A''n ), ''k''-anonymity constraint ''k''; domain generalization hierarchies DGHAi, where ''i'' = 1,...,''n'' with accompanying functions ''f''Ai, and loss, which is a limit on the percentage of tuples that can be suppressed. PT[id] is the set
of unique identifiers (key) for each tuple.

'''Output''':
MGT a generalization of PT[QI] that enforces ''k''-anonymity

'''Assumes''': | PT | ≤ ''k'', and loss * | PT | = ''k''

'''algorithm Datafly''':

// Construct a frequency list containing unique sequences of values across the quasi-identifier in PT,

// along with the number of occurrences of each sequence.

:1. let freq be an expandable and collapsible Vector with no elements initially. Each element is of the form ( QI, frequency, SID ), where SID = { ''idi'' : ∃ ''t''[''id''] ∈ [''id''] ⇒ ''t''[''id''] = ''id''i }; and, frequency = |SID|. Therefore, freq is also accessible as a table over (QI, frequency, SID).

:2. let pos <math> \gets</math> 0, total <math> \gets</math> 0

:3. while total ≠ |PT| do

::3.1 freq[pos] <math> \gets</math> ( ''t''[QI], occurs, SID ) where ''t''[QI] ∈ [QI], ( ''t''[ QI ],__, ___ ) <math>\not\in</math> freq; occurs = |PT| - |PT[QI] – {''t''[QI]}|; and, SID = { ''id''i : ∃ ''t''[''id''] <math> \gets</math> PT[id] ⇒ ''t''[''id''] = ''id''i }

::3.2 pos <math> \gets</math> pos + 1, total <math> \gets</math> total + occurs

:// Make a solution by generalizing the attribute with the most number of distinct values

:// and suppressing no more than the allowed number of tuples.

:4. let belowk <math> \gets</math> 0

:5. for pos <math> \gets</math> 1 to |freq| do

::5.1 ( __, count ) <math> \gets</math> freq[pos]

::5.2 if count < ''k'' then do

:::5.2.1 belowk <math> \gets</math> belowk + count

:6. if belowk > ''k'' then do: // Note. loss * |PT| = ''k''.

::6.1 freq <math> \gets</math> generalize(freq)

::6.2 go to step 4

:7. else do

:// assert: the number of tuples to suppress in freq is ≤ loss * |PT|

::7.1 freq <math> \gets</math> suppress( freq, belowk )

::7.2 MGT <math> \gets</math> reconstruct(freq)

:8. return MGT.

==External links==
[http://cs.utdallas.edu/dspl/cgi-bin/toolbox/javadoc/datafly/Datafly.html Details of the Datafly algorithm]

==References==
{{reflist}}

[[Category:Privacy]]
[[Category:Anonymity]]

Intersection (Euclidean geometry) - Revision history

en>Incnis Mrsi: not restricted to points anymore (+ fixed a hell of typos, apparently also a bit of German accent?)