This blog post examines the principles and techniques used to balance maintaining data usability while protecting personal information through de-identification. It also explains the characteristics and limitations of k-anonymity and l-diversity.
Big data generated across diverse fields like finance, marketing, and healthcare often contains personal information, posing a risk of sensitive data leaks during utilization. Therefore, during big data construction, personal information de-identification techniques are employed. These techniques delete or replace all or part of the personal information to prevent individual identification while maximizing the data’s usability.
The smallest unit representing information in a dataset is called an attribute, and a single piece of information expressed through a combination of various attributes is called a record. A dataset is a collection of these records. De-identification techniques classify attributes into identifiers, quasi-identifiers, general attributes, and sensitive attributes. An identifier is an attribute that can identify an individual on its own, such as a resident registration number. Conversely, a quasi-identifier is an attribute that cannot directly identify an individual on its own, such as gender, age, or address, but when combined with other attributes, enables identification.
Suppose there is an original dataset consisting of gender, name, and age. If we create an anonymized dataset by retaining only the surname from the name, Although the full name is masked, leaving only the surname, if there is only one ‘male’ individual, or only one person with the surname ‘Lee’ who is ‘35 years old’, then someone who knows these two individuals were included in the original dataset and already knows their unique attribute value combinations can re-identify specific individuals. Generally, personal information is used in combination with multiple attributes of an individual. Even anonymous data, when combined with multiple attributes, can generate new unique attribute value combinations. Consequently, it becomes an imperfectly de-identified dataset where specific individuals can be re-identified.
k-anonymity is a de-identification technique that reduces the probability of inferring a specific individual to 1/k or less. It achieves this by applying masking, categorization, or similar processes only to the identifiers or quasi-identifiers in the original dataset, making similar quasi-identifier attribute values identical. Masking changes ‘Hong Gil-dong’ to ‘Hong**’, while categorization changes ‘35 years old’ to ‘30s’. In the resulting de-identified dataset, a set of records where all quasi-identifier attribute values are identical is called a homogeneous set, and the number of records in this set is called the size of the homogeneous set. k-anonymity achieves this by deleting all homogenous sets with fewer than k members, ensuring all remaining homogenous sets have at least k members.
When k=2, even if one knows the pseudo-identifiers of a specific individual in the original dataset beforehand, it is impossible to re-identify that specific individual in the anonymized dataset based solely on the anonymized data. However, individual estimation remains possible. Specifically, if the homogeneous set containing the target individual has size k, it can be estimated that this individual is one of the k members, making individual estimation possible with a probability of 1/k.
A drawback of k-anonymity is that if all records within a homogeneous set share identical values for sensitive attributes (as opposed to quasi-identifiers), this information can be leaked. Sensitive attributes refer to attributes related to an individual’s private life, such as medical conditions or income. For example, if a homogenous group contains three records and all three individuals have stomach cancer, someone who knows that Hong Gil-dong is one of the three individuals in the homogenous group can accurately know that Hong Gil-dong has stomach cancer, even if they do not know which of the three is Hong Gil-dong. To compensate for this weakness of k-anonymity, l-diversity is additionally applied.
l-diversity ensures that the sensitive attribute in a homogenous group has at least l distinct attribute values. Homogenous groups failing this condition are removed from the de-identified dataset. In the previous example, the disease name attribute in the homogenous group only has the value ‘stomach cancer’, failing l-diversity, so this homogenous group is deleted.
While de-identification techniques can reduce the possibility of personal identification, they also cause information loss, diminishing the value of the data for those utilizing the constructed big data. Original similarity is an indicator representing the usability of the de-identified dataset, showing how similar the original dataset is to the de-identified dataset. This indicator is measured by record retention rate and record similarity. Record retention rate is an indicator expressed as a percentage of the total number of records in the de-identified dataset relative to the total number of records in the original dataset. Meanwhile, record similarity is an indicator that expresses the statistical similarity between an original record and a de-identified record pair, when an original record from the original dataset remains in the de-identified dataset, as a value between 0 and 1.