Huawei Innovation Research Project (IRP-2013-12-03); Program of State Key Laboratory of High-End Server & Storage Technology (2014HSSA10); Key Research Project of Jiangxi Science and Technology Normal University (2016XJZD002)
To prevent privacy disclosure, table data generally needs to be anonymized before being published. Existing anonymity methods seldom distinguish different types of quasi-identifier in generalization, and also lack investigation into optimization of both information loss and time efficiency. In this paper, a greedy clustering-anonymity method is proposed using the ideas of greedy algorithm and clustering algorithm. The method makes distinct generalizations according to the type of quasi-identifier to conduct different calculations on information loss, and this providing reduction and reasonable estimate on information loss. Moreover, with regard to distance between tuples, or distance between a tuple and an equivalence class, two definitions are put forward in order to achieve minimum information loss in merging generalization. When establishing a new cluster, the tuple with the minimum distance in the ongoing cluster is always chosen to add. It ensures that the total information loss is close to minimum. Since the number of tuples in establishing each cluster is subject to k and the size of every cluster is equal to or just greater than k, the amount of calculation on distances and therefore the running time are reduced. Experimental results show that the proposed method is effective in reducing both information loss and running time.