计算机科学技术学报 ›› 2022,Vol. 37 ›› Issue (1): 231-251.doi: 10.1007/s11390-021-1754-5

所属专题: Data Management and Data Mining

• • 上一篇    下一篇

机器学习中基于相关差分隐私保护的多方数据发布方法

  

  • 收稿日期:2021-07-01 修回日期:2021-11-04 接受日期:2021-11-12 出版日期:2022-01-28 发布日期:2022-01-28

Correlated Differential Privacy of Multiparty Data Release in Machine Learning

Jian-Zhe Zhao1 (赵建喆), Xing-Wei Wang2,3,* (王兴伟), Senior Member, CCF, Ke-Ming Mao1 (毛克明), Chen-Xi Huang1 (黄辰希), Yu-Kai Su1 (苏昱恺), and Yu-Chen Li1 (李宇宸)        

  1. 1Software College, Northeastern University, Shenyang 110169, China
    2State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang 110819, China
    3College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China
  • Received:2021-07-01 Revised:2021-11-04 Accepted:2021-11-12 Online:2022-01-28 Published:2022-01-28
  • Contact: Xing-Wei Wang E-mail:wangxw@mail.neu.edu.cn
  • About author:Xing-Wei Wang received his B.E., M.S., and Ph.D. degrees in computer science from Northeastern University, Shenyang, in 1989, 1992, and 1998, respectively. He is currently a professor with the School of Computer Science and Engineering, Northeastern University, Shenyang. His research interests include cloud computing and future Internet. He has published more than 100 journal articles, book chapters, and refereed conference papers.
  • Supported by:
    This work is supported by the National Natural Science Foundation of China under Grant Nos. 62102074 and 62032013, the Liaoning Revitalization Talents Program under Grant No. XLYC1902010, the Natural Science Foundation of Liaoning Province of China under Grant No. 2020-MS-091, and Fundamental Research Funds for the Central Universities of China under Grant No. N2017015.

目前,差分隐私技术被广泛应用于单方场景下隐私保护的数据发布。但是研究表明,普遍存在的数据相关性会引入额外的噪声,从而降低数据的效用。相关差分隐私技术通过数据相关性分析来降低灵敏度提高效用。然而,越来越多的多方数据发布应用对现有方法提出了新的挑战。在本文中,我们提出了一种新的基于相关差分隐私保护的多方数据发布方法。该方法通过重要特征选择和降低相关灵敏度来提高数据效用。我们还提出了一种多方数据相关性分析方法有效降低相关灵敏度,因此减少噪声摄入提高数据效用。此外,本文方法通过对发布数据添加查询噪声和在机器学习算法权重中加入噪声,同时提供低噪声的、差分隐私保护的多方数据发布和机器学习算法发布技术。在实际数据集上的综合实验证明了本文方法的有效性和实用性。
1、研究背景(context):
从理论和经验上平衡隐私和效用是目前机器学习和人工智能领域的重点关注。在多方数据场景下,多维数据带来计算复杂度的增加,同时冗余特征将带来数据效用下降。然而,由于普遍存在的数据相关性使得降维操作引入额外的差分隐私噪声,导致数据效用进一步下降。因此,研究基于相关差分隐私保护的多方数据发布方法对提升机器学习的数据效用具有重要意义。
2、目的(Objective):
我们关注多方数据发布的场景下,针对数据的相关性,研究一种差分隐私保护的多方数据发布方法,实现机器学习算法和查询数据的隐私与效用的均衡。
3、方法(Method):
我们提出一种基于相关差分隐私保护的多方数据发布方法,通过特征选择和放松特征数目降低相关敏感度两个步骤有效提升数据效用。并且,本文方法通过隐私保护机制设计同时提供隐私保护的数据和机器学习算法的发布技术。具体地,我们通过分析特征数目和相关敏感度的关系,设计效用最优的特征选择方法。并且提出一种多方数据相关性分析方法,该方法不但考虑数据相关程度,而且根据多方场景提供的先验知识定义一种更加客观和严格的相关度度量标准,从而有效降低相关敏感度减少噪声摄入。
4、结果(Result & Findings):
我们将相关差分隐私技术拓展到了多方数据发布场景,实现了效用优化的多方隐私保护数据发布方法MP-CRDP,并通过大量综合实验验证该方法的有效性和实用性。实验结果表明,(1)在通用的机器学习数据集上,MP-CRDP能够确定最佳的特征集显著提升数据效用;(2)本文提出的多方数据相关性分析方法相比于同类方法能够有效降低数据的相关敏感度,从而减少噪声摄入提升数据效用;(3)本文方法通过将噪声注入查询数据和机器学习算法的权重,在存在可信服务器的情况下,提供一种能够同时提供隐私数据和机器学习算法的发布机制。
5、结论(Conclusions):
(1)多方数据的维度与数据效用之间存在一定的关联,本文提出的基于相关差分隐私的多方数据发布方法,分析维度变化引起的模型精度变化,从而确定发布的最佳特征集以提升机器学习效用。(2)多方数据发布场景下的相关性分析不但要考虑相关程度,而且基于多方数据的先验知识可以获取更客观的、严格的度量标准,本文提出多方数据相关性分析方法可以有效降低相关敏感度。(3)在通用的机器学习数据集上的综合实验验证了本文方法在发布查询数据和机器学习算法的有效性。本文研究假设存在可信的服务器,未来研究将考虑联邦学习场景下的数据相关性问题。

关键词: 相关差分隐私, 多方数据发布, 机器学习

Abstract: Differential privacy (DP) is widely employed for the private data release in the single-party scenario. Data utility could be degraded with noise generated by ubiquitous data correlation, and it is often addressed by sensitivity reduction with correlation analysis. However, increasing multiparty data release applications present new challenges for existing methods. In this paper, we propose a novel correlated differential privacy of the multiparty data release (MP-CRDP). It effectively reduces the merged dataset's dimensionality and correlated sensitivity in two steps to optimize the utility. We also propose a multiparty correlation analysis technique. Based on the prior knowledge of multiparty data, a more reasonable and rigorous standard is designed to measure the correlated degree, reducing correlated sensitivity, and thus improve the data utility. Moreover, by adding noise to the weights of machine learning algorithms and query noise to the release data, MP-CRDP provides the release technology for both low-noise private data and private machine learning algorithms. Comprehensive experiments demonstrate the effectiveness and practicability of the proposed method on the utilized Adult and Breast Cancer datasets.

Key words: correlated differential privacy, multiparty data release, machine learning

[1] Shanthamallu U S, Spanias A, Tepedelenlioglu C, Stanley M. A brief survey of machine learning methods and their sensor and IoT applications. In Proc. the 8th Int. Conf. Information, Intelligence, Systems & Applications, Aug. 2017. DOI: 10.1109/IISA.2017.8316459.
[2] Mohammed N, Fung B C M, Debbabi M. Anonymity meets game theory: Secure data integration with malicious participants. The VLDB Journal, 2011, 20(4): 567-588. DOI: 10.1007/s00778-010-0214-6.
[3] Fung B C M, Wang K, Chen R, Yu P S. Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys, 2010, 42(4): Article No.14. DOI: 10.1145/1749603.1749605.
[4] Kim H, Ben-Othman J, Mokdad L. UDiPP: A framework for differential privacy preserving movements of unmanned aerial vehicles in smart cities. IEEE Trans. Veh. Technol., 2019, 68(4): 3933-3943. DOI: 10.1109/TVT.2019.2897509.
[5] Du M, Wang K, Xia Z, Zhang Y. Differential privacy preserving of training model in wireless big data with edge computing. IEEE Trans. Big Data, 2020, 6(2): 283-295. DOI: 10.1109/TBDATA.2018.2829886.
[6] Kim S, Shin H, Baek C H, Kim S, Shin J. Learning new words from keystroke data with local differential privacy. IEEE Trans. Knowl. Data Eng., 2020, 32(3): 479-491. DOI: 10.1109/TKDE.2018.2885749.
[7] Li D, Yang Q, Yu W, An D, Zhang Y, Zhao W. Towards differential privacy-based online double auction for smart grid. IEEE Trans. Inf. Forensics Secur., 2020, 15: 971-986. DOI: 10.1109/TIFS.2019.2932911.
[8] Dwork C. Differential privacy. In Proc. the 33rd International Colloquium on Automata, Languages and Programming, July 2006, pp.1-12. DOI: 10.1007/11787006-1.
[9] Dwork C, McSherry F, Nissim K, Smith A D. Calibrating noise to sensitivity in private data analysis. In Proc. the 3rd Theory of Cryptography Conference, March 2006, pp.265-284. DOI: 10.1007/11681878-14.
[10] Ji Z, Lipton Z C, Elkan C. Differential privacy and machine learning: A survey and review. arXiv:1412.7584, 2014. https://arxiv.org/abs/1412.7584, May 2020.
[11] Mir D J. Differentially-private learning and information theory. In Proc. the 2012 EDBT/ICDT Workshops, March 2012, pp.206-210. DOI: 10.1145/2320765.2320823.
[12] Friedman A, Schuster A. Data mining with differential privacy. In Proc. the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, July 2010, pp.493-502. DOI: 10.1145/1835804.1835868.
[13] Mohammed N, Chen R, Fung B C M, Yu P S. Differentially private data release for data mining. In Proc. the17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2011, pp.493-501. DOI: 10.1145/2020408.2020487.
[14] Vaidya J, Shafiq B, Basu A, Hong Y. Differentially private naive Bayes classification. In Proc. the 2013 IEEE/WIC/ACM International Conferences on Web Intelligence, November 2013, pp.571-576. DOI: 10.1109/WI-IAT.2013.80.
[15] Chaudhuri K, Monteleoni C. Privacy-preserving logistic regression. In Proc. the 22nd Annual Conference on Neural Information Processing Systems, December 2008, pp.289-296.
[16] Lei J. Differentially private M-estimators. In Proc. the 25th Annual Conference on Neural Information Processing Systems, December 2011, pp.361-369.
[17] Zhang J, Zhang Z, Xiao X, Yang Y, Winslett M. Functional mechanism: Regression analysis under differential privacy. Proceedings of the VLDB Endowment, 2012, 15(11): 1364-1375. DOI: 10.14778/2350229.2350253.
[18] Rubinstein B I P, Bartlett P L, Huang L, Taft N. Learning in a large function space: Privacy-preserving mechanisms for SVM learning. arXiv:0911.5708, 2009. https://arxiv.org/abs/0911.5708, May 2020.
[19] Chaudhuri K, Monteleoni C, Sarwate A D. Differentially private empirical risk minimization. Machine Learning Research, 2011, 12: 1069-1109.
[20] Song S, Chaudhuri K, Sarwate A D. Stochastic gradient descent with differentially private updates. In Proc. the 2013 IEEE Global Conf. Signal Inf. Process., December 2013, pp.245-248. DOI: 10.1109/GlobalSIP.2013.6736861.
[21] Abadi M, Chu A, Goodfellow I J, McMahan H B, Mironov I, Talwar K, Zhang L. Deep learning with differential privacy. In Proc. the 2016 ACMSIGSAC Conf. Comput. Commun. Secur., October 2016, pp.308-318. DOI: 10.1145/2976749.2978318.
[22] Xiao Y, Xiong L. Protecting locations with differential privacy under temporal correlations. In Proc. the 22nd ACM Conference on Computer and Communications Security, October 2015, pp.1298-1309. DOI: 10.1145/2810103.2813640.
[23] Lv D, Zhu S. Achieving correlated differential privacy of big data publication. Computers & Security, 2019, 82: 184-195. DOI: 10.1016/j.cose.2018.12.017.
[24] Kifer D, Machanavajjhala A. No free lunch in data privacy. In Proc. the 2011 ACM SIGMOD International Conference on Management of Data, June 2011, pp.193-204. DOI: 10.1145/1989323.1989345.
[25] He X, Machanavajjhala A, Ding B. Blowfish privacy: Tuning privacy-utility trade-offs using policies. In Proc. the 2014 ACM SIGMOD International Conference on Management of Data, June 2014, pp.1447-1458. DOI: 10.1145/2588555.2588581.
[26] Kifer D, Machanavajjhala A. Pufferfish: A framework for mathematical privacy definitions. ACM Trans. Database Syst., 2014, 39(1): Article No.3. DOI: 10.1145/2514689.
[27] Chen R, Fung B C M, Yu P S, Desai B C. Correlated network data publication via differential privacy. The VLDB Journal, 2014, 23(4): 653-676. DOI: 10.1007/s00778-013-0344-8.
[28] Zhu T, Xiong P, Li G, Zhou W. Correlated differential privacy: Hiding information in Non-IID data set. IEEE Trans. Info. Fore. and Secur., 2015, 10(2): 229-242. DOI: 10.1109/TIFS.2014.2368363.
[29] Yang B, Sato I, Nakagawa H. Bayesian differential privacy on correlated data. In Proc. the 2015 ACM SIGMOD International Conference on Management of Data, May 31-June 4, 2015, pp.747-762. DOI: 10.1145/2723372.2747643.
[30] Alhadidi D, Mohammed N, Fung B C M, Debbabi M. Secure distributed framework for achieving $\epsilon$-differential privacy. In Proc. the 12th International Symposium on Privacy Enhancing Technologies, July 2012, pp.120-139. DOI: 10.1007/978-3-642-31680-7-7.
[31] Hong Y, Vaidya J, Lu H, Karras P, Goel S. Collaborative search log sanitization: Toward differential privacy and boosted utility. IEEE Trans. Dependable Secur. Comput., 2015, 12(5): 504-518. DOI: 10.1109/TDSC.2014.2369034.
[32] Mohammed N, Alhadidi D, Fung B C M, Debbabi M. Secure two-party differentially private data release for vertically partitioned data. IEEE Trans. Dependable Secur. Comput., 2014, 11(1): 59-71. DOI: 10.1109/TDSC.2013.22.
[33] Cheng X, Tang P, Su S, Chen R, Wu Z, Zhu B. Multi-party high-dimensional data publishing under differential privacy. IEEE Trans. Knowl. Data Eng., 2020, 32(8): 1557-1571. DOI: 10.1109/TKDE.2019.2906610.
[34] Goryczka S, Xiong L. A comprehensive comparison of multiparty secure additions with differential privacy. IEEE Transactions on Dependable and Secure Computing, 2017, 14(5): 463-477. DOI: 10.1109/TDSC.2015.2484326.
[35] Dangi D, Santhi G. Secured multi-party data release on cloud for big data privacy-preserving using fusion learning. Turkish Journal of Computer and Mathematics Education, 2021, 12(3): 4716-4725. DOI: 10.17762/turcomat.v12i3.1893.
[36] Zhu T, Xiong P, Li G, Zhou W. Answering differentially private queries for continual datasets release. Future Gener. Comput. Syst., 2018, 87: 816-827. DOI: 10.1016/j.future.2017.05.007.
[37] Chen J, Ma H, Zhao D, Liu L. Correlated differential privacy protection for mobile crowdsensing. IEEE Trans. Big Data, 2021, 7(4): 784-795. DOI: 10.1109/TBDATA.2017.2777862.
[38] Cao Y, Yoshikawa M, Xiao Y, Xiong L. Quantifying differential privacy in continuous data release under temporal correlations. IEEE Trans. Knowl. Data Eng., 2019, 31(7): 1281-1295. DOI: 10.1109/TKDE.2018.2824328.
[39] Song S, Wang Y, Chaudhuri K. Pufferfish privacy mechanisms for correlated data. In Proc. the 2017 ACM International Conference on Management of Data, May 2017, pp.1291-1306. DOI: 10.1145/3035918.3064025.
[40] Zhang T, Zhu T, Xiong P, Huo H, Tari Z, Zhou W. Correlated differential privacy: Feature selection in machine learning. IEEE Trans. Industrial Informatics, 2020, 16(3): 2115-2124. DOI: 10.1109/TII.2019.2936825.
[41] Wang H, Wang H. Correlated tuple data release via differential privacy. Inf. Sci., 2021, 560: 347-369. DOI: 10.1016/j.ins.2021.01.058.
[42] Wang H, Xu Z, Jia S, Xia Y, Zhang X. Why current differential privacy schemes are inapplicable for correlated data publishing? World Wide Web, 2021, 24(1): 1-23. DOI: 10.1007/s11280-020-00825-8.
[43] Ou L, Qin Z, Liao S, Hong Y, Jia X. Releasing correlated trajectories: Towards high utility and optimal differential privacy. IEEE Trans. Dependable Secur. Comput., 2020, 17(5): 1109-1123. DOI: 10.1109/TDSC.2018.2853105.
[44] Tang P, Chen R, Su S, Guo S, Ju L, Liu G. Differentially private publication of multi-party sequential data. In Proc. the 37th IEEE International Conference on Data Engineering, April 2021, pp.145-156, DOI: 10.1109/ICDE51399.2021.00020.
[45] Wu X, Dou W, Ni Q. Game theory based privacy preserving analysis in correlated data publication. In Proc. the Australasian Computer Science Week Multiconference, January 31-February 3, 2017, Article No.73. DOI: 10.1145/3014812.3014887.
[46] McSherry F, Talwar K. Mechanism design via differential privacy. In Proc. the 48th Annu. IEEE Symp. Found. Comput. Sci., October 2007, pp.94-103. DOI: 10.1109/FOCS.2007.66.
[47] Chandrashekar G, Sahin F. A survey on feature selection methods. Comput. Elect. Eng., 2014, 40(1): 16-28. DOI: 10.1016/j.compeleceng.2013.11.024.
[1] 曹荣禹、曹逸轩、周干斌、罗平. 从长文档中提取深度可变的文档逻辑结构:方法、评估和应用[J]. 计算机科学技术学报, 2022, 37(3): 699-718.
[2] Geun Yong Kim, Joon-Young Paik, Yeongcheol Kim, and Eun-Sun Cho. 基于字节频率特征码的勒索病毒检测方法[J]. 计算机科学技术学报, 2022, 37(2): 423-442.
[3] Yi Zhong, Jian-Hua Feng, Xiao-Xin Cui, Xiao-Le Cui. 机器学习辅助的抗逻辑块加密密钥猜测攻击范式[J]. 计算机科学技术学报, 2021, 36(5): 1102-1117.
[4] Sara Elmidaoui, Laila Cheikhi, Ali Idri, Alain Abran. 用于软件可维护性预测的机器学习技术:精度分析[J]. 计算机科学技术学报, 2020, 35(5): 1147-1174.
[5] Andrea Caroppo, Alessandro Leone, Pietro Siciliano. 用于老年人面部表情识别的深度学习模型和传统机器学习方法的对比研究[J]. 计算机科学技术学报, 2020, 35(5): 1127-1146.
[6] Shu-Zheng Zhang, Zhen-Yu Zhao, Chao-Chao Feng, Lei Wang. 基于的特征选择的用于加速芯片物理设计Floorplan的机器学习框架[J]. 计算机科学技术学报, 2020, 35(2): 468-474.
[7] Rui Ren, Jiechao Cheng, Xi-Wen He, Lei Wang, Jian-Feng Zhan, Wan-Ling Gao, Chun-Jie Luo. HybridTune:基于时空数据关联的大数据系统性能诊断[J]. 计算机科学技术学报, 2019, 34(6): 1167-1184.
[8] João Fabrício Filho, Luis Gustavo Araujo Rodriguez, Anderson Faustino da Silva. 另一种智能代码生成系统:一种灵活低成本解决方案[J]. 计算机科学技术学报, 2018, 33(5): 940-965.
[9] Lan Yao, Feng Zeng, Dong-Hui Li, Zhi-Gang Chen. 基于Lp正则化的稀疏支持向量机特征选择算法[J]. , 2017, 32(1): 68-77.
[10] 包新启, 吴云芳. 面向问题检索的层级自训练张量神经网络模型[J]. , 2016, 31(6): 1151-1160.
[11] Najam Nazar, Yan Hu, He Jiang. 软件工件摘要方法综述[J]. , 2016, 31(5): 883-909.
[12] Xi-Jin Zhang, Yi-Fan Lu, Song-Hai Zhang. 用于食品识别和分析的深度卷积神经网络多任务学习[J]. , 2016, 31(3): 489-500.
[13] Lixue Xia, Peng Gu, Boxun Li, Tianqi Tang, Xiling Yin, Wenqin Huangfu, Shimeng Yu, Yu Cao, Yu Wang, Huazhong Yang. 忆阻器阵列矩阵向量乘的设计空间优化[J]. , 2016, 31(1): 3-19.
[14] Jun-Fa Liu, Wen-Jing He, Tao Chen, and Yi-Qiang Chen. 由流形约束实现人脸知识迁移的三维卡通重建方法[J]. , 2013, 28(3): 479-489.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] . Online First Under Construction [J]. 计算机科学技术学报, 0, (): 1 .
[2] Zhi-Neng Chen, Chong-Wah Ngo, Wei Zhang, Juan Cao, Yu-Gang Jiang. 网络视频人脸—姓名关联:大规模数据库,基准实验和开放性问题[J]. , 2014, 29(5): 785 -798 .
[3] Fei Xia, De-Jun Jiang, Jin Xiong, Ning-Hui Sun. PCM内存系统研究综述[J]. , 2015, 30(1): 121 -144 .
[4] André Brinkmann, Kathryn Mohror, Weikuan Yu, Philip Carns, Toni Cortes, Scott A. Klasky, Alberto Miranda, Franz-Josef Pfreundt, Robert B. Ross, Marc-André Vef. 高性能计算专用文件系统[J]. 计算机科学技术学报, 2020, 35(1): 4 -26 .
[5] Yu-Tong Lu, Peng Cheng, Zhi-Guang Chen. Tianhe-2数据存储与管理系统设计与实现[J]. 计算机科学技术学报, 2020, 35(1): 27 -46 .
[6] Reza Jafari Ziarani, Reza Ravanmehr. 推荐系统中的意外效应:系统文献综述[J]. 计算机科学技术学报, 2021, 36(2): 375 -396 .
[7] Bo-Han Li, Yi Liu, An-Man Zhang, Wen-Huan Wang, Shuo Wan. 实体消解中分块技术的综述[J]. 计算机科学技术学报, 2020, 35(4): 769 -793 .
[8] Lie-Huang Zhu, Bao-Kun Zheng, Meng Shen, Feng Gao, Hong-Yu Li, Ke-Xin Shi. 比特币系统的数据安全和隐私综述[J]. 计算机科学技术学报, 2020, 35(4): 843 -862 .
[9] 梁盾, 郭元晨, 张少魁, 穆太江, 黄晓蕾. 车道检测-新结果和调查研究[J]. 计算机科学技术学报, 2020, 35(3): 493 -505 .
[10] Lan Huang, Da-Lin Li, Kang-Ping Wang, Teng Gao, Adriano Tavares. 一个关于高级综合工具性能优化的综述[J]. 计算机科学技术学报, 2020, 35(3): 697 -720 .
版权所有 © 《计算机科学技术学报》编辑部
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn
总访问量: