Data-Centric Security Research Actions
We provide a discussion on relevant research actions that need to be taken to mitigate the threats, gaps, and challenges previously identified and reported in Appendix A.4 of document D4.3.
- RA4.1 – Decentralized and blockchain-based solutions. Even though it is still popular among some big companies, storing data in a centralized way renders it susceptible to the single point of failure and data breaches. To mitigate such issues, blockchain solutions can be utilized to move data from big data silos to distributed data storage. The combination of blockchain and big data can ensure the trustworthiness and integrity of generated data while reducing the likelihood of interference due to its known origin. This is attributed to the data immutability which is enabled by blockchain’s consensus mechanism and secure hash functions [1]. Recently, there have been several endeavours in this big data security research area. Yue et al. [2] developed a credible platform based on blockchain and smart contracts for data sharing between data producers and customers. The authors utilized blockchain for ensuring data traceability and transparency, and smart contracts for ensuring security while sharing data. Similarly, Xia et al. [3] proposed an auditing platform for controlling shared medical data in cloud repositories. The proposed platform enables data transferring between different sources in a tamper-resistant fashion. Uchibeke et al. [4] developed a blockchain access control ecosystem for managing access control of big data and safeguarding it against data breaches, while at the same time ensuring data auditability, transparency, and owner self-sovereignty. Moreover, the platform is also loosely based on Identity-Based Access Control (IBAC), and Role-Based Access Control (RBAC), and for each access control implementation features request, grant, revoke, verify, and view asset operations. Even though there is a continuously increasing number of research works on blockchain data security, there are still open challenges on decentralized and context-aware data warehousing the have to be solved.
Threats: T4.1.2 – Inadequate design and planning or incorrect adaptation, T4.2.2 – Unauthorized acquisition of information (data breach), T4.2.3 – Conversation Eavesdropping/Hijacking – COVID19, T4.4.2 – Denial of service, T4.4.6 – Failures of business processes, T4.5.1 – Violation of laws or regulations, T4.6.1 – Skill shortage
Gaps: G4.3 – Gaps on computing and storage models and infrastructures, G4.5 – Gaps on data trustworthiness, G4.6 – Gaps on decision support systems, G4.8 – Gaps on videoconferencing tools, G4.10 – Gaps on the distributed data and frameworks - RA4.2 – Access control and data encryption. Security issues may emerge during the transmission of big data to the cloud. To prevent data from ending up in the wrong hands, encryption and access control techniques arise as possible solutions. Moreover, transmission requires data to be decrypted, thus exposing it to security vulnerabilities. One of the common solutions involves data masking schemes. Several works proposed data encryption schemes based on Fully Homomorphic Encryption (FHE) [5][6][7]. Even though they achieved encryption before data transmission, the solution was limited only to numerical data. Other recent popular research efforts on data encryption involve work on improving ABE [8] and Format-preserving encryption techniques, as well as the development of novel lightweight schemes, such as Light-weight Encryption using Scalable Sketching (LESS) [9] which aimed to optimize and encrypt big data processing. There have been several research works on the access control and privacy of big data in recent time. Gupta et al. [10] proposed a big data compliance system for ensuring secure big data analysis in real-time dependent on its web directory and self-assurance framework for identifying genuine users. The framework proposed by Al-Shomrani et al. [11] utilizes techniques such as security policy manager, fragmentation approach, encryption approach, and security manager for analyzing and securing sensitive data received from the customers, while the work of Lee et al. [12] protects confidentiality and integrity of patients’ private data through digital signature encryption and Diffie-Hellman session key. Even though not as popular data encryption, access control solutions remain important in protecting big data security. Furthermore, this paper also includes experiments and computational verifications of the theory and proposed applications of this approach to science and technology, computer intelligence, and machine learning.
Threats: T4.1.1 – Information leakage/sharing due to human errors, T4.1.3 – Information leakage/sharing due to the hostile home network – COVID19, T4.2.1 – Interception of information, T4.2.2 – Unauthorized acquisition of information (data breach), T4.2.3 – Conversation Eavesdropping/Hijacking – COVID19, T4.4.1 – Identity fraud
Gaps: G4.1 – Gaps on data protection, G4.2 – Gaps on the use of cryptography in applications and back-end data-intensive services, G4.8 – Gaps on videoconferencing tools, G4.9 – Gaps on data management across borders, G4.11 – Gaps on the use of non-relational databases - RA4.3 – ML/AI-based solutions. Unsupervised learning and deep learning algorithms such as clustering, linear regression, and neural networks have been successfully used for malware and intrusion detection. However, there are still challenges related to these techniques that have to be resolved when it comes to protecting big data. One such challenge is adaptability, which can be exploited by attackers in a way to trick the ML model to produce a different result. Until now the research efforts have focused on feature squeezing which focuses on reducing the search space available to attackers through merging samples related to multiple feature vectors into a single one [13][14]. Similar issues are found in AI solutions, hence organizations and end-users should not consider ML nor AI as sole ways of defending against malware. The rise of Generative Adversarial Networks calls for combining both humans and AI in malware detection. In some other cases, data has to be protected from the people who work with it. Such situations require the complete removal of human intervention and the introduction of automation. One such solution was provided by Pissanetzky [15] who proposed a causal set as the universal language for all information for ML and computer intelligence applications.
Threats: T4.2.1 – Interception of information, T4.2.2 – Unauthorized acquisition of information (data breach), T4.2.3 – Conversation Eavesdropping/Hijacking – COVID19, T4.3.1 – Data poisoning, T4.3.2 – Model poisoning, T4.4.3 – Malicious code/software/activity, T4.4.5 – Misuse of assurance tools, T4.4.6 – Failures of business processes, T4.4.7 – Code execution and injection (unsecured APIs)
Gaps: G4.1 – Gaps on data protection, G4.2 – Gaps on the use of cryptography in applications and back-end data-intensive services, G4.5 – Gaps on data trustworthiness, G4.6 – Gaps on decision support systems, G4.7 – Gaps on ethics - RA4.4 – Self-destructing data. The sheer amount of the recent data breaches resulted in establishing regulations such as Breach of Security Safeguards Regulations and GDPR, which provides the right to forgetting. This enables end-users to enforce the deletion of information related to them. To resolve data privacy issues, it is expected that future research will focus on self-destructing data solutions. One such research effort has already been conducted in the work of Geambasu et al. [16]. In their work authors proposed architecture that rendered copies of old privacy data obsolete and unable to surface. More research on this topic is expected to be conducted in the forthcoming future, but it will have to deal with the big data regulation challenges and policies [1].
Threats: T4.1.1 – Information leakage/sharing due to human errors, T4.1.3 – Information leakage/sharing due to the hostile home network – COVID19, T4.2.2 – Unauthorized acquisition of information (data breach), T4.3.3 – Unreliable data, T4.4.3 – Malicious code/software/activity, T4.6.2 – Malicious insider
Gaps: G4.1 – Gaps on data protection, G4.7 – Gaps on ethics
Highlights on Identified Research Actions
There are four main areas in which future data cybersecurity research actions should focus, namely improving decentralized and blockchain-based solutions, access control, and data encryption solutions, ML/AI-based solutions, data encryption, and solutions including self-destructing data. Blockchain solutions can be used to move data from big data warehouses to distributed storage, thus eliminating the risks of data breaches and single points of failure. Moreover, blockchain’s immutability property can grant trustworthiness, auditability, transparency, and integrity of big data. Encryption and access control remain powerful solutions in ensuring security during big data transmissions to the cloud. Popular recent data encryption solutions include data masking schemes, Fully Homomorphic Encryption (FHE), lightweight cryptography variations, and improvements of techniques such as ABE. Unsupervised learning and deep learning algorithms have been used for malware and intrusion detection fairly successfully during the past few years. Current research efforts in this area focus mostly on feature squeezing to reduce the research space available to potential adversaries. One of the main challenges of ML solutions lies in adaptability, through which adversaries can trick the ML model into producing wrong results. A large number of recent data breaches have inspired the establishment of regulations that enable end-users to enforce information deletion. Consequently, future research should focus on developing reliable self-destructing data solutions with privacy in mind. Finally, increasing the robustness of ML models at both training and inference time is fundamental to strengthen modern distributed systems against training poisoning and adversarial attacks.
[1] D. B. Rawat, R. Doku and M. Garuba, “Cybersecurity in big data era: From securing big data to data-driven security,” IEEE Transactions on Services Computing, 2019.
[2] L. Yue, H. Junqin, Q. Shengzhi and W. Ruijin, “Big data model of security sharing based on blockchain,” 2017 3rd International Conference on Big Data Computing and Communications (BIGCOM), pp. 117-121, 2017.
[3] Q. Xia, E. B. Sifah, K. O. Asamoah, J. Gao, X. Du and M. Guizani, “MeDShare: Trust-less medical data sharing among cloud service providers via blockchain,” IEEE Access, vol. 5, pp. 14757-14767, 2017.
[4] U. U. Uchibeke, K. A. Schneider, S. H. Kassani and R. Deters, “Blockchain access control Ecosystem for Big Data security,” 2018 IEEE International Conference on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 1373-1378, 2018.
[5] J. a. G. V. a. M. P. Kepner, N. Schear, M. Varia, A. Yerukhimovich and R. K. Cunningham, “Computing on masked data: a high performance method for improving big data veracity,” 2014 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1-6, 2014.
[6] D. a. G. B. a. S. Y. a. C. S.-J. a. L. Y.-H. Wang, “A faster fully homomorphic encryption scheme in big data,” 2017 IEEE 2nd International Conference on Big Data Analysis (ICBDA), pp. 345-349, 2017.
[7] T. B. a. P. G. K. a. B. A. T. Patil, “Big data privacy using fully homomorphic non-deterministic encryption,” em 2017 IEEE 7th International Advance Computing Conference (IACC), IEEE, 2017, pp. 138-143.
[8] T. Yang, P. Shen, X. Tian and C. Chen, “A Fine-Grained Access Control Scheme for Big Data Based on Classification Attributes,” em 2017 IEEE 37th International Conference on Distributed Computing Systems Workshops (ICDCSW), IEEE, 2017, pp. 238-245.
[9] A. Kulkarni, C. Shea, H. Homayoun and T. Mohsenin, “Less: Big data sketching and encryption on low power platform,” Design, Automation \& Test in Europe Conference \& Exhibition (DATE), 2017, pp. 1631-1634, 2017.
[10] A. Gupta, A. Verma, P. Kalra and L. Kumar, “Big Data: A security compliance model,” em 2014 Conference on IT in Business, Industry and Government (CSIBIG), IEEE, 2014, pp. 1-5.
[11] A. Al-Shomrani, F. Fathy and K. Jambi, “Policy enforcement for big data security,” 2017 2nd international conference on anti-cyber crimes (icacc), pp. 70-74, 2017.
[12] N.-Y. Lee and B.-H. Wu, “Privacy protection technology and access control mechanism for medical big data,” 2017 6th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI), pp. 424-429, 2017.
[13] N. Papernot, P. McDaniel, A. Sinha and M. Wellman, “Towards the science of security and privacy in machine learning,” arXiv preprint arXiv:1611.03814, 2016.
[14] W. Xu, D. Evans and Y. Qi, “Feature squeezing: Detecting adversarial examples in deep neural networks,” arXiv preprint arXiv:1704.01155, 2017.
[15] S. Pissanetzky, “On the future of information: Reunification, computability, adaptation, cybersecurity, semantics,” IEEE Access, vol. 4, pp. 1117-1140, 2016.
[16] R. Geambasu, T. Kohno, A. A. Levy and H. M. Levy, “Vanish: Increasing Data Privacy with Self-Destructing Data.,” em USENIX security symposium, vol. 316, 2009.