Data-Centric Security

Security Threat Landscape

The ability of sharing, managing, distributing, and accessing data quickly and remotely are at the basis of the digital revolution that started several decades ago. The role of data in today’s technology is even more important, having entered the so-called, data-driven economy. Data management and inference based on them are fundamental for any enterprise, from micro to large, to make value and compete in the global market, and replaced the central role that was usually owned by communication means. The data domain observed important changes at all layers of an IT chain: i) data layer: from data to big data, ii) database layer: from SQL to NoSQL, iii) platform layer: from the data warehouse and DBMS to Big Data platforms, iv) analytics layer: from data mining to machine learning and artificial intelligence. For instance, data mining focuses on discovering unknown patterns and relationships in large data sets. Machine learning aims to discover patterns in data, by learning patterns parameters directly from data; it is composed of a training step and the algorithm is not programmed to manage such patterns. It builds and keeps the model of system behavior. Artificial intelligence mimics human intelligence and tries to reason on data to produce new knowledge. In this context, Big Data has recently become a major trend attracting both academia, research institutions, and industries. According to IDC,[1]revenues for Big Data and business analytics will reach $260 billion in 2020, at a CAGR of 11.9% over the 2017-19 forecast period”. Today pervasive and interconnected world, where billions of resource-constrained devices are connected and people are put at the center of a continuous sensing process, results in an enormous amount of generated and collected data (estimated in 2.5 quintillions bytes of data each day[2]). The Big Data revolution fosters the so-called data-driven ecosystem where better decisions are supported by enhanced analytics and data management. Big Data is not only characterized by the huge amount of data but points to scenarios where data are diverse, come at high rates and must be proven to be trustworthy, as clarified by the 5V storyline[3]. Big Data is defined according to 5V: i)  Volume (huge amount of data), ii) Velocity (high speed of data in and out), iii) Variety (several ranges of data types and sources),  iv) “Veracity” (data authenticity since the quality of captured data can vary greatly and an accurate analysis depends on the veracity of data source), and v)“Value” (the potential revenue of Big Data). Big Data has been defined in different ways starting from Gartner “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. ” to McKinsey  Global  Institute “Big Data as data sets  whose  size  is beyond the ability of typical database software tools to capture, store, manage, and analyze.“

Assets

According to ENISA’s Guideline on Threats and Assets published in the context of ENISA’s Security framework for Article 4 and 13a proposal, an asset is defined as “ anything of value. Assets can be abstract assets (like processes or reputation), virtual assets (for instance, data), physical assets (cables, a piece of equipment), human resources, money”. An item of our taxonomy is either a description of data itself, or describes assets that generate, process, store or transmit data chunks and, as such, is exposed to cyber-security threats. In addition to the ENISA Big Data Threat Landscape,[4] a major source of information for this study is the work undertaken by the NIST Big Data Public Working Group (NBD-PWG) resulting in two draft Volumes (Volume 1 about Definitions and Volume 2 about Taxonomy). Another source of information is the report “Big Data Taxonomy”, issued by Cloud Security Alliance (CSA) Big Data Working Group in September 2014, where a six-dimensional taxonomy for Big Data, built around the nature of the data, is introduced.

Assets can be categorized into 5 different classes as follows:

  • Data – It is the core class and includes all types of data from metadata, to structured, semi-structured and unstructured data, and stream of data.
  • Infrastructure – It comprises software, hardware resources denoting both physical and virtualized devices, computing infrastructure with batch and streaming processes, and storage infrastructure with various database management systems.
  • Big Data analytics – It includes protocols and algorithms for Big Data analysis, as well as all processing algorithms four data routing and parallelization. It points to the design and implementation of procedures, models, algorithms, as well as for.
  • Security and privacy techniques – It refers to all security techniques that are the target for an attacker. These represent the interesting components that would result in unauthorized data disclosure and leakage if compromised. Examples are security best practice documents, cryptography algorithms and methods, information about the access control model used, and the like analytics results.
  • Roles – Introduced by the NIST Big Data Public Working Group, it includes human resources and related assets.

Data assets can be summarized as follows:

  • Metadata – Schemas, indexes, data dictionaries, and stream grammar data.
  • Structured data – Traditional structured data in database records defined following a data model. For instance, a relational or hierarchical schema; structured identification data, as for example users’ profiles and preferences; linked open data; inferences and re-linking data structured according to standard formats.
  • Semi-structured and unstructured data – It includes logs, messages, and web (un)formatted data (Web and Wiki pages, e-mail messages, SMSs, tweets, posts, blogs, etc.), files and documents (e.g. PDF files and Office suite data in Repositories and File Servers), multimedia data (photos, videos, maps, etc.), and other non-textual material besides multi-media (medical data, bio-science data, and raw satellite data before radiometric/geometric processing, etc.).
  • Streaming data – Single-medium streaming (for example in-motion sensor data) and multimedia streaming (remote sensing data streams, etc.).
  • Volatile data – Data that is either in motion or temporarily stored. For instance, network routing data or data stored in the device RAM.
  • Variable data – Permanent data instances, which may change over time.

Infrastructure assets can be summarized as follows:

  • Software – It includes operating systems, device drivers, firmware, server-side software packages, and applications. Applications include software-as-a-service and functionalities that utilize other assets to fulfill a defined task, such as for example asset management tools, requirements gathering applications, billing services, and tools to monitor performances and SLAs.
  • Hardware (physical and virtual) – Servers including physical devices and hardware nodes virtualized systems and virtual data center, with management consoles, virtual machine monitors, virtual machines), clients, network devices, media and storage devices, smart devices, Human Interface Devices (HID) and mobile devices.
  • Computing infrastructure models – It includes architectures, models and paradigms for data processing. It refers to batch processing, for example, MapReduce; real-time/near real-time streaming data, for example, Sketch or Hash-based models; a unified approach supporting both, as for example Cloud Dataflow.
  • Storage infrastructure models – It includes architectures, models, and paradigms for storage.

Analytics assets can be summarized as follows:

  • Data analytics algorithms and procedures – It includes algorithm source code with all parameters, configurations and thresholds, metrics, models. It also includes advanced techniques that streamline the data preparation stage of the analytical process.
  • Analytical results – It considers the results of an analytics process, textual or graphical mode.

Security and privacy techniques assets can be summarized as follows:

  • Infrastructure security – It considers the security of the distributed computation systems and the data stores, including security best practices and policy set-ups. It also focuses on the new IaaS paradigm in the cloud.
  • Data management – It considers all documents and techniques presenting the approaches implemented for maintaining and protecting Data Storage and Logs, and documentation relating to granular audits and the tracing of data through its life cycle (Data provenance).
  • Integrity and reactive security – It considers endpoint security focusing on implemented practices, techniques, and documents. It focuses on solutions for security validation and filtering, as well as real-time security monitoring, including incident handling and information forensics.
  • Data privacy – It includes all techniques for data protection (e.g., encryption, signature, access control) as mandated by law.

Roles assets can be summarized as follows:

  • Data provider – Enterprises, organizations, public agencies, academia, network operators and end-users providing data to data consumers.
  • Data consumer – Enterprises, organizations, public agencies, academia, and end-users, consuming data produced by data providers.
  • Operational roles – System orchestrators (business leader, data scientists, architects, etc.), Big Data application providers (application and platform specialists), Big Data framework providers (Cloud provider personnel), security and privacy specialists, technical management (in-house staff, etc.).

We note that most of the categories and sub-categories in the above tables are still relevant for generic data, where Big Data is just a specialization. For example, relational databases are a very typical and common resource in every enterprise infrastructure, not necessarily storing large data volumes. Even when relational databases hold large data volumes, they are often manageable through traditional hardware clusters, appliances and software tools.


[1] IDC, Worldwide Semiannual Big Data and Analytics Spending Guide, 2018, https://www.idc.com/getdoc.jsp?containerId=prUS44215218

[2] Bernard Marr, How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read, May 2018, https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#363f1fcf60ba

[3] Y. Demchenko, P. Membrey, P. Grosso e C. Laat, «Addressing Big Data Issues in Scientific Data Infrastructure,» in Proc. of CTS 2013, San Diego, CA, USA, May, 2013.

[4] Baseline Security Recommendations for IoT,