The ability of sharing, managing, distributing, and accessing data quickly and remotely are at the basis of the digital revolution that started several decades ago. The role of data in today’s technology is even more important, having entered the so-called, data-driven economy. Data management and inference based on them are fundamental for any enterprise, from micro to large, to make value and compete in the global market, and replaced the central role that was usually owned by communication means. The data domain observed important changes at all layers of an IT chain: i) data layer: from data to big data, ii) database layer: from SQL to NoSQL, iii) platform layer: from the data warehouse and DBMS to Big Data platforms, iv) analytics layer: from data mining to machine learning and artificial intelligence. For instance, data mining focuses on discovering unknown patterns and relationships in large data sets. Machine learning aims to discover patterns in data, by learning patterns parameters directly from data; it is composed of a training step and the algorithm is not programmed to manage such patterns. It builds and keeps the model of system behavior. Artificial intelligence mimics human intelligence and tries to reason on data to produce new knowledge. In this context, Big Data has recently become a major trend attracting both academia, research institutions, and industries. According to IDC,[1] “revenues for Big Data and business analytics will reach $260 billion in 2020, at a CAGR of 11.9% over the 2017-19 forecast period”. Today pervasive and interconnected world, where billions of resource-constrained devices are connected and people are put at the center of a continuous sensing process, results in an enormous amount of generated and collected data (estimated in 2.5 quintillions bytes of data each day[2]). The Big Data revolution fosters the so-called data-driven ecosystem where better decisions are supported by enhanced analytics and data management. Big Data is not only characterized by the huge amount of data but points to scenarios where data are diverse, come at high rates and must be proven to be trustworthy, as clarified by the 5V storyline[3]. Big Data is defined according to 5V: i) Volume (huge amount of data), ii) Velocity (high speed of data in and out), iii) Variety (several ranges of data types and sources), iv) “Veracity” (data authenticity since the quality of captured data can vary greatly and an accurate analysis depends on the veracity of data source), and v)“Value” (the potential revenue of Big Data). Big Data has been defined in different ways starting from Gartner “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization. ” to McKinsey Global Institute “Big Data as data sets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.“
According to ENISA’s Guideline on Threats and Assets published in the context of ENISA’s Security framework for Article 4 and 13a proposal, an asset is defined as “ anything of value. Assets can be abstract assets (like processes or reputation), virtual assets (for instance, data), physical assets (cables, a piece of equipment), human resources, money”. An item of our taxonomy is either a description of data itself, or describes assets that generate, process, store or transmit data chunks and, as such, is exposed to cyber-security threats. In addition to the ENISA Big Data Threat Landscape,[4] a major source of information for this study is the work undertaken by the NIST Big Data Public Working Group (NBD-PWG) resulting in two draft Volumes (Volume 1 about Definitions and Volume 2 about Taxonomy). Another source of information is the report “Big Data Taxonomy”, issued by Cloud Security Alliance (CSA) Big Data Working Group in September 2014, where a six-dimensional taxonomy for Big Data, built around the nature of the data, is introduced.
Assets can be categorized into 5 different classes as follows:
Data assets can be summarized as follows:
Infrastructure assets can be summarized as follows:
Analytics assets can be summarized as follows:
Security and privacy techniques assets can be summarized as follows:
Roles assets can be summarized as follows:
We note that most of the categories and sub-categories in the above tables are still relevant for generic data, where Big Data is just a specialization. For example, relational databases are a very typical and common resource in every enterprise infrastructure, not necessarily storing large data volumes. Even when relational databases hold large data volumes, they are often manageable through traditional hardware clusters, appliances and software tools.
[1] IDC, Worldwide Semiannual Big Data and Analytics Spending Guide, 2018, https://www.idc.com/getdoc.jsp?containerId=prUS44215218
[2] Bernard Marr, How Much Data Do We Create Every Day? The Mind-Blowing Stats Everyone Should Read, May 2018, https://www.forbes.com/sites/bernardmarr/2018/05/21/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/#363f1fcf60ba
[3] Y. Demchenko, P. Membrey, P. Grosso e C. Laat, «Addressing Big Data Issues in Scientific Data Infrastructure,» in Proc. of CTS 2013, San Diego, CA, USA, May, 2013.
[4] Baseline Security Recommendations for IoT,