A Big Data System Supporting Bosch Braga 2017 International Journal of Infor

International Journal of Information Management xxx (xxxx) xxx–xxx
Contents lists available at ScienceDirect
International Journal of Information Management

journal homepage: www.elsevier.com/locate/ijinfomgt
A Big Data system supporting Bosch Braga Industry 4.0 strategy

Maribel Yasmina Santos, Jorge Oliveira e Sá, Carina Andrade, Francisca Vale Lima,
Eduarda Costa, Carlos Costa, Bruno Martinho, João Galvão⁎
ALGORITMI Research Center, University of Minho, Guimarães, Portugal
A R T I C L E I N F O A B S T R A C T
Keywords: People, devices, infrastructures and sensors can constantly communicate exchanging data and generating new
Big Data data that trace many of these exchanges. This leads to vast volumes of data collected at ever increasing velocities
Industry 4.0 and of different variety, a phenomenon currently known as Big Data. In particular, recent developments in
Big Data analytics Information and Communications Technologies are pushing the fourth industrial revolution, Industry 4.0, being
Big Data architecture
data generated by several sources like machine controllers, sensors, manufacturing systems, among others.
Bosch
Joining volume, variety and velocity of data, with Industry 4.0, makes the opportunity to enhance sustainable
innovation in the Factories of the Future. In this, the collection, integration, storage, processing and analysis of
data is a key challenge, being Big Data systems needed to link all the entities and data needs of the factory.
Thereby, this paper addresses this key challenge, proposing and implementing a Big Data Analytics architecture,
using a multinational organisation (Bosch Car Multimedia – Braga) as a case study. In this work, all the data
lifecycle, from collection to analysis, is handled, taking into consideration the different data processing speeds
that can exist in the real environment of a factory (batch or stream).
1. Introduction is of major relevance is the Internet of Things (IoT), seen as a network of

sensors embedded into several devices (e.g., applications, smartphones,
Nowadays, data is generated at unprecedented rates, mainly due to cars), which is a significant source of data that can bring many orga-
the advancements in cloud computing, internet, mobile devices and nisations, like factories, into the era of Big Data (Chen, Mao, & Liu,
embedded sensors (Dumbill, 2013; Villars, Olofson, & Eastwood, 2011). 2014). In this context of the factories of the future, the fourth industrial
The way people interact with organisations, the data produced by or- revolution (Industry 4.0) uses the technological innovations to enhance
ganisations’ day-by-day activities and the rate at which the transactions productive processes through the integration of more automation,
occurs may create unprecedented challenges in data collection, storage, controlling and information technologies.
processing and analysis. If organisations find a way to extract business To support the data needs in these Factories of the Future, a Big
value from this data, they will most likely gain significant competitive Data Analytics architecture was envisaged and proposed (Santos,
advantages (Villars et al., 2011). Oliveira e Sá et al., 2017), integrating several layers and components
Big Data is often seen as a catchword for smarter and more infor the collection, storage, processing, analysis and distribution of data,
sightful data analysis, but it is more than that, it is about new chal- making available an integrated environment that supports decision-
lenging data sources helping to understand business at a more granular making at the several levels of the managerial process.
level, creating new products or services, and responding to business This paper intends to present a Big Data system that implements and
changes as they occur (Davenport, Barth, & Bean, 2012). As we live in a validates a specific set of components of this architecture, using the
world that constantly produces and consumes data, it is a priority to ongoing work on a multinational organisation (Bosch Car Multimedia –
understand the value that can be extracted from it. Braga) as a case study. In this, several layers of the architecture, and
Big Data will have a significant impact on value creation and some of their specific components, are tested, handling the data life-
competitive advantage for organisations, such as new ways of inter- cycle that goes from collection to analysis and visualisation, also pro-
acting with customers or developing new products, services and stra- posing and modelling the data structures needed to support storage and
tegies, raising profitability. Another area where the concept of Big Data processing in a Big Data Warehouse.
⁎
Corresponding author.
E-mail addresses: maribel@dsi.uminho.pt (M.Y. Santos), jos@dsi.uminho.pt (J. Oliveira e Sá), carina.andrade@dsi.uminho.pt (C. Andrade),
franciscavlima@dsi.uminho.pt (F. Vale Lima), eduardacosta@dsi.uminho.pt (E. Costa), carlos.costa@dsi.uminho.pt (C. Costa), bruno_martinho@dsi.uminho.pt (B. Martinho),
joao.galvao@dsi.uminho.pt (J. Galvão).
http://dx.doi.org/10.1016/j.ijinfomgt.2017.07.012
0268-4012/ © 2017 Elsevier Ltd. All rights reserved.
Please cite this article as: Santos, M.Y., International Journal of Information Management (2017),
http://dx.doi.org/10.1016/j.ijinfomgt.2017.07.012
M.Y. Santos et al. International Journal of Information Management xxx (xxxx) xxx–xxx
In methodological terms, all this work has been undertaken in order

to propose a technological artefact built upon data science (Hevner,
March, Park, & Ram, 2004), here made available as a system prototype.
The research process uses the Design Science Research Methodology for
Information Systems from Peffers, Tuunanen, Rothenberger, and
Chatterjee (2007), providing a rigorous way of carrying out design
science research.
The rest of the paper is organised as follows. Section 2 summarises
concepts of Industry 4.0 and the Factories of the Future, pointing the
role of Big Data in this fourth industrial revolution. Section 3 describes
the evolution of the Business Intelligence and Big Data Analytics area,
given the context for the emergence of the Big Data concept. Section 4
presents the proposed Big Data Analytics architecture, describing its
several layers and highlighting the selected layers and components, for
implementing the proof-of-concept. Section 5 reports the work done to
undertake the implementation, addressing the Big Data Warehouse
modelling process and all the related tasks from data integration,
cleaning, transforming and loading, until data analysis and visualisa-
tion. Finally, Section 6 concludes with some remarks and guidelines for
future work.
2. Industry 4.0 and the factories of the future Fig. 1. Enabling technologies for Industry 4.0.
Industry 4.0 is a recent concept that was mentioned for the first time Therefore, the smart manufacturing modules will work in a decen-
in 2011 in the Hannover Fair in Germany. This involves the main tralised way to improve production processes;
technological innovations applied to production processes in the field of 5. Service Orientation – use of service-oriented software architectures
automation, control and information technologies (Hermann, coupled with the IoT concept;
Pentek, & Otto, 2016). The basic foundation of Industry 4.0 implies that 6. Modularity – production processes accordingly to the demand,
through the connection of machines, systems and assets, organisations coupling and decoupling of modules in production, giving the
can create smart grids all along the value chain controlling the pro- flexibility to change machine tasks easily.
duction processes autonomously. Within the Industry 4.0 framework,
organisations will have the capacity and autonomy to schedule main- Based on the principles described above, Industry 4.0 became pos-
tenance, predict failures and adapt themselves to new requirements and sible due to the technological advances of the last decade in the areas of
unplanned changes in the production processes (Jazdi, 2014). information and engineering.
In the context of major industrial revolutions, Industry 4.0 is seen as Fig. 1 shows the key technologies enabling Industry 4.0, namely:
the fourth industrial revolution. The first industrial revolution, around
1780, essentially consisted in the appearance of the steam engine and
the mechanical loom. The second industrial revolution, around 1870,
• IoT – it consists of networking physical objects, environments, ve-
hicles and machines by means of embedded electronic devices, al-
included the use of electric motors and petroleum fuel. The third in- lowing the collection and exchanging of data. Systems that operate
dustrial revolution, around 1970, it is recognised in the context of the on the IoT are endowed with sensors and actuators, the cyber-
use of computerised systems and robots in the industrial production. physical systems, and are the basis of Industry 4.0 (Almada-Lobo,
Finally, the fourth industrial revolution, occurring now, is where 2016; Hermann et al., 2016; Jazdi, 2014; Kagermann, 2015);
computers and automation will come together in an integrated way,
i.e., robotics connecting computerized systems equipped with machine
• Big Data – in Industry 4.0 contexts data is generated by several
sources like machine controllers, sensors, manufacturing systems,
learning algorithms, in which the production systems are able to learn people, among many others. All this voluminous data, arriving at
from data, enabling the increase of efficiency and autonomy of the high velocity and in different formats is called “Big Data”. The
production processes and, also, making them more customizable processing of Big Data in order to identify useful insights, patterns or
(Drath & Horch, 2014; Hermann et al., 2016; Jazdi, 2014). models is the key to sustainable innovation within an Industry 4.0
For the development and deployment of Industry 4.0, six principles factory (Lee, Kao, & Yang, 2014).
are identified guiding the evolution of intelligent production systems
for the coming years (Hermann et al., 2016; Kagermann, 2015),
• Mobile and Augmented Reality – mobile devices with reliable and
inexpensive positioning systems allow the representation of real-
namely: time positioning in 3D maps, enabling the use of augmented reality
scenarios. These are expected to bring tangible gains in areas such as
1. Interoperability – systems, people and information transparently identification and localisation of materials or containers or in
intercommunicated in the cyber-physical systems (a fusion of the maintenance related activities (Almada-Lobo, 2016);
physical and virtual worlds). This allows exchanging information
between machines and processes, interfaces and people;
• Additive Manufacturing – technologies like 3D printing will enable
more localised, distributed and reconfigurable production, which
2. Real-time operation capability – instantaneous data acquisition and will completely change the supply chains. Also, additive manu-
processing, enabling real-time decision making; facturing is a key enabler of mass customization, reducing the pro-
3. Virtualization – creating smart factories, allowing the remote tra- duction time and costs for the creation of unique products
ceability and monitoring of all processes through the several sensors (Biller & Annumziata, 2014);
spread throughout the shop floor;
4. Decentralisation – the cyber-physical systems are spread accordingly
• Cloud – cloud-based manufacturing can be described as a networked
manufacturing model with reconfigurable cyber-physical produc-
to the needs of the production providing real-time decision-making tion lines enhancing efficiency, reducing production costs, and al-
capabilities. In addition, machines will not only receive commands, lowing optimal resource allocation in response to a customer
but will be able to provide information about their work cycle.
2
variable-demand (Almada-Lobo, 2016; Jazdi, 2014; Given this context, an evolution can be seen from BI to Big Data in
Thames & Schaefer, 2016); terms of the supporting technologies and development frameworks,
• Cybersecurity – one of the major challenges to the success of the although the organisational role is the same, processing capabilities to
Industry 4.0 lies in the security and robustness of Information give useful insights on business to decision-makers.
Systems. Problems such as transmission failures in machine-to-ma- The evolution from BI, or from Business Intelligence and Analytics
chine communication, or even eventual “gagging” of the system can (BI & A), to Big Data is addressed in the work of (Chen,
cause production disruption. With all this connectivity, systems will Chiang, & Storey, 2012), making a retrospective characterization of the
also need to protect the organisation’s know-how embedded in the BI & A itself and showing what changes to Big Data. For these authors,
processing control files (Sommer, 2015; Thames & Schaefer, 2016); the term business analytics was introduced to mention the key analy-
tical components of BI, whereas Big Data is used to describe datasets
In this context of Industry 4.0, people need to adapt their skills to that are so large and complex that require advanced and unique data
the needs of the Factories of the Future. The manual labour will be storage, management, analysis and visualisation technologies. In this
replaced by specialised labour, raising new opportunities to very well context, Big Data analytics offers new research directions for BI & A.
trained professionals, in an environment of huge technological variety Retrospectively, (Chen et al., 2012) propose a framework that
and challenges (Hermann et al., 2016). characterises BI & A in three eras, BI & A 1.0, BI & A 2.0 and BI & A 3.0,
Summarizing, when implementing an Industry 4.0 scenario, the verifying the evolution over the years, applications and emerging re-
focus is not on new technologies, but in how to combine them in a new search areas with different data sources, as can be seen in Fig. 3. From
way, considering three levels of integration: the cyber-physical objects its inception to date, the concept of BI is used in specific applications
level; the (Big) data infrastructure and models of the mentioned cyber- such as health, commerce and sales, government, security and even
physical objects; and, the services based on the available (Big) data science and technology.
(Drath & Horch, 2014). In BI & A 1.0, data is mostly structured, distributed by several data
sources that include legacy systems, and often stored in Relational
Database Management Systems (RDBMS). Data Warehouses (DW) are a
3. Business intelligence and Big Data analytics evolution foundation of this era and DW schemas are essential for integrating and
consolidating enterprise data, supported by Extraction, Transformation
Over the last years, the interest in Big Data has increased con- and Loading (ETL) mechanisms. Online Analytical Processing (OLAP)
siderably (Trends, 2016), particularly after 2012, as can be seen in and reporting tools based on intuitive graphics are used to explore data,
Fig. 2. It is important now to look back and see the evolution of data providing interactive environments for ad-hoc querying processing,
analytics in Business Intelligence (BI) systems and, after that, how we complemented by statistical methods and data mining algorithms for
arrived at the Big Data era. advanced data analytics.
Looking back to 1958, Hans Peter Luhn, a researcher from IBM, BI & A 2.0 started to emerge when the Internet and the Web offered
proposed an automatic system for the dissemination of information to new ways of data collection and analytics. In these contexts, detailed
the several players of any industrial, scientific or governmental orga- and IP-specific user search and interaction logs are collected through
nisation. The system was based on the use of data-processing machines cookies and server logs allowing the exploration of customers’ needs
for providing useful information to those who need it. The processing and potentiating the identification of new business opportunities. This
capabilities were based on statistical procedures and complemented era is centred in text and web analytics from unstructured data, using
with proper communication facilities and input-output equipment, data analytics techniques such as web intelligence, web analytics, text
providing a comprehensive system that accommodates all information mining, web mining, social network analysis or spatial-temporal data
needs of an organisation (Luhn, 1958). analysis (Chen et al., 2012).
The key point in Luhn’s proposal was to optimise business using BI & A 3.0 emerges with the new role of mobile devices and their
data, a concern that is maintained in more recent definitions of the BI increasingly use in our modern society. Mobile phones, tablets, sensor-
area. Looking to the Gartner IT glossary, BI is nowadays defined as “an based Internet-enabled devices, barcodes, and radio tags, commu-
umbrella term that includes the applications, infrastructure and tools, and nicating together in the IoT, support mobile, location-aware, person-
best practices that enable access to and analysis of information to improve centred and context-relevant operations (Chen et al., 2012). In the
and optimise decisions and performance” (Gartner, 2017). Although a context of a vast amount of web-based, mobile and sensor-generated
broader definition, the focus is maintained in the data processing cap- data arriving at ever increasing rates, such Big Data will drive the
abilities to provide useful information and insights for improving identification of new insights that can be obtained from highly detailed
business. data.
Looking into the same glossary for the definition of Big Data, it is
defined as “high-volume, high-velocity and/or high-variety information
assets that demand cost-effective, innovative forms of information processing 4. Big Data analytics architecture for Industry 4.0
that enable enhanced insight, decision making, and process automation”
(Gartner, 2017). Putting aside Big Data characteristics like volume, Moving towards Industry 4.0 requires the adoption of proper Big
velocity and variety, the key asset is still information and data pro- Data technologies that be integrated in order to fulfil the data collec-
cessing for supporting the decision-making process. tion, storage, processing and analysis needs. Fig. 4 shows the proposed
Fig. 2. Increased interest in Big Data.

Retrieved from Trends (2016).
3
Fig. 3. BI & A Evolution, applications and research.

Adapted from Chen et al. (2012).
architecture and its main layers and components, following our pre- The Data Preparation layer (ETL/ELT) corresponds to the process of
vious work in Santos, Oliveira e Sá et al. (2017), for Big Data Analytics extracting data from data sources to the Data Storage layer. From the
in Industry 4.0. This proposal benefits from state-of-the-art work, either several technologies that can be used to implement the Data
in the identification of its main components and in the identification of Preparation process, Talend is here used for integrating data from
the Big Data technologies to be adopted (Costa & Santos, 2016a). Be- multiple data sources. Talend is a data integration platform that has
sides, as can be seen in Fig. 4, some components of this architecture several elements used to do data extraction, transformation and
were already tested and included in the proof-of-concept described in loading, making available connectors for the file system in Hadoop,
this paper, used to validated the proposed architecture. These compo- NoSQL databases, among others (Talend, 2016).
nents are highlighted in the figure and create a data workflow that goes The Data Storage layer has different components that will be used in
from data collection to data visualisation, a process that will be ex- different contexts:
plained in more detail in Section 5.
The architecture is divided into seven layers, each layer including • For real-time, data streams will be stored in a real-time fashion into
components, some of which are already associated to some technolo- a NoSQL database. There are several NoSQL technologies available
gical tools. While in our previous version several components were such as column-based, document-based, graph-based, key-value or
instantiated to technologies, now only those implemented in the proof- multi-model, like HBase, Cassandra, MongoDB, CouchDB,
of-concept are associated to technologies. In Fig. 4, each layer is re- DynamoDB, Riak, Redis, Neo4J, among many others. Based on the
presented by a rectangle, while dashed rectangles are used to specify work of Costa and Santos (2016b), the most adequate NoSQL da-
the components and associated technologies, when applicable. Data tabases for real-time environments are Cassandra and HBase, being
flows between layers are also represented in this figure. Cassandra selected in this work due to its compatibility with Presto.
The Entities/Applications layer represents all Big Data producers • Staging Area and Big Data Warehouse (BDW) components will save
and consumers, as for instance customers, suppliers, managers (in data in a more historical perspective. In the Staging Area compo-
several managerial levels), among others. These entities are usually nent, data is stored in the Hadoop Distributed File System (HDFS)
consumers of raw data, indicators or metrics, like Key Performance and available for further use during a delimited period of time. For
Indicators (KPI), available from the Data Storage (through the Raw the BDW, the data previously loaded into the staging area is ex-
Data Publisher layer) and Big Data Analytics layers. tracted, transformed and loaded into the BDW, being available for
The Data Sources layer includes components such as Databases data analytics through the SQL Query Engine component (of the Big
(operational/transactional databases), Dynamic-link libraries (DLL), Data Analytics layer). Hive is the technology used for the BDW, an
Files, ERPs, E-Mail, Web Services or Custom Code, among others. These infrastructure similar to a traditional DW, which is built on HDFS
components can generate data with low velocity and concurrency (like, enabling distributed storage and processing, for storing and ag-
for instance, data from periodical readings from databases), or data gregating large volumes of data.
with high degree of velocity and concurrency (like, for instance, data
streams). The Raw Data Publisher layer enables downloading data available
4
Fig. 4. Big Data Architecture for Industry 4.0.
in the Data Storage layer by using Web Services. This interface is in- mentioning that many other technologies can be used, apart from
cluded to avoid direct accesses to the Hadoop cluster (where the data is Presto, as Impala, HAWQ, IBM Big SQL, Drill, among others. For Data
stored) by other Entities/Applications, but making available a way to Visualization, Tableau was selected due to its successive good evalua-
access, share and distribute data to the several users in the factory. tion: for the fourth year, Gartner names Tableau a leader in the Magic
The Big Data Analytics layer includes components that facilitate the Quadrant for BI & A platforms (Tableau, 2016).
analysis of vast amounts of data, making available different data ana- Finally, the Security, Administration and Monitoring layer includes
lysis techniques, namely: components that provide base functionalities needed in the other
layers, and that ensure the proper functioning of the whole infra-
• Data Visualisation – is a component used for the exploration/ana- structure. In this layer, the components needed are:
lysis of data through intuitive and simple graphs;
• Data Mining (or Knowledge Discovery) – is the component re- • Cluster Tuning and Monitoring – detects bottlenecks and improves
sponsible for identifying new patterns and insights in data; performance by adjusting some parameters of the adopted tech-
• Ad-hoc Querying – is a component that allows the interactive defi- nologies;
nition of queries on data, attending to the users’ analytical needs. • Metadata Management – the needed metadata can be divided into
Queries are defined on-the-fly, mostly depending on the results of three categories:
previous analyses on data. This component must ensure an easy-to- ○ Business – describes the data ownership information and business
use and intuitive querying environment; definition;
• Reporting – is the component that organises data into informational ○ Technical – includes database systems’ names, tables definition,
summaries in order to monitor how the different areas of a business and data characterization like columns’ names, sizes, data types
are performing; and, and allowed values;
• SQL Query Engine – this component provides an interface between ○ Operational – description of the data status (active, archived, or
the other components in this layer and the Data Storage layer. purged), history of the migrated data and transformations ap-
plied on it.
In this layer, different technologies can be used, like for example R, • Authorization and Auditing – user authorizations, data access policy
Weka, Spark and other commercial tools like Tableau, SAS, PowerPivot, management and tracking user’s operations are represented in this
QlikView, SPSS, among others. In the Big Data system here presented, component;
Presto was chosen as the SQL Query Engine due both to its connector to • Data Protection – associated with policies for data storage, allowing
the NoSQL database Cassandra and to its good results in a recent per- data to be encrypted or not, attending to how critical or sensitive is
formed benchmark (Santos, Costa et al., 2017). Nevertheless, it is worth the data;
5
• Authentication – representing the authentication of the users in the allowing the interpretation of analytical attributes, considering the
Big Data infrastructure, here shortly named as the Big Data cluster. possibility of different perspectives, using aggregation or filter opera-
tions, for example. The analytical attributes are the ones with numeric
5. Bosch’s Big Data warehouse: a case study values that can be analysed using the different descriptive attributes, as
happens in a traditional DW (Kimball & Ross, 2013). In addition, to
After the proposal of a Big Data Analytics architecture aligned with optimise the query performance, the big data models can be materi-
Industry 4.0 needs, it is now of major relevance its implementation, a alised as tables.
task that will validate all the work undertaken so far. This validation is In the case of IDCs, the analytical object identified is the IDC and the
being made in an organisation which is aligned with Industry 4.0 granularity of the analytical object is the IDC row, i.e., each IDC
concepts, Bosch Car Multimedia in Braga – Portugal. identified by an “IDC_Code”. The attributes identified as analytical are
For this purpose and due to its complexity, the validation of this the factual ones, namely the quantity of affected units (“Quantity”), the
architecture needs to be made in phases. This first validation phase is value associated with the IDC (“Currency_Value”) and the informative
focused on the data workflow previously shown in Fig. 4, which high- flags (“Flag_Valuated”, “Flag_Cancellation”). The other ones were
lights the components that are here used and tested, from data collec- identified as descriptive attributes.
tion to data visualisation. There is one fundamental difference between the IDCs and the QCs
In this data workflow, data selected mostly from the organisational datasets, which is of major relevance for explaining how the proposed
SAP ERP is regularly extracted and stored as Excel files in specific Big Data Analytics architecture deals with data arriving at different
folders, being available for the needed ETL processes, implemented speeds. In the Bosch case study, IDCs are directly stored in the BDW
with the use of Talend. As the available data contains historical and up- component (with the intervention of the Staging Area component),
to-date transactional data, which can be collected in real-time, all while QCs are stored in the real-time component or in the BDW com-
components of the Data Storage layer are used, namely Cassandra, ponent, depending on the presented status (for instance, if the QC is
HFDS and Hive. For data processing, supporting analytical tasks, Presto already finished or not). This design decision is justified because the
runs the required queries while Tableau allows the visualisation of the Hadoop BDW (HDFS and Hive) can lack efficient support for fast and
obtained results. constant random access inserts/updates. During a certain time frame,
The data used to implement the system and validate the architecture QCs are constantly being updated (e.g., status changes), which leads to
is from Bosch customer's Quality Complaint (QCs) and Internal Defect the need of using a NoSQL database to support analytical tasks, namely
Costs (IDCs). Due to privacy concerns, all the presented values and Cassandra in this case study.
descriptions were manipulated to mask or hide the real values and However, as Costa and Santos (2017) demonstrates, NoSQL data-
names. bases are OLTP-oriented and can lack support for fast sequential access
As one of the central components of the Big Data architecture is its over large amounts of data, typically required in analytical environ-
DW, which is responsible for the integration and consolidation of data ments. In contrast, the Hadoop BDW is the main choice for fast se-
from different business processes, this paper describes how this re- quential access and, therefore, after QCs are marked as “closed”, they
pository is modelled using a methodological approach for building are transferred from Cassandra to Hive. This represents a mix between
BDW (Costa & Santos, 2017; Santos, Martinho, & Costa, 2017). This real-time and historical analytics that can be significantly useful for
approach includes prescriptive models and methods that guide the organisations, combining different perspectives into one single picture.
design and implementation of complex analytical systems and will be Hive tables and Cassandra column families are modelled by organising
the basis for setting the BDW model and a method of Big Data Appli- data into analytical objects, descriptive attributes and analytical attri-
cation Provider which is divided into three main phases: butes. Fig. 6 presents an extract of the data model used in this case
study.
1. Collection: it refers to data acquisition and metadata creation; As can be seen in Fig. 6, this modelling approach considers histor-
2. Preparation: it refers to data validation, cleansing, outlier removal, ical analytical objects stored as Hive tables and real-time analytical
standardisation and reformatting; objects stored as Cassandra column families. Among the descriptive
3. Access, Analytics & Visualisation: it implements the techniques to attributes of an analytical object (top half of the “idcs” and “qcs” in
extract knowledge from the data being represented in an optimised Fig. 6), one can find two other concepts: primary keys, whose main goal
communication format and that involves the production of reports is identical to the one present in the primary keys of traditional data-
or graphs for analysis. bases, i.e., uniquely identify a record; and partition keys, which allow
data distribution according to the values of the partition key. Regarding
5.1. Big Data model the Bosch case study, one uses the “creation_year” attribute in the “idcs”
Hive table to fragment it into different folders according to the year in
To start the architecture validation, it was necessary to analyse the which the IDC was created. This considerably improves query execution
data that would be used. For that, Bosch provided a list of all the used times, since users tend to search IDCs for specific years, typically the
attributes in these two selected areas (QCs and IDCs) and a data sample current year. There is no need to define a primary key in Hive. In
of them. All the attributes that seemed relevant for analysis were used contrast, Cassandra requires a primary key, which in this case is the
and included in an Entity–Relationship Diagram (ERD), enhancing the “qc_notification_code” attribute in the “qcs” column family. This attri-
understanding of the entities, their attributes and relationships. Taking bute is also the partition key, as, in Cassandra, the first part of the
as an example the IDCs, Fig. 5 presents the obtained ERD (the QCs ERD primary key is always the partition key, evenly distributing data
is not depicted here due to its complexity, as it includes more than 200 throughout the nodes in the cluster according to the range of values of
attributes distributed by 50 tables). the partition key.
Based on the two ERDs, QCs and IDCs, and following two different This data modelling approach provides significant flexibility, be-
variants of the methodological approach, one was first transformed into cause analytical objects are denormalized structures, unlike traditional
a multidimensional model and then to a Hive data model, for QCs, and fact tables in relational DWs, which rely on constant join operations to
the other was directly transformed into a Hive data model, for IDCs. answer analytical queries. In Big Data environments, using denorma-
Both Hive data models are here named as the Big Data Model. To define lized structures allows faster execution times and simple collection,
it, the first step consists in identifying the analytical objects (in this case preparation and enrichment processes (Jukic, Jukic, Sharma,
the QCs and IDCs), which have descriptive and analytical attributes, Nestorov, & Korallus Arnold, 2017), reducing the time between data
and an associated granularity. The descriptive attributes are those collection and data analysis. As depicted in Fig. 6, in cases where
6
Fig. 5. IDCs ERD.
different subjects are related (e.g., sales and complaints), analytical subjects to answer specific business questions. Moreover, as previously
objects can be joined or united, even if they are stored in different mentioned, the results of complex and long running queries can be
systems, since Presto can retrieve data from both Hive and Cassandra materialised into Hive tables, in order to achieve interactive data vi-
simultaneously, using a single query. Since join operations can be costly sualisation mechanisms.
in Big Data environments (Chang, 2015; Floratou, Minhas, & Özcan,
2014; Marz & Warren, 2015; Wang, Qin, Zhang, Wang, & Wang, 2011), 5.2. Big Data application provider method
one must highlight that these joins and unions operations are entirely
optional, and are only needed if certain queries combine different As already mentioned, the methodological approach followed in this
7
Fig. 6. Bosch data model extract.
Fig. 7. Phases of the followed approach.
8
Fig. 8. Validation tests on files (Talend job).
implementation has three phases as can be seen in Fig. 7: Collection, responsible updated about the ETL process status. This monitoring
Preparation and Access, Analytics & Visualization phases. is transversal to all validation tests;
The Collection phase refers to the data acquisition and it is where (2) verifying the existence of the temporary files in the corresponding
the metadata are created. The Preparation phase includes the data va- folder;
lidation, cleansing, outlier removal, standardisation and reformatting (3) checking if the Hash MD5 of the source file is the same of the
tasks and its storage. The Access, Analytics & Visualization phase in- previously loaded files;
cludes the access to the data for organising it in reports and graphs that (4) verifying that the file is not empty; and,
help the decision making process. (5) validating the file structure.
5.2.1. Collection phase After the validation tests concluded successfully, the source file is
As already mentioned, in the proof-of-concept there is not (yet) any now ready to be loaded into the file system in two different ways: The
direct access to the SAP ERP system, reason why temporary Excel files first one is a permanent raw data file and has all the original attributes
are used. The data files are periodically extracted and made available and with no treatment. This means that it will remain always available,
for the refreshment of the Data Storage, including Cassandra, HDFS and as it is, in the system; The second one is a temporary file and will be
Hive. used in the next step as a source for data preparation. In this file, the
The Collection phase is focused on an automatic Talend job that null rows, the attributes with less than 1% of fulfilment and the others
runs periodically and starts with a set of validation tests on the avail- in the list of attributes with no analytical value are excluded. The re-
able source files, ensuring that the source file really exists, has a valid maining attributes are loaded to HDFS.
structure and is not empty.
These validation tests are divided into five different steps, as can be 5.2.2. Preparation phase
seen in Fig. 8, including: To guarantee that the available data can be used for decision sup-
port, it is important to certify its quality, reason why data cleaning, data
(1) monitoring of error occurrences, during all the validation tests, conversion or other operations on data are needed.
which catches any error occurrence in any component of the ETL To ensure data quality, a preliminary analysis of the available data
process. If something fails, the process stops, the error is described was done, identifying those attributes that present anomalies, like er-
in the log file and an email is sent to the Data Storage administrator roneous values or missing data fields. In this last case, several attributes
or responsible. This email aims to keep the administrator or with a high percentage of missing values were identified, representing
Table 1
Examples of Problems and Transformations.
Problems Transformations
Null Fields For the fields that correspond to codes in a hierarchy, the nulls are replaced with a code that exists (father or child code);For null in numeric fields,
null values remain to avoid possible misleading in results;For any other fields, null fields are replaced with the expression “Not Applicable” or
“Unknown” (depends on the subject).
Numeric values with dots Since not all numeric values have the dot separating thousands, the dots are deleted.
Dates separated by dots The dots are deleted and the field is converted to the date format. In some cases, it is necessary to create three more attributes: “day”, “month” and
“year”.
Flag treatment Replace the “X” value with “1” and the null value with “0”.
Split of values Some attributes have a specific codification that is only known in the organisation. In these cases, Bosch provided the necessary information to split
the data. As an example, the attribute “Batch_Manufacturing” has characters with different meanings. Thus, based on this attribute, four new ones are
created: Department, Production Line, Production Shift and Workstation.
9
data that do not bring any analytical value. customer QCs at the production date, categorised by Business Unit and
For erroneous data values, all the identified problems were catalo- coloured by part number (corresponding to specific pieces of produced
gued and proper data transformation tasks were defined to improve components).
data quality. Table 1 presents the main groups of problems identified in In this dashboard, it is possible to see different patterns and outliers
the IDCs (and QCs) data and the appropriate transformations defined to that could show a production problem on specific days. For example, in
mitigate them. the PS Business Unit, several dots with the same colour are identified in
Once the file with the transformed data is obtained, the next step is a delimited area, indicating that the same product type had an ab-
divided into two different approaches: normal number of complaints during that period, ranging from March
and November of 2016. Also, for the other Business Units, several peaks
• For the IDCs data, the transformed file is saved into HDFS and then a are visible pointing to possible production problems.
connection with Hive is needed to create the Hive’s table and move Although the data here shown does not reflect real values, it is
the data from HDFS to Hive. This Hive’s table contains the data and possible to see how the proposed system allows data analysis, sup-
the correspondent metadata, and it is partitioned by year of the IDC porting decision making and improving organisational analytical cap-
“Creation_Date”, optimising querying processing; abilities.
• For the QCs data, and due to specific business rules at Bosch (the QC
information is handled for months, changing its status several times 6. Conclusions
over that period), data goes to a Cassandra column family or to a
Hive table, depending on the QC status, as already explained in This paper presented the implementation of a Big Data system
subsection 5.1. aimed to validate a Big Data Analytics architecture for Industry 4.0. In
this implementation, specific layers of the proposed architecture, and
5.3. Access, analytics & visualisation phase specific components for those layers, were integrated into a data
workflow from data collection to data analysis and visualisation.
With the data available in Hive and Cassandra, Tableau is used to The presented proof-of-concept showed how these technologies
visualise and analyse the information in an interactive way. Presto is complement each other, pursuing the overall goal of supporting the
the technology used in the SQL Query Engine component to connect decision-making process. For this, two specific business processes were
Tableau to Hive and/or Cassandra in order to provide data that is selected, QCs and IDCs, and data was modelled, cleaned, transformed
available for producing dynamic dashboards about IDCs and QCs. and delivered to the data storage components, able to deal with his-
Fig. 9 presents an example of a visualisation in Tableau, showing a torical data and data that can arrive in streams, enabling a just-in-time
storytelling and the respective dashboard that is selected. On top, the response from Bosch Braga to possible problems
storytelling is used to navigate between the several dashboards. Below, At the Big Data Analytics layer, where specific dashboards are made
a specific dashboard that shows for the years 2014 and 2015, the available for data analysis and visualisation, it was possible to see that
Fig. 9. Dashboard Example (for QCs).
10
all the selected technologies worked together and that no integration or 1089/big.2012.1503.
Floratou, A., Minhas, U. F., & Özcan, F. (2014). SQL-on-Hadoop: Full circle back to
interoperability problems were detected. shared-nothing database architectures. Proceedings of the VLDB endowment. Vol. 7,
In the future, it is expected to test other components of the archi- (pp. 1295–1306). . http://dx.doi.org/10.14778/2732977.2733002.
tecture, until all the layers are fully characterised in terms of the Gartner (2017). Gartner of information technology IT definitions and glossary. Retrieved 23
January 2017, from http://www.gartner.com/technology/it-glossary/.
technologies that must be used. Some technologies must be chosen and Hermann, M., Pentek, T., & Otto, B. (2016). Design principles for industrie 4.0 scenarios.
others can be replaced with time, if more promising ones emerge. In 49th hawaii international conference on system sciences (HICSS) (pp. 3928–3937). .
terms of business processes, others must be integrated to complement http://dx.doi.org/10.1109/HICSS.2016.488.
Hevner, A. R., March, S. T., Park, J., & Ram, S. (2004). Design science in information
the analysis of the organisation and support the decision-making pro- systems research. MIS Quarterly, 28(1), 75–105. http://dx.doi.org/10.2307/
cess. 25148625.
This architecture is prepared to use several data sources to feed the Jazdi, N. (2014). Cyber physical systems in the context of Industry 4.0. 2014 IEEE auto-
mation, quality and testing, robotics, 2–4. http://dx.doi.org/10.1109/AQTR.2014.
Data Storage layer, being one of our priorities to test streaming data
6857843.
collection. Finally, the Data Mining component should be implemented Jukic, N., Jukic, B., Sharma, A., Nestorov, S., & Korallus Arnold, B. (2017). Expediting
and tested to push this solution to a next level, i.e., an Adaptive Big analytical databases with columnar approach. Decision Support Systems, 95, 61–81.
Data Systems to combine prediction and optimisation techniques to http://dx.doi.org/10.1016/j.dss.2016.12.002.
Kagermann, H. (2015). Change through digitization – value creation in the age of Industry 4.0.
assist decision makers in Industry 4.0. Management of permanent change, 23–45. http://dx.doi.org/10.1007/978-3-658-
05014-6_2.
Acknowledgments Kimball, R., & Ross, M. (2013). The data warehouse toolkit: The definitive guide to dimen-
sional modeling (3rd ed.). John Wiley Sons, Inc.
Lee, J., Kao, H. A., & Yang, S. (2014). Service innovation and smart analytics for Industry
This work has been supported by COMPETE: POCI-01-0145-FEDER- 4.0 and big data environment. Procedia CIRP, Vol. 16, 3–8. http://dx.doi.org/10.
007043 and FCT (Fundação para a Ciência e Tecnologia) within the 1016/j.procir.2014.02.001.
Luhn, H. P. (1958). A business intelligence system. IBM Journal of Research and
Project Scope: UID/CEC/00319/2013, and by Portugal Incentive Development, 2, 314–319. http://dx.doi.org/10.1147/rd.24.0314.
System for Research and Technological Development, Project in co- Marz, N., & Warren, J. (2015). Big data: Principles and best practices of scalable real-time
promotion no 002814/2015 (iFACTORY 2015–2018). Some of the fig- data systems. Shelter Island, NY 11964: Manning Publications Co.
Peffers, K., Tuunanen, T., Rothenberger, M. A., & Chatterjee, S. (2007). A design science
ures in this paper use icons made by Freepik, from www.flaticon.com.
research methodology for information systems research. Journal of Management
Information Systems, 24(3), 45–77. http://dx.doi.org/10.2753/mis0742-1222240302.
References Santos, M. Y., Costa, C., Galvão, J., Andrade, C., Martinho, B., Lima, F. V., & Costa, E.
(2017). Evaluating SQL-on-Hadoop for big data warehousing on Not-So-Good hardware.
Manuscript submitted for publication.
Almada-Lobo, F. (2016). The Industry 4.0 revolution and the future of Manufacturing Santos, M. Y., Martinho, B., & Costa, C. (2017). Modelling and implementing big data
Execution Systems (MES). Journal of Innovation Management, 3(4), 16–21. warehouses for decision support. Journal of Management Analytics, 4(2), 111–129.
Biller, S., & Annumziata, M. (2014). The future of work starts now – GE reports Retrieved 21 http://dx.doi.org/10.1080/23270012.2017.130429.
March 2017 from http://www.gereports.com/post/93343692948/the-future-of- Santos, M. Y., Oliveira e Sá, J., Costa, C., Galvão, J., Andrade, C., Martinho, B., ... Costa, E.
work-starts-now/. (2017). A big data analytics architecture for Industry 4.0. In WorldCIST 2017 (Vol.
Chang, W. L. (2015). NIST big data interoperability framework: Volume 6, reference archi- Ed.), Advances in intelligent systems and computing. Vol. 570Cham: Springer. http://dx.
tecture. http://dx.doi.org/10.6028/NIST.SP. 1500-6 Gaithersburg, MD. doi.org/10.1007/978-3-319-56538-5_19.
Chen, H., Chiang, R. H., & Storey, V. C. (2012). Business intelligence and analytics: From Sommer, L. (2015). Industrial revolution – Industry 4.0: Are German manufacturing SMEs
big data to big impact. Mis Quarterly, 36(4), 1165–1188. http://dx.doi.org/10.1145/ the first victims of this revolution? Journal of Industrial Engineering and Management,
2463676.2463712. 8(5), 1512–1532. http://dx.doi.org/10.3926/jiem.1470.
Chen, M., Mao, S., & Liu, Y. (2014). Big data: A survey. Mobile networks and applications, Tableau (2016). For fourth year, gartner names tableau a “leader” in magic quadrant | tableau
Vol. 19, 171–209. http://dx.doi.org/10.1007/s11036-013-0489-0. software. Retrieved 20 February 2017, from https://www.tableau.com/about/blog/
Costa, C., & Santos, M. Y. (2016a). BASIS: A big data architecture for smart cities. SAI 2016/2/fourth-year-gartner-names-tableau-leader-magic-quadrant-49719.
computing conference (SAI) (pp. 1247–1256). . http://dx.doi.org/10.1109/SAI.2016. Talend (2016). Big data integration for Spark & Hadoop: Big data system. Retrieved 22
7556139. February 2017, from https://www.talend.com/products/big-data/.
Costa, C., & Santos, M. Y. (2016b). Reinventing the energy bill in smart cities with NoSQL Thames, L., & Schaefer, D. (2016). Software-defined cloud manufacturing for Industry 4.0.
technologies. Transactions on engineering technologies, . Singapore: Springer, 383–396. Procedia CIRP, Vol. 52, 12–17. http://dx.doi.org/10.1016/j.procir.2016.07.041.
http://dx.doi.org/10.1007/978-981-10-1088-0_29. Trends (2016). Interest in big data over time. Retrieved 15 November 2016, from https://
Costa, C., & Santos, M. Y. (2017). The SusCity big data warehousing approach for smart trends.google.pt/trends/explore?date=all&q=big data.
cities. In B. C. Desai, J. Hong, & R. McClatchey (Eds.), Proceedings of the 21st inter- Villars, R. L., Olofson, C. W., & Eastwood, M. (2011). Big data: What it is and why you
national database engineering & applications symposium (IDEAS 2017) (pp. 264–273). . should care. White Paper, IDChttp://dx.doi.org/10.1080/00049670.2014.974004.
http://dx.doi.org/10.1145/3105831.3105841. Wang, H., Qin, X., Zhang, Y., Wang, S., & Wang, Z. (2011). LinearDB: A relational ap-
Davenport, T. H., Barth, P., & Bean, R. (2012). How big data is different. MIT Sloan proach to make data warehouse scale like mapreduce. International conference on
Management Review, 54(1), 43–46. database systems for advanced applications. DASFAA 2011 (pp. 306–320). . http://dx.
Drath, R., & Horch, A. (2014). Industrie 4.0 – Hit or hype? IEEE Industrial Electronics doi.org/10.1007/978-3-642-20152-3_23.
Magazine, 8(2), 56–58. http://dx.doi.org/10.1109/mie.2014.2312079.
Dumbill, E. (2013). Making sense of big data. Big Data, 1(1), 1–2. http://dx.doi.org/10.
11

A Big Data System Supporting Bosch Braga 2017 International Journal of Infor

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

A Big Data System Supporting Bosch Braga 2017 International Journal of Infor

Transféré par

Droits d'auteur :

Formats disponibles

International Journal of Information Management xxx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

International Journal of Information Management

A Big Data system supporting Bosch Braga Industry 4.0 strategy

1. Introduction is of major relevance is the Internet of Things (IoT), seen as a network of

0268-4012/ © 2017 Elsevier Ltd. All rights reserved.

In methodological terms, all this work has been undertaken in order

Fig. 2. Increased interest in Big Data.

Fig. 3. BI & A Evolution, applications and research.

Fig. 4. Big Data Architecture for Industry 4.0.

Fig. 5. IDCs ERD.

Fig. 6. Bosch data model extract.

Fig. 7. Phases of the followed approach.

Fig. 8. Validation tests on ﬁles (Talend job).

Fig. 9. Dashboard Example (for QCs).

Vous aimerez peut-être aussi