Big Data Analytics in Banking Sector: Master of Technology

BIG DATA ANALYTICS IN BANKING SECTOR
Report submitted in partial fulfilment of the requirements for the Degree

of
MASTER OF TECHNOLOGY
in
COMPUTER SCIENCE AND ENGINEERING
By
Y. THIRUPATHI REDDY
11803634
Supervisor
DR. GURBAKASH PHONSA
School of Computer Science and Engineering

Lovely Professional University
Phagwara, Punjab (India)
November, 2019
TABLE OF CONTENTS
CONTENTS PAGE NO.
TITLE PAGE………………….………………………………………………………..i
TABLE OF CONTENTS ...............................................................................................ii
LIST OF FIGURES ...................................................................................................... iv
ABSTRACT ................................................................................................................... v
CHAPTER 1 INTRODUCTION............................................................. 1
1.1. CATEGORIES OF BIG DATA ............................................................. 3
3.1.1. SOURCES OF BIG DATA: .................................................................... 4
3.2. BIG DATA ANALYTICS TO BANK ON YOUR BIGGEST ASSET-

INFORMATION .............................................................................................. 5
3.3. BENEFITS OF STREAM PROCESSING PLATFORM .................... 6
3.4. PROPEL STREAM SOLVING CHALLENGES IN BANKING

SECTOR 7
3.4.1. IMPACT OF BIG DATA ON BANKING INSTITUTIONS AND

MAJOR AREAS OF WORK .............................................................................................. 7
3.4.2. RISK MANAGEMENT: ......................................................................... 8
3.4.3. TRANSACTIONS: .................................................................................. 8
3.4.4. ANALYSIS AND INFERENCES ........................................................... 8
FEEDBACK ANALYSIS ............................................................................... 8
DATA COLLECTION AND SAMPLE SIZE .............................................. 9
3.4. BIG DATA IN BANKING APPLICATIONS ........................................ 9
BENEFITS OF THE HADOOP: .................................................................. 16
HIGH SCALE COMPUTING PLATFORM FOR BIG DATA

ANALYTICS: 17
ii
CHAPTER 2 LITERATURE REVIEW .............................................. 18
CHAPTER 3 METHODOLOGY ......................................................... 21
CHAPTER 4 SUMMARY AND CONCLUSION ............................... 28
REFERENCES ....................................................................................... 29
iii
LIST OF FIGURES
FIGURE NO. FIGURE DESCRIPTION PAGE NO.
Figure 1.1. General architecture of the data aggregation algorithm 6
Figure 5.1. Proposed Methodology 16
iv
ABSTRACT
Big Data Analytics is the process the huge amount of data. Data is continuously
streaming data in Banking sectors since many decades. Now a days Banking sectors
maintaining huge amount of large data every day. This large amount of data is used to wide
open the secrets money movements. These days the most common thing in banking industries
failed to use the data in their relational database management systems. Because of these kind
of problems big data evolved in the data world. By using Big data has the characteristics three
v’s are volume, variety and velocity and in the bigdata. These are the characteristics will
improve the strength of risk management and also providing fraud detection. Big Data
challenging the new technological trends of banking sectors and solving technical problems
and issues in effective way. Main objective of Big Data Analytics in banking sectors is
customer analysis, detection of frauds and risk management.
v
CHAPTER 1
INTRODUCTION
Big Data Analytics is the process the huge amount of data. Data is continuously streaming
data in Banking sectors since many decades. Now a days Banking sectors maintaining huge
amount of large data every day. This large amount of data is used to wide open the secrets money
movements. These days the most common thing in banking industries failed to use the data in their
relational database management systems. Because of these kind of problems big data evolved in
the data world. By using Big data has the characteristics three v’s are volume, variety and velocity
and in the bigdata. These are the characteristics will improve the strength of risk management and
also providing fraud detection. Big Data challenging the new technological trends of banking
sectors and solving technical problems and issues in effective way. Main objective of Big Data
Analytics in banking sectors is customer analysis, detection of frauds and risk management.
The banks have a vast variety and amount of customer data due to an increasing number of
transactions through various devices, but they are only using a very tiny proportion to generate
insights and enhance the customer experience. Data science goes beyond traditional statistics to
extract actionable insights from information. It extracts not just the sort of information you might
find in a spreadsheet, but everything from emails, phone calls, text, images, video, social media
data streaming, internet searches, GPS locations and computer logs. Historically banks collected
huge amounts of data but were unable to derive meaningful insights in a timely manner, which
prohibited them to predict and respond to the changing consumer needs and led to missed
opportunities.
Today the Banking Industry truly believes that big data analytics offer a significant
competitive advantage, and even then, only 37% of the banks actually have any hands-on
experience with any live big data processes or policies. Banks are no longer questioning the
benefits of big data, but some are still holding back. Apparently, 63% of the banks and financial
institutions are just exploring and experimenting with it. As per another study by Dell, only 1 in 5
companies use advanced analytics report or utilize high volume or high velocity data commonly
1
associated with big data and majority of firms seem to have their hands full with their own internal,
“small” data [1]. One of the major reasons behind a slower adoption of big data in the banking
industry is the organization structure and silos.
According to, Deutsche Bank: Big data plans held back by legacy systems, February 2013,
Big Data plans at Duetsche bank were held back due to legacy infrastructure, which resulted in
90% overlap of data since petabytes of data was stored across 46 data warehouses and the bank
constantly kept collecting data from the front end (trading data), the middle (operations data) and
the back-end (finance data). In spite of having huge amount of data, the bank was unable to utilize
it and get valuable customer insights since there was no efficient way to extract data, streamline
it, and build traceability and linkages from the already existing traditional system [2]. Traditional
relational databases have limits on field lengths leading to data loss and it resists complex images,
numbers, designs and multimedia. Since they are isolated in nature, information cannot be shared
easily from one large system to another for e.g., database at hospital billing department is unable
to "talk" to the database at hospital HR department.
Another reason that acts as a challenge in big data adoption is skills and talent gap. It is
estimated that by 2018, United States alone would face a talent gap of about 200,000 professionals
with deep analytical skills, and 1.5 million more to interpret and use findings effectively for
decision making. According to Ryerson's University paper “Closing Canada’s Big Data Talent
Gap,” Canada’s Big Data Talent Gap is estimated to be about 20,000 professionals with deep data
and analytical skills (roles of Chief Data Officer, Data Scientist, and Data Solutions Architect).
The gap for professionals with solid data and analytical literacy to make better decisions is
estimated at a further 150,000 professionals (Business Manager and Business Analyst). By
improving labour market clarity, building the right type of talent and having the government act
as a key enabler, can successfully overcome the talent gap. As per another study by Oracle, 60%
of companies report that lack of data scientists hinders the success of their projects, 65%
organizations are hampered by too little business intelligence and too few analytic applications
developers and only 10% of employees are satisfied with the big data technology resources
available to them to support analysis and decision-making.
2
Data privacy, governance and compliance form the third biggest challenge when it comes
to adopting big data.
Big Data is the storing and analysing data in order to make some sense for the
Organizations like banking sectors and educational institutions. For any application contains
limited amount of data we normally use SQL and Postegre SQL. [1]These are maintained for small
applications but for large applications like Facebook, Google, You tube? This data is very large
and complex to store that none of traditional database management system is able to store and
process the data. Big data used for analysing data and improve the business of institutions.
Data is a collection of information and it is divides into two types of categories.
1.1. CATEGORIES OF BIG DATA
Strucured Data:
Data which can be stored and processed in table. Table means data has to be stored in the
form of rows and columns and this format is called structured data. This data is relatively simple
to store, enter and analyse.
3
Unstructured Data:
This data is unknown form or structure is called as unstructured Data. Examples of

unstructured data are images, videos, customer service interactions, web pages, PDF files, PPT,
social media data etc., Facebook generates 500 plus terabytes of data per day as people upload
various images, videos, posts, advertisements etc.,
3V’s of Big Data:
1. Volume: The amount of data which deals with large size of Kilobytes(KB),
Megabytes(MB), Tera bytes(TB), Petabytes(PB), Exabytes(EB), Zettabytes,
Yottabytes. In this volume maintains large amount of datasets.
2. Variety: Data comes in all type of format. Structured, Unstructured and Semi-
structured. [2]Structured data consists of structural format in the form of rows and
columns. Unstructured data like (text, audio, image, video). In this version maintains
unstructured data. Data is unknown form of structure. Semi-structured data contains
the XML files and log files.
3. Velocity: Data is generating at a very fast rate. Velocity is the measure how fast is the
data is coming from. [3]For time critical applications faster processing is very
important. For example share marketing and video sharing are velocities of big data.
3.1.1. SOURCES OF BIG DATA:
Banking sectors: This could be data coming from banking sectors such as RBI, ICICI,
State Bank of India and Different banks etc.,
Social Media: Data coming from the social media services such as facebook, likes, photos,
video uploads, comments, you tube views, Tweets.
Share market: [2]Stock market exchange generate huge amount of data through daily
transaction. Now days stock markets plays important role in Big Data.
E-commerce sites: These days E-commerce sites generates huge amount of data Like
Flipkart, Amazon, snapdeal, Myntra etc.,
4
3.2. BIG DATA ANALYTICS TO BANK ON YOUR BIGGEST ASSET-
INFORMATION
[4]When a customer walks into a bank for the first time, he/she brings in a lot of potential;
the potential of becoming a loyal customer, to make good investment, to have a short time
relationship or even the potential of fraud. Now, the banks have to handle millions of such potential
every day. They have to retain long standing customers, bring in new, apprehend and prevent fraud.
For all this, they need data, lots of it. There’s no scarcity of data in the banking sector. In fact they
are the industry which suffers most from the 3 Vs of big data, volume, velocity and of course the
variety. The challenge lies in implementing the right analytics on the received data to dig out useful
information to meet business challenges. Various departments are working in silos, gathering data
throughout the day, how can this diverse data be homogenized and used for taking important
business decisions? This is where the banks need to bank on their biggest asset: Big Data and
exploit it by applying the right analytics.
Big Data Analytics in banking Introduction
According to Microsoft and Celent, “How Big is Big Data: [5]Big Data Usage and
Attitudes among North American Financial Services Firms”, 90% of the banks thought that
successful big data initiatives will define the winners in the future. In this scenario one can assume
that the banks are making most of the big data scenario to deliver the best to their customers. But
instead, according to a survey by Capgemini, only 37% of the customers think that their banks
understand their needs. Why so? Because, of the lack of application of right analytical tools in the
banking sector.
Banking Industry has been embracing the digitization trend for quite some time now. With
bi-modal architecture like legacy modules working with mobile apps, banks are making the most
of digitization. But the industry is yet to explore the full potential of big data analytics. A handful
of banks like Bank of America, Deutsche bank and Citibank have delved into big data for fraud
detection and customized service offerings. The rest are yet to catch onto the trend.
5
3.3. Benefits of Stream Processing Platform
The right stream processing platform would allow for a harmonious blending of data
integration together with stream processing technologies for better interoperability. The high-
quality data output provides for an efficient data flow thus enabling real-time analytics for
predictive and prescriptive capabilities lending businesses agility and flexibility. Banks are heavily
dependent on real time streaming data on a daily basis. They need to keep track of transactions
made by a person with date time and geo location, so that they can easily detect a fraud if the
location and transaction place do not match. Banks need real time location information in case the
customer searches for a branch or ATM nearby. Real time streaming data also helps them
materialize personal offers when a customer is in or around a specific store or venue.
Introducing Propel Stream -Information

[6]Propel Stream is a real time streaming analytics solution built to create and capture value
from disparate sources of data. Propel Stream collects real time data from all the available sources
like router switches, banks, internet apps a variety of social channels like Facebook and Twitter.
Predictive messages are then sent to receivers via channels like mobile, file systems and fraud
detection pages.
6
3.4. Propel Stream solving Challenges in Banking Sector
1. Homogenizing high volume, velocity and variety of data from departments working in
silos
2. Data security
3. Customer data analytics
4. Fraud detection
5. Risk management
6. Personalized offers
7. Customer sentiment analysis
8. Customer experience analysis
9. Keeping track of regulatory compliances
3.4.1. Impact of Big Data on Banking Institutions and major areas of work
Bank industry experts describe gigantic data as the mechanical assembly which empowers
a relationship to make, control, and administer significant educational accumulations in a given
day and age and the limit required to help the volume of data, depicted by variety, volume and
speed.
7
Underneath we look at the genuine regions where tremendous data is being utilized by cash
related establishments[7] which are increment their endeavour peril organization structures to help
improve undertaking straightforwardness, auditability, and authority oversight of hazard.
3.4.2. Risk management:
[8]Following are the manners by which information investigation is being utilized to

discover and assess monetary wrong doing administration (FCM) arrangement rules, by early
identification of the connection between money related wrong doing and qualities of the exchange,
or arrangement of exchanges.
MIS/ Regular announcing Disclosure

detailing
Real time keyboard discussion Anti-tax

following evasion
3.4.3. Transactions:
Exchanges when pursued over some undefined time frame, have a tendency to uncover a
great deal of data about the idea of exchange, log examination, exchanging conclusions and
different angles. Banks and other money related establishments use Big Data under this header in
following ways
3.4.4. Analysis and Inferences
Feedback Analysis
It is the important of process of any organization to understand about building areas

development and its consistent basis. ABC bank started to collect requirements from their
customers. In ABC bank who has visited bank branches as well as using the online services.
8
Data Collection and Sample Size
[9]Following the data collection and sample size, in the span of 2 years 11months.
Consumers came to the banks and ask the reviews about ABC bank secretly on the scale of 0 to 5
parameters.
The analysis below is performed using the affecting subset of the total data together and
including collecting feedback from the 23000 clients.
When we plot the data, there are some snooping findings
Feedback Analysis and Inference
Ratings in the year of 2011 and 2012 is high stable and low. Quality and speed of the
service and addressing the queries of same weightage.
Inference – Customers ratings of the bank services average bank does not take up any
calculative measures during this time and improve its ratings
3.4. BIG DATA IN BANKING APPLICATIONS
Fraud Detection: It help bank to detect and prevent inner and exterior as well as reduced
as the associated cost. Big Data is mainly found the mistakes and limitations in Baking sectors. By
using Big Data we can easily find out the frauds in the transactions.
Risk Management: Banks are analyse the transaction data to determine risks and
revelations on simulated their behaviour and customer moments.
Contacts centre Efficiency Optimization: It helps banks to resolve problems of

customers speedily by permitting banks to anticipate customers in need ahead of time.
9
Customer Segmentation for Optimize Offers: Provides a way to understand customers
needs at a rough level. So that the banks can carry targeted offers more successfully.
Customer analysis: It is very helpful to banks It helps to banks recollect their clients by
using analyse their customers behaviour and collect the customers outlier rejection.
Examine customer feedback: Clients assessment can be gathered in the content shape
from different internet based life sites. Once these assumptions can be gathered, they can be
characterized into positive and negative and by applying different channels they can be utilized to
give administrations to clients.
Detect when a customer is about to leave: As we probably aware the expense of getting
new clients is more noteworthy than holding its old clients. At the point when the bank deals with
clients require by understanding the issue, consideration must be given to discover an answer.
However, as we explained in the article on the Big Data visualization principles, the 3 v’s
are useless if they do not lead to the 4’th one value. For the banks, this means they can apply the
results of big data analysis real time and make business decisions accordingly. This can be applied
to the following activities:
 Discovering the spending patterns of the customers
 Identifying the main channels of transactions (ATM withdrawal, credit/debit card payments)
 Splitting the customers into segments according to their profiles
 Product cross-selling based on the customers’ segmentation
 Fraud management & prevention
 Risk assessment, compliance & reporting
 Customer feedback analysis and application
Below we elaborate on the examples of using Big Data in these fields of the banking industry.
10
Customer spending patterns
The banks have direct access to a wealth of historical data regarding the customer spending
patterns. They know how much money you were paid as a salary any given month, how much went
to your saving account, how much went to your utility providers, etc. This provides a reach basis
for further analysis. Applying filters like festive seasons and macroeconomic conditions the banking
employees can understand if the customer’s salary is growing steadily and if the spending remains
adequate. This is one of the cornerstone factors for risk assessment, loan screening, mortgage
evaluation and cross-selling of multiple financial products like insurance.
Transaction channel identification
The banks benefit greatly by understanding if their customers withdraw in cash all the sum
available on the payday, or if they prefer to keep their money on the credit/debit card. Obviously,
the latter customers can be approached with the offers to invest in short-term loans with high payout
rates, etc.
Customer segmentation and profiling
[10]Once the initial analysis of customer spending patterns and preferred transaction
channels is complete, the customer base can be segmented according to several appropriate profiles.
Easy spenders, cautious investors, rapid loan repayers, deadline rush returners… Knowing the
financial profiles of all customers helps the bank evaluate the expected spending and income next
month and make detailed plans to secure the bottom line and maximize income.
Product cross-selling
Why not offer a better return on interest to cautious investors to stimulate them to spend
more actively? Is it worth providing a short-time loan to an easy spender who already struggles to
repay a debt? Precise analysis of the customers’ financial backgrounds ensures the bank is able to
cross-sell auxiliary products more efficiently and better engage the customers with personalized
offers.
Fraud management & prevention
11
Knowing the usual spending patterns of an individual helps raise a red flag if something
outrageous happens. If a cautious investor who prefers to pay with his card attempts to withdraw
all the money from his account via an ATM, this might mean the card was stolen and used by
fraudsters. A call from a bank requesting a clearance for such operation helps easily understand if
it is a legitimate claim or a fraudulent behavior the cardholder does not know of. Analyzing other
types of transactions helps cut down the risk of fraudulent actions greatly.
Risk assessment, compliance & reporting
A similar procedure can be used for risk assessment while trading stocks or screening a
candidate for a loan. Understanding the spending patterns and previous credit history of a customer
can help rapidly assess the risks of issuing a loan. Big Data algorithms can also help deal with
compliance, audit and reporting issues in order to streamline the operations and remove the
managerial overhead.
Customer feedback analysis and application
[10]The customer can leave feedback after dealing with the customer support center or
through the feedback form, but they are much more likely to share their opinion through the social
media. Big Data tools can sift through this public data and gather all the mentions of the bank’s
brand to be able to respond rapidly and adequately. When the customers see the bank hears and
values their opinion and makes the improvements they demand — their loyalty and brand advocacy
grows greatly.
Final thoughts on using Big Data in the banking sector
Doing the things the old way is too risky nowadays. The companies must evolve and grasp
the new technologies if they want to succeed. Adopting the Big Data analytics and imbuing it into
the existing banking sector workflows is one of the key elements of surviving and prevailing in the
rapidly evolving business environment of the digital millennium.
We are all used to perceive the banks as huge buildings with cool marble halls where the
clerks work with the customers. In the last 10 years, the banks invested heavily into modernizing
12
their offers and providing mobile access to their services. In the next 5 years, they will have to learn
to empower their operations with Big Data analytics, AI/ML algorithms, and other high-tech tools.
HADOOP
[11]It is a Open Source Framework where we can analyze the data cheaper and faster with
the cluster of commodity hardware. [7] It provide massive storage for any kind of data with
enormous processing power. Hadoop is an distributed file System and it is not a database. It simply
uses the file system provided by Linux to store data. Hadoop has five such daemons. They are
Name Node, Secondary Name Node, Data Node, Job Tracker and Task Tracker. By using daemons
operates the Both Structured Data and Unstructured data. By Using Hadoop in Banking sectors we
can handle the Risk Management and Fraud Detection.
HDFS (Hadoop Distributed File System): The java based scalable that stores data across
multiple machines without prior organization.
Map Reduce: It is a software programming model for processing large sets of data in
parallel.
Flexibility: We can store as much data as we want and decide how to use it later. That
include unstructured data like text, images and videos.
HDFS Architecture:
YARN child: it is a logical expression in the resource manager which is going to chap the
data into perfect block size 128MB.
Yarn Scheduler: it is also logical expression which is going to schedule the block on top
of the particular data node based upon resource manager information.
Hadoop has five daemons. They are Name Node, Secondary Name Node, Data Node, Job
Tracker, Task Tracker.
Name Node: Name Node is the central piece of HDFS. It keeps the directory tree of all
files in the file system. It won’t store any meta data of the blocks which exits in the data node.
13
Name Node maintain two important files one is Edits.log and other one is fs.image. Edits.log
stores the meta data and fs.image captures the Edits.log information.
Secondary Name Node: Secondary Name Node is also called as check point node and
passive node.
For every frequent equal intervals of time the Meta Data transferred to Name Node to
Check Point Node.
Data Node: Data Node is responsible for storing actual data in the form of blocks in HDFS.
Data Node is always communicate the Name Node constantly for every 3 seconds. It is also called
heart beat. It will update the health information of Data Node.
Cluster Management Calculation:
To calculate the Data Node with given configuration.
30%  2PB RAM = 64 GB
100%  ? HDD = (6*12) = 72 TB
2*100/30 = 6.7 pb + 0.2 pb = 6.9 pb = 7.0 pb
7 pb * 1024 TB = 7168Tb/72 = 100 DN.
Process of Hadoop Cluster:
1. Node manager of Data Node 2 send the information to resource manager again sent to
the Edits.log
2. [8]To know the Hadoop cluster information (Health) which includes total size, total
directories, total files and about block information such as total blocks, over, under, Mis replicated
Blocks and number of Data Nodes, number of racks which are using in the cluster HDFS
Architecture information.
14
3. let us consider the file name as bbc.txt which is the size of 1 GB which includes 8 blocks
of 128 MB and over all 24 blocks of replication which includes replication.
4. To write on top of HDFS Hadoop fs -put “bbc.txt” “user/cloudera”. By using edge node
we are using above syntax. Once file reaches name node resource manager handover to yarn child.
5. Yarn child will chap the data into perfect block size 128MB and sending to Data Que in
a FIFO process. From Data Que to blocks going to send the acknowledge Que by using data stream
(mapper).
6. In the Acknowledge Que the blocks are going to be in hold position. Acknowledge Que
totally controlled by Yarn Scheduler. The first original block is return on the top of particular Data
Node with the guidance of yarn scheduler to Acknowledge Que. Once the original block
successfully return it’s meta updated in Edits.log.
7. Now replication block1 return on top of adjacent Data Node with the help of native C++
libraries. Once the block written successfully requires Acknowledge. Third replicated block also
return on top of adjacent data node successfully till the replicated blocks are return successfully
the Node Manager will contact Resource Manager and update blocks of meta data successfully.
Fail Over Chances of the Blocks:
Over Replicated Block:
In any case the replication block written successfully and Data Node get failed at that time
there is a chances of writing replicated block again.
Under Replicated Block:
In any case the block return successfully at 99.0% chances of getting Data Node failure.
Node Manager sent information Acknowledgement. Block will removed in the Acknowledgement
Que.
15
Benefits of the Hadoop:
Computing Power:
Its distributed computing model quickly processes Big Data. The more computing notes
we use the more processing power we have.
Low Cost:
It is open source framework id free and uses commodity hardware to store large quantity
of data.
Scalability:
We can easily grow our system simply by adding more nodes.
16
High Scale Computing Platform for Big Data Analytics:
Advantages of Big data for banks Using big data and technology, the banks may be able
reap some of the following benefits:
Advantages of Big data for banks
1.[12]Find out the root cause of issue and failures
2.Determine the most efficient channel for particular customers
2.Identify the most important and valuable customer
3.Prevent the fake behaviour
4.Identify the risk and the risk analysis
5.Customised products and customised marketing communication
6.Optimise human resources
7.Customer retention
17
CHAPTER 2
LITERATURE REVIEW
[3]Banks square measure establishments to inside the money business and region
improvement, concern like exchange stores organization and premiums in capital markets, among
others. The keeping cash structure is indispensable for the economy as it's a subject of lovely
energy for examiners in an exceedingly no matter how you look at it of different regions, like
organization science, displaying, back and information propels. Berger (2003) found check of an
association between creative headway and effectiveness in keeping cash.
Predictable maker conjointly focuses on that banks use associated science models
reinforced their money data for different limits, like credit evaluation and peril examination.
Budgetary region changes allowed a climb in competition, turning bank advancing a basic supply
of financing. Credit risk examination is by its very own a colossal region, encompassing a bigger
than normal combination of examination creations inside dealing with a record and spread out
through the latest twelve years. [9]Other dealing with a record related subject wherever
examination has been dynamic is distortion bar and acknowledgment in old keeping cash
organizations and in new correspondence channels that assistance e-setting aside extra cash
organizations, from that bit of email spamming to unlawfully get singular money data could be a
specific case of premium. As a result of advances in data development, essentially all dealing with
a record exercises and approach square measure customized, making tremendous proportions of
information. Thusly, all of the subjects determined over will in all probability have the upside of
metallic segment courses of action.
In the review, “Penetrating the Fog: Analytics in Learning and Education. EDUCAUSE
Review, 46(5), 30–32” (2011), Siemens, G., & Long.P defines big data as datasets whose size is
beyond the ability of typical database software tools to capture, store, manage and analyze.
In a report, “TDWI Best Practices Report: [15]Big Data Analytics (Best Practices) (pp. 1–
35). The Data Warehouse Institute (TDWI)” (2011), Russom P. writes that for data to be classified
as big data it must possess the three Vs: Volume, Variety, and Velocity. Many people assume that
18
big data simply has volume, but Russom clarifies that the other two Vs are just as essential. Big
data is not just large, but it is varied. It comes in many formats and can be organized in a structured
or non-structured way. Velocity refers to the speed at which it is generated. One of the reasons we
build larger and larger stores of data is that we can generate it much more quickly. And Russom
explains that volume doesn’t have to refer to terabytes or petabytes. He suggests that other ways
to measure volume of data could be number of files, records, transactions, etc.
[16]In the glossary of Gartner. Inc. , they defined big data as high-volume, high-velocity
and high-variety information assets that demand cost-effective, innovative forms of information
processing for enhanced insight and decision making. In the conclusion stated in “Demystifying
big data: A practical guide to transforming the business of Government” (2012), TechAmerica
Foundation’s Federal Big Data Commission stated that Big data is a term that describes large
volumes of high velocity, complex and variable data that require advanced techniques and
technologies to enable the capture, storage, distribution, management, and analysis of the
information.
Usage of big data analytics in banking:
12% banks are in process of using big data.25% banks are expanding by implementing
it.38%banks are exploring options.25% banks are experimenting further uses of it.
Sources for banks:
Credit card history: To track most used retailers. Transactions: To identify loyal customers
Branch visits: To compare e-banking and traditional banking. Web and social media interactions:
For efficient marketing of plans and schemes.
Key Areas for Usage of Big Data:
To increase personalization and convenience. Expedite credit card risk checks. Faster credit
and loan applications. Understand preference to interact. Other individualized service
Technology Behind Big Data
19
Clustering: It is the automation of finding correlated and meaningful data patterns within
a set of data. It is useful for identifying important data amongst noise of possible hundreds of
patterns; it breaks the data into simple parts - segmentation.
Text Analytics: They rely on probability [2]theory and rarity and occurrence of certain
words which is used to predict the meanings and overall idea. Thus they assist in automatic reading
and compilation to provide a summary from possible 1000s of documents. It is a classification
algorithm to clearly define target field. Neural Networks: In this algorithm nodes are activated
by a signal to active other nodes. Thus a transfer function then outputs a signals based on total
received signal. They assign the data to a predefined target field and it is useful for answering
questions related to event a leading to action B or action C.
Link Analysis: It is a subset of mathematics and it is called the graph theory. It represents
a relationship between objects. Link Analysis constitutes both direct and undirected data mining.
It is useful for identifying key sources of information on the web by analysing links for findings
influential customers from call patterns and to recruit new subscribers and so on.
Survival Analysis: It is called time to event analysis. It is a technique used to evaluate

when you should start worrying about an event. Survival Analysis answers the following
questions: when is the customer likely to leave which factors likely increase or decrease customer
tenure affects of various factors time period of when the customer moves to a new customer
segment. Survival analysis is calculated using survival curves and hazard probabilities
Decision Trees: They are the most powerful data mining techniques which are capable of
handling diverse array of problems that can handle any data type. Decision tress spilt the data into
small data cells. It aims at decreasing the overall entropy of data.
Random Trees: The difference between possible errors and noise of individual decision
tree.
20
CHAPTER 3
METHODOLOGY
[13]In terms of methodology, big data analytics differs significantly from the traditional
statistical approach of experimental design. Analytics starts with data. Normally we model the data
in a way to explain a response. The objectives of this approach is to predict the response behavior
or understand how the input variables relate to a response. Normally in statistical experimental
designs, an experiment is developed and data is retrieved as a result. This allows to generate data
in a way that can be used by a statistical model, where certain assumptions hold such as
independence, normality, and randomization.
In big data analytics, we are presented with the data. We cannot design an experiment that
fulfills our favorite statistical model. In large-scale applications of analytics, a large amount of
work (normally 80% of the effort) is needed just for cleaning the data, so it can be used by a
machine learning model.
We don’t have a unique methodology to follow in real large-scale applications. Normally

once the business problem is defined, a research stage is needed to design the methodology to be
used. However general guidelines are relevant to be mentioned and apply to almost all problems.
One of the most important tasks in big data analytics is statistical modeling, meaning
supervised and unsupervised classification or regression problems. Once the data is cleaned and
pre-processed, available for modeling, care should be taken in evaluating different models with
reasonable loss metrics and then once the model is implemented, further evaluation and results
should be reported. A common pitfall in predictive modeling is to just implement the model and
never measure its performance.
Why You Need A Methodology For Your Big Data Research
Effective big data research methodology will help address some common planning
challenges for business, notably aligning investment priorities with strategy. A research
methodology can help big data managers collect better and more intelligent information.
21
Businesses that utilize big data and analytics well, particularly with the aid of research
methodology, find their profitability and productivity rates are five to six percent higher than their
competition.
Businesses may view that substantial increase in effectiveness and immediately seek to
expand big data management, but without a proper research methodology, [8]the time and
monetary investment required for successful big data management may not pay off. Many
companies that fail to get the most out of their big data falter because they lack a plan for how big
data, analytics and any relevant tools interact.
A prudent business should involve data scientists, tech professionals, managers and senior
executives when establishing a big data methodology, with these roles combining their expertise
to create an all-encompassing plan. Project initiation and team selection are critical parts of a
successful research methodology because it highlights decisions a business must make and how
those impact end goals for faster growth or greater profit margins.
Assembling and Integrating Data
Big data can be the lifeblood of strategic decisions that can influence whether a company
will profit or experience losses. Especially in today’s digital age, many businesses are drowning
in a large quantity of data, struggling to identify relevancy. The amount of data is especially
overwhelming today due to the influx of social media platforms, which provide insight into
customer data and behavior that is technically outside of the company.
Assembling data and knowing which data to prioritize is a big aspect of establishing a
methodology and may point to a need for further investments in new data capabilities. Short-term
options include outsourcing issues to data specialists, though this can be costly and can feel too
hands-off for some businesses. Internally, a company can strive to consolidate analytical reports
by separating transactions from other data. They can also attempt to implement data-governance
standards to avoid mishaps regarding accuracy and general compliance.
Utilizing Analytic Models and Tools
22
While the integration of data is vital when establishing a methodology, that integration
won’t have much value if advanced analytic models are not in place to help optimize results and
predict outcomes based on that data. A methodology needs to identify how models create business
value, such as how data regarding customer buying histories can influence what types of discounts
they receive via email.
Additionally, the methodology should utilize analytic models to help solve optimization
issues regarding data storage in general. Models that can separate superfluous information from
meaningful data that can impact a business’ bottom line can have a massive impact on productivity
and results. Tools that help integrate data into daily processes and business actions can provide an
easily comprehensible interface for many functions, from employee schedules to decision making
on which types of discounts to offer.
Industries will vary regarding their core areas of focus. For example, a transportation
company will rely more on GPS and weather data than a static storefront, while a hospital will
require data on drug efficacy. Regardless, aspects of importance should be at the forefront when
analyzing big data, particularly how they interact with productivity on a daily and long-term basis.
Methodology Planning Challenges
An effective big data research methodology will help address some common planning
challenges for business, notably aligning investment priorities with strategy, focusing on frontline
engagement and balancing speed with cost.
Frontline engagement and general efficiency can increase if a methodology is capable of

detecting unusual data segments, helping alert researchers of areas that require manual analysis
alongside pre-existing machine learning and automated transactional data. Big data research
methodology should be ready to identify abnormalities, with a plan in place how to address them.
Additionally, a methodology that disregards responsible big data research can run into
legal and ethical issues down the line regarding data sharing and the use of data about people,
especially in social network maps. As a result, the methodology should regard efficiency and
productivity, while also being mindful of ethics. An ethical methodology for big data research that
23
assembles and integrates data into an organized system with relevant analytics tools can provide
businesses with an increase in productivity and profitability.
Areas a big data methodology should address include selecting ideal analytic tools and
models, identifying which internal and external data to integrate and developing an organizational
structure to accommodate this data flow with goals in mind.
Research Objective
A Study to analyze the effectiveness of using Big Data Analytics in Banking Industry.
Research Design
The Research Design for the study is based on descriptive research model in which the
analysis has been done on the basis of data collected through primary research and also research
of relevant published secondary data.
Data Capturing Instrument
The data was captured using questionnaires. Each questionnaire comprised of 10 questions.
The data that was sought pertains to the challenges faced by the banks in processing big
data and adopting strategies to use the same in a meaningful manner.
The questionnaires were personally administered as well as sent and collected through e-
mail.
Sampling
The primary data has been collected through questionnaires administered to the banking
professionals who are aware of the technologies being used in the banking process and using the
same in day-to-day operations and transactions.
The sample size for primary data collection was 100. The respondents were selected
through simple random sampling process.
24
[2]The secondary data has been collected from a diversified pool of resources such as
newspapers, journals and research articles published by consulting companies like Mckinsey,
E&Y, PwC, KPMG and CII. The secondary data shall help to validate the inferences drawn from
the primary research.
Data Analysis
The data has been analyzed using statistical tools like graphical analysis and percentage
analysis.
HADOOP FOR FINANCIAL SERVICES
The key challenges with current architectures in ingesting & processing above kinds of
data –
[14]A high degree of Data is duplicated from system to system leading to multiple
inconsistencies at the summary as well as transaction levels. Because different groups perform
different risk reporting functions (e.g Credit and Market Risk) – the feeds, the ingestion, the
calculators end up being duplicated as well.
25
Traditional Banking algorithms cannot scale with this explosion of data as well as the
heterogeneity inherent in reporting across areas such as Risk management. E.g certain kinds of
Credit Risk need access to around 200 days of historical data where one is looking at the
probability of the counter party defaulting & to obtain a statistical measure of the same. The latter
are highly computationally intensive. Open source software offerings have immensely matured
with compelling functionality in terms of data processing, deployment scalability, much lower cost
& support for enterprise data governance. Hadoop, I hold that the catalyst for this disruption is
Predictive Analytics – which provides both real time and deeper insight across hundreds of myriad
of scenarios –
1. Predicting customer behaviour in real time,
2. Creating models of customer personas (micro and macro) to track their journey across a
Bank’s financial product offerings,
3. Defining 360degree views of a customer so as to market to them as one entity,

26
4. Fraud detection
5. Risk Data Aggregation (e.g Volcker Rule)
6. Compliance etc.
27
CHAPTER 4
SUMMARY AND CONCLUSION
Thanks to Big Data, the industry landscape is transforming. But like we said earlier, data
is of no use unless there’s someone to decipher it and unravel the hidden patterns within.
Businesses need insights from Big Data, and this is precisely why they’re always on the lookout
for skilled professionals in the field – individuals who can unlock the secrets that Big Data holds.
Big Data technologies like Hadoop and Spark are the buzzwords now. So, make sure you
learn how to work with related tools Hive, HBase, MapReduce, Spark RDD, Spark Streaming,
SparkSQL, SparkR, MLlib, Flume, Sqoop, Oozie, Kafka, Data frames, and GraphX, to name a
few.
Rest assured, if you train yourself to acquire the right skills, you will grow to become a
vital asset to any organization invested in Big Data. You will grow as the company grows.
28
REFERENCES
[1] U. Srivastava and S. Gopalkrishnan, “Impact of Big Data Analytics on Banking Sector :
Learning for Indian Banks Impact of Big Data Analytics on Banking Sector : Learning for Indian
Banks,” Procedia - Procedia Comput. Sci., vol. 50, no. February, pp. 643–652, 2016.
[2] P. M. Manish, S. Kasale, and D. Simon, Banking & Big Data Analytics. .
[3] B. Nk, “Big data in banking,” 2020.
[4] E. . Lai, “Big Data Analytics to Bank on your Biggest Asset-Information,” no. 08, pp. 1–
39, 2011.
[5] R. A. Introduction, “Big Data in Financial Services and Banking Architect ’ s Guide and
Reference Architecture Introduction,” no. February, 2015.
[6] Datameer, “3 Top Big Data Use Cases in Financial Services How Financial Services
Companies are Gaining Momentum in Big Data Analytics and Getting Results.”
[7] U. Srivastava and S. Gopalkrishnan, “Impact of big data analytics on banking sector:
Learning for Indian Banks,” Procedia Comput. Sci., vol. 50, no. December 2015, pp. 643–652,
2015.
[8] N. Shah and S. Way, “BIG DATA FOR BANKING.”
[9] H. Hassani, X. Huang, and E. Silva, “Digitalisation and Big Data Mining in Banking,” Big
Data Cogn. Comput., vol. 2, no. 3, p. 18, 2018.
[10] A. Munar, E. Chiner, and I. Sales, “A Big Data Financial Information Management
Architecture for Global Banking,” 2014.
[11] M. Amakobe, “file:///C:/Users/Ayman/Desktop/AUC Research engine/Big Data &

Competitive/Journals/BG & Banking sector/IJAL020405SHARMA.pdfThe Impact of Big Data
Analytics on the Banking Industry,” no. July, pp. 1–12, 2015.
29
[12] A. Chandani, M. Mehta, B. Neeraja, and O. Prakash, “Banking on Big Data: a Case Study,”
vol. 10, no. 5, pp. 2066–2069, 2015.
[13] P. S. Kumar and S. P. Jayanna, “A Survey of Big Data Analytics in Banking and Health
Care today,” vol. 3, no. 12, pp. 632–638, 2016.
[14] S. Hafiz Oluwasola, “How to Improve Efficiency of Banking System with Big Data (A
Case Study of Nigeria Banks),” Int. J. Sci. Res., vol. 6, no. 6, pp. 2015–2018, 2017.
[15] S. HK, “Big Data & Analytics: Tackling Business Challenges in Banking Industry,” Bus.
Econ. J., vol. 08, no. 02, 2017.
[16] Paromita, “Big Data Analytics In Banking Industry.” 2018.
30

Big Data Analytics in Banking Sector: Master of Technology

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Big Data Analytics in Banking Sector: Master of Technology

Transféré par

Droits d'auteur :

Formats disponibles

BIG DATA ANALYTICS IN BANKING SECTOR

Report submitted in partial fulfilment of the requirements for the Degree

COMPUTER SCIENCE AND ENGINEERING

DR. GURBAKASH PHONSA

School of Computer Science and Engineering

CONTENTS PAGE NO.

TABLE OF CONTENTS ...............................................................................................ii

LIST OF FIGURES ...................................................................................................... iv

3.1.1. SOURCES OF BIG DATA: .................................................................... 4

3.2. BIG DATA ANALYTICS TO BANK ON YOUR BIGGEST ASSET-

3.3. BENEFITS OF STREAM PROCESSING PLATFORM .................... 6

3.4. PROPEL STREAM SOLVING CHALLENGES IN BANKING

3.4.1. IMPACT OF BIG DATA ON BANKING INSTITUTIONS AND

3.4.2. RISK MANAGEMENT: ......................................................................... 8

3.4.3. TRANSACTIONS: .................................................................................. 8

3.4.4. ANALYSIS AND INFERENCES ........................................................... 8

FEEDBACK ANALYSIS ............................................................................... 8

DATA COLLECTION AND SAMPLE SIZE .............................................. 9

3.4. BIG DATA IN BANKING APPLICATIONS ........................................ 9

BENEFITS OF THE HADOOP: .................................................................. 16

HIGH SCALE COMPUTING PLATFORM FOR BIG DATA

FIGURE NO. FIGURE DESCRIPTION PAGE NO.

Figure 1.1. General architecture of the data aggregation algorithm 6

Figure 5.1. Proposed Methodology 16

Data is a collection of information and it is divides into two types of categories.

1.1. CATEGORIES OF BIG DATA

This data is unknown form or structure is called as unstructured Data. Examples of

3V’s of Big Data:

3.1.1. SOURCES OF BIG DATA:

Big Data Analytics in banking Introduction

Introducing Propel Stream -Information

3.4.2. Risk management:

[8]Following are the manners by which information investigation is being utilized to

MIS/ Regular announcing Disclosure

Real time keyboard discussion Anti-tax

3.4.4. Analysis and Inferences

It is the important of process of any organization to understand about building areas

When we plot the data, there are some snooping findings

Feedback Analysis and Inference

3.4. BIG DATA IN BANKING APPLICATIONS

Contacts centre Efficiency Optimization: It helps banks to resolve problems of

 Discovering the spending patterns of the customers

 Splitting the customers into segments according to their profiles

 Product cross-selling based on the customers’ segmentation

 Fraud management & prevention

 Risk assessment, compliance & reporting

 Customer feedback analysis and application

Transaction channel identification

Customer segmentation and profiling

Fraud management & prevention

Risk assessment, compliance & reporting

Customer feedback analysis and application

Final thoughts on using Big Data in the banking sector

Cluster Management Calculation:

To calculate the Data Node with given configuration.

30%  2PB RAM = 64 GB

100%  ? HDD = (6*12) = 72 TB

2*100/30 = 6.7 pb + 0.2 pb = 6.9 pb = 7.0 pb

7 pb * 1024 TB = 7168Tb/72 = 100 DN.

Process of Hadoop Cluster:

Fail Over Chances of the Blocks:

Over Replicated Block:

Under Replicated Block: