OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith

Data Science
Simple and Effective Tips and Tricks to

Learn Data Science
© Copyright 2020 - All rights reserved.
The contents of this book may not be reproduced, duplicated or transmitted without direct written
permission from the author.
Under no circumstances will any legal responsibility or blame be held against the publisher for any
reparation, damages, or monetary loss due to the information herein, either directly or indirectly.
Legal Notice:
This book is copyright protected. This is only for personal use. You cannot amend, distribute, sell,
use, quote or paraphrase any part or the content within this book without the consent of the author.
Disclaimer Notice:
Please note the information contained within this document is for educational and entertainment
purposes only. Every attempt has been made to provide accurate, up to date and reliable complete
information. No warranties of any kind are expressed or implied. Readers acknowledge that the
author is not engaging in the rendering of legal, financial, medical or professional advice. The
content of this book has been derived from various sources. Please consult a licensed professional
before attempting any techniques outlined in this book.
By reading this document, the reader agrees that under no circumstances is the author responsible for
any losses, direct or indirect, which are incurred as a result of the use of information contained within
this document, including, but not limited to, —errors, omissions, or inaccuracies.
Table of Contents
Introduction
Chapter One: History of Data Science
Chapter Two: Data Dominance and Data Revolution – The Value of
Data
Chapter Three: Why Data Science? Advantages and Disadvantages of
Data Science
Advantages of Data Science
Data Science and the Disadvantages
Categories of Data
How to Approach Data for Analysis
The Ecosystem of Big Data
Frameworks for Data Science
NoSQL Databases
Scheduling Tools
Chapter Four: Predictive Analytics
Descriptive Analysis
Diagnostic Analysis
Predictive Analysis
Prescriptive Analysis
Advantages of Predictive Analysis
Predictive Analysis and its Disadvantages
Chapter Five: Prerequisites for Data Science
Statistics
Descriptive Statistics
Inferential Statistics
Central Limit Theorem
Hypothesis Testing
ANOVA
Skills a Data Scientist Must Develop
Chapter Six: Tools Required for Data Science
R Programming Language
Python Programming Language
Structured Query Language
Hadoop
SAS
Apache Spark
D3.js
Chapter Seven: Data Science and Artificial Intelligence
What is Artificial Intelligence?
Chapter Eight: Data Science and Machine Learning
Knowing Machine Learning
Artificial Intelligence and Machine Learning Go Hand in Hand
Chapter Nine: Hadoop
Hadoop in Data Science
Parts of Hadoop
MapReduce in Hadoop – What Does It Mean?
MapReduce – Working
Is it all Good? What are the Challenges of Using Hadoop?
Common examples of commands and examples that you must know before
coding in Hadoop
Creating the First Program in Hadoop
Chapter Ten: R Programming Language
The Best Ways to Learn R
Never Stop Learning R
Chapter Eleven: Building Simple Programs in R
Example 1
Example 2
Example 3
Example 4
Example 6
Chapter Twelve: Python for Data Science
Introduction to Python
The Importance of learning Python
Functioning of Python
Why Use Python?
Python in Web Development
How to Learn Python Efficiently
Advanced-Data Science Techniques
Chapter Thirteen: Building Simple Python Programs
Chapter Fourteen: SQL for Data Science
Introduction
Master Data Science in 5 Steps
Examples of few SQL queries in a program
Analyzing Every Part of Data Science Puzzle
Chapter Fifteen: Math for Data Science
Naïve Bayes
Linear Regression
Logistic Regression
Neural Networks
Applying Math to Data Science Models
Chapter Sixteen: Data Science in an Organization
Communicating the Insights Captured through Data
Leveraging Cloud-Based Solutions
Chapter Seventeen: Creating Data Visualization to Communicate the
Insight
Chapter Nineteen: Applications of Data science
Fraud And Risk Detection
Healthcare
Internet Search
Targeted Advertising
Website Recommendations
Advanced Image Recognition
Speech Recognition
Airline Route Planning
Gaming
Augmented Reality
Conclusion
References
Introduction
Today’s world is all about artificial intelligence and big data. As astonishing
as it sounds, around 2.5 exabytes of data are created each day, which means
they need for data has risen significantly, especially in the last decade.
Many companies have even changed the business model and centered it on
data. Data has also added a new department in the IT industry. Before data
science, statisticians used to analyze the data qualitatively. Companies
employed such statisticians to check and analyze the overall sales and
performance of the company. With the emergence of a strong computing
process, cloud technology, and analytical tools, the computer science field,
combined with statistics, gave birth to Data Science.
Let us begin with understanding what data science is. In simple terms, data
science is the branch of study that involves obtaining useful and meaningful
insights and trends from a raw set of data or information. The great number
of data sets that is obtained is then processed through programming,
analytical, and business skills. Sounds tough? Do not worry. Most people
do not know how to start with data science or understand how to learn it
effectively.
This book deals with tips and tricks to learn data science effectively. With
the world turning towards data to make any decision day by day, it is
important to know and learn about Data Science. The aim is to learn what
data science is and get an exact framework for how to approach the learning
process with tips and tricks to make the study effective.
The field of Data Science goes back to its roots of statistics. But it is a
combination of statistics, programming, and business acumen. Learning
about each of the topics is important and will give a complete idea of how
to approach the learning process, making it easy. The art of finding trends
and insights from the obtained large set of data goes a long way back.
The ancient Egyptians used census data to collect tax efficiently, and they
forecasted the flooding of the river Nile each year accurately. Learning from
the past data to make an insight, which makes sense to the business, has
been done ever since it will eventually lead to making smart business
decisions.
It is not a secret anymore that data scientists are in high demand. If you love
working with data, then this field is for you. If you learn data science, you
could grab an opportunity to work in this well-compensated field. Also, the
employees who are skilled in data science can make the work, and the
company more data-driven and hence be in demand across industries.
Moreover, data science jobs are making revolutionary changes in the
technologies to the extent that we now have self-driven autonomous
vehicles and image recognition tools, which have created a huge impact on
the advancement of industries and academia.
Every data scientist is expected to have some skills, both technical and non-
technical so that he can excel in the field. These skills also make it easier
for the data scientist to identify patterns in the data set and help the
management gather more information from the data. The management can
use this information to assess how they should proceed further. It also helps
them understand the various changes they need to make to the existing
products and services to increase revenue.
With the expert tips and tricks in this book, you will be able to learn how to
drive insights efficiently from a data set and, more importantly, get a
definite framework on how to proceed to become a data scientist that
companies are in search of.
So, if you are ready to learn more, let’s read on and get started.
Chapter One: History of Data Science
The term “Data Science” might have emerged only recently to mark a new
profession in the IT industry, but making insights from a large repository of
data goes a long way back. It has been mainly discussed by statisticians,
mathematicians, and computer engineers for years. In 1962 John W. Turkey
termed a field called “data analysis,” which is the same as modern-day data
science. Later in 1992, the attendees at a statistics symposium accepted the
birth of a new module focused on data of numerous origins and forms,
which was a combination of the established principles and concepts of
statistics and data analysis with computing.
The term “Data Science” traces back to 1974 when renowned Danish
computer scientist Peter Naur suggested it a possible alternative name for
computer science. In 1996, the International Federation of Classification
Societies was termed the very first conference to feature data science as a
separate topic particularly. In 1997, C.F Jeff Wu proposed that the name
statistics be replaced as data science as he thought a new name would help
statistics differentiate itself from accounting, which otherwise was
considered synonymous.
In 1998, Chikio Hayashi suggested that data science is a very prominent,
interdisciplinary concept with three main concepts: data design, assembling,
and data analysis.
The modern concept of data science as a separate discipline is attributed to
William S. Cleveland. In 2002 and 2003, Data science journals were
launched by the committee of Data Science Technology and Columbia
University, respectively.
Chapter Two: Data Dominance and Data
Revolution – The Value of Data
Oil is surely one of the most valuable and important resources of human
society. Oil has dominated the human world for centuries as one of the most
important and valuable resources. The more you look back at history, the
more you will see how controlling the sources and trade of oil equal to
controlling the whole economy. It is no wonder that data is considered to be
the new oil now. It is a perfect phrase because currently, we live in a ‘data
economy’ where data or information is supposed to be the most important
asset. Raw data is as valuable as oil as it holds the potential to be refined
and used as a commodity.
With the help of refined data, companies can make crucial changes in the
initial stages before the costs are incurred instead of looking at the problems
retrospectively. This should serve as a guiding principle for everyone as it
will help you turn data into insight. If the input is transparent, the output
will be more credible and trustworthy. It is necessary to clean, review, and
verify data because you need to check that nothing is incorrect or missing
before the refinement process begins. It is necessary for the business leads
to check the opportunities provided by the refined data. With their
knowledge of organization, culture, and priorities, they can make the best
use of data by comparing it against the current environment to make the
company's best possible decision.
With around 97% of businesses using data to make business decisions and
power their opportunities, it is necessary to understand that data and data
science should be up with our ever-changing world and needs. Being
careful with your data is a crucial step to turn a business profitable these
days. But have you considered what factors make data so valuable? Data
these days is just consumer information and records. So, for any business,
data can give the best insight into your current and potential customers.
You might have noticed yourself these days that when you casually browse
Facebook or Google, ads are shown based on your habits and buying
patterns. Every piece of customer data, whether it is the website you have
clicked to which state you are residing in, how frequently you check in to
your favorite restaurant would appeal to a certain audience, and hence you
can sell it to them. Personalizing a few campaigns to the most loyal
customer is also possible with the assistance of data.
Data is the fastest way to find out the target for your business strategy. If
you ever lose in a sales strategy, the chances are that you did not have
enough data to research before launching the campaign. Most of the money
spent these days by marketers is to figure out what their customers really
like.
With all this in mind, it is noted that 70% of business decision-makers
forecast increasing the resources that are utilized to store data. Thus, many
businesses have started keeping their data clean and organized like never
before.
Chapter Three: Why Data Science? Advantages
and Disadvantages of Data Science
The past few years have seen the amount of data grow exponentially. With
growing data, it becomes important to learn and hence find useful insights
from the available data. It has become an essential part of understanding the
functioning of various industries. With the assistance of Data Science, the
companies get the potential to reach their respective clients in a refined and
enhanced manner.
Advantages of Data Science
Data science seems to be a buzzword these days, and it's essentially the
study of data. There are various advantages of data science, and some of
them are as follows.
Perhaps the most obvious benefit of data science is the rapidly growing
professional opportunities at creating. As the available data structure
increasing, the need for trained professionals capable of dealing with this
data and obtaining useful insights from it is also increasing. It is rightly said
that data science is in high demand and for all the right reasons. It is
believed to be one of the highly employable sectors these days. It's not just
the IT sector that uses data science, but various other sectors are also using
it. Irrespective of whether it's healthcare or law enforcement, there seems to
be a need for data scientists everywhere. It's not just the abundance of
available jobs, but they are also high-paying jobs.
Data science is truly versatile and has various applications. From the e-
commerce industry is to organizations offering consultancy services or even
banking, there are ample opportunities available in this field. As mentioned
earlier, there is a need for trained professionals capable of processing and
analyzing all the data available. All the insights derived from such data are
often used for enhancing the overall performance and profitability of
various companies and organizations. Another major advantage of using
data science is that it helps automate repetitive, redundant, and boring tasks.
Machines can be easily trained using artificial intelligence, and various
other programming languages to perform repetitive tasks based on historical
data. Once these tasks are automated, it frees up human resources, which
could be directed for better purposes. Instead of spending all our valuable
time, energy, and effort on redundant activities, the machines can take care
of it. The result is that they produce can be analyzed for better
understanding.
Data science is a great way to create specifically tailored products to
enhance overall customer experience and satisfaction. For instance, most of
the e-commerce platforms these days use recommendation systems that
help provide personalized insights to all their online shoppers and users.
Whenever you shop on an e-commerce platform or browse through certain
categories, you will notice a list of recommended products or related
services. These insights are based on your browsing history. All this
couldn't have been possible without data science.
Today, Data Science is turning to be an important and necessary factor for
the growth of the company. It helps them make better decisions to enhance
their business. Data Science enables companies to connect with their
customers in an improved manner, thus providing a better quality service
and a more powerful product.
Data science can be applied to any industry such as travel, healthcare, and
education, which makes it very important as the industries can get insights
into the challenges and efficiently address them.
Big Data is also continuously emerging and improving. Various tools are
developed in which big data helps the firm resolve complex IT and human
resource management issues successfully.
As most of the fields mentioned so far are growing rapidly, data science's
significance also increases continuously. Its effect can be seen in various
sectors, including the healthcare industry as well, where various new
medicines and techniques are being discovered regularly. They have a
continuous need to increase the care of patients.
With the data science technique, the healthcare sector can develop a
solution that can help take care of the patients better. As mentioned, the
scope of data science in the healthcare industry is noteworthy. Detecting
early-stage tumors have become possible due to machine learning. Machine
learning wouldn't have been possible without data science.
Therefore, as the importance of data science grows by the day, the
requirement of a data scientist also grows. As more companies are
implementing Data Science into their business strategies, it has, in turn,
resulted in opening several job opportunities in the sector of Data Science.
Data scientists are considered to be the future and rightly so. Therefore, a
data scientist should find great solutions that meet the challenges of the
entire sectors.
Data Science and the Disadvantages
As we have already seen why data science can be a very desirable job, there
are various disadvantages when it comes to data science. To get a complete
picture of data science, we must understand the problems behind this
subject.
Being a domain with a large number of high paying jobs has changed many
a career trajectories, but this field has its drawbacks as well when you
consider the vastness of the field and its nature of being cross-disciplinary.
Data Science is an ever-evolving domain that will take ages to gain
knowledge. You should consider all the pros and cons before you decide
whether to make this your ultimate career. To make your job easier, here are
some of the disadvantages of learning data science:
Data Science Is A Very Blurry Term
Data Science is a broad term, and it lacks a good and proper definition.
Even though it is the talk of the town, it has become very difficult to jot
down the actual meaning of Data Science exactly. Even the specific role of
a data scientist depends on which sector he is working or what firm or
industry he is mastering in. While few people have called Data Science to
become the 4th model of Science, some of the critics call it a simple
rephrasing of the term statistics, especially the statisticians.
Becoming A Master Of Data Science Is Almost Impossible
Data Science is a blend of many areas derived from computer science,
statistics, and math. If you want to master data science, you need to be an
expert in all fields. Many books and online sources have been trying to fill
the gap in the skill that the data science industry is facing. It is impossible
to be the master of it when you take into account the vastness of it. A
person who is an expert in computer science might not be able to learn all
the things in statistics during a short duration. So, it is a very dynamic field
that makes a person learn throughout and across all the fields of data
science.
A Large Amount Of Domain Knowledge Is Needed
Another downside of Data Science is its need to rely on Domain
Knowledge or the subject matter expertise required. A data scientist with
some knowledge in Computer Science and statistics will find it difficult to
answer a question about genetics if he doesn't have any information about
the subject. For instance, a data scientist working in the healthcare sector is
trying to understand and use genomic sequences to predict an event's
outcome. To do this, he would require someone with a little knowledge of
genetics and molecular biology. This is the only way he can predict the
outcome of various decisions that will help the firm. It will be difficult for a
Data Scientist to do this by himself.
Arbitrary Data May Lead To Unexpected Results
Data Scientists have the job to analyze the values or data set and make
careful predictions and insights to help or assist the company's decision-
making. Often, the data given is erratic, and the expected results are not
obtained. This can also happen due to a not so good management system
and poor utilization of resources.
The Problem Of Data Privacy
Data is the fuel for almost all the sectors now. Data Scientists assist
companies in making data-driven decisions. But the data used in the same
may violate the privacy of clients or customers. The personal information of
clients are seen by the main company (parent) and may sometimes lead to a
major data leak if there is a failure in data security. The ethical problem
about safeguarding data privacy and usage of the data is a major problem
for various sectors and industries.
Categories of Data
When you start to learn about data science, you will realize there is not just
one definite type of data. Instead, there are various categories under which
Data is classified, and every category would need an entirely new set of
tools for the process.
There are seven major categories of data:
● Structured Data
● Unstructured Data
● Natural Language data
● Machine Generated data
● Graph-based Data
● Audio, video, and image data
● Streaming Data
Now let me explain a little about each type of data.
Structured Data
As said in the name, a data set organized according to a defined model and
restricted in the record’s corresponding data fields is called a structured
data. Such kind of data can be organized and managed using Structured
Query Language. However, not all structured data sets are easily
manageable. For example, a family tree data set is also considered
structured, but it is not easy to process and find insights into such structured
data sets.
What makes it difficult to process is the presence of a few exceptions in the
data categories, which may demand another data processing model or rule.
Raw Data obtained from the Internet, or a company is never structured. It is
brought into a structure by the coders. Hence, if the Data is given in a
structured manner, it is good. But if not, then you can give it a structure
before exposing it to the Structured Query Language or SQL techniques.
Data that is found in emails is an example of unstructured data. Hence to
analyze and process the data set, we need to bring it to a structured form.
Unstructured Data
Many people argue that since an email has fields such as the sender, the
receiver, and the subject, it is structured. But the traditional structuring data
analysis is not possible in emails as the contents in them are either highly
varying or content-specific. The choice of words, the language used, and
the intonations that refer to something also varies in every mail, making the
step even more complicated.
Natural Language
Even this is a type of unstructured data and is considered to be difficult to
process, as we would need to factor in linguistics. So, for such data sets, the
user must be proficient with various data science and linguistics techniques.
The main concern with the programmers working with natural language
processing is the lack of generalization in the models. Every model is
trained specifically to one aspect, such as entity recognition, topic
recognition, summarization, etc. but these models cannot be generalized
over the domains such as text completion and sentiment analysis. The main
reason being it is ambiguous and is almost impossible to program machines
to overcome this confusion when human brains have themselves failed to
achieve the same.
Machine Generated Data
The data generated by a computer or any similar machine or application
without any human being fiddling the process is known as machine-
generated data. As things are becoming automated, such data types are
considered an important source of information. To extract and analyze the
information that these kinds of data are trying to portray, we need tools that
are scalable as there is a large chunk of data to be analyzed to get a proper
view of the insight. Such type of machine-generated data is very high in
volume and also generated very fast. Some data, like logs of site crashes,
server crash logs, call record logs, and even network logs, are examples of
machine-generated data.
Graph-Based Data
We should not confuse ourselves with the terms “graph” and “graph
theory.” The first one represents the geometrical representation of
information in a graph. And remember all the data can be turned into a
graphical representation or graph, which does not change the type of data.
The graph theory represents the mathematical structure, which ideally
would connect the object to the pair based on their relationship with one
another. Hence, such categories of data are also termed as network data.
Such type of data gives more importance to the adjacency and relationship
of objects. The common structural components found in graphs can be said
as
● Nodes
● Edges
● Properties
Graph-based Data is most visible in social media sites. To query graph-
based Data, we use some of the specialized query languages similar to
SPARQL.
Audio, Video and Image Data
All of you must be familiar with audio, video, and image data to some
extent. But for a data scientist, the type of data such as audio, video, and
image data are very difficult to deal with. One of the main reasons for it is
because even though we are the ones analyzing such kind of data, the
computer must recognize the elements such as image data, the difference
between each object. Identifying these things is tough for a computer, even
though it seems pretty easy for the user. To deal with such kind of data, we
use a deep learning model to analyze and provide insights.
Streaming Data
Streaming data can take the nature of any kind of data that has been
mentioned above. The major difference that makes it different from other
kinds of Data is that this type of data only comes in the system after an
event has occurred in real-time, unlike the previous kinds of data that had to
be analyzed after being loaded into the system by a user in batches. We
need an entirely different process to analyze and extract information from
the streaming data, which is why streaming Data is considered to be an
entirely different category.
How to Approach Data for Analysis
Here, I will talk to you regarding the steps that you might come across in
the process of data science. Usually, there are six steps that you need to
follow:
● Setting a goal for the research
● Retrieving Data
● Data preparation
● Exploration of data
● Data modeling
● Presentation, visualization, and automation
Now let me explain each step briefly such that you get an idea of how you
need to approach a data science analysis:
Setting a goal for the research
The very first step in the data science process could be seen as a goal
concerning the contextual boundaries of the corresponding company.
Compare this to the data science project that is assigned to you by your
firm. The first point you can highlight is the end goal and the project's main
features so that you follow a definite framework and do not wander away
from the given topic.
Retrieving Data
In the first step, we have highlighted the importance of research and the
data we require to attain the goal and the place where we can get it. The
next process is getting the data that is needed. This step actually helps to
focus on making sure of the existence and credibility of the data and also
making sure that the data is accessible. Once we have made sure of these
aspects, we can then collect the same data.
Data Preparation
As of now, we have been successful in retrieving the required data.
However, we should note that the way we are getting the needed data is not
error-free. Collecting the required data can be subjected to a large number
of mistakes and errors, and if we do not sort them out, then, of course, even
the results will turn out to be faulty. Hence, after collecting the Data, we
have to make the data ready by refining it to make it useful in the upcoming
steps of the data science process. Data preparation can be summed up in 3
steps:
Data Cleaning: To clean the data by removing the erroneous values and
inconsistencies within the already retrieved data source
Data Integration: To take multiple sources of data and use them in the
retrieved data source to enrich the standard of the Data
Data Transformation: In this step, we make sure that we have data in a very
desirable format. If not, we work towards changing it to a format that can
take part in the model that we have built.
Data Exploration
This step is to get to know the reality of the prepared data source and by
doing this, you get to develop a comprehensive understanding of the
intricacies and elements in the data. In other words, this process is all about
familiarizing us with the data using various techniques, including statistical
and visual representation. You can use simple modeling for this as well.
Data Modeling
This step is used to make a model on the data that we have collected,
prepared, and explored so that we could note the end goal in the best way
possible. To make building a proper data model possible, there are a bunch
of different approaches you can opt for, such as machine learning, search
models, or even from the sector of statistics. The steps involved in building
a data model are very repetitive, which means each step should be executed
with respect to the model building sequence.
Presentation and Automation
Once the above steps are taken into account, the final plan of action will
present the obtained results in a proper format to your firm. This process
can also be automated by presenting the results from another model as a
form of insights.
The Ecosystem of Big Data
This section will discuss the tools, technologies, and frameworks that have
come up as an ecosystem encompassing big data. The functionalities of the
tools for big data are very similar to each other. And this is exactly why the
technologies that are adopted by a data scientist can be different from one
another.
Frameworks for Data Science
Now let us focus on some of the frameworks of data science.
Distributed Programming Framework
In the distributed file system, we can save data and then move forward to
exploit it, as it is the main idea of the distributed programming framework.
A very important advantage of using a distributed hard drive is that instead
of moving data to another program, we can rather do the opposite. Instead,
take the program and move it to the data. While using the distributed
programming framework, there are a bunch of problems that come along
initially with using general programming languages such as Python or Java.
Although there are few complications, the data science community of
coders has created many frameworks that would help deal with such
complications, giving a better experience on the whole when they use a
distributed programming framework.
Data Integration Framework
After establishing a distributed file system successfully, we would add in
the data, as the file system is currently empty. To perform this, we will have
to shift our data from one source to another, which is exactly where the data
integration system comes into the picture. The working model of the data
integration framework is very much similar to how data warehouses
function, which is to extract and transform and in the end to load.
Machine Learning Frameworks
Once the data is set up in the right place, the next process is to extract
information. To do this, we take advantage of the functionality that is
offered by the areas of machine learning, statistics, and applied
mathematics. Before the popularity of computers, every piece of data had to
be manually processed by humans through various calculations, but this
changed as soon as scientific computing was introduced.
As a single computer could perform all calculations easily that too, with
incredible speed, it enabled lost opportunities to be seized, with people only
doing manual labor and letting the machine do the rest.
More information about deriving algorithms and designing a network model
for the machine’s core functionality is covered in machine learning
frameworks. These days, to analyze and process a huge number of data, the
computer has to perform corresponding calculations. For such a result, we
largely rely on machine learning models. Let us look at some of the most
important machine learning libraries for Python:
● Scikit-learn
● Natural Language Toolkit
● Pylearn2
● PyBrain for neural networks
● Tensor Flow
It is necessary to understand that the machine-learning framework does not
limit its boundaries to only the libraries available in python. There are
useful machine learning engines such as “Spark,” which are focused on
learning real-time machine learning.
NoSQL Databases
SQL databases are relational database software that is specialized for
querying a large collection of stored data, making it easier to manage it. A
bunch of SQL databases have been around for quite some time and are even
used today, such as:
● Oracle SQL
● MySQL
● Sybase IQ, etc.
However, there are new relational databases that are grouped under NoSQL
databases, which means ‘Not only SQL.’ Although the standalone SQL
databases are lacking in terms of functionality, that might not be the sole
reason for the introduction of NoSQL Databases. The exact reason comes
with the problem that the traditional SQL databases had major flaws that
didn’t allow them to scale much. Therefore, the NoSQL Databases
eliminated these flaws in its architecture, and it is now becoming the very
building block for exponential growth in the machine learning
functionality.
You might be wondering that to what extent do the flaws of SQL relate to
big data. The answer you are seeking to this problem is that flaws covered
all the aspects of big data. The major effect was that the relational database
software hit a dead end while trying hard to fulfill the storage and
processing power needs beyond the single node of big data. SQL is not
capable of handling various formats of data such as streaming, unstructured,
and graph data.
Even though there are many databases available nowadays, everything can
be easily categorized into the types detailed below:
Column Databases
As the name suggests, the data in such databases are captured in the
columns. This means the process of query much easier for algorithms and,
therefore, boosts it’s querying speed. We have to keep this in mind that the
newer emerging databases use a similar but efficient technology known as
cell wise storage, but the core idea remains the same.
Document Stores
This kind of database shifted focus from using tables to documentation. The
observational data is stored in a document that makes it easy to use various
data schemes.
Streaming Data
Instead of storing data after collecting, transforming and aggregating it in
batches, the Data is stored in real-time while following the same three steps,
Key-Value Stores
Each value of the Data is assigned a key rather than just being stored in a
table. Although the scalability of Data is very good, the disadvantage is that
the implementation becomes the responsibility of the developer wholly.
SQL On Hadoop
In Hadoop, the data batch queries are available in a language that is similar
to SQL, and the squares use a map-reduce framework that would run in the
background.
New SQL
This class can be considered as a hybrid in the sense that new SQL leverage
is the good scalability of NoSQL databases and also takes advantage of the
good features of relational databases, hence bringing the best of both worlds
to the user. The new SQL class has the traditional SQL interface with a data
model of a relational database.
Graph Databases
Not every Data is suitable to be stored and processed in a table; some forms
of data are best handled with the graph theory and stored in a graph
database, for example, the social networking data.
Scheduling Tools
Scheduling tools are used to automatically perform tasks that are
continuous and repetitive in nature and initiate tasks that are triggered by a
particular event. For instance, a scheduling tool that can automatically
perform MapReduce as it shows that there exists is a new data set in its
directory.
Benchmarking Tools
Benchmarking tools are the class afterward, which provides the user with a
bunch of pre-defined standardized profiling suits, and the main reason for
its development was to take the user’s big data installation and to optimize
it. But he will not go into the details of benchmarking tools as
benchmarking and optimizing the infrastructure and configuration of the
Data is not a task performed by data scientists. Instead, professionals in the
IT infrastructure use it to handle this kind of job. However, compared to an
un-optimized big data infrastructure, optimized ones can offer exponential
savings in monetary resources. For example, consider that you are using a
cluster of 100 servers. If even 10% of the servers are optimized, you will be
saving on the cost of those ten servers.
System Deployment Tools
You should never be underestimated the deployment of application into
clusters of big data as it is a very challenging task. However, this is where
the system deployment tools come into the picture and make the job
relatively easier. These tools perform deployment automation by handling
the installation and configuring the sections of big data. By taking care of
the majority of the heavy lifting, the users only need to learn to use this
software or tools properly.
Service Programming
Service programming tools basically take your big data apps and expose
them as a service to be owned by other applications. This lets other users
who use other system architectures and networks to make it able to use the
functionality of your big data applications in the form of server client-based
relationship. A very popular example of this would be the REST service;
the main purpose is to get the data and give it to webpages on the Internet.
Security
Big data security gives the user a very restricted control over who is
authorized to access the data to handle the authorization and access your
data and data applications. But this is mainly the concern of the data
security professional and not a data scientist. Hence, he will not go into the
deep details of security in this book.
Chapter Four: Predictive Analytics
Predictive modeling is used widely and helps identify accurate insights
from a very large set of data and let the other users forecast. To have a
major advantage over the competitors, it is essential to hold into the
outcomes and the future predictions that your company might confront.
The data from the following sources are used to make a predictive model
usually:
● Transaction data
● CRM data
● Data related to customer service
● Survey or polling data
● Data on traffic on the web
● Economic data
● Data on geographic representation
● Demographic related data
● Data generated through machines
● Digital marketing and advertising data
There are four main types of predictive modeling:
Descriptive Analysis
It is related to data. For instance, a SaaS firm makes up for sale of 4,000
licenses in Quarter 2 and 3,200 licenses in Quarter 1. Descriptive analytics
helps and provides insights into the query that involves the overall selling in
between these two quarters.
Diagnostic Analysis
Diagnostic analytics moves a step more with the data. It can also make it
predict if the rise in sales is exactly due to the way the salespeople
performed or is it due to the change in the interest of a certain section of the
society.
Predictive Analysis
This method uses techniques like machine learning or data mining, etc. and
predicts the future. Here it is about looking into the past data to find out
what the future has in store. Predictive analysis and data mining are very
different from each other. A SaaS firm would base the data on sales of
previous marketing expenses in the sector to create a forecast model to
increase income and optimize the money that is spent on marketing by
targeted marketing.
Prescriptive Analysis
Prescriptive analytics gives a proposal for the forecasted outcome.
Depending on the historical data, an action plan can also be recommended.
Advantages of Predictive Analysis
Improvement of Prediction efficiency - It allows firms to efficiently create a
predictive modeling process that would make use of statistics, and a large
number of past data sets let us get the result of the model.
These models let forecasting things from things like ratings in TV or
technological advances and also sports.
Predictive Analysis and its Disadvantages
● There is a definite gap with the models regarding predictions and

human behavior understanding
● Indirect Power in Decision models
● Failure in polling prediction
Even though we have seen the predictive modeling problems related to
mathematics, it has the user's requirement to make sure and plan because of
the company and technical barriers. This might let you not get the past data
that you need.
Chapter Five: Prerequisites for Data Science
You will learn about the prerequisites for data science in this section.
Statistics
Contrary to the popular belief, data science is not a new term or concept; it
has been popular since many years now, except previously it was known as
statistics and data scientists were known as statisticians. Data science is
surely one of the best, if not the best tool that can help you process data and
to find out various inferences and predictions through it. Under the fancy
name of the concept, lies hardcore statistics. This base provides businesses
the necessary power and capacity to turn around data for improved
productivity. It is impossible to master data science without making oneself
accustomed to statistics.
There are multiple topics under statistics that a data scientist needs to be
aware of. Statistics can be broadly divided into two groups, which are:
● Descriptive Statistics
● Inferential Statistics
Descriptive Statistics
Descriptive statistics means presenting and expressing the data. This
concept allows you to read the data and focuses on the quantitative
summarization of the data using numerical representation and/or graphs. To
understand Descriptive Statistics, you need to understand the following
topics:
● Normal Distribution
● Central Tendency
● Variability
● Kurtosis
Normal Distribution
It is also known as Gaussian distribution. In this large amount of data is
showcased using a graphical plot. To show the data points the probability
function is used. Generally in a Gaussian distribution a symmetrical bell-
shaped curve is observed. In this the group that topped is shown in the
center. This represents the average. When the data points move farther from
the center, they situate themselves evenly on both sides. The data must be
distributed normally if you plan to do inferential statistics.
Central Tendency
In central tendency a center point displays the amount of data. There are
three different sections of this tendency, and they are median, mean, and
mode. These are basically the same as normal distribution where the mean
is at the center of the point in the Data. This mean is also known as the
arithmetic average. It is also the total points divided by the number of
values.
The second measure of central tendency is the median. It is the middle
value of data when the data is arranged from the ascending order of the
value. In the case of an odd set of values it is quite easy to find the middle
value, however in the case of even data points, it is quite difficult to do so.
For this, the mean of the two middle points is calculated the result is the
median.
Mode is the third part of the central tendency. The value that appears the
most number of times in a data set is the mode.
Variability
It is the concept that shows how farther are the data points from the mean
value. It also shows the amount of difference between the chosen data
points. The indicators of variability would be a range, standard deviation,
and variation.
The difference between the largest and the smallest value is known as the
range of the dataset.
Kurtosis and Skewness
In a dataset the lack of symmetry is represented using skewness. If the data
is uniformly distributed, it will be presented in the shape of normal
distribution. The curves are evenly shaped to both sides of the average. For
a uniformly distributed data set, the skewness is zero or null. When the data
is stronger on the right side, it shows negative skew. When the data is more
powerful on the left side, it shows a positive skew.
The measure of all the tails of the probability distribution is known as
kurtosis. Kurtosis displays whether the data is light-tailed or heavy-tailed.
With large Kurtosis, data sets are heavy-tailed, but with less Kurtosis, data
sets have light tails.
Inferential Statistics
Descriptive Statistics is about describing the data, but Inferential Statistics
is about getting insights from the data. Generally, inferential statistics is a
method of concluding inferences about the whole larger population, a
smaller data set, or a sample.
For instance, let us say that you have to count the number of Indians who
have been vaccinated from polio. This could be done in 2 possible ways:
● You can either go around asking every Indian if they are vaccinated,
or
● You can sample a small number of people in the city and then
extrapolate it to a large audience across India.
The first method of going around and asking seems impossible, as it is very
difficult to go around and ask in the country. The second way is by
implementing a certain method that involves statistics that would help draw
conclusions or draw insights from a sample and use it to get the insights and
patterns of a larger population. Here are a few inferential statistical tools:
Central Limit Theorem
According to the Central limit theorem, the average of the entire population
is the same as the average of the sample. This proves that the sample’s
properties, for instance standard deviation, will also be the same for the
entire population. At the end, when the sample size increased, the errors
will be less which will lead to the formation of a bell-shaped curve.
One of the most important concepts in the Central Limit theorem is the
concept of Confidence Interval. This is a show of an approximate value of
the average of the population. To create the intervals, various factors need
to be considered such as the addition of the margin of errors. This error can
be calculated when the standard error of the mean is multiplied with the z-
score of the percentage of the confidence level.
Hypothesis Testing
Hypothesis testing is the limit up to which an assumption can be tested. In
this the results of the hypothesis are collected and displayed to a small
group from the selected population.
The hypothesis that is to be tested is known as ‘null hypothesis’. The
hypothesis that is used to test the hypothesis is known as the ‘alternative
hypothesis’.
For instance, there are two study groups where one group consists of people
who smoke and the other one consists of people who do not smoke. Here
the study begins by assuming that the average number of patients who have
cancer in both the smoking and the non-smoking group will be the same.
This is our "null hypothesis," which we have to check and decide whether
to reject it.
Our" alternate hypothesis" is - The number of patients who have cancer in
the group that smokes is way higher than the group that avoids smoking.
Therefore, the average number of patients who have cancer in the group
that smokes is very high compared to the non-smoking group.
Based on Data that has been provided and the actual evidence, we can test
both the hypotheses and conclude that either accepts the null hypothesis or
rejects it.
ANOVA
It is a hypothesis-testing methodology used to test multiple groups. It is
used to check whether both the study groups have the same or similar
variances and averages. This method can perform this check with a very
less chance of error. ANOVA can be calculated using the F-ratio.
The ratio of Mean square present between the groups to the Mean Square
present between internally in the group is known as the F-ratio. There are
many different methods of calculating ANOVA.
In terms of creating and understanding both the hypotheses, i.e., alternative
and null, as far as null hypothesis is concerned it is necessary to assume that
the average of all the groups will be the same. In contrast, in the case of an
alternative hypothesis all the averages will not be the same.
Skills a Data Scientist Must Develop
If you want to become a data scientist or work in the field of data science,
you must know about the topics mentioned in this chapter. There are some
technical and non-technical skills you need to develop if you want to
succeed in this field. You can use the information in the following sections
to ensure you succeed at being a data scientist. If you work as a data analyst
at an organization, you can use the information you already have to train a
team of data scientists. Make sure they have the skills mentioned below to
make them more efficient and productive.
Technical Skills
The most important technical skills a data scientist must have are statistical
analysis and the skills to leverage various frameworks and software to
collect, mine, collate, process, analyze, interpret and present value from
large volumes of data. This means a data scientist must be skilled in
programming so that he can perform these activities. One of the easiest
ways to do this is to ensure you have a good academic background. A data
scientist often has a Master's Degree or a Ph.D. in computer science,
engineering and statistics. This ensures they have a strong foundation to
help them connect with various technical points that are the foundation of
the practice. Numerous schools offer programs for those people who want
to pursue data science.
If you do not want to take up the focused approach, you can look at
different options. These options include boot camps and MOOCs (Massive
Open Online Courses). These programs improve your understanding of the
core subjects that support a data scientist. They also provide a practical
learning approach that is not present in most textbooks.
Some other technical skills you need to develop are:
Understanding Data
Data science is all about working with different types of data. Thus it is
important for a data scientist to love data. Here are some questions you can
answer to help you determine your love for data:
● Do you understand what data is, and what information you need to
look at?
● Do you work with data frequently?
● Do you know how to work with unstructured and structured data?
● Do you love working with data?
If your answer is yes, then you should definitely take up certifications to
help you develop the necessary skills to become a data scientist.
Algorithms
An algorithm is a set of instructions given to any machine to perform a
specific task. For instance, when you tell a computer to calculate the sum of
two numbers, you give it specific instructions.
1. Declare two variables.
2. Initialize the variables.
3. Assign values to the variables.
4. Initialize another variable to hold the sum.
5. Calculate the sum of the first two variables.
6. Assign the sum to the third variable.
You also use an algorithm when you solve a puzzle with a pen and paper. A
data scientist needs to understand what an algorithm is since most systems
used in data analytics are based on algorithms. It is also important for a data
scientist to know how to design various algorithms to perform different
functions. Let us assume that you key in 10 random numbers. For example,
you enter the numbers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. The objective is to find the
third-largest number in the set of numbers entered. To do this, you need to
find the right logic to use.
The data scientist needs to write the logic and develop an algorithm to find
the third-largest number. We will look at different techniques to use over the
book.
Programming
You should learn different languages, such as Perl, Python, Java, SQL, C
and C++. Python is the most common language used to develop models.
These languages help you collect, clean, organize and analyze data. You can
also use this language to work with unstructured data sets.
Analytical Tools
Every data scientist needs to learn different analytical tools and understand
the use of SaaS (Software as a Service). This understanding will help you
obtain valuable information from the cleaned and processed data sets used
for analysis. R, SAS, Hadoop, Pig, Hive and Spark are some of the most
popular tools to use for data analysis. You can assert your expertise in the
use of these tools through certifications.
Using Unstructured Data
Data scientists need to collect, manage, clean, understand and analyze data
collected from different sources. These data are not always structured, and
they can be in different forms, such as emails, texts, videos, images and
other unstructured data. For example, if the data scientist is working on a
marketing project or assisting the marketing team, he needs to know how to
handle social media. He also must know how to collect the data and use it
for his analysis.
Non-Technical Skills
Let us now look at some non-technical skills that a data scientist must have.
These skills are personal skills that a user must develop, and these are
different from certifications and qualifications.
Strong Business Acumen
A data scientist needs to have business acumen. He must understand the
different elements that are a part of a successful business model. Otherwise,
he cannot use the right technical tools and skills for the analysis. He also
cannot identify the different problems that would need to be solved for the
business to grow and sustain. It becomes hard for them to manage or help
the organization to explore different business opportunities.
Communication Skills
A data scientist understands data better than everybody else in the
organization. If a data scientist wants to be successful in his role and the
organization to do better, the data scientist needs to communicate the
interpretation and analysis of the data to the management. This is of utmost
importance since the data scientist must communicate the findings to
someone non-technical, and thus a data scientist must have the necessary
communication skills.
Intuition
This is one of the most important non-technical skills that every data
scientist must have. He needs to have the ability to perceive hidden patterns
in the data set. He should also know where he should look in the data set so
that he can add more value to the analysis. These skills make the data
scientist efficient in their work. This skill only comes as the data scientist
gains more experience.
Chapter Six: Tools Required for Data Science
Now that you have understood the importance of data and why we need
data science, let us now understand the tools involved in Data Science. Data
science has various tools and the important tools are listed here.
R Programming Language
R is a coding language used by Data Scientists widely and even used by
many organizations for data analysis. The programming language is widely
used because it possesses the feature of the Object-Oriented Programming
language. This is anyway used for the analysis of statistics, reporting, and
visual graphic representations.
Python Programming Language
Python is another Programming Language that possesses the features of
Object-Oriented Programming, thus widely being used for software
development and data science. Python works on multiple platforms such as
Windows, Mac, Linux, etc. It can also be used to create numerous
applications like web applications, user interface based applications,
numeric applications, etc.
Structured Query Language
Structured Query Language is mainly utilized to make new databases and
tables. It is needed to update the records in a database. It can retrieve as
well as delete records from a database. Data Scientists use structured Query
Language (SQL) for managing the data that can be captured in the database.
Hadoop
Hadoop is an open-source framework that can store and can surely work on
processing big data. It provides big data analysis using a distributed
computing framework. It also provides flexibility as the user can store the
data without pre-processing the same. Therefore, store as much data as you
need and hence decide on which needs to be used.
SAS
Software as Service is one of the most crucial data science tools that are
designed especially for statistical operations. SAS is used by big companies
to draw patterns about the data. SAS makes use of base SAS coding
language, which is used for carrying out statistical modeling. It is largely
made use of by experts and professionals and firms that are based on
“Reliable Commercial Software.” SAS gives many statistical libraries and
tools useful for data scientists for modeling and keeping their data records
in order. Even though SAS can be relied on and has very strong support
from companies, it is very costly, and hence it is used only in very big
industries. Also, SAS is not that great compared to some of the better tools
that are open source. Moreover, there are many libraries and packages in
SAS that are not accessible in the base pack; hence it needs a costly
upgrade to access the same.
Apache Spark
Spark is a very strong analytic tool and one of the most popular tools for
Data Scientists. Spark is specially designed to take care of batch processing
and Stream Processing. It has numerous APIs that help scientists carry out a
frequent path to Data for Machine Learning. It is a definite win over
Hadoop and can perform a hundred times swifter than MapReduce. Spark
has numerous Machine Learning APIs, which can lead get insights and lead
them to make powerful data analysis.
Spark’s ability to handle streaming data makes it better than other Big Data
Platforms. This also shows that Spark can take care of real-time data when
you compare it with other tools that take only historical data that too in
batches. Spark offers numerous APIs, which can be coded in Python, Java,
and R. The very powerful coexistence of Spark is and a programming
language called Scala.
Spark is very structured in the management of clusters, and hence it makes
it much better than Hadoop as Hadoop can be used only for storage. This is
a cluster management system that allows the Spark to act upon the tool at a
great speed.
D3.js
An interactional visualization can be created using the Javascript library on
a web browser. Many different APIs of D3.js and functions can be utilized
to make dynamic visualization and to analyze the dataset within the
browser. Animated transitions are an important tool of the D3.js. It is
possible to make a document dynamic and to change the data continuously
to display visualizations on the web.
It is possible to create transitory and great visuals that can display specific
and customized graphs on web pages by using a combination of D3.js and
CSS. It is one of the most important tools for data scientists especially those
who use IoT based devices that require client-side working for visualization
and data processing.
Chapter Seven: Data Science and Artificial
Intelligence
The next thing we need to focus on is artificial intelligence and check if this
and Machine Learning are the same concepts. As already mentioned earlier,
data science is a broad topic that includes a lot of concepts that come
together. Topics that come with data science will have data mining, big data
or artificial intelligence, etc. Since data science is a fairly new concept, it
will keep growing, and hence understanding the different concepts around it
will make a big difference.
Statistics is an important topic when it comes to data science and also in
Machine learning. It is very important to the categories or machine learning
and artificial intelligence. They can either be very similar in some cases and
also entirely different in another. Now let us take a look at each to
understand exactly how they work in data science.
What is Artificial Intelligence?
The very first thing that we have to delve into is the basic idea of artificial
intelligence and also how exactly it is different from machine learning.
Initially, you may think that these two concepts are the same and that you
might not understand how the concepts work. There are major differences
when you learn more about this process.
The process of artificial intelligence is a method that is visible in many
devices in the manufacturing unit to copy human tasks and capabilities
when it comes to various mental tasks. In the past few years, the meaning of
the term has changed a little, but the idea is the same. When you implement
artificial intelligence, you enable the program or the machine that you are
working on to work and think like a human brain. It can mean that a
machine enabled with artificial intelligence will be capable of efficiently
completing the tasks, similar to what the human brains can.
For people who are not from the world of technology, it would seem like
machine learning is the same as artificial intelligence. But definitely, there
are differences between them.
The term artificial intelligence can conjure up images reminiscent of a
scientific fiction movie. However, we regularly use devices powered by
artificial intelligence that are steadily making our lives easier. Here are the
different benefits of artificial intelligence.
Most of us not only use smartphones but also depend a lot on them. They
are no longer just a device for communication but have become a source of
entertainment too. From making payments to checking financial
transactions and reading new or staying in touch with others, we use our
smartphones for several purposes. These days it almost seems like these
devices have become an extension of our physical bodies. Believe it or not,
why such assistants, such as Siri and Cortana, used on iOS and Windows-
powered devices are examples of real-life applications of AI. Whenever you
use social media platforms such as Facebook or Instagram for tagging other
users in posts, it is due to AI's algorithms that have facial recognition. Even
financial institutions and different organizations in the banking sector are
regularly using artificial intelligence for organizing and managing a vast
amount of data. It's a great way to detect any frauds in the system.
Unlike human beings who cannot work 24/7 without rest, a machine can do
this without any breaks. Once these machines are programmed to use
artificial intelligence for functioning, it reduces human involvement in
getting things done. Human beings are bound to make errors, but once a
machine is programmed with the right codes and executes it efficiently, the
likelihood of errors reduces. Machines are good at multitasking. Therefore,
the efficiency of results produced by them is also higher. Artificial
intelligence plays a significant role in the medical sector. Assessing health-
related data and determining the probability of risk of certain diseases or
contributing risk factors can be conducted successfully using artificial
intelligence. These days, the software can help monitor neurological
disorders by stimulating how the human brain functions.
Once the machines are equipped with artificial intelligence, they can be
used for hazardous exploration. These machines could easily overcome any
human limitations. They are not only equipped to perform difficult tasks but
can do so efficiently and effectively. Another great benefit of artificial
intelligence is that machines cannot get bored of mundane and repetitive
tasks. Irrespective of how boring the task might seem, the machines will not
get bored.
Chapter Eight: Data Science and Machine
Learning
Knowing Machine Learning

Since we have already taken a look at artificial intelligence, now it is time
to know about machine learning and the difference between the two. Firstly,
machine learning is a newer concept when compared to artificial
intelligence. But it has had a great impact when it comes to the world of
technology.
Machine learning is an implementation of Artificial Intelligence that
provides the systems the ability to learn and refine with time without
explicitly programming each time. Machine learning focuses on the
programs that can access data, learn from it, and use it for themselves.
Improving through learning starts with observation of the data, or
instructions to check the pattern of a particular data and make an improved
better decision in the future for the examples that we provide. The main aim
is to let the computer learn automatically without any human interventions
and adjust the output or the actions accordingly. The classic algorithm of
machine learning considers text as a sequence of keywords instead of an
approach based on semantic analysis, which mimics the human body to
understand what the text means.
Is artificial intelligence the same as machine learning?
As we have read on what each of the terms means, let us look at how
exactly are they different from each other. To start with, the aim of artificial
intelligence is not to improve accuracy but to raise the chance of success,
whereas machine learning aims to increase the accuracy and not the chance
of success. Artificial intelligence develops a system that can mimic a
human behavior that would respond in a certain situation, whereas machine
learning involves creating a self-learning algorithm.
The goal in AI is, in the end, to mimic natural intelligence to find a solution
to a complex problem. Machine learning works in a way where the
objective is to study from a particular set of data on a certain work to
increase or optimize the performance of the machine on a task.
Artificial Intelligence and Machine Learning Go Hand in Hand
In the world of technology, machine learning and artificial intelligence are
really big and important words. They often go hand in hand, being used by
industry experts interchangeably as they look very similar. We talked about
machine learning and artificial intelligence before and figured out a few
differences, but there is a perception that can lead us to a lot of confusion.
When you talk about technological changes, analytics, or even big data,
both terms are going to show up a lot.
They have also been able to make changes in the technological and other
industries either together or individually.
The best answer to the issue is that artificial intelligence is a larger concept
of machines in a position to interact with each other and carrying out the
task in technology; it can perform this in a way that people see as “smart.”
But with machine learning, we are seeing the current usage of artificial
intelligence centered on the idea that the real purpose of humans is to feed
the data that they require. They will be in a position to study from the
dataset on their own.
Two major things led to the growth of machine learning and converted it
into a vehicle that helps to take the artificial intelligence developments
forward with the intense speed that we see now. One breakthrough occurred
in 1959. It stated that rather than performing coding where we teach a
computer everything it needs to know about the whole world and about the
task we need it to do, we could code it so it could learn by itself.
The second breakthrough can be considered as the one that can be
considered as a bit more recent, that is, the emergence and the spread of the
Internet and the huge interest it came with that. With how the Internet
started and how quickly it took off, there was a big rise in the amount of
information that became digital, which was generated and stored. It was
also made available for finding out insights and analysis for those who
required it the most.
Once these types of innovations were in the picture, it did not take much
time for engineers to learn that they did not have to take vast amounts of
time teaching a machine or a computer on what to do at each step and how
to do it. Before, that was needed at each step of the process and it consumed
a lot of time. Hence, it was not considered efficient.
Programmers found that it is much easier and more efficient to write a code
that allowed computers or machines to behave and act more like humans.
This process would make the system learn along the way and make accurate
predictions over time. The coders were able to write a code to make the
machines do the same as mentioned above. They connect it to the Internet
and effectively and efficiently give the system all of the information and
data in the world, which helps them to learn new things.
Neural networks that we are going to discuss in a bit with little more details
was an essential point on helping to teach a computer or another machine
how to behave and understand the world in a way similar to human beings.
It had some of the biggest advantages that included the need of only a
single person to do the job, and also the fact that there was no bias, it was
accurate and fast was a big advantage that it had over any human being. The
idea of a neural network is to classify information the same work that a
human brain tends to do as a computer system. If it is done correctly, with
the correct kind of rewards for giving the correct answer, it will show the
various things like images and classify them based on elements found
inside. This is one way in which machine learning that is the vehicle of
artificial intelligence in most of the cases can learn and perform some of the
amazing things that we come to expect.
Both artificial intelligence and machine learning can provide us a lot of
used cases today. It promises to make some of the most mundane tasks
automatic and provide a lot of insights. It will not be surprising to find
industries from all the sectors such as manufacturing, healthcare, and
banking as well to take advantage of what can be achieved with these
technologies. These are so successful in the market that both the
technologies machine learning and artificial intelligence are now two of the
most wanted products that are being sold very consistently.
Chapter Nine: Hadoop
Hadoop in Data Science

The Internet and the World Wide Web steadily gained popularity and
importance in the latter part of the 20th and early 21st centuries. During this
period search engines and indexes were created to help users find the
relevant topic using the text-based content. As the network grew to include
millions of webpages, automation became quintessential to get the search
result. Nightcrawlers were created, many as University lead research
projects, and many more search engine start-ups were formed. The main
examples during that period were Yahoo, AltaVista, etc.
Doug Cutting and Mike Caferella came up with a new open-source project
named Nutch. Their primary goal was to increase the rate at which search
results are displayed by spreading the data and calculation work across
multiple computers. The simple distribution of work across several systems
reduces the processing time and helps perform multiple functions at once.
Around the same time, another search engine project called Google was
under progress. Even this was based on the same concept – which included
storing and processing data through a distributor and automate it such that
the results are quicker.
In 2006, Cutting started working for Yahoo and took the project Nutch with
him. The idea relies on Google’s early findings based on automating
‘distributed data storage and processing.’ The Nutch Project was broken
down into two; the WebCrawler part of the project stayed as Nutch whereas
the distributed computing and processing division became Hadoop (named
after Doug Cutting’s Son’s toy elephant). In 2008, Yahoo released Hadoop
as an open-source project. Now, the non-profit Apache Software
Foundation (ASF)- a community of software developers and contributors
globally, carries out the framework and other technologies related to
Hadoop.
Apache Hadoop is open-source software, which allows a combination of
machine networks to solve problems that need large data points and
computation effort. Hadoop, as mentioned, is scalable, that was created to
take care of machine power that ranges from a single server to a
combination of a thousand machines. Coding in Hadoop in multiple
languages such as Python, C++, Ruby, etc.
Parts of Hadoop
The three main parts of Hadoop are as follows.
Hadoop Distributed File System
The component of Hadoop that stores data is the Hadoop Distributed File
System or as it is commonly called HDFS. Hadoop is essentially a
combination of primary and secondary computer networks that run on the
hardware. In Hadoop Distributed File System, there are two main nodes –
name node and data node that execute on the commodity hardware.
MapReduce
This section of Hadoop is responsible for high-level data processing. It
takes care of the refinement of a large number of data values over the
cluster of nodes.
YARN
This is used for the management of resources and scheduling tasks. It is
tough for the resources to be managed, allocated, and released in a multi-
node cluster. Hadoop Yarn lets us manage and control the resources very
effectively and efficiently.
Are Data Scientist Required to Know Hadoop?
The answer to the above will be a big YES! Hadoop is considered as a
must-know for data scientists.
Data Science is a vast field. It has roots in multiple interdisciplinary fields
like math, statistics, and programming. It is all about finding patterns in the
data set. Data Scientists are trained to analyze and create insights and
predict various outcomes from the uploaded data. It is a comprehensive
word, which takes into account almost every technology that uses data
The main function of Hadoop is to store Big Data. It also lets users save all
types of data- structured data and unstructured data. Hadoop also has
modules like Pig and Hive to analyze large scale datasets.
The major difference between big data and data science is that the latter is a
subject involving all sets of operations in data. Therefore, as a result, big
data is known to be a component of data science. Since data science
contains a large amount of information, it is imperative to understand big
data. Hadoop's knowledge is certainly going to add up to your knowledge
base and make you versatile when it comes to handling massive amounts of
data. Another benefit is that understanding Hadoop will add value to you by
a substantial amount in the market and provide a much needed competitive
edge over others in the same field.
Moreover, as a data scientist, knowledge of machine learning is also
considered very important. This is one aspect of learning that cannot be
overlooked. Machine Learning helps write codes, analyze data and
performance with bigger datasets. This is the reason why bid data is the
preferred choice for creating machine learning algorithms. To learn the
intricacies of data science, obtaining knowledge about big data is a must.
Is Hadoop the First Step toward Data Science?
According to the experts, Hadoop is the basic step towards being a data
scientist.
Hadoop is the most popularly used platform for big data and is widely used
for data operation, which involves large scale data. To take the first step
towards becoming a full-fledged data scientist, understanding big data
including structured and unstructured data is the first step. Thus, Hadoop is
considered an ideal tool for users to solve problems for large data.
Moreover, Hadoop is not only an ideal platform that allows users to handle
big data, and analyze it by various extensions such as Hive and Mahout. So,
knowing the entire width and breadth of Hadoop will allow you to handle
numerous data operations that prove to be the major task for data scientists.
As it is a major chunk of data science, learning Hadoop as the very first
platform will provide you all the required information to proceed further.
In the Hadoop platform, the code is written in Java for the sake of machine
learning. This might not be an easy process, but learning it will prove to be
a wonderful benefit. Performing some of the machine learning techniques
like regression, classification, and clustering to a MapReduce framework is
also needed. To analyze the data set easily, Apache has released two chief
components in Hadoop called Pig and Hive. Moreover, to carry out machine
learning on data records, the Apache software foundation has given out
Apache Mahout. The application Apache Mahout executes and runs on top
of Hadoop, which has MapReduce as its prime paradigm.
So, having expertise in big data and Hadoop will help create a
comprehensive architecture to analyze a large amount of data.
Why is Hadoop Important?
Hadoop can store and process large amounts of data and do it swiftly too.
As the data volume and variety are continually increasing, mainly from
social media and the Internet of Things (IoT), it is imperative to have this
feature.
Computing power: Hadoop's model of distributed computing processes big
data quickly. The more computing nodes you use, the more coding and
executing power it gives.
Flexibility: Unlike the old school relational databases, you do not have to
pre-process data before uploading it. It is possible to store a substantial
amount of data and later, decide how much to use it. That also includes and
structurally data like images, text, and videos.
Low cost: The open-source framework is of no cost and uses commodity
hardware to save a huge amount of data.
Scalability: You can easily increase and grow your servers to take care of a
large amount of data by simply using nodes. A very little administration is
required for the same.
Hadoop and its Impact on Data Scientists
In the last few years, Hadoop has been extensively utilized to implement
tools for data science in companies. With people taking up big data and data
science, firms and all the sectors have been able to use data science. Here
are the four critical techniques where Hadoop has shown its impact on Data
Scientists:
Exploring data using large scale data sets
Data scientists are needed to take care of massive datasets. Previously, data
scientists were restricted from storing data in the local machine. With the
rise in data and an extensive need for analysis of big data, Hadoop gives an
easy environment for extensive and data analysis that is exploratory.
Hadoop can perform a MapReduce function, HIVE, or a PIG script and
start it straight on Hadoop to full data records to get the insights and results.
Pre-processing large scale data
All the roles associated with data science require data pre-processing such
as acquiring data, transformation, clean-up of the data. Pre-processing large
scale data changes raw data to a standard type of vectors.
Hadoop also executes large scale data pre-processing and simplifies the jobs
of data scientists.. It gives tools such as MapReduce, Pig, and Hive, to
effectively take care of a large amount of data.
Enforcing data agility
Almost all traditional database systems required a strict schema structure.
But when it comes to Hadoop, it facilitates a flexible schema to the users.
This flexible schema or “schema on-road” removes the requirement for
redesigning schema when a new field is required.
Facilitates large scale data mining
It is proven that machine learning algorithms provide better results and train
better with huge data sets. Techniques such as clustering, outlier detection,
and product recommenders give an extensive range of techniques related to
statistics.
Initially, machine learning meant that engineers had to take care of only a
limited amount of data, which commonly resulted in low performance of
the models or algorithms. By using a Hadoop platform that provides
scalable linear storage, you can save your entire raw data.
MapReduce in Hadoop – What Does It Mean?
MapReduce is a framework for Hadoop, which is used to process a large
amount of data and also on huge clusters of commodity hardware reliably. It
lets the apps store data values in a largely distributed manner and hence
process massive data sets over the clusters of machines using simple coding
models.
The Apache Hadoop project has various sub-projects:
● Hadoop Common: Hadoop common has other utilities that support

other sub-projects
● Hadoop Distributed file system (HDFS): It provides access to a
database file and the application data
● Hadoop MapReduce: A framework for processing large distributed

datasets on various clusters
● Hadoop Yarn: itis a framework to schedule jobs and for resource

management
The main aim of MapReduce is to help scale up data processing spread over
a cluster of machines. The MapReduce function takes place in 2 steps
known as Map and Reduce. The former process is known as the mapper and
the latter as the reducer. Once you set up MapReduce for an app, the
application can scale up and run over multiples or even thousands of
clusters, which would need a mere change in configuration. This easiness of
MapReduce has encouraged many coders to use it.
MapReduce – Working
The MapReduce function takes place in four main steps:
● Input Splits
● Map
● Shuffle
● Reduce
Now let us look at how each process or the step works:
Map Step
This step is used to denote the input split and the map step. In this step, the
input file is processed line by line. Before passing on to the map function,
the incoming data is divided into small-sized units known as input splits.
The input split is a significant chunk of input that can be consumed by only
a single map. In this step, each data is split and moved on to the mapper
function. The mapper function takes in the data and processes it before
generating output. Usually, the input data is in the type of a directory or a
file saved in the Hadoop Distributed file system (HDFS).
Reduce Step
This process can be taken as a combination of two steps: shuffle step and
reduce step. The reduce function accepts the data that, in turn, is the answer
to the previous Map step. After processing it by the reducing function, a
new set of outputs is obtained, stored back again into Hadoop Distributed
file system (HDFS).
Advantages of MapReduce
Now, let’s look at the advantages of the MapReduce function:
Scalability
The MapReduce function makes Hadoop scalable as it makes large storage
of data possible in a distributed form, in multiple servers. Also, since it is
spread across systems, it makes it possible for them to function parallelly.
Cost-effective solution
MapReduce gives a cheap and effective solution to businesses and
customers who want a cheaper alternative to storing the ever-increasing
amounts of data.
Flexibility
MapReduce makes Hadoop flexible for handling various types of data also
for data from different sources in either structured on unstructured manner.
Hence, it becomes effortless to store and process both structured and
unstructured data.
Fast
Hadoop saves data in the Hadoop Distributed file system (HDFS), which
means storing the data on a local disk of the cluster. The MapReduce
functions are located in one server as the files. This allows high-speed
processing of data as data is not transferred from one server to another.
Parallel processing
Since Hadoop stores data in the Hadoop Distributed file system (HDFS), a
MapReduce function divides the task into a map and reduces as explained
earlier and can be executed parallelly. And also, because of this parallel
execution, the runtime is reduced by a considerable margin.
Is it all Good? What are the Challenges of Using Hadoop?
MapReduce is not the perfect solution for all problems. It is not good for
repetitive and interactive patterns, but it is good enough for easy
information problems and requests, which can be further divided into
independent units. MapReduce is file-intensive as the nodes do not
intercommunicate except via things like sorts and shuffles; these repetitive
algorithms require numerous map shuffle/ sort reduce phases to finish. This
leads to the creation of several files between the MapReduce phase and is
not efficient in doing any advanced computing in analytics.
There is a largely acknowledged talent gap that needs to be addressed. It
could be difficult to find an entry-level coder with sufficient skills in Java to
be of any use with MapReduce. This is one reason that distribution
providers are raising to put relational technology above Hadoop. Instead, it
is easy to get a coder with skills in SQL than skills in MapReduce. Hadoop
admin appears as some art with some science that needs low-level know-
how about the operating systems, Hadoop settings, and hardware.
Data Security
Another problem is the fact that the data is stored in a fragmented manner
across servers using recent technology. The Kerberos authentication
protocol is a very good initiative in making Hadoop environment safe and
secure.
Fully-Grown Governance and Management of Data
Hadoop does not have any simple to operate, a full-featured tool to manage
or clean data and metadata. It also doesn’t have a tool to standardize data.
However, we can safely conclude that Hadoop is a must know to learn
about data science. It is used widely by all the data scientists to store large
amounts of data because it is scalable and fault-tolerant. Moreover, Hadoop
has grown to be a very important data science platform. This is also helped
by the fact that companies such as Mark & Spencer are using Hadoop to
analyze various ways of purchase, insights, and inventory management.
Real-World Application of Hadoop
Security and Law Enforcement
Some security and law enforcement agencies in the USA use Hadoop to
reduce the risk of terrorist attacks. It also helps to detect and prevent cyber-
attacks. Police even use big data to predict crimes, and the credit card
companies use it to find suspicious transactions. Hadoop is widely used
because it can easily go through massive datasets to find meaningful
patterns and interactions between the data.
Customer's requirement understanding
Hadoop plays a major role in customer service too. Many sectors, such as
telecom companies, use this to figure out the needs of the customer by
analyzing a large chunk of data and getting insights from it. Social media
also uses this technology. It shows advertisements on various social media
websites to the target audience when the user opens the social media
platform on his computer. Even the credit card firms use this technology to
figure out the exact audience for the products. They use various ways to sell
the same. Many online ticket booking websites use the customers browsing
history to figure out their interest and also their previous history so that
using this information, they can suggest a flight to their customers on this
basis. Even the e-commerce companies track the products that were
purchased together, and with respect to this data, they suggest the customers
on what to buy when at least one from all the products from that particular
segment is bought.
Improving cities and countries
Hadoop is used by many countries and even cities to improve their public
services by using the available data. For instance, things like traffic jams
can be taken care of using Hadoop. It is also utilized in projects such as
making a smart city. It can also help make the transportation system better
in the city. It also offers guidelines and acts as a guide on giving proper way
of transportation to either choose buses and trains or some other mode of
transport.
Financial training and forecasting
Hadoop is used largely in the field of finance and trading. It uses a complex
mechanism to scan the important events or happenings in the market using
predefined conditions and e criteria to identify the opportunities in trading.
Many trading decisions are carried out by algorithms only.
Understanding and optimizing the business processes
One of the main reasons why learning and using Hadoop is important is that
it is commonly used in business processes. It has improved the performance
and optimized the business process in different ways. Retailers use Hadoop
to customize their stock by using predictions based on existing data, which
comes from various sources such as social media, Google search, various
other platforms. Based on this information and data, an organization can
make the best decision possible to improve its overall business and
maximize profits.
Radiofrequency and GPS (Geographical Positioning System) are also used
to track the product and take the required actions to minimize loss and
reduce the cost at various places for the organization. It helps plan the most
optimized route for product transfer between two points. This can be
between one country to another and interstate transport and lets the
organization cut its costs and maximize the overall profitability. Many
organizations also utilize this technology in their work to monitor their
employees. A sensor is placed on the employee badges to check their
interaction with each other and even with clients. This can help in HR
decisions if something goes wrong between the employees or the employee
and the client. In a way, it is a handy tool for obtaining information in case
of disputes.
Performance optimization and personal improvement
Hadoop can play a crucial role in our personal life too. It can monitor sleep
patterns, the number of steps you have taken in a day, and provide tips on
how to improve the lifestyle. By making a conclusion using all these
patterns, it can effectively help us improve our overall quality of life. Even
dating sites use Hadoop to match people with a common interest and, in
turn, help users find their true love or soulmate. This is an example of the
real-life application of Hadoop.
Use of Hadoop in the healthcare sector
Hadoop is used to improve public health in the medical sector. Many
health-related mobile applications work on Hadoop. These apps not only do
they monitor actions day by day using this, but they also help improve the
same. It stores and analyzes massive amounts of data to draw insights about
certain facts that can help to enhance the medical sector. These insights can
also be used to make any changes to help improve human health.
Optimizing the performance of the machine
Hadoop is also applied in the mechanical sector. It is often used in self-
driving cars by automation where GPS, powerful cameras, and various
sensors are placed all around the vehicle. All this helps to run a car without
a driver. Hadoop plays a crucial part in the same, and it is going to
transform our lives in the coming years significantly.
Improving the field of sports
Hadoop is also used extensively in the sports sector. Various tools such as
IBM slam Tracker, an application that is used in tennis, help improve
sporting activities. Video analytics is used in various sports to improve
players' performance, especially in games such as football or baseball. As of
now, numerous sensors are utilized to improve the performance of the
players using data analysis to improve the quality of the sport.
Improving research
Hadoop plays a vital role in the field of science, medicine, and research.
Many decisions are taken using the extraction of a large amount of data
relevant to the decisions, which makes it easy to draw conclusions. It can
also be used to predict the output before conducting research. This step
helps make the most of the available resources and reduces the overall
effort when compared to the earlier practices.
Common examples of commands and examples that you must
know before coding in Hadoop
Examples of various commands:
Find below the various commonly used commands in Hadoop:
● version
● mkdir
● ls
● put
● copy from local
● get
● copyTOLocal
● cat
● mv
● cp
● movefromlocal
● tail
● rm
● expunge
● chown
● chgrp
● setrep
● du
● df
● touchz
● append to file
● count
● checksum
Let us understand these in further detail:
1. Version: This can check the version of Hadoop
Command: hdfs dfs version
2. Mkdir: This command is generally utilized to create new

directories and it follows the path of the URI as parameters or
variables. Command: hdfs dfs - mkdir /usr/local/firstdir?
3. Ls: This is used to display content in a directory provided by the
user as a form of input.
Command: hdfs dfs -ls /usr/local/firstdir
4. Put: This command is used to copy the content to a location from

a local file with the help of DFS.
Command: hdfs dfs -put source_dir destination_dir
5. CopyFromLocal: This is like the ‘put’ command with a simple

difference; in this command the source file is always restricted to
the local file reference.
Command: hdfs dfs -copyFromLocal local_src destination_dir
6. Get: This command is used to take all files that match the src
dire that is provided by the user in HDFS. It also helps to create a
duplicate of this in the local file system.
Command: hdfs dfs - get source_dir local_dir?
7. CopyTOlocal: This command and the getting command are

utilized in the same way. This difference between both
commands is that in this command the destination is always
restricted to the local path only.
Command: hdfs -dfs - copyTOLocal src_dir local_dir
8. cat: It is used to display the content of the name of the file

present on the console.
Command: hdfs dfs - cat dir_path
9. mv: It is used to move the directory and file from one place to
another with the help of HDFS. Command: hdfs fs - mv
source_dir_filename destination_dir
10.
cp: It is used to copy the file and directory from one
place to another with the help of HDFS. Command: hdfs fs - cp
source_dir_filename destination_dir
11.
moveFromLocal: The command copies contents
from to its destination from a local file system within HDFS and
when the copying operation is successful, it removes the contents
from the local file system
Command: hdfs dfs - move from local local_src destination_dir
12.
tail: It is used to show 1 kb contented present on the
console of a file.
Command: hdfs dfs - tail file_path
13.
rm: It is used to remove directory and files from a
path that is specified by the user.
Command: hdfs dfs - rm dir_name
14.
expunge: This command is used to clear the trash.
Command: hdfs dfs -expunge
15.
chown: This command is used to change the name
of the owner.
Command: hdfs dfs -change owner_name dir_name
16.
chgrp: This command is used to change the group of
the file.
Command: hdfs dfs -change group_name dir_name
17.
du: This command is used to display the disk usage
of all the files which are present and available in a directory as
per the path or URL specified by the user. The information and
data is displayed in ‘bytes’ format
Command: hdfs dfs -du dir_name
18.
df: This command displays the available free space
Command: hdfs dfs -df -h
19.
touchz: This command is used to create a file with a
path. It also consists of a timestamp.
Command: hdfs dfs -touchz dir_name
20.
appendToFile: This command is used to append one
or more sources from the local system to the next destination.
Command: hdfs dfs -append to file local_src….

Destination_dir_name
21.
count: This command is used to count the total
number of files and directories
Command: hdfs dfs -count dir_name
22.
checksum: This command returns checksum related
data of a specific file
Command: hdfs dfs -checksum file_name

Creating the First Program in Hadoop
Let us find the total count of each word by writing a MapReduce job. We
will start with the map function, which is shown in Java using an instance
of org.apache.hadoop.mapreduce.Mapper.
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> : this
is the declaration of the mapper class.
Find Below the Listing One: ProjectionMapper.java.
package com.tom_e_white.drdobbs.mapreduce;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
public class ProjectionMapper extends Mapper<LongWritable, Text, Text,
LongWritable> {
private Text word = new Text();
private LongWritable count = new LongWritable();
@Override
protected void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
// value is tab separated values: word, year, occurrences, #books, #pages
// we project out (word, occurrences) so we can sum over all years
String[] split = value.toString().split("\t+");
word.set(split[0]);
if (split.length > 2) {
try {
count.set(Long.parseLong(split[2]));
context.write(word, count);
} catch (NumberFormatException e) {
// cannot parse - ignore
}
}
}
}
The map function method is used once for each element in an array, so it
avoids useless creation of objects. The body of map () separates the tab-
separated input sentence to different fields, and also makes the first cell as
the word, and the third as the count of it.
To make it easy, the code is designed so that it does not take into account
the field, which is not a number, but other actions can be taken like
increasing a MapReduce counter to keep track of the number of lines it
affects.
Example: (see the command: getCounter() method on Context for details).
Chapter Ten: R Programming Language
R programming language is a free software environment for computing in
statistics. It is used extensively among data miners and statisticians for data
analysis and to create statistical software.
Robert Gentleman and Ross Ihaka from New Zealand are considered to be
the creators of the R programming language which was developed by the R
Core Team. The R programming language is named after the initials of the
first names of both the creators. The first version was released in 1995 but a
stable beta version only came out on 29 February 2000.
Why is learning the R programming language important for data skills?
Why is it believed to be better than Python? Why is it said that this is better
than Python? Both the programming languages are good and have an
affinity for data science. However, R has a few unique and special strengths
that are worth mentioning.
R programming language has been created especially for statistics. It was
originally designed for statistical analysis by statisticians and is considered
to be one of the best programming languages for statisticians even today.
The syntax of this language is simple and it allows us to generate complex
statistical models with just a few lines of code. Many different statisticians
use R packages. It is easy to find support related to this language.
It is one of the most commonly used coding languages at the top tech firms
around the world. Almost all companies tend to prefer data scientists who
are proficient at R language. Facebook uses an R programming language to
conduct a behavioral analysis using the user's post data. Google uses the R
language to make efficient and effective ads and to make economic
predictions. Twitter uses the R language for semantic clustering and data
visualization. Other companies such as HP, Microsoft, and IBM use R
language in their financial institutions and consulting firms too. The R
language is used by the New York Times too.
It is quite easy to learn the basics of data science with the help of the R
language. Python is considered to be the best language for beginners. Other
than the easy to use Syntax, R has a major advantage over other
programming languages because it was designed keeping mind data
manipulation and analysis. Once you understand the basic fundamentals of
the R language properly, it becomes quite easy to learn the important skills
of data science that include data visualization, data manipulation, and
machine learning.
R language has multiple packages that make the whole system easy. It is a
simple and creative ecosystem that has been designed specifically by
keeping analysts in mind. For instance the GGPLOT2 tool is a great way to
plot grafts while the DPLYR package can be used for excellent data
manipulation. These are some of the many benefits of this language.
It is an important field in data science and is a brilliant choice for both
statisticians and data scientists. R’s popularity increased with the explosion
of data science, making it one of the fastest evolving programming
languages globally. This practically means that finding answers to queries
and also community guidance through various forums is easy while
working with R. As there are many enthusiastic users, R’s packages are
integrated with several applications.
Even if you are well versed in Python, it might not be the right tool for
every job. So, learning another language is a wonderful idea. Adding R to
your skillset will certainly elevate your resume and increase the scope of
jobs available.
The Best Ways to Learn R
There is usually a mismatch between what prompts you to learn a language
and why how you learn it. This is a major obstacle when it comes to
learning a programming language and is known as the cliff of boring.
What exactly is the cliff of boring? It is the cliff of being bored coding
syntax and using dry practice problems that you are usually asked to work
on before getting to the good stuff, which is the work you want to do.
Beginners often lose their motivation because learning syntax is not an
exciting aspect of learning a new programming language. This is one step
you cannot skip or ignore while learning any programming language and
not just R. However, you can overcome the cliff of boring.
R is worth learning and the advantages it offers merely make it seem more
enticing. Be it increasing your skillset or job opportunities, it is a wonderful
language. Now, are you wondering how you can get through the boring
bits? Well, here are some tips that will come in handy.
Know exactly why you have started learning R
The first step is to understand the reasons that prompted you to learn R and
what you wish to do with it after learning. Also, try to look at the different
types of data you can explore with R and what you wish to do with them.
The primary objective at this stage is to find your motivation for learning
this language. Unless you have the internal motivation to do something,
chances are you will give up on it sooner or later. To avoid this, it is
important to understand your motivation. It also gives you a better
understanding of the goal you wish to achieve by learning R.
Try to think beyond the common answer of “becoming a data scientist.”
There are different projects and problems data scientists can work on. Think
about your interests- do you want to make forecasts about the stock market?
Or maybe use this knowledge to understand sports statistics? What are the
various things you can do by learning this language? Make a note of all
such points that seem to interest you and keep them in mind. Once you start
learning R, remind yourself of these reasons and how R can help you attain
those objectives. Financial benefits might seem tempting, but don’t equate
everything in monetary terms. In the end, all that matters is whether you are
interested in the field you have chosen or not.
Learn the syntax first
You cannot avoid learning the basic syntax of a programming language.
Syntax in a coding language is more important than in human languages. It
is easy to understand what someone else is saying, especially when you
notice their body language and other aspects of nonverbal communication.
However, computers aren’t this forgiving. If the syntax used for the code is
wrong, the machine will not understand it.
Learning syntax might not be exciting but it is quintessential. Instead of
spending a lot of your time on rote learning, it makes more sense to learn
the syntax while tackling real-world problems and situations that appeal to
you. This gives you a defined goal that keeps you motivated, even if the
syntax is not exciting.
The faster you get to work on the projects, the quicker you will learn. You
must refer to a variety of resources to learn and check the syntax if you get
stuck. Ideally, don’t spend more than a couple of weeks dedicated to
learning the syntax. As with any other language, the more you use it for
communicating, the more fluent you get.
Try to work on structured projects initially
Once you have the hang of R programming's syntax, please move on to the
structured projects independently. By doing projects, you can apply
whatever you have already learned while seeing new things, which can be
challenging. This also gives you the perfect opportunity to learn and helps
you solve problems on the go. Working on structured projects will also help
spruce up your existing portfolio and make it seem more exciting to any
prospective employers. However, the one thing you need to do is keep
exploring new projects. It is okay even if you get stuck; R’s robust
community will help answer any of your queries. Keep at it, even if it gets a
little frustrating at times. The rewards will truly be worth it.
Instead, check for the structured projects until you build on little more
experience and feel comfortable. Even when choosing a course on data
science, choose one with a guided project that lets you apply the concepts
you have learned.
Build your projects
After completing the structured projects, you can move on to the next step
of learning the R programming language- build your own data science
projects. This is the best way to check and test the concepts you’ve learned
until now. It gives you a chance to analyze the areas where there’s scope for
improvement too.
Even while building a project on your own, you wouldn’t truly be alone. R
gives you access to a variety of resources and techniques you can use while
approaching a project. There are several dedicated packages associated with
R programming that will also come in handy. Whenever you work on a new
project using R, you are essentially learning a new package too. If you ever
get stuck or need some help, here are some resources you can use.
Stack overflow
Whatever your doubt is, it would have probably been asked here before.
Even if it has not been asked, you can do it yourself. If you have any
doubts, it's quite likely that it's already been asked on this platform. Don’t
worry even if you don’t find any existing answers to your question, because
the community of users will be helpful.
Google
Google, one of the most popular search engines, is also commonly used by
experienced programmers. Whenever you get stuck or get an error message,
a quick Google search of the error message will give you the answer.
Now the question arises on what kind of projects you should build. As
mentioned earlier, it would be better if you start working on projects that
interest you. Always start small instead of taking on large projects. Even
while working on large projects, try simplifying it into smaller tasks and
tackle them one after the other.
Below are some of the ideas you can consider for your own project:
● Adding on a new feature to the already built structured project or do
a deeper analysis of the same
● Find some interesting projects built by others by with R on GitHub

and try to extend it. Or, take an existing project coded in another
language and try to change it to R.
● Meet other R coders and see if you can join a project they are

currently working on.
● Look for interesting stories in the news, which might have the data

available, which you could use to make a project.
Level up the difficulty
Never stop learning if you really want to master this language. To broaden
your horizons with R, you must try different projects based on a variety of
concepts. In R, you can do a lot of functions with for loop. Now, this does
not mean you start building many projects only using your looping skills.
Make each upcoming project more challenging and complex than the one
you previously completed. All these things help enhance your overall
learning
If you are not sure how to apply more complexity to your projects, here are
some of the questions you can ask yourself before ramping up:
● Can you write a tutorial on how to build this project for an R?
novice. Trying to teach someone will show you how much you
actually know about the subject, and it can be challenging!
● Is it possible to scale the project to handle more data than you have
added?
● Can you make the program run faster?
● Is it possible for you to make the program interactive and also to

improve the visualization?
● Can you add a sense of predictive analysis to your program?
Never Stop Learning R
A programming language and a spoken language are quite similar to a
certain extent. Your fluency and comfort levels increase when you practice
them regularly and consistently. However, when it comes to a programming
language, learning never really stops. The most experienced of R
programmers will learn new concepts as the language keeps evolving. The
addition of a new package to R will improve its overall scope and learning
all the time.
Hence, it is important to stay motivated and curious and keep learning.
Whenever you feel skeptical or low on motivation, look at the progress you
have made and all the learning you went through. By looking at the
progress you made, you can replenish your motivation levels. If you keep at
it, you will certainly get the hang of it within no time.
Chapter Eleven: Building Simple Programs in R
Example 1
This takes care of only variables and constants – Program to build Hello
World
> # We can use the print() function
> print("Hello World!")
[1] "Hello World!"
> # Quotes can be suppressed in the output
> print("Hello World!", quote = FALSE)
[1] Hello World!
> # If there are more than 1 item, we can concatenate using paste()
> print(paste("How","are","you?"))
[1] "How are you?"
In the program above, we have used print(), which is a built-in function to
print the required string Hello World! The quotes that you see are printed by
default. Hence to avoid that, we can add an argument called quote =
FALSE. Also, if there are more than a single item, you can use paste() or
cat() functions to concatenate the strings together.
Example 2
We can add the elements of the vector by the function sum()
> sum(2,7,5)
[1] 14
>x
[1] 2 NA 3 1 4
> sum(x) # if any element is NA or NaN, result is NA or NaN
[1] NA
> sum(x, na.rm=TRUE) # this way we can ignore NA and NaN values
[1] 10
> mean(x, na.rm=TRUE)
[1] 2.5
> prod(x, na.rm=TRUE)
[1] 24
When a vector has NA (not applicable), or NaN (not a number), the
functions that are used here such as sum(), mean(), prod(), etc. make NA or
NaN respectively.
Example 3
This example will deal with an interactive screen, i.e. take inputs from the
user.
my.name <- readline (prompt="Enter name: ")
my.age <- readline (prompt="Enter age: ")
# convert character into integer
my.age <- as.integer(my.age)
print (paste ("Hi,", my.name, "next year you will be", my.age+1, "years
old."))
Output:
Enter name: Mary
Enter age: 17
[1] "Hi, Mary next year you will be 18 years old."
As you can see, we have used the function readline() to get input from the
user.
Here, you can see that with the prompt argument you can display an
appropriate message for the user.
In the above example, you convert the input age, which is a character vector
into an integer by the function as.integer().
This is necessary for doing further calculations.
Example 4
In this example, we will find if a year is a leap year or not by taking inputs
from the user.
# Program to check if the input year is a leap year or not
year = as.integer(readline(prompt="Enter a year: "))
if((year %% 4) == 0) {
if((year %% 100) == 0) {
if((year %% 400) == 0) {
print (paste (year,"is a leap year"))
} else {
Print (paste (year,"is not a leap year"))
}
} else {
Print (paste (year,"is a leap year"))
}
} else {
Print (paste (year,"is not a leap year"))
}
Output 1:
Enter a year: 1900
[1] "1900 is not a leap year"
Output 2:
Enter a year: 2000
[1] "2000 is a leap year"
Here we have used the logic that a leap year is exactly divisible by 4 except
for the years ending with 00. The century year is a leap year only if it is
perfectly divisible by 400.
A nested if else is used to implement the logic in the above program.
Example 5
In this example, we will find the HCF of two numbers
# Program to find the H.C.F of two input number
# define a function
hcf <- function (x, y) {
# choose the smaller number
if(x > y) {
smaller = y
} else {
smaller = x
}
for(i in 1:smaller) {
if((x %% i == 0) && (y %% i == 0)) {
hcf = i
}
}
return(hcf)
}
# take input from the user
num1 = as.integer (readline(prompt = "Enter first number: "))
num2 = as.integer (readline(prompt = "Enter second number: "))
print (paste ("The H.C.F. of", num1,"and", num2,"is", hcf(num1, num2)))
Output:
Enter first number: 72
Enter second number: 120
[1] "The H.C.F. of 72 and 120 is 24"
The user is asked to insert two integers and then pass them to a function
which presents the result in the form of HCF.
In the function, the smaller number from the two numbers given in the input
is checked. The HCF can either be less than or equal to the smallest number
between these two.
Then a ‘for’ loop is used to move from 1 to the smallest number.
It is checked in each loop whether the number divides both the input
numbers perfectly or not.
If the number does divide perfectly, it is stored as the HCF. Once the loop is
complete the outcome is the HCF or the largest number that divides both
the numbers perfectly.
Example 6
In this example, we will show you how to develop a calculator of your own
# This program can help you make a simple calculator that can perform
simple functions such as add, subtract, multiply and divide.
add <- function(x, y) {
return(x + y)
}
subtract <- function(x, y) {
return(x - y)
}
multiply <- function(x, y) {
return(x * y)
}
divide <- function(x, y) {
return(x / y)
}
# take input from the user
print ("Select operation.")
print ("1.Add")
print ("2.Subtract")
print ("3.Multiply")
print ("4.Divide")
choice = as.integer (readline (prompt="Enter choice[1/2/3/4]: "))
num1 = as.integer (readline (prompt="Enter first number: "))
num2 = as.integer (readline (prompt="Enter second number: "))
operator <- switch(choice,"+","-","*","/")
result <- switch (choice, add(num1, num2), subtract(num1, num2),
multiply(num1, num2), divide(num1, num2))
print (paste (num1, operator, num2, "=", result))
Output:
[1] "Select operation."

[1] "1.Add"
[1] "2.Subtract"
[1] "3.Multiply"
[1] "4.Divide"
Enter choice[1/2/3/4]: 4
Enter first number: 20
Enter second number: 4
[1] "20 / 4 = 5"
In this code, the user is first asked to check what operation they want to
perform. Then the user is asked to input the two numbers after which the
‘switch’ branch is used to perform a function.
The functions such as add(), subtract(), multiply() and divide() areal user
defined functions.
Chapter Twelve: Python for Data Science
Introduction to Python
Before learning about Python, it is important to understand why it is
important. Understanding python is one of the valuable skills required for a
career in data science. Python's use is increasing, and this trend is expected
to continue with increasing developments in the python ecosystem by data
science experts.
There may be many programming languages you can learn to work with
when it comes to coding, but nothing provides the advantages of working
with Python. Python is popular and used by so many people that few
operating systems already have some version of Python pre-installed in
them. This will help you understand how coding is done, and it ensures you
will reap the benefits it offers quickly
There are numerous things you would love while working with the python
programming language. You might be a little overwhelmed while you start
learning a new language because focusing on multiple languages isn’t easy
and it might seem like a time-consuming process. You might not even think
that you will have enough time or resources needed to get it done. You
might even worry that each time you want to work on the code, you have to
begin from scratch.
You can handle all such issues using python programming language, and it
will make you feel more confident within no time. Therefore you won't feel
like a beginner every time you need to code. There are many benefits of
python programming language. Since it is an Object-Oriented programming
language, it is an open-source language. You can also enjoy that it is open-
sourced. It can also handle a lot of different types of programming required
in the process.
Now, what exactly is meant by an open-source in coding? It essentially
means that Python will let you get the code for free and work with
programs and much more without having to worry about copyright issues.
This encourages advanced programmers to join Python and make some
adjustments, development, updates, and more, which are required to keep
this coding language modern and updated.
Python can handle all the coding you would want to do. It can handle
anything from basic coding that you would do while learning a language
and can also handle data analysis and machine learning. These advantages
make it amply clear why anyone would want to use python for coding.
The career opportunities opening up for data scientists are steadily
increasing by the day. It is believed the employment rate in this field can
increase by up to 28%. An efficient and effective way to capitalize on the
ever-growing opportunities is by learning various programming languages.
The Importance of learning Python
Python seamlessly works on different platforms, such as Mac, Linux,
Windows, etc. It carries out on an interpreter system, which means the code
can be instantly run through the system the instant you write it. This makes
the entire process of prototyping quite simple. The syntax of Python is quite
easy to understand and is similar to the English language. Compared to
other developing programs, Python has a syntax that allows developers to
write programs quickly and in fewer lines.
Functioning of Python
Python can be connected to database teams to read and change files. It can
manage big data and also carry out complicated mathematics. It is used on a
server to establish website apps. As mentioned earlier, rapid and swift
prototyping is facilitated by Python.
Python is a versatile language when it comes to website development. It is
known to be instrumental in AI and data science and is used extensively in
the Internet of things. Industry leaders such as YouTube, Amazon, Google,
and Facebook use Python as a programming language. Even other
companies are adopting Python as a coding language that makes the
demand for python developers very high.
Python is simple to learn and code because of its simple syntax and easy-to-
read code. This language is shorter than other programming languages,
which are considered simple such as C, C++, etc. Also, the developers can
build a rapport that does not seem less in other languages. So, python
developers consider their coding fun and comfortable.
Python is believed to be one of the better portable languages that means that
users can execute any code written in python on other platforms. They can
do it without any amendments to the code. The code will be run on any
platform flawlessly and without any interruptions. Also, Python is an
extensible programming language. The users can develop a python code
into C++ or C language. They can even compile that particular code in
languages like C++ and C.
Python is a high-level programming language that is fast, open, simple to
learn, and user-friendly. As mentioned earlier, it performs flawlessly with
other platforms and can be executed anywhere seamlessly. Earlier Python
was considered tough, and only those people who are well-versed in
statistics could be comfortable with the programming language would use
it. However now, tools for almost all the aspects of scientific computing are
easily available in Python.
Why Use Python?
There are different programming languages available these days. All these
choices can make it difficult to select a specific language to learn. While
evaluating all the different languages, you would realize that the processing
power and coding efficiency of python is unlike any other. It is easy to read,
and it will also work with almost all operating systems. Here are some of
the benefits of this programming language that will help you make a better
decision:
Robust And Simple To Use
Python was initially created such that it is simple to learn and code. It can
be learned quickly and used effectively by beginners or newbies and even
experts alike. While learning this language, you will realize the code is easy
to understand and this feature comes in handy while working on complex
codes.
It has minimal constraints and debugging codes. When compared with other
languages like C, C++, Java, etc. the code implementation time in python is
less, which in turn provides enough time for the coders to work on their
algorithm.
The Large Community That Python Has!
Python is considered to be one of the most preferred languages for coding.
It is coder-friendly, which means that it is easy to learn shows why people
all over are using Python to code with. This opens up several opportunities
to learn about the kind of programming language and what exactly it can do
for you. There are public forums where you can post any questions that can
be easily answered and there are several tutorials you can go through too.
This comes in handy while just getting started with Python.
The Program Is Open Source
Open source means the original Python can be downloaded, and it is
available to the public for free. Of course, few people have taken a part of
Brighton and developed it to meet certain specifications and requirements
and have tried to sell it. Since it is an open-source code, you can download
the original and uncorrupted version of it. You can work with the basics of
python code without having to buy anything else to get the program and its
components to work.
Scalable
Python is a highly scalable programming language. It gives a solution to
problems that cannot be solved by other programming languages like Java.
Many businesses have coded in this language to establish apps and tools
very fast.
Options Of Support Libraries
The vast database of libraries offered by python can be utilized in machine
learning and artificial intelligence. These libraries provide the assistance
required for a beginner to get started with the language. You can find many
functions, objects, classes, and more inside this language to help you with
your code. You can also add third-party libraries to get some extra power, or
if there is something in particular that you want to do with your code that
the current library cannot help with. The most popular libraries used are
Seaborn, Scikit Learn, Pytorch, and Matplotlib.
Extensive Visualization And Graphics Options
Python developers can explore a variety of options for data visualization
and graphics. Charts, graphical layouts, web-ready plots, etc. can be used
by developers to visualize the data.
Python’s Automation
Being an open-source language, it is straightforward to automate anything
on Python. Even a beginner can automate anything on this by learning the
basics. Only scripts are written in Python. The code should be written in
scripts so that it can be executed. Machines read and interpret the code.
Once the developer has checked the code, it can be used or executed several
times without interruption. Thus, users can automate many tasks or stuff in
a program.
Python is considered the best data science tool to work on big data. When
the integration of location and data analysis is needed with the production
database, big data and Python are the best solutions. It is considered one of
the best programming languages because it is equipped with powerful
library packages Designed to meet any of the data science and other
analytical requirements.
The popular libraries used by big data and Python are:
• Pandas: This library is used to ensure data structure and operations
to manipulate data on numerical tables and time series
• SciPy: It is used in case of technical and scientific computing
• NumPy: It makes scientific computing possible. It offers support for
all types of number crunching, linear algebra, etc. Even it is useful
during the calculation of matrices with an extensive library, multi-
dimensional arrays, etc.
• Mlpy: It makes sure that many machine learning techniques to solve
the problems. It helps us in finding a reasonable compromise between
usability, efficiency, reproducibility, modularity, and maintainability.
• NetworkX: This library is used for learning about graphs that are
helping in establishing, manipulating, and reading the structure and
functions of complicated networks
Besides all this, it has many other useful libraries such as - Dask, Dmelt,
TensorFlow, Scikit-learn, Regression, Matplotlib, etc.
Python also supports testing. Python is the best choice for test automation
because of its libraries. Pytest is the most recommended test framework that
can be utilized to perform unit testing and functional testing or integrations.
Python is a functional and object-oriented language. It is essential for test
automation because it has a straightforward syntax and stateless functions.
Python also offers basic functionality, which is required to code the tools to
process three-dimensional data. For instance, the cgtypes module in python
can be used to learn more about various computer graphics, matrices and
other information used to create RIB files and more. With these modules
that exist, it becomes effortless to keep a 3D scene in memory, which can be
easily manipulated in Python.
Python is used in AI more than any other programming language as it
provides an independent platform. It is the most preferred and popular
choice for different platforms, with few tweaks. It has many prebuilt
libraries that can be utilized for coding in artificial intelligence. Python is
highly dynamic, given that there are two approaches possible - scripting and
OOPs approach. Even the users will be able to use IDE to get the required
codes themselves.
Python in Web Development
To develop a website using python implies that users produce responsive
code to execute the server instead of the browser. Establishing the server
side of a website is crucial for storing user-related information. Before
building a website using python, they must learn and grasp all the principles
of this programming language before starting to create a website.
Various aspects of Python ensure it is one of the best coding languages.
This coupled with the fact that it offers the best possible output makes this
language quite fetching.
Here are the five simple steps you can follow to learn python.
But remember that although the steps shown below are simple, it does not
mean that you needn’t put any effort. Without the required time, effort,
practice and consistency, learning a new language isn’t possible.
How to Learn Python Efficiently
Python Fundamentals
● Find a course: There are many courses available, join a course after
reading a review from the python community
● Join a community: Join a python community such a job could be

found and learn from your peers. Also, this helps you to land a job
easier as referrals for 30% of all hires
Mini Python Projects
● Read guidebooks, blog posts, and other people’s code, Try the

command line interface: Working from the CLI lets you execute
python code faster so you can learn and experiment more
● Build small applications, try web scraping with python
● Work with APIs
Python’s Data Science Libraries
● Online mentors from Dataquest, Stack overflow and Reddit make

great resources
● Work with the database using SQL: 40% of data scientist uses SQL
Data Science Portfolio
● Communicate, collaborate and focus on technical competence: Focus
on understanding the statistical processes behind the python functions
you are learning
● Use GIT for version control: GIT allows you and others to see past

versions of your projects. It helps debug and test new approaches
● Display your portfolios for employers to see
Advanced-Data Science Techniques
1. Keep Learning. Data science is growing quickly; it is too tough
to learn all of it. Dive into the parts that excite you, irrespective
of whether data visualization, machine learning or another
developing field, and keep learning.
2. Learn beginning and intermediate statistics. A firm grasp of
statistics will give you the perspective to focus on the right
insights and help you communicate those insights to others
3. Create models using live data feeds
Chapter Thirteen: Building Simple Python
Programs
Example 1
Let us start with our favorite example, a program to print Hello World.
# This program prints Hello, world!
print ('Hello, world!')
Output:
Hello, world!
In this program, we make use of the built-in function print() function to
print the string.
Example 2
In this example we will show you a Python code to add two numbers by
taking two numbers from the user.
# Store input numbers
num1 = input ('Enter first number: ')
num2 = input ('Enter second number: ')
# Add two numbers
sum = float (num1) + float (num2)
# Display the sum
print ('The sum of {0} and {1} is {2}'. Format (num1, num2, sum))
Output:
Enter first number: 1.5
Enter second number: 6.3
The sum of 1.5 and 6.3 is 7.8
Here the build function in input() has been used to receive inputs from the
user. The input() returns a string which is why it is necessary to convert a
string into a number that can be performed using the float() function.
Example 3
In this example, we will check if a number that is provided by the user is a
prime number or not
# Program to check if a number is prime or not
num = 407
# To take input from the user
#num = int (input ("Enter a number: "))
# prime numbers are greater than 1
if num > 1:
# check for factors
for i in range(2,num):
if (num % i) == 0:
print (num,"is not a prime number")
print (i,"times",num//i,"is",num)
break
else:
print (num,"is a prime number")
# if input numberis < 1
# or = to 1, it is not prime
else:
print (num,"is not a prime number")
Output:
407 is not a prime number
11 times 37 is 407
In this code, the variable ‘num’ is checked if it is prime or not. Numbers
that are < or = 1 are not prime numbers. Therefore, we only move forward
if the num is larger than 1.
We checked if the variable ‘num’ is exactly divisible by any number from 2
to ‘num – 1’. If we get a factor in that range, the number is not prime. Else
the number is prime.
Example 4
In this example, we will be checking on how to find the HCF of two
numbers using Python
# Python program to find H.C.F of two numbers
# define a function
def compute_hcf (x, y):
# choose the smaller number
if x > y:
smaller = y
else:
smaller = x
for i in range (1, smaller+1):
if((x % i == 0) and (y % i == 0)):
hcf = i
return hcf
num1 = 54
num2 = 24
print ("The H.C.F. is", compute_hcf (num1, num2))
Output:
The H.C.F. is 6
The function used here is compute_hcf() function. The function will
calculate the HCF of the two numbers and return the same.
Example 5
In this example, we will be sorting the words alphabetically using a Python
program.
# Program to sort alphabetically the words form a string provided by the
user
my_str = "Hello this Is an Example With cased letters"
# To take input from the user
#my_str = input ("Enter a string: ")
# breakdown the string into a list of words
words = my_str.split ()
# sort the list
words.sort()
# display the sorted words
print ("The sorted words are:")
for word in words:
print (word)
Output:
The sorted words are:
Example
Hello
Is
With
an
cased
letters
this
In this program, we make use of the function split() once the string is stored
in the variable my_str. The split() function splits the string at while spaces.
Example 6
In this example, we will check and perform various set of operations like in
mathematics
# Program to perform various sets of operations like in mathematics
# define three sets
E = {0, 2, 4, 6, 8};
N = {1, 2, 3, 4, 5};
# set union
print ("Union of E and N is", E | N)
# set intersection
print ("Intersection of E and N is", E & N)
# set difference
print ("Difference of E and N is", E - N)
# set symmetric difference
print ("Symmetric difference of E and N is", E ^ N)
Output:
Union of E and N is {0, 1, 2, 3, 4, 5, 6, 8}
Intersection of E and N is {2, 4}
Difference of E and N is {8, 0, 6}
Symmetric difference of E and N is {0, 1, 3, 5, 6, 8}
In this program two distinct sets are considered and various set operations
are performed on them. This is done with the help of the set method.
Chapter Fourteen: SQL for Data Science
Introduction
SQL (structured query language) is a programming language used to
manage data in a relational database. It is also used for querying. A
collection of two-dimensional tables gives us a relational database. Each of
the tables is made by a fixed number of columns and can have multiple
rows. For example, consider different car manufacturers. Every car
manufacturer would have a database composed of tables. Each table will
store different parameters or metrics for each car model in different
countries.
The Beginning of the NoSQL Movement
The term NoSQL has become a highly popular term currently. You must
have seen it being thrown around quite often. Even if you haven’t, don’t
worry because you will learn about it in this section. The NoSQL
movement displays the databases that manipulate and store the data in
formats different than the double relations that are non-relational.
The NoSQL database is different than the other, non-relational databases
due to various reasons. The non-relational database systems started to
become popular in the 2000s. In this period a lot of emerging companies
such as Amazon, Facebook, Google, etc. needed a lot of data to be
processed in short periods. These situations arose because a lot of people
were looking for things on these websites all the time.
It is a common belief in the open-source community that the systems that
were developed to look into these requirements had exceedingly broad and
diverse characteristics. It was agreed that these systems could tackle the
problems associated with RDBMS up to a certain extent. They targeted
Non-linear horizontal scalability with the help of the distribution of data
over the cluster of database nodes.
Due to its less rigidity, the relational data model is different than all other
modeling paradigms. It is also better equipped to handle the growing data
structures. The query mechanism in this interface is quite simple and
accessible as compared to other interfaces.
The NoSQL data storage layers have become extremely popular, but they
are not without their limitations. Almost all the new SQL implementations
are still in the preliminary stage and they have not proved their mettle in the
field yet. They also lack relational support which makes writing
aggregations and queries difficult.
Compared to this, the RDBMS is durable, easily manageable, and provides
better support for transactions. While the choice between NoSQL and
RDBMSs is difficult to make as it incorporates many criteria, one day or
another people will be forced to look into it carefully.
Most of the NoSQL vendors tend to focus on the durability and robustness
of the product. The traditional RDBMS vendors tend to put in features that
can allow the users to build scalable and schema-free data stores. It can
store semi-structured and nested documents. This is why NoSQL is so
popular in the document store category.
Some vendors have started using the term ‘NewSQL’ to describe modern
relational database management systems. The primary motto of these
systems is to combine the flexibility of the NoSQL system, scalable
performance, and the robustness of the DBMS and present a new and
improved product. The future trend seems to be all about search blended
systems expect the cases wherein specialized and niche database
management systems are needed.
The ‘NoSQL’ shows that in modern times, there is no one size fits all tool
that can help your firm.
With Python and R, SQL is now considered one of the most required skills
in data science. Are you still wondering why it is important to learn SQL?
Well, here are the reasons you should consider it.
As already mentioned earlier, around 2.5 Quintilian bytes of data are
created daily. Databases are needed to store such massive amounts of data.
These days, several organizations are placing a lot of value on data and
have realized its importance. The data available can be used to solve
problems related to the organization by analyzing and predicting the
incoming credit. It can also be used to predict trends of the market and get
to know the wants of the client.
One of the primary reasons for utilizing structured query language is that
when an operation is performed on a data set, the database can be reached
directly without having the necessity to copy it beforehand. This allows the
prices to speed up and also speeds up workflow executions.
There exist different SQL databases such as SQLite, Oracle, Microsoft SQL
Server MySQL, and Postgres.
Here are the different ways in which you can learn SQL using Google
BigQuery Kaggle integration. SQL can be used in data science for
preprocessing and machine learning too.
Master Data Science in 5 Steps
Basics of Relational Database
As already mentioned, SQL is a language for quitting data and managing
the data in a relational database management system (RDBMS). Usually,
the terms SQL and RDBMS are used together for the sake of convenience.
The term SQL is often used to create a difference between relational
database systems and non-relational database systems. Non-relational
database systems are put under the umbrella term NoSQL. Non-RDBMS
systems do not acquire any SQL skills.
An Overview of SQL
Since we are familiar with the relational language, let us shift our focus to
SQL.
The syntax of SQL is similar to the English language. Hence, it is easy to
understand the code if you go through a few examples about it. While you
go through SQL codes, stop worrying about understanding the entire code.
The initial steps of all the programs are intuitive, which would help your
understanding. To get started with the following steps, you need an SQL
environment setup. You need not enter every SQL statement in it, but it
does enhance your overall understanding. You could install SQLite locally
and start working with it.
Data Management Functions
SQL can be used to perform several types of data management functions,
such as querying the database with SELECT command, inserting records
into the database with the INSERT command, and updating the records with
the UPDATE command. These three commands are some of the most
famous and heavily used commands and would be a good start to your SQL
learning journey as these are very simple and readable.
Firstly, select a database you want to work on. Ensure that the records you
opt for are interesting and will not bore you during the early stages of the
learning process. Now, you need to choose data. Simple tutorials to do this
can be found on YouTube and similar platforms. The next step is to add
another record into the database present in the existing records. After this,
don’t forget to update the database. Once you start following this flow, it
becomes easier to understand how the different commands work.
Creating, Dropping and Deleting
The next set of commands you can use are CREATE and DROP tables, and
also to DELETE records. With an understanding of a growing set of
commands, it becomes easier to use SQL.
Views and Joins
Now let us focus on slightly advanced SQL topics. Firstly, let’s look at
views or the virtual tables populated by the results of queries, which are
useful for application development, Data security, easy data sharing and so
on.
These virtual tables make it easy to search your database and avoid any
long queries. Does it look too good to be true? Well, it is not. Let us assume
there is a table created to store information about all the individuals who
live on this planet. Some important data points such as salary or their
address shouldn’t become public knowledge. So, the access to this data
table needs to be given without giving away access to private information
Now let us learn how to create a view.
Write a query returning the data that should appear in the view, using the
SELECT command and execute it. Now to create a view from this data,
write CREATE VIEW and your view NAME and write AS followed by the
SELECT command that you have already written, and that is it. The view is
created. You can now treat in the created view as a table.
One of the important benefits of VIEW is security. You can give access to
only the views instead of all the underlying tables and you control the data
others see and hence restrict the sensitive data. Another benefit of VIEW
could be its simplicity. Instead of everyone learning how to use union is
combining two tables, it can easily select of you that is created using the
commands.
Joins come in different flavors, and it is very probable to be one of the very
complex topics you will learn while covering SQL. That is more of a
testament to the easiness of SQL than the actual difficulty of learning about
joins. The tutorial, MySQL Tutorial 2: Views and Joins from Arachnoid ,
covers vies and joins.
This chapter is meant to give you a brief insight into what SQL is and how
ti works. You need to keep working on it and learn more about it to get a
deeper understanding.
Examples of few SQL queries in a program
Example 1
CREATE TABLE emp (
empno INT PRIMARY KEY,
ename VARCHAR(10),
job VARCHAR(9),
mgr INT NULL,
hiredate DATETIME,
sal NUMERIC(7,2),
comm NUMERIC(7,2) NULL,
dept INT)
begin
insert into emp values
(1,'JOHNSON','ADMIN',6,'12-17-1990',18000,NULL,4)
(2,'HARDING','MANAGER',9,'02-02-1998',52000,300,3)
(3,'TAFT','SALES I',2,'01-02-1996',25000,500,3)
(4,'HOOVER','SALES I',2,'04-02-1990',27000,NULL,3)
(5,'LINCOLN','TECH',6,'06-23-1994',22500,1400,4)
(6,'GARFIELD','MANAGER',9,'05-01-1993',54000,NULL,4)
(7,'POLK','TECH',6,'09-22-1997',25000,NULL,4)
(8,'GRANT','ENGINEER',10,'03-30-1997',32000,NULL,2)
(9,'JACKSON','CEO',NULL,'01-01-1990',75000,NULL,4)
(10,'FILLMORE','MANAGER',9,'08-09-1994',56000,NULL,2)
(11,'ADAMS','ENGINEER',10,'03-15-1996',34000,NULL,2)
(12,'WASHINGTON','ADMIN',6,'04-16-1998',18000,NULL,4)
(13,'MONROE','ENGINEER',10,'12-03-2000',30000,NULL,2)
(14,'ROOSEVELT','CPA',9,'10-12-1995',35000,NULL,1)
end
CREATE TABLE dept (
deptno INT NOT NULL,
dname VARCHAR(14),
loc VARCHAR(13))
begin
insert into dept values (1,'ACCOUNTING','ST LOUIS')
insert into dept values (2,'RESEARCH','NEW YORK')
insert into dept values (3,'SALES','ATLANTA')
insert into dept values (4, 'OPERATIONS','SEATTLE')
end
The following data script can be run in ISQL. It uses functions such as
INSERT, CREATE TABLE, etc.
Example 2
This example shows creating the sample table using the data provider
SQLServerConnection Conn;
Conn = new SQLServerConnection("host=nc-star;port=1433;
User ID=test01;Password=test01; Database Name=Test");
try
{
Conn.Open();
}
catch (SQLServerException ex)
{
// Connection failed
Console.WriteLine(ex.Message);
return;
}
string[] DropTableSQL = {"drop table emp", "drop table dept"};
for (int x=0; x<=1; x++)
{
try
{
// Drop the tables, don't care if they don't exist
SQLServerCommand DBCmd = new
SQLServerCommand(DropTableSQL[x], Conn);
DBCmd.ExecuteNonQuery();
}
catch (SQLServerException ex)
{
}
// Create the tables
string CreateEmpTableSQL = "CREATE TABLE emp
(empno INT PRIMARY KEY NOT NULL,"
+"ename VARCHAR(10) NOT NULL,"
+"job VARCHAR(9) NOT NULL,"
+"mgr INT,"
+"hiredate DATETIME NOT NULL,"
+"sal NUMERIC(7,2) NOT NULL,"
+"comm NUMERIC(7,2),"
+"dept INT NOT NULL)";
string CreateDeptTableSQL = "CREATE TABLE dept ("
+"deptno INT NOT NULL,"
+"dname VARCHAR(14),"
+"loc VARCHAR(13))";
try
{
SQLServerCommand DBCmd = new
SQLServerCommand(CreateEmpTableSQL, Conn);
DBCmd.CommandText = CreateDeptTableSQL;
}
catch (Exception ex)
{
//Create tables failed
Console.WriteLine (ex.Message);
return;
}
// Now insert the records
string[] InsertEmpRecordsSQL = {
"insert into emp values
(1,'JOHNSON','ADMIN',6,'12-17-1990',18000,NULL,4)",
(2,'HARDING','MANAGER',9,'02-02-1998',52000,300,3)",
(3,'TAFT','SALES I',2,'01-02-1996',25000,500,3)",
(4,'HOOVER','SALES I',2,'04-02-1990',27000, NULL,3)",
(5,'LINCOLN','TECH',6,'06-23-1994',22500,1400,4)",
(6,'GARFIELD','MANAGER',9,'05-01-1993',54000, NULL,4)",
(7,'POLK','TECH',6,'09-22-1997',25000, NULL,4)",
(8,'GRANT','ENGINEER',10,'03-30-1997',32000, NULL, 2)",
(9,'JACKSON','CEO',NULL,'01-01-1990',75000,NULL, 4)",
(10,'FILLMORE','MANAGER',9,'08-09-1994',56000, NULL, 2)",
(11,'ADAMS','ENGINEER',10,'03-15-1996',34000, NULL,2)",
(12,'WASHINGTON','ADMIN',6,'04-16-1998',18000, NULL,4)",
(13,'MONROE','ENGINEER',10,'12-03-2000',30000, NULL, 2)",
(14,'ROOSEVELT','CPA',9,'10-12-1995',35000, NULL,1)"};
string[] InsertDeptRecordsSQL = {
"insert into dept values (1,'ACCOUNTING','ST LOUIS')",
"insert into dept values (2,'RESEARCH','NEW YORK')",
"insert into dept values (3,'SALES','ATLANTA')",
"insert into dept values (4, 'OPERATIONS','SEATTLE')"};
// Insert dept table records first
for (int x = 0; x<InsertDeptRecordsSQL.Length; x++)
{
try
{
SQLServerCommand DBCmd =
new SQLServerCommand(InsertDeptRecordsSQL[x], Conn);
}
{
return;
}
}
// Now the emp table records
for (int x = 0; x<InsertEmpRecordsSQL.Length; x++)
{
try
{
SQLServerCommand DBCmd =
new SQLServerCommand(InsertEmpRecordsSQL[x], Conn);
}
{
return;
}
}
Console.WriteLine ("Tables created Successfully!");
// Close the connection
Conn.Close();
Example 3
The example uses the Parameters.Add method to create the parameters for
the Update statement, fills a DataSet, programmatically makes changes to
the DataSet, then synchronizes the changes back to the database.
SQLServerConnection Conn =
new SQLServerConnection("host=nc-star;port=4100;User ID=test01;
Password=test01;Database Name=Test");
try
{
string selectText = "select sal, job, empno from emp";
string updateText = "update emp set sal = ?, job = ? where empno = ?";
SQLServerDataAdapter adapter = new SQLServerDataAdapter(selectText,
Conn);
SQLServerCommand updateCommand = new
SQLServerCommand(updateText, Conn);
adapter.UpdateCommand = updateCommand;
updateCommand.Parameters.Add("@sal", SQLServerDbType.Int, 15,
"SAL");
updateCommand.Parameters.Add("@job", SQLServerDbType.VarChar, 9,
"JOB");
updateCommand.Parameters.Add("@empno", SQLServerDbType.Int, 15,
"empno");
DataSet myDataSet = new DataSet("emp");
adapter.Fill(myDataSet, "emp");
// Give employee number 11 a promotion and a raise
DataRow changeRow = myDataSet.Tables["emp"].Rows[11];
changeRow["sal"] = "35000";
changeRow["job"] = "MANAGER";
// Send back to database
adapter.Update (myDataSet, "emp");
myDataSet.Dispose();
}
{
// Display any exceptions
}
Console.WriteLine ("DataSet Updated Successfully!");
// Close the connection
Conn.Close();
Analyzing Every Part of Data Science Puzzle
It is necessary to understand statistics and math carefully to practice data
science. It is also necessary to have some basic programming skills which
allow you to work with data. You do not need to be a subject matter expert;
you can always call yourself a statistician or a mathematician. A software
engineer who does not possess the analytical skills or subject matter
expertise is just a software developer and not a data scientist.
The requirement for data predictions and insights is increasing day by day.
This has happened because more and more sectors and fields have now
realized the importance of data science and how it can change their lives.
Here are some of the many sectors that use data science now.
● Ad tech data scientist,
● Director of banking digital analyst,
● Clinical data scientist,
● Geoengineer data scientist,
● Geospatial analytics data scientist,
● Political analyst,
● Retail personalization data scientist.
In the next section, let’s look at the important roles performed by a data
scientist. Collecting, Querying and Consuming Data
Large amounts of unstructured, semi-structured, and structured big data are
collected and collated by data engineers. Any data that goes beyond the
processing power of a regular or traditional data system thanks to its sheer
size, or agility, or any other significant reason is known as big data. Data
engineering isn’t the same as data science, which is all about data analysis,
prediction, and visualization. While data scientists and data engineers are
quite different, when a data scientist queries, collects or consumes data
while analyzing data, they do a job that is quite similar to the job of a data
engineer.
It is possible to gain a multitude of insights from just one source of data.
However, it is the combination of different sources that provide contextual
information required for better decision making based on data. A data
scientist can work with several databases in different data warehouses or he
can also work on data that is based in one single database as well. Or the
data they need to work with could also be stored using cloud-based
software. It does not matter where the data is stored or in what way a data
scientist obtains it; there is one important step that should not be ignored.
The data should be queried. Querying data could be through writing
commands, extracting the relevant data set from the system of data storage.
Generally, to query data SQL is used.
It does not matter if you use an application or perform custom analysis with
the help of a programming language like python, here is a list of universal
file formats.
CSV or Comma Separated Value File: It is present in almost all operating
systems and many web-based data analysis applications too. It is used in
scripting languages like R language and Python.
Scripts: Most of the data scientists use programming languages such as the
R language or Python to visualize and analyze data.
Application Files: One of the best tools for a quick and easy spot-check
analysis on small (or medium) datasets is Excel. Excel files use the .xlxs or
the .xls extension.
Web programming files: If you want to create a custom, web-based data
visualization you need work in the D3.js format, i.e., data-driven
documents. It is a JavaScript library made for detailed visualization.
Chapter Fifteen: Math for Data Science
Is it mandatory to have a Ph.D. in mathematics to become a data scientist?

Not really. However, your level of knowledge and the additional skills you
need essentially depends on what you are looking for. In this section, let’s
have a glance at certain concepts that are necessary to become a good data
scientist.
All the data scientists must have some knowledge of probability theory and
statistics. It is necessary to practice daily if you want to learn about various
algorithms and math used in data science and machine learning.
It is required to have a top-down approach in the case of data science. You
can start with learning about coding, data stacking, getting acquainted with
real-life projects, using libraries for documentation, and going through
tutorials for better understanding. If you realize you need a deeper
theoretical understanding, learn how algorithms work, and check for their
mathematical basis while dealing with any issues you encounter. At times,
hands-on experience is the best learning tool. All the theory you learn
wouldn’t be of much help if you cannot apply it practically.
Here are some tips you can use to learn about the fundamental math you
need to understand popular algorithms.
Naïve Bayes
Naïve Bayes' classifiers are algorithms based on the assumption that "the
value of a single feature is independent of the value of another feature."
This theorem suggests the probability of the occurrence of a specific
condition can be predicted using the information available about the given
event.
It is necessary to study the basics of probability and conditional probability
to understand the working or Bayes’ classifiers.
Tip: The best way to study this theorem is by studying conditional
probability and probability in detail.
Linear Regression
This the most basic kind of regression, used to show the connection
between two simultaneous variables. In this a set of data points are taken
and the trend is extrapolated to make predictions. It is considered to be an
example of parametric machine learning. In this the training process allows
the machine learning algorithms to make a mathematical functional which
can approximate the perfect pattern that was observed in the last training
set. This particular mathematical function is used to make predictions
regarding the expected result. These functions are known as models in the
machine learning lingo.
Tip: The best method to learn this by using doing a course on elementary
statistics. For a better, conceptual, and deeper understanding, it is
recommended to learn the derivation of the formula of the residual sum of
squares. This formula is generally used in advanced statistics.
Logistic Regression
This type of aggression is used in situations where the dependent variable is
binary. It focuses on estimating the probability of whether an event will take
place or not. Logistical regression is also an example of parametric machine
learning such as linear regression. This shows that the result of the training
process for the machine learning algorithms is nothing more than a
mathematical function which finds out the approximate and the best
patterns present in a training set. The main difference between the logistic
and linear regression is that in linear regression a real number is obtained
while in logistic regression a probability value is obtained. A linear
regression model creates a linear function while a logistical regression
model produces a logistic function. The logistics function that is created is
also known as a sigmoid function. It takes care of all the values to generate
a probability result between zero and one.
The sigmoid function will always result in a probability between zero and
one. This happens because of the algebra that is present in this process. If
the number is raised to a negative exponent, the effect is the same as
bringing up the number’s reciprocal to the next positive exponent.
Tip: As it needs probability and exponents, it is necessary to have an in-
depth understanding of probability and algebra if you want to learn this
algorithm. To understand this better, it is necessary to learn about real
analysis, discrete mathematics, and probability theory.
Neural Networks
These are machine learning models that closely follow the structure of
neurons present in the human brain. They utilize many different activation
units which are known as neurons which can be used to make predictions
regarding various outcomes. Neurons receive outputs, utilize a
transformation function, and generate an output. As far as nonlinear
relationships in data are concerned, a neural network is best suited for tasks
such as image and audio processing. The basic concept of the neural
network is processing, transforming, and generating output.
TIP: To understand the math present in neural networks, a separate
mathematics course, and another in Linear algebra will be a good way to
start the preparation. For a deep conceptual understanding, courses in
Matrix theory, Graph theory and real analysis are considered good.
K-Means Clustering
This is an unsupervised form of machine learning. It is used to categorize
unlabeled data i.e. the data that has no defined categories or groups. The
variable ‘k’ is used to display the number of categories in the data that the
algorithm searches. It goes through all the data points, according to the
features, to check and assign every data to a k group. It uses the idea of
distance to set up data points to a cluster. Distance is the space present
between any two items. Any function that shows the distance between two
elements of a set is known as the metric or distance function.
Tip: It is necessary to check and understand the basics of algebra including
subtraction and addition if you want to learn the distance formula with ease.
If you plan to go any further in this, it is necessary to take a geometry class
that should ideally cover non-Euclidean and Euclidean geometry.
Decision Tree
The decision tree method is used to display every possible outcome of a
decision with the help of a tree-like structure in the form of a flowchart that
utilizes the branching method. Each node in this flowchart shows a test on
the specific variable while each branch presents the variable’s outcome.
Information theory is necessary as it shows how they are constructed. One
of the crucial measures in this theory is the predictability. It can show you
the level of uncertainty in any variable. Using the predictability, you can
begin constructing the decision tree. Higher predictability means there is
more uncertainty, which means the tree must be split in a way to decrease
the uncertainty. Information gain displays how many bits of information
someone can receive. The information gained on every column in a dataset
in a decision tree can be calculated to find out which column will generate
the maximum information gain and then divide the column.
Tip: It is necessary to learn the basics of probability and algebra to
understand decision trees. If you plan to go deeper into decision trees, it is
necessary to learn more about logarithm and probability. It is recommended
to take a course on Algebra and probability theory.
Some final thoughts…
Even if the mathematics courses seem dry and overwhelming at times, don’t
worry and go through them. Once you are armed with this knowledge, you
will be better equipped than your peers when it comes to algorithms and
their application. Spend some time and go through the different topics
highlighted in the above-mentioned tips to deepen your knowledge and
understanding of these topics. A machine learning library might do the trick
when it comes to probability and statistics. However, a data scientist would
benefit from understanding the math and statistics that power these
algorithms. All this information will surely enable you to make the best
choice possible.
Applying Math to Data Science Models
Math is necessary and crucial to understanding the data and its importance
which is why data science is closely related to the mathematical skills of a
programmer. It is necessary to learn these skills because they are of great
importance in data science. They can be used for decision modeling,
forecasting, and testing of hypotheses. Math is used by data scientists to
form approximations, make decision models, and to make predictions. Here
is a short summary of how math is used in data science models.
How to derive insights from Statistical Methods
The statistical methods are of utmost importance in data science because
they help the scientist to understand the significance of the data, to
stimulate certain scenarios, to validate hypotheses, and to make predictions
about future events. It is difficult to find advanced statistical skills even
among statisticians, quantitative analysts, and engineers. If you want to
become a data scientist, it is necessary to spend some time learning the
basics of statistical methods such as logistic and linear regression, time
series analysis, and Bayes classification.
Coding, Programming, and Coding again – An essential part of
the game
It is impossible to avoid coding while working in data science. It is
necessary to learn how to code because you need to create code to order the
computer to analyze, manipulate, and visualize any data. R language and
Python are two of the many programming languages which are useful for
writing scripts for data manipulation, visualization, and analysis. SQL or
structured query language is used for data querying too. Another option is
the JavaScript library which is a great option to make interactive and
custom web-based data visualization. A data scientist should know how to
write codes in one or more languages. While it is necessary to learn coding
to be good at data science, it is not a scary experience and can be learned
with ease. The difficulty level of coding is subjective to the user.
Chapter Sixteen: Data Science in an Organization
A common myth among statisticians is that data science is nothing more
than what they have been doing for many decades. As stated earlier, this is
just a myth as the only similarity that statisticians and data scientists share
are that they both utilize data. Data scientists use a new and different
approach of looking at data as compared to the more commonplace way of
looking at data that is employed by statisticians. Data scientists employ
various computer languages to make sense of data instead of relying solely
on statistical formulae and conventional approaches of the field of
mathematics. The biggest difference between statisticians and data
scientists is that the data scientists need a lot of expertise regarding the
subject matter.
Statisticians do not have a lot of expertise in fields and sectors beyond
statistics (although exceptions do exist). They are generally forced to look
for external experts to check, analyze, and verify their findings and
sometimes to understand the meaning of their findings and to find out
which way to move in next.
This section shows how data scientists use data science to promote
performance in their field.
● Architects utilize machine learning to analyze and optimize the
energy required in modern building designs.
● Marketing data scientists use logistic regression to check, analyze,

and make predictions regarding customer churn. This means they
check how many customers are lost due to a product or service
provided by a competitor.
● Clinical Data Scientists use data science to make personalized and

custom treatment plans and to predict and check the potential health
problems of at-risk patients.
● Data Journalists receive data from websites in bulk and then check it
to report any breaking news or story.
● Data Scientists in the field of crime analysis use the field to prevent
and predict criminal activities by looking at and analyzing patterns.
● Data scientists use machine learning to sort and report important data
related to disaster-affected communities.
Communicating the Insights Captured through Data
A good data scientist must have written and oral communication skills. This
is necessary because he needs to communicate his findings and what he has
inferred from the given sample of data. If he cannot communicate these
things, his data and findings become useless. The data scientists in an
organization have the responsibility of understanding the insights from data
and explain it to other departments with shared insights to make the most of
it. Communicating the insights and observations clearly and in meaningful
ways coupled with written narratives is important.
Leveraging Cloud-Based Solutions
Cloud-based solutions offer the power of data analytics to those
professionals who aren’t well versed with data knowledge such as coding.
New and private cloud applications such as trusted analytics platforms are
dedicated to making it easier for companies to deploy their big data
initiatives. Cloud services like Tableau, offer code-free, automated data
services from basic clean up and statistical modeling to analysis and data
visualization with hardly any coding. With the fundamental knowledge of
statistics, mathematics and an understanding of data insights, tools such as
Tableau can deliver significant results without writing codes and scripts.
Cloud-based solutions could be used by organizations to meet their data
science objectives. A well-trained in-house staff capable of designing,
running, visualizing, and understanding quantitative results generated by
such platforms is needed. Cloud-based solutions aren’t a replacement for
those with data science expertise. Instead, they are tools that can be used to
attain the organization’s objectives.
Chapter Seventeen: Creating Data Visualization
to Communicate the Insight
Data visualization is an integral aspect of data science and analysis. The

pictorial and graphical representation of data is known as data visualization.
It is essentially a form of representation of data to understand its patterns
and the insights it provides which cannot be presented any other way. Data
consists of only bits and bytes stored in a computer’s hard drive. To make
sense of the available data, it is important to visualize the same. In this
section, let’s look at the term data visualization and all that it encompasses.
For example, something as simple as loading data into an excel sheet or
spreadsheet software can be termed data visualization. The invisible data is
converted into a relatively easy-to-understand table. However, tables by
themselves don’t offer any significant insights into data and when dealing
with massive amounts of data, tables can become cumbersome. For
instance, geographical patterns cannot be studied unless the dataset is
visualized in the form of a map.
Remember, whenever you start visualizing the available data using tools
and software, it helps discover some hidden meanings and patterns that
weren’t obvious previously. Each new visualization is likely to provide
some extra insights into our data. Some of the inferences might already be
known to the user, whereas others might be complete. Some insights might
mark a new beginning to the story, while others might be the result of an
error in the dataset itself that was discovered during data visualization.
Visualization provides a new perspective and direction on the data set. Data
can be visualized in different ways.
Tables can be used when the data points are relatively small in number.
Tables assign labels to different values and help with data organization and
classification. Tables essentially lend structure and organization to the raw
data. If you use a one-dimensional data set, use tables to compare the
values. You can arrange the data in a tabular format that makes data
analysis easier. While comparing multiple dimensions at once, tables aren’t
effective.
On the other hand, charts help map different dimensions in the data and
help visualize its individual properties, features, or characteristics in
geometric shapes. While using charts for data visualization, it is important
to remember that positions matter the most. For instance, while using a
scatterplot, a two-dimensional object can be easily mapped using x and y
coordinates. A third dimension can be easily added to this 2-d
representation using the color and size of the numbers and figures.
Using maps for data visualization is a wonderful idea. Maps are often used
for graphically displaying geo-related data for analyzing and displaying.
This form of data representation makes the data easy to understand. It helps
the users visually see the regional distribution of data instead of going
through several numbers and points. For instance, crime mapping is
commonly used in law enforcement. Analysts using crime mapping use
maps for a visual representation of datasets to identify and analyze patterns
of criminal incidents. It shows the areas where crimes take place, making it
easier to take the required corrective action to reduce such instances. Maps
can also be used to understand any geographical relationship with the data.
For instance, an analyst can compare how the crime rates in urban areas
with rural regions.
This brings us to another important tool for data visualization- graphs.
Graphs are meant to show interconnections in the data points. The data
point position is calculated by graph layout algorithms, which allows us to
see the structure within the network. All that’s left to do is find the right
way for modeling the network for data visualization. While doing this, it
would do you good to remember that not all datasets are related, and even if
they do, obtaining the insights might not always be a straightforward
process.
Once the data is visualized using any of the above techniques, the next step
is to understand and gain insights or learn from the visualizations. To make
sense of the data, here are some questions you can use.
● What is visible in this image? Is it close to what I imagined or
expected it to be?
● Can I see any interesting patterns in the visualization?
● What does all this mean in relation to the data points entered into the
system?
Don’t be disappointed if the data visualization doesn’t immediately offer
eye-opening insights. Whenever you use data visualization, you will surely
discover something more about the data that you previously missed.
.
Chapter Eighteen: Documenting Your Insights
from the Data
In most cases, you would start working with a previously unseen dataset
and start with a couple of assumptions and expectations about it in mind.
Documenting the initial thoughts you have about the data is a good idea.
Once you make a note of this, it helps you identify any preexisting bias
about a data set or trend. It reduces the risk of interpreting the data points.
Whenever you need to analyze any data, don’t forget to use this tip.
Documentation is a vital step while processing data and ironically, this is
the step that most skip. It involves plotting and wrangling data to obtain the
desired results. Looking at numerous charts, numbers, and figures can
become quite overwhelming and confusing. So, take a few minutes and
document the following things.
● Why did I start creating this chart?
● What exactly have I done to the data to create it?
● What does the chart show me?
So, how can you visualize the data set you need to analyze? There are
various tools available for this process. Each of these tools is good for a
specific function. Before you opt for a tool, ensure that it considers data
visualization and wrangling too. You can use two separate tools for this
purpose. However, it makes more sense to use a dedicated tool for each
purpose. If not, you will essentially be exporting and importing large
chunks of data. Here are a couple of data visualization and wrangling tools:
● Spreadsheets like Excel or Google Docs.
● A programming framework like R or Pandas
● Visualization libraries, for example, d3.js, Flare
● Data Wrangling software such as Google Refine or Datawrangler
● Non-programming visualization tools such as Tableau
Chapter Nineteen: Applications of Data science
Now that you are aware of the different benefits of data science let’s look at
some of its real-life applications. Once you start reading through the points
discussed in this chapter, you will understand how all the concepts you have
learned about are applied in real-time.
Fraud And Risk Detection
One of the first applications of data science was in the finance sector. All
the bad debts and resultant losses created a tremendous burden on the
companies in this sector. Whenever a loan had to be approved or processed,
these companies usually gathered a lot of data. They started using this data
by hiring a data scientist to reduce the risk of losses caused due to bad
debts. Banking companies started using customer profiling and other crucial
data points to study and analyze the probability of risk of default on loans
before it is sanctioned. It also helped this sector capitalize on the power of
targeting marketing because they could now use the data they have and
market their products to those with purchasing power.
Healthcare
Medical Image Analysis: Various processes such as content-based medical
image indexing, machine learning methods, and wavelet analysis are
applied to procedures including organ delineation, tumor detection, etc.
Drug Development: On average, it takes around twelve years to develop
and create an official drug. The process of discovering a drug is
complicated and involves different areas such as research, testing,
development, and so on. Billions of dollars can be saved from being
squandered on impractical or unsuccessful drugs. Using data science and
machine learning, the entire process of research and development can be
simplified. At every step, data forecasting and analysis can be used for
initial screening of the different drug compounds to determine its success
based on biological factors. Algorithms and data science can also be used to
forecast the probable ways in which a drug could potentially react upon
human use. All this is based on advanced mathematical modeling and
simulations instead of the cumbersome and time-consuming lab
experiments.
Virtual Assistance For Customer Support: These days, mobile applications
can provide a more effective solution by bringing the doctor to the patient
instead of the other way around. The AI-powered mobile applications with
a chatbot provide healthcare support at a basic level. There are applications
that can be used to set reminders about taking medicines or keeping track of
doctor’s appointments. This approach reduces time wastage for patients and
doctors alike. Patients needn’t wait in lengthy queues for a consultation and
doctors can now concentrate on cases that demand their immediate
attention.
Internet Search
Internet search might be the first thing that comes to mind when you think
of data science applications. All the popular search engines, including
Google, Bing, and Ask, use data science to provide the best results for the
search query within seconds. Without data science, Google wouldn’t be the
all-powerful search engine we know it as today.
Targeted Advertising
One of the most common applications of data science is research. Research
is an important aspect of digital marketing wherein data science is used.
Most of the advertisements on digital billboards at the airport and display
banners on various websites are decided using data science algorithms. Data
science is the reason why digital adverts boast a much higher click-through
rate than traditional advertisements that aren’t based on a specific user’s
past behavior.
Website Recommendations
Have you ever shopped on eCommerce platforms such as Amazon or surfed
through other online platforms such as Netflix, or even social media sites?
If yes, you might have noticed the tab showing “recommendations.” Did
you ever consider how a seemingly unknown online platform could offer
suggestions or recommendations to different users? All this is possible
through data science. All online platforms gather user data and information
and monitor their usage pattern. Based on all this, they offer suggestions
and recommendations to enhance your overall experience.
Advanced Image Recognition
You must have noticed that when you upload a picture with your friends on
Facebook, you start getting suggestions to tag your friends. This automatic
tagging suggestion feature uses the face recognition algorithm. Google also
the option to search for images by uploading a particular image. It provides
related search results by using image recognition.
Speech Recognition
Google Voice, Siri, and Alexa and other similar voice assistance products
use the feature of speech recognition. The feature of a virtual voice assistant
certainly makes like easier. These days, you can send a message, order
something online, or even make calls without even looking at your tech
device using these features.
Airline Route Planning
Companies in the aviation sector use data science to identify areas for
strategic improvement. The growing competition in the industry reduces
profitability due to differing margins and competitive pricing. This coupled
with the ever-increasing costs of fuel and strategic marketing campaigns
(discounts and perks) can eat into their existing profit margins. The
companies can use data science to predict flight delays and decide which
type and class of aircraft to buy based on historical data. They can also use
data science to decide whether nonstop flights or flights with layovers
should be provided to increase their profitability and decisions about
customer loyalty programs.
Gaming
Machine learning algorithms are used in many recently developed and
released games that allow players to upgrade their virtual avatars. Even in
motion gaming, the computer analyses the player’s previous models and
shapes them accordingly to provide the best experience. EA Sports, Zynga,
Sony, and Nintendo have taken the experience of gaming to the next level
using various data science algorithms.
Augmented Reality
Augmented reality is an exciting application of data science; Virtual reality
headsets contain computing knowledge, logic and algorithms, and data
points to help create the best viewing experience. So, Data science and
virtual reality have a relationship. You might have seen or even played
Pokémon Go, a high-trending game based on the enticing concept of
augmented reality. This game gives users the ability to explore their
surroundings and search for Pokémons in the virtual world while taking
action in the physical realm. For instance, a user could walk around his
neighborhood while the game screen shows he is exploring a virtual path.
The developers of this game used the data from Ingress, another application
from the same company, to choose the exact locations of gyms and
Pokémons. Data science in the virtual reality sphere will make more sense
once it is accessible and cheap so consumers can use it often like other
applications.
Conclusion
Data science learning is well to its formative phases of development; it's
evolving into a self-supporting, self-discipline, and producing professionals
with complementary and distinct abilities that are needed by the experts in
the personal computing, information, and then statistical sciences.
Nevertheless, irrespective of its possible eventual disciplinary sector, the
research points to the strong development of information science training,
which will indelibly condition the undergraduate students of the future.
In reality, fuelled by raising student's interest and business demand,
information science knowledge will probably be a staple of the
undergraduate experience. There will be a growth in the number of
candidates majoring, earning certificates, minoring, or perhaps simply
taking courses in information science. The importance of information skills
gets more commonly recognized.
Yes, learning data science can get boring and tough at times, but as
mentioned in the book, do not lose hope! Choose a data set that interests
you and learn various concepts. With respect to programming, start with
simple programs like the ones I have given in the book and then keep
increasing the difficulty level one by one.
The way to become the best in data science is to work on as much data as
you can and build a strong community presence and sector expertise, i.e., be
a subject matter expert in an area that will help you explain the data trends
better. It will also place you above the rest in the job market. The book also
sheds some light on the skills you need to develop to be a data scientist.
These skills will help you convey to the management about what they need
to do to improve their business. Thus, you must learn how to work with
data and identify the relationship between different variables in the data set.
This will help you inform the management about what it needs to do to
improve revenue.
Adopting a broad education requirement in information science for those
undergraduates will endow succeeding generations of students with a
simple understanding of information science that they have to be
responsible citizens. For example, continuing education programs,
information science boot camps, summer schools, career accelerators, and
incubators, will deliver an additional stream of talent. This comprises the
emerging watershed of information science education that feeds many
channels of specialists and generalists in society; citizens are empowered by
their fundamental skills to look at, interpret, as well as draw insights from
data.
Now that you are armed with all the information you need, all that's left for
you to do is start implementing the practical tips and suggestions given in
this book. Remember, it takes consistent effort, time, practice, and lots of
patience to learn a new skill. If you are willing to do all this, you can
certainly elevate your knowledge of data science within no time.
Thank you and all the best!
References
https://towardsdatascience.com/why-learn-data-science-in-2020-
d3f54123b2e4
https://www.dataquest.io/blog/learn-data-science/
https://towardsdatascience.com/sql-for-data-science-f9a42ae66bbb
https://data-flair.training/blogs/purpose-of-data-science/
https://www.edureka.co/blog/data-science-applications/
https://www.i-scoop.eu/big-data-action-value-context/data-business-asset/
https://www.dezyre.com/article/is-it-necessary-to-learn-hadoop-to-become-
a-data-scientist-/253
https://towardsdatascience.com/big-data-analysis-spark-and-hadoop-
a11ba591c057
https://www.datamentor.io/r-programming/examples/
https://www.programiz.com/python-programming/examples
https://www.educba.com/hadoop-commands/?source=leftnav
https://towardsdatascience.com/who-should-not-become-a-data-scientist-
32ce27c614f5
https://www.simplilearn.com/what-skills-do-i-need-to-become-a-data-
scientist-article

OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

OceanofPDF - Com DATA SCIENCE Simple and Effective Tips An - Benjamin Smith

Transféré par

Droits d'auteur :

Formats disponibles

Data Science

Simple and Effective Tips and Tricks to

Knowing Machine Learning

Hadoop in Data Science

Command: hdfs dfs version

2. Mkdir: This command is generally utilized to create new

4. Put: This command is used to copy the content to a location from

Command: hdfs dfs -put source_dir destination_dir

5. CopyFromLocal: This is like the ‘put’ command with a simple

Command: hdfs dfs -copyFromLocal local_src destination_dir

Command: hdfs dfs - get source_dir local_dir?

7. CopyTOlocal: This command and the getting command are

Command: hdfs -dfs - copyTOLocal src_dir local_dir

8. cat: It is used to display the content of the name of the file

Command: hdfs dfs - cat dir_path

Command: hdfs dfs - move from local local_src destination_dir

Command: hdfs dfs - tail file_path

Command: hdfs dfs - rm dir_name

Command: hdfs dfs -expunge

Command: hdfs dfs -change owner_name dir_name

Command: hdfs dfs -change group_name dir_name

Command: hdfs dfs -du dir_name

Command: hdfs dfs -df -h

Command: hdfs dfs -touchz dir_name

Command: hdfs dfs -append to file local_src….

Command: hdfs dfs -count dir_name

Command: hdfs dfs -checksum file_name

[1] "Select operation."

Is it mandatory to have a Ph.D. in mathematics to become a data scientist?

Data visualization is an integral aspect of data science and analysis. The

Vous aimerez peut-être aussi