Académique Documents
Professionnel Documents
Culture Documents
You also use an algorithm when you solve a puzzle with a pen and paper. A
data scientist needs to understand what an algorithm is since most systems
used in data analytics are based on algorithms. It is also important for a data
scientist to know how to design various algorithms to perform different
functions. Let us assume that you key in 10 random numbers. For example,
you enter the numbers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. The objective is to find the
third-largest number in the set of numbers entered. To do this, you need to
find the right logic to use.
The data scientist needs to write the logic and develop an algorithm to find
the third-largest number. We will look at different techniques to use over the
book.
Programming
You should learn different languages, such as Perl, Python, Java, SQL, C
and C++. Python is the most common language used to develop models.
These languages help you collect, clean, organize and analyze data. You can
also use this language to work with unstructured data sets.
Analytical Tools
Every data scientist needs to learn different analytical tools and understand
the use of SaaS (Software as a Service). This understanding will help you
obtain valuable information from the cleaned and processed data sets used
for analysis. R, SAS, Hadoop, Pig, Hive and Spark are some of the most
popular tools to use for data analysis. You can assert your expertise in the
use of these tools through certifications.
Using Unstructured Data
Data scientists need to collect, manage, clean, understand and analyze data
collected from different sources. These data are not always structured, and
they can be in different forms, such as emails, texts, videos, images and
other unstructured data. For example, if the data scientist is working on a
marketing project or assisting the marketing team, he needs to know how to
handle social media. He also must know how to collect the data and use it
for his analysis.
Non-Technical Skills
Let us now look at some non-technical skills that a data scientist must have.
These skills are personal skills that a user must develop, and these are
different from certifications and qualifications.
Strong Business Acumen
A data scientist needs to have business acumen. He must understand the
different elements that are a part of a successful business model. Otherwise,
he cannot use the right technical tools and skills for the analysis. He also
cannot identify the different problems that would need to be solved for the
business to grow and sustain. It becomes hard for them to manage or help
the organization to explore different business opportunities.
Communication Skills
A data scientist understands data better than everybody else in the
organization. If a data scientist wants to be successful in his role and the
organization to do better, the data scientist needs to communicate the
interpretation and analysis of the data to the management. This is of utmost
importance since the data scientist must communicate the findings to
someone non-technical, and thus a data scientist must have the necessary
communication skills.
Intuition
This is one of the most important non-technical skills that every data
scientist must have. He needs to have the ability to perceive hidden patterns
in the data set. He should also know where he should look in the data set so
that he can add more value to the analysis. These skills make the data
scientist efficient in their work. This skill only comes as the data scientist
gains more experience.
Chapter Six: Tools Required for Data Science
Now that you have understood the importance of data and why we need
data science, let us now understand the tools involved in Data Science. Data
science has various tools and the important tools are listed here.
R Programming Language
R is a coding language used by Data Scientists widely and even used by
many organizations for data analysis. The programming language is widely
used because it possesses the feature of the Object-Oriented Programming
language. This is anyway used for the analysis of statistics, reporting, and
visual graphic representations.
Python Programming Language
Python is another Programming Language that possesses the features of
Object-Oriented Programming, thus widely being used for software
development and data science. Python works on multiple platforms such as
Windows, Mac, Linux, etc. It can also be used to create numerous
applications like web applications, user interface based applications,
numeric applications, etc.
Structured Query Language
Structured Query Language is mainly utilized to make new databases and
tables. It is needed to update the records in a database. It can retrieve as
well as delete records from a database. Data Scientists use structured Query
Language (SQL) for managing the data that can be captured in the database.
Hadoop
Hadoop is an open-source framework that can store and can surely work on
processing big data. It provides big data analysis using a distributed
computing framework. It also provides flexibility as the user can store the
data without pre-processing the same. Therefore, store as much data as you
need and hence decide on which needs to be used.
SAS
Software as Service is one of the most crucial data science tools that are
designed especially for statistical operations. SAS is used by big companies
to draw patterns about the data. SAS makes use of base SAS coding
language, which is used for carrying out statistical modeling. It is largely
made use of by experts and professionals and firms that are based on
“Reliable Commercial Software.” SAS gives many statistical libraries and
tools useful for data scientists for modeling and keeping their data records
in order. Even though SAS can be relied on and has very strong support
from companies, it is very costly, and hence it is used only in very big
industries. Also, SAS is not that great compared to some of the better tools
that are open source. Moreover, there are many libraries and packages in
SAS that are not accessible in the base pack; hence it needs a costly
upgrade to access the same.
Apache Spark
Spark is a very strong analytic tool and one of the most popular tools for
Data Scientists. Spark is specially designed to take care of batch processing
and Stream Processing. It has numerous APIs that help scientists carry out a
frequent path to Data for Machine Learning. It is a definite win over
Hadoop and can perform a hundred times swifter than MapReduce. Spark
has numerous Machine Learning APIs, which can lead get insights and lead
them to make powerful data analysis.
Spark’s ability to handle streaming data makes it better than other Big Data
Platforms. This also shows that Spark can take care of real-time data when
you compare it with other tools that take only historical data that too in
batches. Spark offers numerous APIs, which can be coded in Python, Java,
and R. The very powerful coexistence of Spark is and a programming
language called Scala.
Spark is very structured in the management of clusters, and hence it makes
it much better than Hadoop as Hadoop can be used only for storage. This is
a cluster management system that allows the Spark to act upon the tool at a
great speed.
D3.js
An interactional visualization can be created using the Javascript library on
a web browser. Many different APIs of D3.js and functions can be utilized
to make dynamic visualization and to analyze the dataset within the
browser. Animated transitions are an important tool of the D3.js. It is
possible to make a document dynamic and to change the data continuously
to display visualizations on the web.
It is possible to create transitory and great visuals that can display specific
and customized graphs on web pages by using a combination of D3.js and
CSS. It is one of the most important tools for data scientists especially those
who use IoT based devices that require client-side working for visualization
and data processing.
Chapter Seven: Data Science and Artificial
Intelligence
The next thing we need to focus on is artificial intelligence and check if this
and Machine Learning are the same concepts. As already mentioned earlier,
data science is a broad topic that includes a lot of concepts that come
together. Topics that come with data science will have data mining, big data
or artificial intelligence, etc. Since data science is a fairly new concept, it
will keep growing, and hence understanding the different concepts around it
will make a big difference.
Statistics is an important topic when it comes to data science and also in
Machine learning. It is very important to the categories or machine learning
and artificial intelligence. They can either be very similar in some cases and
also entirely different in another. Now let us take a look at each to
understand exactly how they work in data science.
What is Artificial Intelligence?
The very first thing that we have to delve into is the basic idea of artificial
intelligence and also how exactly it is different from machine learning.
Initially, you may think that these two concepts are the same and that you
might not understand how the concepts work. There are major differences
when you learn more about this process.
The process of artificial intelligence is a method that is visible in many
devices in the manufacturing unit to copy human tasks and capabilities
when it comes to various mental tasks. In the past few years, the meaning of
the term has changed a little, but the idea is the same. When you implement
artificial intelligence, you enable the program or the machine that you are
working on to work and think like a human brain. It can mean that a
machine enabled with artificial intelligence will be capable of efficiently
completing the tasks, similar to what the human brains can.
For people who are not from the world of technology, it would seem like
machine learning is the same as artificial intelligence. But definitely, there
are differences between them.
The term artificial intelligence can conjure up images reminiscent of a
scientific fiction movie. However, we regularly use devices powered by
artificial intelligence that are steadily making our lives easier. Here are the
different benefits of artificial intelligence.
Most of us not only use smartphones but also depend a lot on them. They
are no longer just a device for communication but have become a source of
entertainment too. From making payments to checking financial
transactions and reading new or staying in touch with others, we use our
smartphones for several purposes. These days it almost seems like these
devices have become an extension of our physical bodies. Believe it or not,
why such assistants, such as Siri and Cortana, used on iOS and Windows-
powered devices are examples of real-life applications of AI. Whenever you
use social media platforms such as Facebook or Instagram for tagging other
users in posts, it is due to AI's algorithms that have facial recognition. Even
financial institutions and different organizations in the banking sector are
regularly using artificial intelligence for organizing and managing a vast
amount of data. It's a great way to detect any frauds in the system.
Unlike human beings who cannot work 24/7 without rest, a machine can do
this without any breaks. Once these machines are programmed to use
artificial intelligence for functioning, it reduces human involvement in
getting things done. Human beings are bound to make errors, but once a
machine is programmed with the right codes and executes it efficiently, the
likelihood of errors reduces. Machines are good at multitasking. Therefore,
the efficiency of results produced by them is also higher. Artificial
intelligence plays a significant role in the medical sector. Assessing health-
related data and determining the probability of risk of certain diseases or
contributing risk factors can be conducted successfully using artificial
intelligence. These days, the software can help monitor neurological
disorders by stimulating how the human brain functions.
Once the machines are equipped with artificial intelligence, they can be
used for hazardous exploration. These machines could easily overcome any
human limitations. They are not only equipped to perform difficult tasks but
can do so efficiently and effectively. Another great benefit of artificial
intelligence is that machines cannot get bored of mundane and repetitive
tasks. Irrespective of how boring the task might seem, the machines will not
get bored.
Chapter Eight: Data Science and Machine
Learning
6. Get: This command is used to take all files that match the src
dire that is provided by the user in HDFS. It also helps to create a
duplicate of this in the local file system.
9. mv: It is used to move the directory and file from one place to
another with the help of HDFS. Command: hdfs fs - mv
source_dir_filename destination_dir
10.
cp: It is used to copy the file and directory from one
place to another with the help of HDFS. Command: hdfs fs - cp
source_dir_filename destination_dir
11.
moveFromLocal: The command copies contents
from to its destination from a local file system within HDFS and
when the copying operation is successful, it removes the contents
from the local file system
12.
tail: It is used to show 1 kb contented present on the
console of a file.
13.
rm: It is used to remove directory and files from a
path that is specified by the user.
14.
expunge: This command is used to clear the trash.
15.
chown: This command is used to change the name
of the owner.
16.
chgrp: This command is used to change the group of
the file.
17.
du: This command is used to display the disk usage
of all the files which are present and available in a directory as
per the path or URL specified by the user. The information and
data is displayed in ‘bytes’ format
18.
df: This command displays the available free space
19.
touchz: This command is used to create a file with a
path. It also consists of a timestamp.
20.
appendToFile: This command is used to append one
or more sources from the local system to the next destination.
21.
count: This command is used to count the total
number of files and directories
22.
checksum: This command returns checksum related
data of a specific file
Example 1
This takes care of only variables and constants – Program to build Hello
World
> # We can use the print() function
> print("Hello World!")
[1] "Hello World!"
> # Quotes can be suppressed in the output
> print("Hello World!", quote = FALSE)
[1] Hello World!
> # If there are more than 1 item, we can concatenate using paste()
> print(paste("How","are","you?"))
[1] "How are you?"
In the program above, we have used print(), which is a built-in function to
print the required string Hello World! The quotes that you see are printed by
default. Hence to avoid that, we can add an argument called quote =
FALSE. Also, if there are more than a single item, you can use paste() or
cat() functions to concatenate the strings together.
Example 2
We can add the elements of the vector by the function sum()
> sum(2,7,5)
[1] 14
>x
[1] 2 NA 3 1 4
> sum(x) # if any element is NA or NaN, result is NA or NaN
[1] NA
> sum(x, na.rm=TRUE) # this way we can ignore NA and NaN values
[1] 10
> mean(x, na.rm=TRUE)
[1] 2.5
> prod(x, na.rm=TRUE)
[1] 24
When a vector has NA (not applicable), or NaN (not a number), the
functions that are used here such as sum(), mean(), prod(), etc. make NA or
NaN respectively.
Example 3
This example will deal with an interactive screen, i.e. take inputs from the
user.
my.name <- readline (prompt="Enter name: ")
my.age <- readline (prompt="Enter age: ")
# convert character into integer
my.age <- as.integer(my.age)
print (paste ("Hi,", my.name, "next year you will be", my.age+1, "years
old."))
Output:
Enter name: Mary
Enter age: 17
[1] "Hi, Mary next year you will be 18 years old."
As you can see, we have used the function readline() to get input from the
user.
Here, you can see that with the prompt argument you can display an
appropriate message for the user.
In the above example, you convert the input age, which is a character vector
into an integer by the function as.integer().
This is necessary for doing further calculations.
Example 4
In this example, we will find if a year is a leap year or not by taking inputs
from the user.
# Program to check if the input year is a leap year or not
year = as.integer(readline(prompt="Enter a year: "))
if((year %% 4) == 0) {
if((year %% 100) == 0) {
if((year %% 400) == 0) {
print (paste (year,"is a leap year"))
} else {
Print (paste (year,"is not a leap year"))
}
} else {
Print (paste (year,"is a leap year"))
}
} else {
Print (paste (year,"is not a leap year"))
}
Output 1:
Enter a year: 1900
[1] "1900 is not a leap year"
Output 2:
Enter a year: 2000
[1] "2000 is a leap year"
Here we have used the logic that a leap year is exactly divisible by 4 except
for the years ending with 00. The century year is a leap year only if it is
perfectly divisible by 400.
A nested if else is used to implement the logic in the above program.
Example 5
In this example, we will find the HCF of two numbers
# Program to find the H.C.F of two input number
# define a function
hcf <- function (x, y) {
# choose the smaller number
if(x > y) {
smaller = y
} else {
smaller = x
}
for(i in 1:smaller) {
if((x %% i == 0) && (y %% i == 0)) {
hcf = i
}
}
return(hcf)
}
# take input from the user
num1 = as.integer (readline(prompt = "Enter first number: "))
num2 = as.integer (readline(prompt = "Enter second number: "))
print (paste ("The H.C.F. of", num1,"and", num2,"is", hcf(num1, num2)))
Output:
Enter first number: 72
Enter second number: 120
[1] "The H.C.F. of 72 and 120 is 24"
The user is asked to insert two integers and then pass them to a function
which presents the result in the form of HCF.
In the function, the smaller number from the two numbers given in the input
is checked. The HCF can either be less than or equal to the smallest number
between these two.
Then a ‘for’ loop is used to move from 1 to the smallest number.
It is checked in each loop whether the number divides both the input
numbers perfectly or not.
If the number does divide perfectly, it is stored as the HCF. Once the loop is
complete the outcome is the HCF or the largest number that divides both
the numbers perfectly.
Example 6
In this example, we will show you how to develop a calculator of your own
# This program can help you make a simple calculator that can perform
simple functions such as add, subtract, multiply and divide.
add <- function(x, y) {
return(x + y)
}
subtract <- function(x, y) {
return(x - y)
}
multiply <- function(x, y) {
return(x * y)
}
divide <- function(x, y) {
return(x / y)
}
# take input from the user
print ("Select operation.")
print ("1.Add")
print ("2.Subtract")
print ("3.Multiply")
print ("4.Divide")
choice = as.integer (readline (prompt="Enter choice[1/2/3/4]: "))
num1 = as.integer (readline (prompt="Enter first number: "))
num2 = as.integer (readline (prompt="Enter second number: "))
operator <- switch(choice,"+","-","*","/")
result <- switch (choice, add(num1, num2), subtract(num1, num2),
multiply(num1, num2), divide(num1, num2))
print (paste (num1, operator, num2, "=", result))
Output:
Introduction to Python
Before learning about Python, it is important to understand why it is
important. Understanding python is one of the valuable skills required for a
career in data science. Python's use is increasing, and this trend is expected
to continue with increasing developments in the python ecosystem by data
science experts.
There may be many programming languages you can learn to work with
when it comes to coding, but nothing provides the advantages of working
with Python. Python is popular and used by so many people that few
operating systems already have some version of Python pre-installed in
them. This will help you understand how coding is done, and it ensures you
will reap the benefits it offers quickly
There are numerous things you would love while working with the python
programming language. You might be a little overwhelmed while you start
learning a new language because focusing on multiple languages isn’t easy
and it might seem like a time-consuming process. You might not even think
that you will have enough time or resources needed to get it done. You
might even worry that each time you want to work on the code, you have to
begin from scratch.
You can handle all such issues using python programming language, and it
will make you feel more confident within no time. Therefore you won't feel
like a beginner every time you need to code. There are many benefits of
python programming language. Since it is an Object-Oriented programming
language, it is an open-source language. You can also enjoy that it is open-
sourced. It can also handle a lot of different types of programming required
in the process.
Now, what exactly is meant by an open-source in coding? It essentially
means that Python will let you get the code for free and work with
programs and much more without having to worry about copyright issues.
This encourages advanced programmers to join Python and make some
adjustments, development, updates, and more, which are required to keep
this coding language modern and updated.
Python can handle all the coding you would want to do. It can handle
anything from basic coding that you would do while learning a language
and can also handle data analysis and machine learning. These advantages
make it amply clear why anyone would want to use python for coding.
The career opportunities opening up for data scientists are steadily
increasing by the day. It is believed the employment rate in this field can
increase by up to 28%. An efficient and effective way to capitalize on the
ever-growing opportunities is by learning various programming languages.
The Importance of learning Python
Python seamlessly works on different platforms, such as Mac, Linux,
Windows, etc. It carries out on an interpreter system, which means the code
can be instantly run through the system the instant you write it. This makes
the entire process of prototyping quite simple. The syntax of Python is quite
easy to understand and is similar to the English language. Compared to
other developing programs, Python has a syntax that allows developers to
write programs quickly and in fewer lines.
Functioning of Python
Python can be connected to database teams to read and change files. It can
manage big data and also carry out complicated mathematics. It is used on a
server to establish website apps. As mentioned earlier, rapid and swift
prototyping is facilitated by Python.
Python is a versatile language when it comes to website development. It is
known to be instrumental in AI and data science and is used extensively in
the Internet of things. Industry leaders such as YouTube, Amazon, Google,
and Facebook use Python as a programming language. Even other
companies are adopting Python as a coding language that makes the
demand for python developers very high.
Python is simple to learn and code because of its simple syntax and easy-to-
read code. This language is shorter than other programming languages,
which are considered simple such as C, C++, etc. Also, the developers can
build a rapport that does not seem less in other languages. So, python
developers consider their coding fun and comfortable.
Python is believed to be one of the better portable languages that means that
users can execute any code written in python on other platforms. They can
do it without any amendments to the code. The code will be run on any
platform flawlessly and without any interruptions. Also, Python is an
extensible programming language. The users can develop a python code
into C++ or C language. They can even compile that particular code in
languages like C++ and C.
Python is a high-level programming language that is fast, open, simple to
learn, and user-friendly. As mentioned earlier, it performs flawlessly with
other platforms and can be executed anywhere seamlessly. Earlier Python
was considered tough, and only those people who are well-versed in
statistics could be comfortable with the programming language would use
it. However now, tools for almost all the aspects of scientific computing are
easily available in Python.
Why Use Python?
There are different programming languages available these days. All these
choices can make it difficult to select a specific language to learn. While
evaluating all the different languages, you would realize that the processing
power and coding efficiency of python is unlike any other. It is easy to read,
and it will also work with almost all operating systems. Here are some of
the benefits of this programming language that will help you make a better
decision:
Robust And Simple To Use
Python was initially created such that it is simple to learn and code. It can
be learned quickly and used effectively by beginners or newbies and even
experts alike. While learning this language, you will realize the code is easy
to understand and this feature comes in handy while working on complex
codes.
It has minimal constraints and debugging codes. When compared with other
languages like C, C++, Java, etc. the code implementation time in python is
less, which in turn provides enough time for the coders to work on their
algorithm.
The Large Community That Python Has!
Python is considered to be one of the most preferred languages for coding.
It is coder-friendly, which means that it is easy to learn shows why people
all over are using Python to code with. This opens up several opportunities
to learn about the kind of programming language and what exactly it can do
for you. There are public forums where you can post any questions that can
be easily answered and there are several tutorials you can go through too.
This comes in handy while just getting started with Python.
The Program Is Open Source
Open source means the original Python can be downloaded, and it is
available to the public for free. Of course, few people have taken a part of
Brighton and developed it to meet certain specifications and requirements
and have tried to sell it. Since it is an open-source code, you can download
the original and uncorrupted version of it. You can work with the basics of
python code without having to buy anything else to get the program and its
components to work.
Scalable
Python is a highly scalable programming language. It gives a solution to
problems that cannot be solved by other programming languages like Java.
Many businesses have coded in this language to establish apps and tools
very fast.
Options Of Support Libraries
The vast database of libraries offered by python can be utilized in machine
learning and artificial intelligence. These libraries provide the assistance
required for a beginner to get started with the language. You can find many
functions, objects, classes, and more inside this language to help you with
your code. You can also add third-party libraries to get some extra power, or
if there is something in particular that you want to do with your code that
the current library cannot help with. The most popular libraries used are
Seaborn, Scikit Learn, Pytorch, and Matplotlib.
Extensive Visualization And Graphics Options
Python developers can explore a variety of options for data visualization
and graphics. Charts, graphical layouts, web-ready plots, etc. can be used
by developers to visualize the data.
Python’s Automation
Being an open-source language, it is straightforward to automate anything
on Python. Even a beginner can automate anything on this by learning the
basics. Only scripts are written in Python. The code should be written in
scripts so that it can be executed. Machines read and interpret the code.
Once the developer has checked the code, it can be used or executed several
times without interruption. Thus, users can automate many tasks or stuff in
a program.
Python is considered the best data science tool to work on big data. When
the integration of location and data analysis is needed with the production
database, big data and Python are the best solutions. It is considered one of
the best programming languages because it is equipped with powerful
library packages Designed to meet any of the data science and other
analytical requirements.
The popular libraries used by big data and Python are:
• Pandas: This library is used to ensure data structure and operations
to manipulate data on numerical tables and time series
• SciPy: It is used in case of technical and scientific computing
• NumPy: It makes scientific computing possible. It offers support for
all types of number crunching, linear algebra, etc. Even it is useful
during the calculation of matrices with an extensive library, multi-
dimensional arrays, etc.
• Mlpy: It makes sure that many machine learning techniques to solve
the problems. It helps us in finding a reasonable compromise between
usability, efficiency, reproducibility, modularity, and maintainability.
• NetworkX: This library is used for learning about graphs that are
helping in establishing, manipulating, and reading the structure and
functions of complicated networks
Besides all this, it has many other useful libraries such as - Dask, Dmelt,
TensorFlow, Scikit-learn, Regression, Matplotlib, etc.
Python also supports testing. Python is the best choice for test automation
because of its libraries. Pytest is the most recommended test framework that
can be utilized to perform unit testing and functional testing or integrations.
Python is a functional and object-oriented language. It is essential for test
automation because it has a straightforward syntax and stateless functions.
Python also offers basic functionality, which is required to code the tools to
process three-dimensional data. For instance, the cgtypes module in python
can be used to learn more about various computer graphics, matrices and
other information used to create RIB files and more. With these modules
that exist, it becomes effortless to keep a 3D scene in memory, which can be
easily manipulated in Python.
Python is used in AI more than any other programming language as it
provides an independent platform. It is the most preferred and popular
choice for different platforms, with few tweaks. It has many prebuilt
libraries that can be utilized for coding in artificial intelligence. Python is
highly dynamic, given that there are two approaches possible - scripting and
OOPs approach. Even the users will be able to use IDE to get the required
codes themselves.
Python in Web Development
To develop a website using python implies that users produce responsive
code to execute the server instead of the browser. Establishing the server
side of a website is crucial for storing user-related information. Before
building a website using python, they must learn and grasp all the principles
of this programming language before starting to create a website.
Various aspects of Python ensure it is one of the best coding languages.
This coupled with the fact that it offers the best possible output makes this
language quite fetching.
Here are the five simple steps you can follow to learn python.
But remember that although the steps shown below are simple, it does not
mean that you needn’t put any effort. Without the required time, effort,
practice and consistency, learning a new language isn’t possible.
How to Learn Python Efficiently
Python Fundamentals
● Find a course: There are many courses available, join a course after
reading a review from the python community
● Join a community: Join a python community such a job could be
found and learn from your peers. Also, this helps you to land a job
easier as referrals for 30% of all hires
Mini Python Projects
● Read guidebooks, blog posts, and other people’s code, Try the
command line interface: Working from the CLI lets you execute
python code faster so you can learn and experiment more
● Build small applications, try web scraping with python
● Work with APIs
Python’s Data Science Libraries
● Online mentors from Dataquest, Stack overflow and Reddit make
great resources
● Work with the database using SQL: 40% of data scientist uses SQL
Data Science Portfolio
● Communicate, collaborate and focus on technical competence: Focus
on understanding the statistical processes behind the python functions
you are learning
● Use GIT for version control: GIT allows you and others to see past
versions of your projects. It helps debug and test new approaches
● Display your portfolios for employers to see
Advanced-Data Science Techniques
1. Keep Learning. Data science is growing quickly; it is too tough
to learn all of it. Dive into the parts that excite you, irrespective
of whether data visualization, machine learning or another
developing field, and keep learning.
2. Learn beginning and intermediate statistics. A firm grasp of
statistics will give you the perspective to focus on the right
insights and help you communicate those insights to others
3. Create models using live data feeds
Chapter Thirteen: Building Simple Python
Programs
Example 1
Let us start with our favorite example, a program to print Hello World.
# This program prints Hello, world!
print ('Hello, world!')
Output:
Hello, world!
In this program, we make use of the built-in function print() function to
print the string.
Example 2
In this example we will show you a Python code to add two numbers by
taking two numbers from the user.
# Store input numbers
num1 = input ('Enter first number: ')
num2 = input ('Enter second number: ')
# Add two numbers
sum = float (num1) + float (num2)
# Display the sum
print ('The sum of {0} and {1} is {2}'. Format (num1, num2, sum))
Output:
Enter first number: 1.5
Enter second number: 6.3
The sum of 1.5 and 6.3 is 7.8
Here the build function in input() has been used to receive inputs from the
user. The input() returns a string which is why it is necessary to convert a
string into a number that can be performed using the float() function.
Example 3
In this example, we will check if a number that is provided by the user is a
prime number or not
# Program to check if a number is prime or not
num = 407
# To take input from the user
#num = int (input ("Enter a number: "))
# prime numbers are greater than 1
if num > 1:
# check for factors
for i in range(2,num):
if (num % i) == 0:
print (num,"is not a prime number")
print (i,"times",num//i,"is",num)
break
else:
print (num,"is a prime number")
# if input numberis < 1
# or = to 1, it is not prime
else:
print (num,"is not a prime number")
Output:
407 is not a prime number
11 times 37 is 407
In this code, the variable ‘num’ is checked if it is prime or not. Numbers
that are < or = 1 are not prime numbers. Therefore, we only move forward
if the num is larger than 1.
We checked if the variable ‘num’ is exactly divisible by any number from 2
to ‘num – 1’. If we get a factor in that range, the number is not prime. Else
the number is prime.
Example 4
In this example, we will be checking on how to find the HCF of two
numbers using Python
# Python program to find H.C.F of two numbers
# define a function
def compute_hcf (x, y):
# choose the smaller number
if x > y:
smaller = y
else:
smaller = x
for i in range (1, smaller+1):
if((x % i == 0) and (y % i == 0)):
hcf = i
return hcf
num1 = 54
num2 = 24
print ("The H.C.F. is", compute_hcf (num1, num2))
Output:
The H.C.F. is 6
The function used here is compute_hcf() function. The function will
calculate the HCF of the two numbers and return the same.
Example 5
In this example, we will be sorting the words alphabetically using a Python
program.
# Program to sort alphabetically the words form a string provided by the
user
my_str = "Hello this Is an Example With cased letters"
# To take input from the user
#my_str = input ("Enter a string: ")
# breakdown the string into a list of words
words = my_str.split ()
# sort the list
words.sort()
# display the sorted words
print ("The sorted words are:")
for word in words:
print (word)
Output:
The sorted words are:
Example
Hello
Is
With
an
cased
letters
this
In this program, we make use of the function split() once the string is stored
in the variable my_str. The split() function splits the string at while spaces.
Example 6
In this example, we will check and perform various set of operations like in
mathematics
# Program to perform various sets of operations like in mathematics
# define three sets
E = {0, 2, 4, 6, 8};
N = {1, 2, 3, 4, 5};
# set union
print ("Union of E and N is", E | N)
# set intersection
print ("Intersection of E and N is", E & N)
# set difference
print ("Difference of E and N is", E - N)
# set symmetric difference
print ("Symmetric difference of E and N is", E ^ N)
Output:
Union of E and N is {0, 1, 2, 3, 4, 5, 6, 8}
Intersection of E and N is {2, 4}
Difference of E and N is {8, 0, 6}
Symmetric difference of E and N is {0, 1, 3, 5, 6, 8}
In this program two distinct sets are considered and various set operations
are performed on them. This is done with the help of the set method.
Chapter Fourteen: SQL for Data Science
Introduction
SQL (structured query language) is a programming language used to
manage data in a relational database. It is also used for querying. A
collection of two-dimensional tables gives us a relational database. Each of
the tables is made by a fixed number of columns and can have multiple
rows. For example, consider different car manufacturers. Every car
manufacturer would have a database composed of tables. Each table will
store different parameters or metrics for each car model in different
countries.
The Beginning of the NoSQL Movement
The term NoSQL has become a highly popular term currently. You must
have seen it being thrown around quite often. Even if you haven’t, don’t
worry because you will learn about it in this section. The NoSQL
movement displays the databases that manipulate and store the data in
formats different than the double relations that are non-relational.
The NoSQL database is different than the other, non-relational databases
due to various reasons. The non-relational database systems started to
become popular in the 2000s. In this period a lot of emerging companies
such as Amazon, Facebook, Google, etc. needed a lot of data to be
processed in short periods. These situations arose because a lot of people
were looking for things on these websites all the time.
It is a common belief in the open-source community that the systems that
were developed to look into these requirements had exceedingly broad and
diverse characteristics. It was agreed that these systems could tackle the
problems associated with RDBMS up to a certain extent. They targeted
Non-linear horizontal scalability with the help of the distribution of data
over the cluster of database nodes.
Due to its less rigidity, the relational data model is different than all other
modeling paradigms. It is also better equipped to handle the growing data
structures. The query mechanism in this interface is quite simple and
accessible as compared to other interfaces.
The NoSQL data storage layers have become extremely popular, but they
are not without their limitations. Almost all the new SQL implementations
are still in the preliminary stage and they have not proved their mettle in the
field yet. They also lack relational support which makes writing
aggregations and queries difficult.
Compared to this, the RDBMS is durable, easily manageable, and provides
better support for transactions. While the choice between NoSQL and
RDBMSs is difficult to make as it incorporates many criteria, one day or
another people will be forced to look into it carefully.
Most of the NoSQL vendors tend to focus on the durability and robustness
of the product. The traditional RDBMS vendors tend to put in features that
can allow the users to build scalable and schema-free data stores. It can
store semi-structured and nested documents. This is why NoSQL is so
popular in the document store category.
Some vendors have started using the term ‘NewSQL’ to describe modern
relational database management systems. The primary motto of these
systems is to combine the flexibility of the NoSQL system, scalable
performance, and the robustness of the DBMS and present a new and
improved product. The future trend seems to be all about search blended
systems expect the cases wherein specialized and niche database
management systems are needed.
The ‘NoSQL’ shows that in modern times, there is no one size fits all tool
that can help your firm.
With Python and R, SQL is now considered one of the most required skills
in data science. Are you still wondering why it is important to learn SQL?
Well, here are the reasons you should consider it.
As already mentioned earlier, around 2.5 Quintilian bytes of data are
created daily. Databases are needed to store such massive amounts of data.
These days, several organizations are placing a lot of value on data and
have realized its importance. The data available can be used to solve
problems related to the organization by analyzing and predicting the
incoming credit. It can also be used to predict trends of the market and get
to know the wants of the client.
One of the primary reasons for utilizing structured query language is that
when an operation is performed on a data set, the database can be reached
directly without having the necessity to copy it beforehand. This allows the
prices to speed up and also speeds up workflow executions.
There exist different SQL databases such as SQLite, Oracle, Microsoft SQL
Server MySQL, and Postgres.
Here are the different ways in which you can learn SQL using Google
BigQuery Kaggle integration. SQL can be used in data science for
preprocessing and machine learning too.
Master Data Science in 5 Steps
Basics of Relational Database
As already mentioned, SQL is a language for quitting data and managing
the data in a relational database management system (RDBMS). Usually,
the terms SQL and RDBMS are used together for the sake of convenience.
The term SQL is often used to create a difference between relational
database systems and non-relational database systems. Non-relational
database systems are put under the umbrella term NoSQL. Non-RDBMS
systems do not acquire any SQL skills.
An Overview of SQL
Since we are familiar with the relational language, let us shift our focus to
SQL.
The syntax of SQL is similar to the English language. Hence, it is easy to
understand the code if you go through a few examples about it. While you
go through SQL codes, stop worrying about understanding the entire code.
The initial steps of all the programs are intuitive, which would help your
understanding. To get started with the following steps, you need an SQL
environment setup. You need not enter every SQL statement in it, but it
does enhance your overall understanding. You could install SQLite locally
and start working with it.
Data Management Functions
SQL can be used to perform several types of data management functions,
such as querying the database with SELECT command, inserting records
into the database with the INSERT command, and updating the records with
the UPDATE command. These three commands are some of the most
famous and heavily used commands and would be a good start to your SQL
learning journey as these are very simple and readable.
Firstly, select a database you want to work on. Ensure that the records you
opt for are interesting and will not bore you during the early stages of the
learning process. Now, you need to choose data. Simple tutorials to do this
can be found on YouTube and similar platforms. The next step is to add
another record into the database present in the existing records. After this,
don’t forget to update the database. Once you start following this flow, it
becomes easier to understand how the different commands work.
Creating, Dropping and Deleting
The next set of commands you can use are CREATE and DROP tables, and
also to DELETE records. With an understanding of a growing set of
commands, it becomes easier to use SQL.
Views and Joins
Now let us focus on slightly advanced SQL topics. Firstly, let’s look at
views or the virtual tables populated by the results of queries, which are
useful for application development, Data security, easy data sharing and so
on.
These virtual tables make it easy to search your database and avoid any
long queries. Does it look too good to be true? Well, it is not. Let us assume
there is a table created to store information about all the individuals who
live on this planet. Some important data points such as salary or their
address shouldn’t become public knowledge. So, the access to this data
table needs to be given without giving away access to private information
Now let us learn how to create a view.
Write a query returning the data that should appear in the view, using the
SELECT command and execute it. Now to create a view from this data,
write CREATE VIEW and your view NAME and write AS followed by the
SELECT command that you have already written, and that is it. The view is
created. You can now treat in the created view as a table.
One of the important benefits of VIEW is security. You can give access to
only the views instead of all the underlying tables and you control the data
others see and hence restrict the sensitive data. Another benefit of VIEW
could be its simplicity. Instead of everyone learning how to use union is
combining two tables, it can easily select of you that is created using the
commands.
Joins come in different flavors, and it is very probable to be one of the very
complex topics you will learn while covering SQL. That is more of a
testament to the easiness of SQL than the actual difficulty of learning about
joins. The tutorial, MySQL Tutorial 2: Views and Joins from Arachnoid ,
covers vies and joins.
This chapter is meant to give you a brief insight into what SQL is and how
ti works. You need to keep working on it and learn more about it to get a
deeper understanding.
Examples of few SQL queries in a program
Example 1
CREATE TABLE emp (
empno INT PRIMARY KEY,
ename VARCHAR(10),
job VARCHAR(9),
mgr INT NULL,
hiredate DATETIME,
sal NUMERIC(7,2),
comm NUMERIC(7,2) NULL,
dept INT)
begin
insert into emp values
(1,'JOHNSON','ADMIN',6,'12-17-1990',18000,NULL,4)
insert into emp values
(2,'HARDING','MANAGER',9,'02-02-1998',52000,300,3)
insert into emp values
(3,'TAFT','SALES I',2,'01-02-1996',25000,500,3)
insert into emp values
(4,'HOOVER','SALES I',2,'04-02-1990',27000,NULL,3)
insert into emp values
(5,'LINCOLN','TECH',6,'06-23-1994',22500,1400,4)
insert into emp values
(6,'GARFIELD','MANAGER',9,'05-01-1993',54000,NULL,4)
insert into emp values
(7,'POLK','TECH',6,'09-22-1997',25000,NULL,4)
insert into emp values
(8,'GRANT','ENGINEER',10,'03-30-1997',32000,NULL,2)
insert into emp values
(9,'JACKSON','CEO',NULL,'01-01-1990',75000,NULL,4)
insert into emp values
(10,'FILLMORE','MANAGER',9,'08-09-1994',56000,NULL,2)
insert into emp values
(11,'ADAMS','ENGINEER',10,'03-15-1996',34000,NULL,2)
insert into emp values
(12,'WASHINGTON','ADMIN',6,'04-16-1998',18000,NULL,4)
insert into emp values
(13,'MONROE','ENGINEER',10,'12-03-2000',30000,NULL,2)
insert into emp values
(14,'ROOSEVELT','CPA',9,'10-12-1995',35000,NULL,1)
end
CREATE TABLE dept (
deptno INT NOT NULL,
dname VARCHAR(14),
loc VARCHAR(13))
begin
insert into dept values (1,'ACCOUNTING','ST LOUIS')
insert into dept values (2,'RESEARCH','NEW YORK')
insert into dept values (3,'SALES','ATLANTA')
insert into dept values (4, 'OPERATIONS','SEATTLE')
end
The following data script can be run in ISQL. It uses functions such as
INSERT, CREATE TABLE, etc.
Example 2
This example shows creating the sample table using the data provider
SQLServerConnection Conn;
Conn = new SQLServerConnection("host=nc-star;port=1433;
User ID=test01;Password=test01; Database Name=Test");
try
{
Conn.Open();
}
catch (SQLServerException ex)
{
// Connection failed
Console.WriteLine(ex.Message);
return;
}
string[] DropTableSQL = {"drop table emp", "drop table dept"};
for (int x=0; x<=1; x++)
{
try
{
// Drop the tables, don't care if they don't exist
SQLServerCommand DBCmd = new
SQLServerCommand(DropTableSQL[x], Conn);
DBCmd.ExecuteNonQuery();
}
catch (SQLServerException ex)
{
}
// Create the tables
string CreateEmpTableSQL = "CREATE TABLE emp
(empno INT PRIMARY KEY NOT NULL,"
+"ename VARCHAR(10) NOT NULL,"
+"job VARCHAR(9) NOT NULL,"
+"mgr INT,"
+"hiredate DATETIME NOT NULL,"
+"sal NUMERIC(7,2) NOT NULL,"
+"comm NUMERIC(7,2),"
+"dept INT NOT NULL)";
string CreateDeptTableSQL = "CREATE TABLE dept ("
+"deptno INT NOT NULL,"
+"dname VARCHAR(14),"
+"loc VARCHAR(13))";
try
{
SQLServerCommand DBCmd = new
SQLServerCommand(CreateEmpTableSQL, Conn);
DBCmd.ExecuteNonQuery();
DBCmd.CommandText = CreateDeptTableSQL;
DBCmd.ExecuteNonQuery();
}
catch (Exception ex)
{
//Create tables failed
Console.WriteLine (ex.Message);
return;
}
// Now insert the records
string[] InsertEmpRecordsSQL = {
"insert into emp values
(1,'JOHNSON','ADMIN',6,'12-17-1990',18000,NULL,4)",
"insert into emp values
(2,'HARDING','MANAGER',9,'02-02-1998',52000,300,3)",
"insert into emp values
(3,'TAFT','SALES I',2,'01-02-1996',25000,500,3)",
"insert into emp values
(4,'HOOVER','SALES I',2,'04-02-1990',27000, NULL,3)",
"insert into emp values
(5,'LINCOLN','TECH',6,'06-23-1994',22500,1400,4)",
"insert into emp values
(6,'GARFIELD','MANAGER',9,'05-01-1993',54000, NULL,4)",
"insert into emp values
(7,'POLK','TECH',6,'09-22-1997',25000, NULL,4)",
"insert into emp values
(8,'GRANT','ENGINEER',10,'03-30-1997',32000, NULL, 2)",
"insert into emp values
(9,'JACKSON','CEO',NULL,'01-01-1990',75000,NULL, 4)",
"insert into emp values
(10,'FILLMORE','MANAGER',9,'08-09-1994',56000, NULL, 2)",
"insert into emp values
(11,'ADAMS','ENGINEER',10,'03-15-1996',34000, NULL,2)",
"insert into emp values
(12,'WASHINGTON','ADMIN',6,'04-16-1998',18000, NULL,4)",
"insert into emp values
(13,'MONROE','ENGINEER',10,'12-03-2000',30000, NULL, 2)",
"insert into emp values
(14,'ROOSEVELT','CPA',9,'10-12-1995',35000, NULL,1)"};
string[] InsertDeptRecordsSQL = {
"insert into dept values (1,'ACCOUNTING','ST LOUIS')",
"insert into dept values (2,'RESEARCH','NEW YORK')",
"insert into dept values (3,'SALES','ATLANTA')",
"insert into dept values (4, 'OPERATIONS','SEATTLE')"};
// Insert dept table records first
for (int x = 0; x<InsertDeptRecordsSQL.Length; x++)
{
try
{
SQLServerCommand DBCmd =
new SQLServerCommand(InsertDeptRecordsSQL[x], Conn);
DBCmd.ExecuteNonQuery();
}
catch (Exception ex)
{
Console.WriteLine (ex.Message);
return;
}
}
// Now the emp table records
for (int x = 0; x<InsertEmpRecordsSQL.Length; x++)
{
try
{
SQLServerCommand DBCmd =
new SQLServerCommand(InsertEmpRecordsSQL[x], Conn);
DBCmd.ExecuteNonQuery();
}
catch (Exception ex)
{
Console.WriteLine (ex.Message);
return;
}
}
Console.WriteLine ("Tables created Successfully!");
// Close the connection
Conn.Close();
Example 3
The example uses the Parameters.Add method to create the parameters for
the Update statement, fills a DataSet, programmatically makes changes to
the DataSet, then synchronizes the changes back to the database.
SQLServerConnection Conn =
new SQLServerConnection("host=nc-star;port=4100;User ID=test01;
Password=test01;Database Name=Test");
try
{
string selectText = "select sal, job, empno from emp";
string updateText = "update emp set sal = ?, job = ? where empno = ?";
SQLServerDataAdapter adapter = new SQLServerDataAdapter(selectText,
Conn);
SQLServerCommand updateCommand = new
SQLServerCommand(updateText, Conn);
adapter.UpdateCommand = updateCommand;
updateCommand.Parameters.Add("@sal", SQLServerDbType.Int, 15,
"SAL");
updateCommand.Parameters.Add("@job", SQLServerDbType.VarChar, 9,
"JOB");
updateCommand.Parameters.Add("@empno", SQLServerDbType.Int, 15,
"empno");
DataSet myDataSet = new DataSet("emp");
adapter.Fill(myDataSet, "emp");
// Give employee number 11 a promotion and a raise
DataRow changeRow = myDataSet.Tables["emp"].Rows[11];
changeRow["sal"] = "35000";
changeRow["job"] = "MANAGER";
// Send back to database
adapter.Update (myDataSet, "emp");
myDataSet.Dispose();
}
catch (Exception ex)
{
// Display any exceptions
Console.WriteLine (ex.Message);
}
Console.WriteLine ("DataSet Updated Successfully!");
// Close the connection
Conn.Close();
Analyzing Every Part of Data Science Puzzle
It is necessary to understand statistics and math carefully to practice data
science. It is also necessary to have some basic programming skills which
allow you to work with data. You do not need to be a subject matter expert;
you can always call yourself a statistician or a mathematician. A software
engineer who does not possess the analytical skills or subject matter
expertise is just a software developer and not a data scientist.
The requirement for data predictions and insights is increasing day by day.
This has happened because more and more sectors and fields have now
realized the importance of data science and how it can change their lives.
Here are some of the many sectors that use data science now.
● Ad tech data scientist,
● Director of banking digital analyst,
● Clinical data scientist,
● Geoengineer data scientist,
● Geospatial analytics data scientist,
● Political analyst,
● Retail personalization data scientist.
In the next section, let’s look at the important roles performed by a data
scientist. Collecting, Querying and Consuming Data
Large amounts of unstructured, semi-structured, and structured big data are
collected and collated by data engineers. Any data that goes beyond the
processing power of a regular or traditional data system thanks to its sheer
size, or agility, or any other significant reason is known as big data. Data
engineering isn’t the same as data science, which is all about data analysis,
prediction, and visualization. While data scientists and data engineers are
quite different, when a data scientist queries, collects or consumes data
while analyzing data, they do a job that is quite similar to the job of a data
engineer.
It is possible to gain a multitude of insights from just one source of data.
However, it is the combination of different sources that provide contextual
information required for better decision making based on data. A data
scientist can work with several databases in different data warehouses or he
can also work on data that is based in one single database as well. Or the
data they need to work with could also be stored using cloud-based
software. It does not matter where the data is stored or in what way a data
scientist obtains it; there is one important step that should not be ignored.
The data should be queried. Querying data could be through writing
commands, extracting the relevant data set from the system of data storage.
Generally, to query data SQL is used.
It does not matter if you use an application or perform custom analysis with
the help of a programming language like python, here is a list of universal
file formats.
CSV or Comma Separated Value File: It is present in almost all operating
systems and many web-based data analysis applications too. It is used in
scripting languages like R language and Python.
Scripts: Most of the data scientists use programming languages such as the
R language or Python to visualize and analyze data.
Application Files: One of the best tools for a quick and easy spot-check
analysis on small (or medium) datasets is Excel. Excel files use the .xlxs or
the .xls extension.
Web programming files: If you want to create a custom, web-based data
visualization you need work in the D3.js format, i.e., data-driven
documents. It is a JavaScript library made for detailed visualization.
Chapter Fifteen: Math for Data Science