DataScience Beginner Questionnaire

Q1.
Using Python script as a calculator for the below tasks:
a. Create the variables n, r, p and assign them values 10, 5, and 100 respectively. Then evaluate the following
expression.
= (1 + / 100)
b. What are the data types of the variables p, r, n, and A in problem (1(a))?
c. If the data type of the variable ‘n’ is not integer, then redefine ‘n’ as integer with its value being 10 and
again compute the value of A.
In [1]:
n,r,p=10,5,100
A=p*((1+(r/100))**n)
print('a] Evaluating the equation would give, A =', A,'\n ')

print('b] DataType of n is,',type(n),'\t ','DataType of r is,',type(r),'\n ',
' DataType of p is,',type(p),'\t ','DataType of A is,',type(A))
a] Evaluating the equation would give, A = 162.8894626777442
b] DataType of n is, <class 'int'> DataType of r is, <class 'int'>

DataType of p is, <class 'int'> DataType of A is, <class 'float'>
-----------------------------------------------------------------------------------------------------------------------
Q2. What will be the output of the following code?
In [2]:
print('The output of the above code is;')
import os
import numpy as np
my_list1 = [2,7,3,5,4,6]
print(my_list1)
arr_1 = np.array(my_list1, dtype = int)

print(arr_1)
The output of the above code is;

[2, 7, 3, 5, 4, 6]
[2 7 3 5 4 6]
-----------------------------------------------------------------------------------------------------------------------
Q3. Create a string called ‘string’ with the value as “Machine Learning”. Which code(s) is/are appropriate to
slice the sub-string “Learn”?
In [3]:
string = 'Machine Learning'

sub_string = string[8:13]
print('The required sliced substring is:', sub_string)
The required sliced substring is: Learn
-----------------------------------------------------------------------------------------------------------------------
Q4. Create a sequence of numbers from 15 to 25 and increment by 4. What is the index of the value 19?
In [4]:
a = []
for i in range(15, 25, 4):

a.append(i)
print('The index of 19 is:', a.index(19))
The index of 19 is: 1
-----------------------------------------------------------------------------------------------------------------------
Q5. Which of the following is true with respect to the below codes?
A. num1 = num2
B. num1 ≠ num2
C. num1 < num2
D. num1 > num2
In [5]:
num1 = 5**4
num2 = pow(5,4)
print('num1 =', num1, '\t ', ' num2 =', num2, '\n ')
if num1 == num2:
print('As num1 = num2, Option A is correct.')
elif num1 != num2:

print('As num1 ≠ num2, Option B is correct.')
elif num1 < num2:

print('As num1 < num2, Option C is correct.')
elif num1 > num2:

print('As num1 > num2, Option D is correct.')
num1 = 625 num2 = 625
As num1 = num2, Option A is correct.

As num1 = num2, Option A is correct.
-----------------------------------------------------------------------------------------------------------------------
Q6. We have two lists. L1 = ['Coffee', 'Tea'] L2 = ['Hot', 'Ice']. Using a nested while loop compute the
following output:
Coffee
* Hot
* Ice
Tea
* Hot
* Ice
In [6]:
L1 = ['Coffee', 'Tea']
L2 = ['Hot', 'Ice']
print('The required output is:','\n ')
i = 0
while (i<len(L1)):
print(L1[i])
a = 0
while (a<len(L2)):
print('*',L2[a])
a = a+1
i = i+1
The required output is:
Coffee
* Hot
* Ice
Tea
* Hot
* Ice
-----------------------------------------------------------------------------------------------------------------------
Q7. Write a function which finds all Pythagorean triplets of triangles whose sides are no greater than a
natural number N.
In [7]:
def py_trip(N):
c, x = 0, 2
while c < N:
for n in range(1, x):

a = x*x - n*n
b = 2*x*n
c = x*x + n*n
if c > N:
break
print(a, b, c)
x = x + 1
print('The required Pythagorean triplets are:')

N = 20
py_trip(N)
The required Pythagorean triplets are:

3 4 5
8 6 10
5 12 13
15 8 17
12 16 20
-----------------------------------------------------------------------------------------------------------------------
Q8. Given a list of integers, write a function that finds the smallest and the largest value of this list.
Given a list of numbers, write a function which finds their standard deviation.
In [8]:
b = [2,3,4,-5]
def smallest_num(a):
min = a[0]
for i in a:
if i < min:
min = i
return min
def largest_num(a):
max = a[0]
for i in a:
if i > max:
max = i
return max
x=smallest_num(b)
y=largest_num(b)
print('The smallest value of the list is:', x)

print('The largest value of the list is:', y)
The smallest value of the list is: -5

The largest value of the list is: 4
In [9]:
b = [6.5, 25.8, 16.9, 25.12, 23.2]
mean = sum(b) / len(b)
total=0
for i in range(0, len(b)):
total = total + ((b[i] - mean) ** 2)
variance = (total) / len(b)

std_dev = variance ** 0.5
print('Standard Deviation of the list is:', std_dev)
Standard Deviation of the list is: 7.221140076192956
-----------------------------------------------------------------------------------------------------------------------
Q9. There are 3 arrangements of the word DAD, namely DAD, ADD, and DDA. How many arrangements are
there of the word ENDURINGLY?
In [10]:
import math
s = 'ENDURINGLY'
dup_num={}
for i in range(0,len(s)):
count = 1
for j in range(i+1,len(s)):
if (s[i] == s[j] and s[i] != ' ' and s[i] !='0'):
count =count + 1
s=s[:j] +'0' + s[j+1:]
if (count>1):
dup_count={s[i]:count}
dup_num.update(dup_count)
denom = 1
for i in list(dup_num.values()):
denom = denom * math.factorial(i)
tot_arr = int(math.factorial(len(s))/ denom)
print("The number of different arrangements are:", tot_arr)
The number of different arrangements are: 1814400
-----------------------------------------------------------------------------------------------------------------------
Q10. There are 13 men and 12 women in a ballroom dancing class. If 6 men and 6 women are chosen and
paired off, how many pairings are possible?
In [11]:
import math
# Total number of men, M

M = 13
# Total number of women, W

W = 12
# Number of men to be selected, M1

M1 = 6
# Number of women to be selected, W1

W1 = 6
# Number of ways to select 6 men,

M_ways = math.factorial(M)/(math.factorial(M1) * math.factorial(M-M1))
# Number of ways to select 6 women,

W_ways = math.factorial(W)/(math.factorial(W1) * math.factorial(W-W1))
# Number of ways of paring 6 men and 6 women,

MW_ways = M1 * W1
# Total number of pairing possible is,

tot_pairs = int(M_ways * W_ways * MW_ways)
print("The combinations possible are:", tot_pairs)
The combinations possible are: 57081024
-----------------------------------------------------------------------------------------------------------------------
Q11. Suppose you are taking a multiple-choice test with 4 choices for each question. In answering a question
on this test, the probability that you know the answer is 0.33. If you don’t know the answer, you choose one
at random. What is the probability that you knew the answer to a question, given that you answered it
correctly?
ANSWER:
P(K) : Probability of knowing the answer = 1/3
P(A) : Probability of answering correct
P(A|K) : Probability of answering correct, given that you know the answer = 1
P(K|A) : Probability of knowing answer, given that it is correct = ?
By Bayes formula, P(K|A) = {P(A|K) * P(K)} / P(A)
Now, we know that there are 2 possibilities for answer to be correct:
1. You know the answer

= P(K)
2. You guess the answer AND the answer is correct

= Probability of guessing answer AND Probability of answer being correct
= P(G) * P(C)
Now, Probability of guessing an answer, P(G) = 1 question out of 4 = 1/4
Probability of answer being correct, P(C) = 1 correct out of 4 options = 1/4
In mathematical words we can write the probability of answering correct as,
P(A) = P(K) + {P(G) * P(C)}
Substituting in the equation,
P(K|A) = {P(A|K) P(K)} / {P(K) + P(G) P(C)}
= {1 * 1/3} / {1/3 + 1/4 * 1/4}
= 1/3 / {1/3 + 1/16}
P(K|A) = 16/19
Hence, the probability of knowing the answer to a question, given that you answered it correctly is, 16/19.
-----------------------------------------------------------------------------------------------------------------------
Q12. Read the given data ‘TIPS.csv’ as a dataframe named Tips and answer the following question:
a) In the tips dataframe, for the variable “total bill” what is the 3rd quartile and maximum value?
b) The range of the variable “TotalBill”?
In [12]:
import pandas as pd
import numpy as np
df=pd.read_csv('Tips.csv')
df=df['TotalBill']
third_quartile = np.quantile(df, 0.75)

print('a] 3rd Quartile of TotalBill is:', third_quartile)
max = df.max()
print(' Max value of TotalBill is:', max,'\n ')
a] 3rd Quartile of TotalBill is: nan

Max value of TotalBill is: 48.27
In [13]:
min = df.min()
range = max - min

print('b] Range of TotalBill is:', range)
b] Range of TotalBill is: 45.2
-----------------------------------------------------------------------------------------------------------------------
Q13. A normal distribution which has a mean of 50 and standard deviation of 7 is taken into consideration.
68% of the distribution can be found between what two numbers?
a. 40 and 60
b. 0 and 43
c. 0 and 68
d. 43 and 57
In [14]:
mean = 50
std_dev = 7
# 68% of the distribution is within one standard deviation of the mean.

# Hence, the two numbers are,
a = mean - std_dev
b = mean + std_dev
print('The two numbers are:', a, 'and', b)
The two numbers are: 43 and 57
ANSWER: D. 43 and 57
-----------------------------------------------------------------------------------------------------------------------
Q14. Consider the data X = (58,59,63,60,60,63,60,57,58,59). An unbiased estimation for population variance
would be __
In [1]:
X = (58,59,63,60,60,63,60,57,58,59)
mean = sum(X) / len(X)
total=0
for i in range(0, len(X)):
total = total + ((X[i] - mean) ** 2)
# (len(X)-1) instead of len(X) in the formula corrects the bias in the estimation of t
he population variance
variance = (total) / (len(X) - 1)
print(' An unbiased estimation for population variance would be:',variance)
An unbiased estimation for population variance would be: 4.01111111111111
-----------------------------------------------------------------------------------------------------------------------
Q15. From the given below boxplot identify the median value and the outlier.
ANSWER:
Median: 27
Outliers: below 5 and above 75
(estimation from the image)
-----------------------------------------------------------------------------------------------------------------------
Q16. Create a dictionary ‘Country’ that maps the following countries to their capitals respectively:
Find 2 commands to replace “Marseilles” with “Paris”.

In [1]:
import pandas as pd
Country = {'Country':'State','India':'Delhi','China':'Beijing','Japan':'Tokyo','Qatar
':'Doha','France':'Marseilles'}
print('The created dictionary is:', '\n ', Country, '\n ')
# First method to replace “Marseilles” with “Paris”

Country['France']= 'Paris_1'
print('Dict with 1st replacement is:', '\n ', Country, '\n ')
# Second method to replace “Marseilles” with “Paris”

Country.update({'France':'Paris_2'})
print('Dict with 2nd replacement is:', '\n ', Country)
The created dictionary is:

{'Country': 'State', 'India': 'Delhi', 'China': 'Beijing', 'Japan': 'Tokyo', 'Qatar':
'Doha', 'France': 'Marseilles'}
Dict with 1st replacement is:

'Doha', 'France': 'Paris_1'}
Dict with 2nd replacement is:

'Doha', 'France': 'Paris_2'}
-----------------------------------------------------------------------------------------------------------------------
Q17. Create the tuples given below:
tuple_1 = (1,5,6,7,8)
tuple_2 = (8,9,4)
Identify which of the following code does not work on a tuple.
a) sum(tuple_1)
b) len(tuple_2)
c) tuple_2 + tuple_1
d) tuple_1[3] = 45
In [2]:
tuple_1 = (1,5,6,7,8)
tuple_2 = (8,9,4)
a=sum(tuple_1)
print('Output of Option a is:', a, '\n ')
b=len(tuple_2)
print('Output of Option b is:', b, '\n ')
c=tuple_2 + tuple_1
print('Output of Option c is:', c, '\n ')
d=tuple_1[3] = 45
print('Output of Option d is:', d, '\n ')
Output of Option a is: 27
Output of Option b is: 3

Output of Option c is: (8, 9, 4, 1, 5, 6, 7, 8)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-449643dbbafe> in <module>
11 print('Output of Option c is:', c, '\n')
12
---> 13 d=tuple_1[3] = 45
14 print('Output of Option d is:', d, '\n')
TypeError: 'tuple' object does not support item assignment
ANSWER:
D. tuple_1[3] = 45 - does not work on tuple.
-----------------------------------------------------------------------------------------------------------------------
Q18. How many elements in the following data structure?
In [3]:
S = {1,2,3,4,4,4,5,6}
from collections import Counter
a = Counter(S)
print('The number of elements in above data structure is:', a)
The number of elements in above data structure is: Counter({1: 1, 2: 1, 3: 1, 4: 1, 5:

1, 6: 1})
-----------------------------------------------------------------------------------------------------------------------
Q19. Create an array with whole numbers values from 0 to 10 and find what is the command to extract the
elements in the following sequence -
In [4]:
import numpy as np
a = np.arange(0,10)
b = a[-3:0:-3]
print('The required output is:', b)
The required output is: [7 4 1]
-----------------------------------------------------------------------------------------------------------------------
Q20. Create a 2-dimensional array with 3 rows and 3 columns containing random numbers from 1 to 9.
Find the difference between the maximum element across the columns and the minimum element across the
rows.
In [5]:
import numpy as np
a=np.arange(1,10,1)
b=np.reshape(a, (3, 3))
print('The 2D array is', '\n ', b)
c = np.amax(b, axis=0)
d = np.amin(b, axis=1)
print('\n ', 'The difference is:', c-d)
The 2D array is
[[1 2 3]
[4 5 6]
[7 8 9]]
The difference is: [6 4 2]
-----------------------------------------------------------------------------------------------------------------------
Q21. Consider the following data to answer the question below:
What is the command to convert the above dictionary into a dataframe named ‘df_state’?
In [6]:
import pandas as pd
state_data={'state':['Goa','Goa','Goa','Gujarat','Gujarat','Gujarat',],
'year':[2010, 2011, 2012, 2013, 2014, 2015],
'pop':[2.5, 2.7, 4.6, 3.4, 3.9, 4.2]}
a = pd.DataFrame(state_data)
print('The dataframe generated is:','\n\n', a)
The dataframe generated is:
state year pop

0 Goa 2010 2.5
1 Goa 2011 2.7
2 Goa 2012 4.6
3 Gujarat 2013 3.4
4 Gujarat 2014 3.9
5 Gujarat 2015 4.2
-----------------------------------------------------------------------------------------------------------------------
Q22. List 3 commands to display the columns of the above dataframe df_state.
In [7]:
b=a['state']
print('By using 1st command:', '\n ', b)
c=a.year
print('\n ','By using 2nd command:', '\n ', c)
d=a.iloc[:,2]
print('\n ','By using 3rd command:', '\n ', d)
By using 1st command:

0 Goa
1 Goa
2 Goa
3 Gujarat
4 Gujarat
5 Gujarat
Name: state, dtype: object
By using 2nd command:

0 2010
1 2011
2 2012
3 2013
4 2014
5 2015
Name: year, dtype: int64
By using 3rd command:

0 2.5
1 2.7
2 4.6
3 3.4
4 3.9
5 4.2
Name: pop, dtype: float64
-----------------------------------------------------------------------------------------------------------------------
Q23. Correlation between two variables X&Y is 0.85. Now, after adding the value 2 to all the values of X, the
correlation co-efficient will be__
ANSWER:
Correlation between two variables X&Y is 0.85.
After adding the value 2 to all the values of X, the correlation co-efficient will be remain the same (0.85)
-----------------------------------------------------------------------------------------------------------------------
Q24. A) Read the given dataset “Tips.csv” as a dataframe “Data”. Give 3 commands to extract the columns in
the following sequence - Time, TotalBill, Tips?
B) Read the given excel sheet ‘Tips1.xlsx’ as a dataframe ‘Data1’. What command should be given to merge
the two data frames ‘Data’ and ‘Data1’ by columns?
C) Copy the 'Data2' dataframe as 'Data3’. (Data3 = Data2.copy()) and identify the command to find the total
tips received across Day’s from the dataframe ‘Data3’?
In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv('Tips.csv')
a=data.loc[:,['Time', 'TotalBill', 'Tips']]

print(a.head(),'\n ')
b=pd.DataFrame(data, columns=['Time', 'TotalBill', 'Tips'])
print(b.head(),'\n ')
c=data[['Time', 'TotalBill', 'Tips']]

print(c.head())
Time TotalBill Tips

0 Dinner 16.99 1.01
1 Dinner 10.34 1.66
2 Dinner 21.01 3.50
3 Dinner 23.68 3.31
4 Dinner 24.59 3.61
Time TotalBill Tips

0 Dinner 16.99 1.01
1 Dinner 10.34 1.66
2 Dinner 21.01 3.50
3 Dinner 23.68 3.31
4 Dinner 24.59 3.61
Time TotalBill Tips

0 Dinner 16.99 1.01
1 Dinner 10.34 1.66
2 Dinner 21.01 3.50
3 Dinner 23.68 3.31
4 Dinner 24.59 3.61
In [2]:
import pandas as pd
data1 = pd.read_excel(r'Tips1.xlsx')
print(data1.head(), '\n ')
data2 = pd.merge(data, data1)
print(data2.head())
SINO Gender
0 1 Female
1 2 Male
2 3 Male
3 4 Male
4 5 Female
SINO TotalBill Tips Smoker Day Time Size Gender

0 1 16.99 1.01 No Sun Dinner 2.0 Female
1 2 10.34 1.66 No Sun Dinner 3.0 Male
In [3]:
data3 = data2.copy()
print(data3.head(), '\n ')
data4 = data3.groupby(by='Day').sum()['Tips']
print(data4)
SINO TotalBill Tips Smoker Day Time Size Gender

Day
Day
Fri 28.28
Sat 102.79
Sun 102.28
Thur 41.99
Name: Tips, dtype: float64
-----------------------------------------------------------------------------------------------------------------------
Q25. Data Visualization Problem Statement:
‘Stock_File_1’, a stock trend forecasting company has just employed you as a Data Scientist.As a first task in
your new job, your manager has provided you with a company’s stock data and asked you to check the
quality of the data for the next step of analysis. Following are the additional description and information
about the data which your manager has shared with you.
a) The data set contains six variables namely

i. Date
ii. Open
iii. High
iv. Low
v. Close
vi. Volume
b) Typically, the stock market opens at 9:15 hours and closes at 15:30 hours. E
ach stock is defined by an opening price and a closing price which are the price
s it opensnand closes with. Within the operating hours, the stock price touches
a maximum and minimum which are the highest and lowest prices achieved by the st
ock in the working hours of the stock market. You have access to ten years of mo
nthly stock price data with the Open, High, Low and Close price and the number o
f stocks traded for each day given by the feature Volume. On some days when ther
e is no trading, the parameters Open, High, Low and Close remain constant and Vo
lume is zero.
Furthermore, your manager also claims that the model prediction is too bad since the data is polluted. Try to
impress your new boss by preprocessing the data and by giving a proper rationale behind the steps you
would follow. The two datasets should be merged before preprocessing.
In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
df_stock1 = pd.read_csv('Stock_File_1.csv', parse_dates=['Date'])
df_stock2 = pd.read_csv('Stock_File_2.txt', parse_dates=['Date'])
# Joining both the datasets (Stock_1 and Stock_2) and resetting the index.
df = pd.concat([df_stock1, df_stock2]).reset_index()
df.drop(['index'], axis='columns', inplace=True)
print(df)
Date Open High Low Close Volume

0 2006-06-01 471.60 474.00 442.00 444.42 21900
1 2006-06-12 454.00 464.00 440.00 446.17 8400
2 2006-06-22 451.16 464.20 447.60 460.26 19400
3 2006-07-03 495.10 509.68 493.00 498.97 9100
3 2006-07-03 495.10 509.68 493.00 498.97 9100
4 2006-07-13 518.00 526.40 517.00 521.66 6800
.. ... ... ... ... ... ...
364 2016-07-11 1265.50 1287.00 NaN 1276.40 11800
365 2016-07-21 1301.00 1322.10 1261.00 1270.55 36800
366 2016-08-01 1279.65 NaN 1247.40 1250.65 4600
367 2016-08-11 1180.05 683.00 1171.15 1191.35 4800
368 2016-08-22 1168.20 1168.20 1148.40 1155.55 2600
[369 rows x 6 columns]
In [2]:
# Calculated the null values to check if the dataset had flaws.
a = df.isnull().sum()
print(a)
Date 0
Open 7
High 16
Low 13
Close 14
Volume 0
dtype: int64
In [3]:
# Open = (High + Low) / 2
# Replacing the null values in Open column with mean of High and Low values of respect
ive rows.
df['Open'] = df['Open'].fillna((df['High'] + df['Low'])/2)
# Close = (High + Low) / 2

# Replacing the null values in High, Low, Close columns using the above relation.
df['High'] = df['High'].fillna((2 * df['Close'] - df['Low']))
df['Low'] = df['Low'].fillna((2 * df['Close'] - df['High']))
df['Close'] = df['Close'].fillna((df['High'] + df['Low'])/2)
print(df)

0 2006-06-01 471.60 474.00 442.00 444.42 21900
1 2006-06-12 454.00 464.00 440.00 446.17 8400
2 2006-06-22 451.16 464.20 447.60 460.26 19400
3 2006-07-03 495.10 509.68 493.00 498.97 9100
4 2006-07-13 518.00 526.40 517.00 521.66 6800
.. ... ... ... ... ... ...
364 2016-07-11 1265.50 1287.00 1265.80 1276.40 11800
365 2016-07-21 1301.00 1322.10 1261.00 1270.55 36800
366 2016-08-01 1279.65 1253.90 1247.40 1250.65 4600
367 2016-08-11 1180.05 683.00 1171.15 1191.35 4800
368 2016-08-22 1168.20 1168.20 1148.40 1155.55 2600
In [4]:
# Checking the Dataypes to make sure if all had similar ones.
# Volume column has dtype (object).
print(df.dtypes)
Date datetime64[ns]
Date datetime64[ns]
Open float64
High float64
Low float64
Close float64
Volume object
dtype: object
In [5]:
# Checked why dtype was 'object', And replacing the 'zero' word with 0.
df["Volume"].replace({"zero": 0}, inplace=True)
df=df.astype({"Volume": float})
In [6]:
print(df.dtypes)
Date datetime64[ns]
Open float64
High float64
Low float64
Close float64
Volume float64
dtype: object
In [7]:
# Checking the dataset.
print(df)

0 2006-06-01 471.60 474.00 442.00 444.42 21900.0
1 2006-06-12 454.00 464.00 440.00 446.17 8400.0
2 2006-06-22 451.16 464.20 447.60 460.26 19400.0
3 2006-07-03 495.10 509.68 493.00 498.97 9100.0
4 2006-07-13 518.00 526.40 517.00 521.66 6800.0
.. ... ... ... ... ... ...
364 2016-07-11 1265.50 1287.00 1265.80 1276.40 11800.0
365 2016-07-21 1301.00 1322.10 1261.00 1270.55 36800.0
366 2016-08-01 1279.65 1253.90 1247.40 1250.65 4600.0
367 2016-08-11 1180.05 683.00 1171.15 1191.35 4800.0
368 2016-08-22 1168.20 1168.20 1148.40 1155.55 2600.0
In [8]:
# in row #367, 'High' value is lower than 'Low' value
# Replacing such mistakes with the previous relation.

# Close = (High + Low) / 2
df.loc[df["High"] <= df["Low"], "High"] = (2 * df['Close'] - df['Low'])
In [9]:
# The dataset is finally all cleaned and preprocessed.
print('Fully cleaned and preprocessed dataset:','\n\n\n', df)
Fully cleaned and preprocessed dataset:

0 2006-06-01 471.60 474.00 442.00 444.42 21900.0
1 2006-06-12 454.00 464.00 440.00 446.17 8400.0
2 2006-06-22 451.16 464.20 447.60 460.26 19400.0
3 2006-07-03 495.10 509.68 493.00 498.97 9100.0
4 2006-07-13 518.00 526.40 517.00 521.66 6800.0
.. ... ... ... ... ... ...
364 2016-07-11 1265.50 1287.00 1265.80 1276.40 11800.0
365 2016-07-21 1301.00 1322.10 1261.00 1270.55 36800.0
366 2016-08-01 1279.65 1253.90 1247.40 1250.65 4600.0
367 2016-08-11 1180.05 1211.55 1171.15 1191.35 4800.0
368 2016-08-22 1168.20 1168.20 1148.40 1155.55 2600.0
NOTE: Rationale behind the steps are stated using '#' before each snippet of code.
-----------------------------------------------------------------------------------------------------------------------
Q26. The total number of missing values within the dataset.
In [1]:
import pandas as pd
df = pd.read_csv('Concrete Compressive Strength.csv')
a=df.isnull().sum().sum()
print('Total number of missing values is:', a)
Total number of missing values is: 0
-----------------------------------------------------------------------------------------------------------------------
Q27. Which column / feature requires correction in the type of value they hold?
In [2]:
import pandas as pd
b=df.dtypes
print('DataTypes of the respective columns are:','\n ')
print(b)
DataTypes of the respective columns are:
Cement (component 1)(kg in a m^3 mixture) float64

Blast Furnace Slag (component 2)(kg in a m^3 mixture) float64
Fly Ash (component 3)(kg in a m^3 mixture) float64
Water (component 4)(kg in a m^3 mixture) float64
Superplasticizer (component 5)(kg in a m^3 mixture) float64
Coarse Aggregate (component 6)(kg in a m^3 mixture) float64
Fine Aggregate (component 7)(kg in a m^3 mixture) float64
Age (day) int64
Concrete compressive strength(MPa, megapascals) float64
dtype: object
-----------------------------------------------------------------------------------------------------------------------
Q28. After imputation of nulls with mean what is the average value of the compressive strength in concrete?
In [3]:
import pandas as pd
df.iloc[:, -1] = df.iloc[:, -1].fillna(df.iloc[:, -1].mean())
d = df.iloc[:, -1].mean()
print('Average of compressive strength in concrete is:', d)
Average of compressive strength in concrete is: 35.81783582620971
-----------------------------------------------------------------------------------------------------------------------
Q29. The feature that has a moderately strong relationship with compressive strength in concrete is:
In [4]:
import pandas as pd

df.columns = ['C1_cem','C2_Blast','C3_Ash','C4_water','C5_plast','C6_Coarse','C7_Agg',
'Age','CCstrength']
corr = df.corr()
print(corr, '\n\n')
df.CCstrength.astype(float)
def get_max_correlated_column(a):
for i in corr[a]:
if i==1.0:
corr.loc[a,a]=0
return corr[a].idxmax(axis=0)
print('The feature that has a moderately strong relationship with compressive strength
in concrete is:')
print(get_max_correlated_column('CCstrength'))
C1_cem C2_Blast C3_Ash C4_water C5_plast C6_Coarse \

C1_cem 1.000000 -0.275193 -0.397475 -0.081544 0.092771 -0.109356
C2_Blast -0.275193 1.000000 -0.323569 0.107286 0.043376 -0.283998
C3_Ash -0.397475 -0.323569 1.000000 -0.257044 0.377340 -0.009977
C4_water -0.081544 0.107286 -0.257044 1.000000 -0.657464 -0.182312
C5_plast 0.092771 0.043376 0.377340 -0.657464 1.000000 -0.266303
C6_Coarse -0.109356 -0.283998 -0.009977 -0.182312 -0.266303 1.000000
C7_Agg -0.222720 -0.281593 0.079076 -0.450635 0.222501 -0.178506
Age 0.081947 -0.044246 -0.154370 0.277604 -0.192717 -0.003016
CCstrength 0.497833 0.134824 -0.105753 -0.289613 0.366102 -0.164928
C7_Agg Age CCstrength

C1_cem -0.222720 0.081947 0.497833
C2_Blast -0.281593 -0.044246 0.134824
C3_Ash 0.079076 -0.154370 -0.105753
C4_water -0.450635 0.277604 -0.289613
C5_plast 0.222501 -0.192717 0.366102
C6_Coarse -0.178506 -0.003016 -0.164928
C7_Agg 1.000000 -0.156094 -0.167249
Age -0.156094 1.000000 0.328877
CCstrength -0.167249 0.328877 1.000000
The feature that has a moderately strong relationship with compressive strength in con
crete is:
C1_cem
-----------------------------------------------------------------------------------------------------------------------
Q30. Standardize the dataset using standardscaler(), split the dataset into train and test of proportions 70:30
and set the random state to 1. Build a Linear Regression Model on the data and the resulting r-squared value
is between which range?
In [5]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
scaled_data_x = scale.fit_transform(df.iloc[:,:-1])
x = pd.DataFrame(scaled_data_x)
y = df.iloc[:,-1]
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression().fit(x_train,y_train)
y_test_pred = lin_reg.predict(x_test)
from sklearn.metrics import r2_score
print('r2 score of the model deployed is:')

print(r2_score(y_test,y_test_pred))
r2 score of the model deployed is:

0.5414363184636142
-----------------------------------------------------------------------------------------------------------------------
Q31. Match the terms in Group A with the relevant terms in Group B:
Group A Group B
A. k-means 1) unsupervised learning algorithm
B. knn 2) k is no. of clusters
C. logistic regression 3) k is no. of neighbors
D. clustering 4) logit function
ANSWERS:
A - 2 (k-means - k is no. of clusters)

B - 3 (knn - k is no. of neighbors)
C - 4 (logistic regression - logit function)
D - 1 (clustering - unsupervised learning algorithm)
-----------------------------------------------------------------------------------------------------------------------
Q32. Each centroid in K-Means algorithm defines one:
A. cluster
B. data point
C. two clusters
D. None of the above
ANSWER:
A. cluster
(Each centroid in K-Means algorithm defines one cluster.)
-----------------------------------------------------------------------------------------------------------------------
Q33. The method / metric which is NOT useful to determine the optimal number of clusters in unsupervised
clustering algorithms is:
A. Dendogram
B. Elbow method
C. Scatter plot
ANSWER:
C. Scatter plot
(The method / metric which is NOT useful to determine the optimal number of clusters in unsupervised
clustering algorithms is Scatter plot.)
-----------------------------------------------------------------------------------------------------------------------
Q34. In the K-means algorithm, what is the most commonly used distance metric to calculate distance
between centroid of each cluster and data points?
A. Chebyshev distance
B. Manhattan
C. Euclidean
ANSWER:
C. Euclidean
(In the K-means algorithm, Euclidean is the most commonly used distance metric to calculate distance
between centroid of each cluster and data points.)
-----------------------------------------------------------------------------------------------------------------------
Q35. Which of the following statements is not correct about k-means?
A. Accuracy of clusters are improved by scaling of attributes.

B. K-means clusters are affected by outliers.
C. K-Means clustering is NOT influenced by initial centroids which are called c
luster seeds
D. Number of clusters to be built is typically a user input and it impacts the
way clusters are created.
ANSWER:
C. K-Means clustering is NOT influenced by initial centroids which are called cluster seeds
-----------------------------------------------------------------------------------------------------------------------

DataScience Beginner Questionnaire

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

DataScience Beginner Questionnaire

Transféré par

Droits d'auteur :

Formats disponibles

Q1.

Using Python script as a calculator for the below tasks:

print('a] Evaluating the equation would give, A =', A,'\n ')

a] Evaluating the equation would give, A = 162.8894626777442

b] DataType of n is, <class 'int'> DataType of r is, <class 'int'>

Q2. What will be the output of the following code?

print('The output of the above code is;')

arr_1 = np.array(my_list1, dtype = int)

The output of the above code is;

string = 'Machine Learning'

print('The required sliced substring is:', sub_string)

The required sliced substring is: Learn

for i in range(15, 25, 4):

print('The index of 19 is:', a.index(19))

The index of 19 is: 1

elif num1 != num2:

elif num1 < num2:

elif num1 > num2:

num1 = 625 num2 = 625

As num1 = num2, Option A is correct.

print('The required output is:','\n ')

The required output is:

for n in range(1, x):

print('The required Pythagorean triplets are:')

The required Pythagorean triplets are:

print('The smallest value of the list is:', x)

The smallest value of the list is: -5

mean = sum(b) / len(b)

variance = (total) / len(b)

print('Standard Deviation of the list is:', std_dev)

Standard Deviation of the list is: 7.221140076192956

tot_arr = int(math.factorial(len(s))/ denom)

print("The number of different arrangements are:", tot_arr)

The number of different arrangements are: 1814400

# Total number of men, M

# Total number of women, W

# Number of men to be selected, M1

# Number of women to be selected, W1

# Number of ways to select 6 men,

# Number of ways to select 6 women,

# Number of ways of paring 6 men and 6 women,

# Total number of pairing possible is,

The combinations possible are: 57081024

P(K) : Probability of knowing the answer = 1/3

P(A) : Probability of answering correct

P(K|A) : Probability of knowing answer, given that it is correct = ?

By Bayes formula, P(K|A) = {P(A|K) * P(K)} / P(A)

Now, we know that there are 2 possibilities for answer to be correct:

1. You know the answer

2. You guess the answer AND the answer is correct

Now, Probability of guessing an answer, P(G) = 1 question out of 4 = 1/4

Probability of answer being correct, P(C) = 1 correct out of 4 options = 1/4

In mathematical words we can write the probability of answering correct as,

P(A) = P(K) + {P(G) * P(C)}

Substituting in the equation,

P(K|A) = {P(A|K) P(K)} / {P(K) + P(G) P(C)}

= {1 * 1/3} / {1/3 + 1/4 * 1/4}

= 1/3 / {1/3 + 1/16}

b) The range of the variable “TotalBill”?

third_quartile = np.quantile(df, 0.75)

a] 3rd Quartile of TotalBill is: nan

range = max - min

b] Range of TotalBill is: 45.2