Académique Documents
Professionnel Documents
Culture Documents
a. Create the variables n, r, p and assign them values 10, 5, and 100 respectively. Then evaluate the following
expression.
= (1 + / 100)
b. What are the data types of the variables p, r, n, and A in problem (1(a))?
c. If the data type of the variable ‘n’ is not integer, then redefine ‘n’ as integer with its value being 10 and
again compute the value of A.
In [1]:
n,r,p=10,5,100
A=p*((1+(r/100))**n)
-----------------------------------------------------------------------------------------------------------------------
In [2]:
import os
import numpy as np
my_list1 = [2,7,3,5,4,6]
print(my_list1)
-----------------------------------------------------------------------------------------------------------------------
Q3. Create a string called ‘string’ with the value as “Machine Learning”. Which code(s) is/are appropriate to
slice the sub-string “Learn”?
In [3]:
-----------------------------------------------------------------------------------------------------------------------
Q4. Create a sequence of numbers from 15 to 25 and increment by 4. What is the index of the value 19?
In [4]:
a = []
-----------------------------------------------------------------------------------------------------------------------
Q5. Which of the following is true with respect to the below codes?
A. num1 = num2
B. num1 ≠ num2
C. num1 < num2
D. num1 > num2
In [5]:
num1 = 5**4
num2 = pow(5,4)
print('num1 =', num1, '\t ', ' num2 =', num2, '\n ')
if num1 == num2:
print('As num1 = num2, Option A is correct.')
-----------------------------------------------------------------------------------------------------------------------
Q6. We have two lists. L1 = ['Coffee', 'Tea'] L2 = ['Hot', 'Ice']. Using a nested while loop compute the
following output:
Coffee
* Hot
* Ice
Tea
* Hot
* Ice
In [6]:
L1 = ['Coffee', 'Tea']
L2 = ['Hot', 'Ice']
i = 0
while (i<len(L1)):
print(L1[i])
a = 0
while (a<len(L2)):
print('*',L2[a])
a = a+1
i = i+1
Coffee
* Hot
* Ice
Tea
* Hot
* Ice
-----------------------------------------------------------------------------------------------------------------------
Q7. Write a function which finds all Pythagorean triplets of triangles whose sides are no greater than a
natural number N.
In [7]:
def py_trip(N):
c, x = 0, 2
while c < N:
if c > N:
break
print(a, b, c)
x = x + 1
-----------------------------------------------------------------------------------------------------------------------
Q8. Given a list of integers, write a function that finds the smallest and the largest value of this list.
Given a list of numbers, write a function which finds their standard deviation.
In [8]:
b = [2,3,4,-5]
def smallest_num(a):
min = a[0]
for i in a:
if i < min:
min = i
return min
def largest_num(a):
max = a[0]
for i in a:
if i > max:
max = i
return max
x=smallest_num(b)
y=largest_num(b)
In [9]:
b = [6.5, 25.8, 16.9, 25.12, 23.2]
total=0
for i in range(0, len(b)):
total = total + ((b[i] - mean) ** 2)
-----------------------------------------------------------------------------------------------------------------------
Q9. There are 3 arrangements of the word DAD, namely DAD, ADD, and DDA. How many arrangements are
there of the word ENDURINGLY?
In [10]:
import math
s = 'ENDURINGLY'
dup_num={}
for i in range(0,len(s)):
count = 1
for j in range(i+1,len(s)):
if (s[i] == s[j] and s[i] != ' ' and s[i] !='0'):
count =count + 1
s=s[:j] +'0' + s[j+1:]
if (count>1):
dup_count={s[i]:count}
dup_num.update(dup_count)
denom = 1
for i in list(dup_num.values()):
denom = denom * math.factorial(i)
-----------------------------------------------------------------------------------------------------------------------
Q10. There are 13 men and 12 women in a ballroom dancing class. If 6 men and 6 women are chosen and
paired off, how many pairings are possible?
In [11]:
import math
-----------------------------------------------------------------------------------------------------------------------
Q11. Suppose you are taking a multiple-choice test with 4 choices for each question. In answering a question
on this test, the probability that you know the answer is 0.33. If you don’t know the answer, you choose one
at random. What is the probability that you knew the answer to a question, given that you answered it
correctly?
ANSWER:
P(A|K) : Probability of answering correct, given that you know the answer = 1
P(K|A) = 16/19
Hence, the probability of knowing the answer to a question, given that you answered it correctly is, 16/19.
-----------------------------------------------------------------------------------------------------------------------
Q12. Read the given data ‘TIPS.csv’ as a dataframe named Tips and answer the following question:
a) In the tips dataframe, for the variable “total bill” what is the 3rd quartile and maximum value?
In [12]:
import pandas as pd
import numpy as np
df=pd.read_csv('Tips.csv')
df=df['TotalBill']
max = df.max()
print(' Max value of TotalBill is:', max,'\n ')
In [13]:
min = df.min()
-----------------------------------------------------------------------------------------------------------------------
Q13. A normal distribution which has a mean of 50 and standard deviation of 7 is taken into consideration.
68% of the distribution can be found between what two numbers?
a. 40 and 60
b. 0 and 43
c. 0 and 68
d. 43 and 57
In [14]:
mean = 50
std_dev = 7
a = mean - std_dev
b = mean + std_dev
ANSWER: D. 43 and 57
-----------------------------------------------------------------------------------------------------------------------
Q14. Consider the data X = (58,59,63,60,60,63,60,57,58,59). An unbiased estimation for population variance
would be __
In [1]:
X = (58,59,63,60,60,63,60,57,58,59)
mean = sum(X) / len(X)
total=0
for i in range(0, len(X)):
total = total + ((X[i] - mean) ** 2)
# (len(X)-1) instead of len(X) in the formula corrects the bias in the estimation of t
he population variance
-----------------------------------------------------------------------------------------------------------------------
Q15. From the given below boxplot identify the median value and the outlier.
ANSWER:
Median: 27
-----------------------------------------------------------------------------------------------------------------------
Q16. Create a dictionary ‘Country’ that maps the following countries to their capitals respectively:
Country = {'Country':'State','India':'Delhi','China':'Beijing','Japan':'Tokyo','Qatar
':'Doha','France':'Marseilles'}
print('The created dictionary is:', '\n ', Country, '\n ')
-----------------------------------------------------------------------------------------------------------------------
tuple_1 = (1,5,6,7,8)
tuple_2 = (8,9,4)
a) sum(tuple_1)
b) len(tuple_2)
c) tuple_2 + tuple_1
d) tuple_1[3] = 45
In [2]:
tuple_1 = (1,5,6,7,8)
tuple_2 = (8,9,4)
a=sum(tuple_1)
print('Output of Option a is:', a, '\n ')
b=len(tuple_2)
print('Output of Option b is:', b, '\n ')
c=tuple_2 + tuple_1
print('Output of Option c is:', c, '\n ')
d=tuple_1[3] = 45
print('Output of Option d is:', d, '\n ')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-2-449643dbbafe> in <module>
11 print('Output of Option c is:', c, '\n')
12
---> 13 d=tuple_1[3] = 45
14 print('Output of Option d is:', d, '\n')
ANSWER:
-----------------------------------------------------------------------------------------------------------------------
In [3]:
S = {1,2,3,4,4,4,5,6}
a = Counter(S)
print('The number of elements in above data structure is:', a)
-----------------------------------------------------------------------------------------------------------------------
Q19. Create an array with whole numbers values from 0 to 10 and find what is the command to extract the
elements in the following sequence -
In [4]:
import numpy as np
a = np.arange(0,10)
b = a[-3:0:-3]
-----------------------------------------------------------------------------------------------------------------------
Q20. Create a 2-dimensional array with 3 rows and 3 columns containing random numbers from 1 to 9.
Find the difference between the maximum element across the columns and the minimum element across the
rows.
In [5]:
import numpy as np
a=np.arange(1,10,1)
b=np.reshape(a, (3, 3))
c = np.amax(b, axis=0)
d = np.amin(b, axis=1)
The 2D array is
[[1 2 3]
[4 5 6]
[7 8 9]]
-----------------------------------------------------------------------------------------------------------------------
What is the command to convert the above dictionary into a dataframe named ‘df_state’?
In [6]:
import pandas as pd
state_data={'state':['Goa','Goa','Goa','Gujarat','Gujarat','Gujarat',],
'year':[2010, 2011, 2012, 2013, 2014, 2015],
'pop':[2.5, 2.7, 4.6, 3.4, 3.9, 4.2]}
a = pd.DataFrame(state_data)
print('The dataframe generated is:','\n\n', a)
-----------------------------------------------------------------------------------------------------------------------
Q22. List 3 commands to display the columns of the above dataframe df_state.
In [7]:
b=a['state']
print('By using 1st command:', '\n ', b)
c=a.year
print('\n ','By using 2nd command:', '\n ', c)
d=a.iloc[:,2]
print('\n ','By using 3rd command:', '\n ', d)
-----------------------------------------------------------------------------------------------------------------------
Q23. Correlation between two variables X&Y is 0.85. Now, after adding the value 2 to all the values of X, the
correlation co-efficient will be__
ANSWER:
After adding the value 2 to all the values of X, the correlation co-efficient will be remain the same (0.85)
-----------------------------------------------------------------------------------------------------------------------
Q24. A) Read the given dataset “Tips.csv” as a dataframe “Data”. Give 3 commands to extract the columns in
the following sequence - Time, TotalBill, Tips?
B) Read the given excel sheet ‘Tips1.xlsx’ as a dataframe ‘Data1’. What command should be given to merge
the two data frames ‘Data’ and ‘Data1’ by columns?
C) Copy the 'Data2' dataframe as 'Data3’. (Data3 = Data2.copy()) and identify the command to find the total
tips received across Day’s from the dataframe ‘Data3’?
In [1]:
import pandas as pd
import numpy as np
data = pd.read_csv('Tips.csv')
In [2]:
import pandas as pd
data1 = pd.read_excel(r'Tips1.xlsx')
print(data1.head(), '\n ')
print(data2.head())
SINO Gender
0 1 Female
1 2 Male
2 3 Male
3 4 Male
4 5 Female
In [3]:
data3 = data2.copy()
print(data3.head(), '\n ')
data4 = data3.groupby(by='Day').sum()['Tips']
print(data4)
Day
Day
Fri 28.28
Sat 102.79
Sun 102.28
Thur 41.99
Name: Tips, dtype: float64
-----------------------------------------------------------------------------------------------------------------------
‘Stock_File_1’, a stock trend forecasting company has just employed you as a Data Scientist.As a first task in
your new job, your manager has provided you with a company’s stock data and asked you to check the
quality of the data for the next step of analysis. Following are the additional description and information
about the data which your manager has shared with you.
b) Typically, the stock market opens at 9:15 hours and closes at 15:30 hours. E
ach stock is defined by an opening price and a closing price which are the price
s it opensnand closes with. Within the operating hours, the stock price touches
a maximum and minimum which are the highest and lowest prices achieved by the st
ock in the working hours of the stock market. You have access to ten years of mo
nthly stock price data with the Open, High, Low and Close price and the number o
f stocks traded for each day given by the feature Volume. On some days when ther
e is no trading, the parameters Open, High, Low and Close remain constant and Vo
lume is zero.
Furthermore, your manager also claims that the model prediction is too bad since the data is polluted. Try to
impress your new boss by preprocessing the data and by giving a proper rationale behind the steps you
would follow. The two datasets should be merged before preprocessing.
In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
# Joining both the datasets (Stock_1 and Stock_2) and resetting the index.
df = pd.concat([df_stock1, df_stock2]).reset_index()
df.drop(['index'], axis='columns', inplace=True)
print(df)
In [2]:
# Calculated the null values to check if the dataset had flaws.
a = df.isnull().sum()
print(a)
Date 0
Open 7
High 16
Low 13
Close 14
Volume 0
dtype: int64
In [3]:
# Open = (High + Low) / 2
# Replacing the null values in Open column with mean of High and Low values of respect
ive rows.
print(df)
In [4]:
# Checking the Dataypes to make sure if all had similar ones.
# Volume column has dtype (object).
print(df.dtypes)
Date datetime64[ns]
Date datetime64[ns]
Open float64
High float64
Low float64
Close float64
Volume object
dtype: object
In [5]:
# Checked why dtype was 'object', And replacing the 'zero' word with 0.
df=df.astype({"Volume": float})
In [6]:
print(df.dtypes)
Date datetime64[ns]
Open float64
High float64
Low float64
Close float64
Volume float64
dtype: object
In [7]:
# Checking the dataset.
print(df)
In [8]:
# in row #367, 'High' value is lower than 'Low' value
In [9]:
# The dataset is finally all cleaned and preprocessed.
NOTE: Rationale behind the steps are stated using '#' before each snippet of code.
-----------------------------------------------------------------------------------------------------------------------
In [1]:
import pandas as pd
a=df.isnull().sum().sum()
print('Total number of missing values is:', a)
-----------------------------------------------------------------------------------------------------------------------
Q27. Which column / feature requires correction in the type of value they hold?
In [2]:
import pandas as pd
b=df.dtypes
print('DataTypes of the respective columns are:','\n ')
print(b)
-----------------------------------------------------------------------------------------------------------------------
Q28. After imputation of nulls with mean what is the average value of the compressive strength in concrete?
In [3]:
import pandas as pd
d = df.iloc[:, -1].mean()
print('Average of compressive strength in concrete is:', d)
-----------------------------------------------------------------------------------------------------------------------
Q29. The feature that has a moderately strong relationship with compressive strength in concrete is:
In [4]:
import pandas as pd
corr = df.corr()
print(corr, '\n\n')
df.CCstrength.astype(float)
def get_max_correlated_column(a):
for i in corr[a]:
if i==1.0:
corr.loc[a,a]=0
return corr[a].idxmax(axis=0)
print('The feature that has a moderately strong relationship with compressive strength
in concrete is:')
print(get_max_correlated_column('CCstrength'))
-----------------------------------------------------------------------------------------------------------------------
Q30. Standardize the dataset using standardscaler(), split the dataset into train and test of proportions 70:30
and set the random state to 1. Build a Linear Regression Model on the data and the resulting r-squared value
is between which range?
In [5]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
scaled_data_x = scale.fit_transform(df.iloc[:,:-1])
x = pd.DataFrame(scaled_data_x)
y = df.iloc[:,-1]
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
lin_reg = LinearRegression().fit(x_train,y_train)
y_test_pred = lin_reg.predict(x_test)
-----------------------------------------------------------------------------------------------------------------------
Q31. Match the terms in Group A with the relevant terms in Group B:
Group A Group B
A. k-means 1) unsupervised learning algorithm
B. knn 2) k is no. of clusters
C. logistic regression 3) k is no. of neighbors
D. clustering 4) logit function
ANSWERS:
-----------------------------------------------------------------------------------------------------------------------
Q32. Each centroid in K-Means algorithm defines one:
A. cluster
B. data point
C. two clusters
D. None of the above
ANSWER:
A. cluster
-----------------------------------------------------------------------------------------------------------------------
Q33. The method / metric which is NOT useful to determine the optimal number of clusters in unsupervised
clustering algorithms is:
A. Dendogram
B. Elbow method
C. Scatter plot
D. None of the above
ANSWER:
C. Scatter plot
(The method / metric which is NOT useful to determine the optimal number of clusters in unsupervised
clustering algorithms is Scatter plot.)
-----------------------------------------------------------------------------------------------------------------------
Q34. In the K-means algorithm, what is the most commonly used distance metric to calculate distance
between centroid of each cluster and data points?
A. Chebyshev distance
B. Manhattan
C. Euclidean
D. None of the above
ANSWER:
C. Euclidean
(In the K-means algorithm, Euclidean is the most commonly used distance metric to calculate distance
between centroid of each cluster and data points.)
-----------------------------------------------------------------------------------------------------------------------
ANSWER:
C. K-Means clustering is NOT influenced by initial centroids which are called cluster seeds
-----------------------------------------------------------------------------------------------------------------------