CS306 Data Analysis and Visualization

CS306
Data Analysis and Visualization
Project Report
Prepared by:
Tirth Shah - 201601009

Dhruvesh Asnani - 201601423
1
1 Data Cleaning and Geocoding
We first reduced the size of our data. We took the first 3 lakh entries of the yes bank.csv. We opened and
viewed this reduced file to come up with some criteria to remove the invalid records. We observed the
there were many records which had addpin value in their current pin code field. Such records did not have
any value in any of the fields related to the permanent address and had values like addcity, addstate in the
fields related to the current address. Observing this, we removed all records which had addpin value in
the current pin code field. This was the sole criterion for our cleaning at this stage. After this step, the
number of records reduced to 161065 (almost half).
Next, we performed geocoding. For this we proceeded as follows. We found a csv file on GitHub [1]
which contained the data of pincodes of Indian regions and other relevant data. Among this other relevant
data were latitude and longitude of some point in the region represented by the pincode. So, using this file,
we were able to geocode around 1.5 lakh Indian addresses based on their pincode. By 1.5 lakh addresses
we mean that both current and permanent addresses in around 1.5 lakh records we geocoded. So, in total
3 lakh addresses were geocoded. So, now around 10, 000 records remained. These, were geocoded using
geocoder API in python. Specifically, we used the geocoding service provided by Nominatim. Using this
we were able to geocode around another 5000 records. So, in the end, around 5000 records could not be
geocoded. These records were assumed to be invalid by us.
We did not perform data normalization or augmentation. Our final csv contains 158022 records in to-
tal. Each record has 5 fields: customer id, current address latitude, current address longitude, permanent
address latitude and permanent address longitude.
2 Classification and Clustering

To classify customers into household and business customers, we use the following scheme. If the current
address and permanent address of a customer are the same, then it is a household customer and is a
business customer otherwise. Using this classifcation, there were 136652 household customers and 21370
business customers.
In clustering, we used the K-means clustering algorithm to cluster the addresses based on geographical
location. We performed this using a function provided by the scikit package. [2]
2
3 Scatter Plots, Heatmaps and other graphs
We used basemap package [3] to plot our data on the world map. We first show the scatter plot of addresses
and then we show the heatmap of the same addresses so as to have a good visualization. Finally, we show
the result of applying the clustering algorithm on the addresses.
3.1 Current Addresses
Figure 1: Scatter plot of current addresses
3
Figure 2: Scatter plot of current addresses in India
Figure 3: Heatmap of current addresses
4
Figure 4: Heatmap of current addresses in India
3.2 Permanent Addresses
Figure 5: Scatter plot of permanent addresses
5
Figure 6: Scatter plot of permanent addresses in India
Figure 7: Heatmap of permanent addresses
6
Figure 8: Heatmap of permanent addresses in India
3.3 Household Addresses
Figure 9: Scatter plot of household addresses
7
Figure 10: Scatter plot of household addresses in India
Figure 11: Heatmap of household addresses
8
Figure 12: Heatmap of household addresses in India
3.4 Business Addresses
Figure 13: Scatter plot of business addresses
9
Figure 14: Scatter plot of business addresses in India
Figure 15: Heatmap of business addresses
10
Figure 16: Heatmap of business addresses in India
3.5 Clustering algorithm results
For current addresses, here is the plot of error v/s the number of clusters. The error is defined as the sum
of squared distances of samples to their closest cluster center.
Figure 17: The elbow occurs at k = 7
11
Here is the result with 7 clusters:
Figure 18: Clustering of current addresses with 7 clusters
Figure 19: Clustering of permanent addresses with 7 clusters
12
References
[1] https://github.com/arswright/data-geonames
[2] https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
[3] https://matplotlib.org/basemap/
13

CS306 Data Analysis and Visualization

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

CS306 Data Analysis and Visualization

Transféré par

Droits d'auteur :

Formats disponibles

CS306

Data Analysis and Visualization

Tirth Shah - 201601009

2 Classification and Clustering

3.1 Current Addresses

Figure 1: Scatter plot of current addresses

Figure 3: Heatmap of current addresses

3.2 Permanent Addresses

Figure 5: Scatter plot of permanent addresses

Figure 7: Heatmap of permanent addresses

3.3 Household Addresses

Figure 9: Scatter plot of household addresses

Figure 11: Heatmap of household addresses

3.4 Business Addresses

Figure 13: Scatter plot of business addresses

Figure 15: Heatmap of business addresses

3.5 Clustering algorithm results

Figure 17: The elbow occurs at k = 7

Figure 18: Clustering of current addresses with 7 clusters

Figure 19: Clustering of permanent addresses with 7 clusters

Vous aimerez peut-être aussi