Vous êtes sur la page 1sur 3

" Gender prediction using ML Algorithms on Twitter data"

The algorithm which is giving highest accuracy is “Random Forest” with


an accuracy of 55%

Q1) What are the most common emotions/words used by Males and
Females?
A) The most common word used by male and female is “get”, the count
of get is “900”
As an individual:
The most common word used by female users is "get"
The most common word used by male users is "like"

Q2) Which gender makes more typos in their tweets?

A) Number of males who made typos in their Tweets: 1511

Number of females who made typos in their Tweets: 1381

“Males are the one who made more typos in their tweets”

SUMMARY:
Our project is “Gender prediction on Twitter data using ML
Algorithms”.
Our project deals with the predicting gender on twitter data which
contains a number of features.
Firstly, we read the complete data from excel and understood what the
data is.
After analyzing the data, we saw that the dataset is unclean that means
that it contains a lot of null values which will affect our accuracy and
we started our work in Jupyter Notebook.
We imported the required libraries and calculated the percentage of
missing data in the data set. Using dropna() statement we dropped all
the rows containing null values.
As gender is our dependent variable, we checked what the categories
of the gender column are and removed the unnecessary categories
such as ‘brand’ and ‘unknown’.Then we checked for the correlation
between independent variables using “Heat map” and removed the
columns with higher correlations.
After cleaning the data, we have to deal with the gender column. As
gender columns contains 4 unique values i.e. male, female, brand,
unknown we are only interested in rows containing male and female.
So, we filter the rows containing brand and unknown. Now we perform
“Label encoding” which will assign male as 1 and female as 0.
Now we plotted each of the features using “Histogram”, and the
dependent variable “Gender” using “Seaborn”.
After completing feature selection, we gave the features to
train_test_split function which will split the data into training set and
testing set, then we started with our algorithms.
The first algorithm that we started is Logistic Regression in this we have
imported “LogisticRegression” from “sklearn.linear_model”. Then we
trained our logistic regression model using “.fit()”, and predicted the
accuracy using “.score()”. Finally, we plotted the graphs between
independent column’s and dependent column individually.
There comes the second algorithm i.e. K Nearest Neighbors (KNN), in
this we have imported “KNeighborsClassifier” from
“sklearn.neighbors”. Then we trained our KNN model using “.fit()”, and
predicted the accuracy by using “accuracy_score” from sklearn.metrics.
Finally, we plotted the graphs between “n_neighbors” and the
“Accuracy”, from the graph we got the highest accuracy in KNN.
The third algorithm that we worked on is the Support Vector Machines
(SVM), in this we have imported “SVC” from “sklearn.svm”. Then we
trained our SVM model using “. fit ()” by setting the kernel as “linear”
and predicted the accuracy by using “classification_report” from
sklearn.metrics.
The final algorithm that we worked on is Random Forest, in this we
have imported “RandomForestClassifier” from “sklearn.ensemble”.
Then we trained our model using “.fit()” and predicted the accuracy by
using “classification_report” from sklearn.metrics.
With this our algorithms part is finished now we will be starting to give
the solutions to the given questions.

We started our work on “text” and “description” columns by doing text


analysis on these columns and tried to replace all the special
characters, links, unknown characters, with white space. Then we tried
to figure out all the “stopwords” using “nltk” library and removed them
from our columns.

After getting number of times a particular word is used by female and


male using bar graph then we got the most common word used by
male as “like” and by female as “get”. Also, we calculated the most
common word used by male and female as “get”.
The next question was about finding the gender who made more typos
(Grammatical errors) in their tweets for that we used “spellchecker”
library. With the help of spellchecker, we tried to figure out the
misspelled words. Then after with the help of the misspelled words we
found corresponding row of the gender and counted the number of
males and females. Finally, we found that the male count of the typos
is more than females. Hence, we can say that males are the one who
made a greater number of typos.