Machine Learning Project Report
Machine Learning Project Report
PROJECT TITLE
Solution: As after the cleaning, analysis and visualization, it was clear that the most
common emotions/words used by males are
Make Know
Go See
Day Time
Good Want
Amp People
Love Need
Back Think
New Best
One Got
Make One
Need Best
Amp Got
Time Go
Good People
Last Love
New Thing
Day Want
Know Back
We can see all these quite evidently in the word cloud following.
2. Which gender makes more typos in their tweets?
Solution: By using the spellchecker package, we found out the number of typos done
by each gender in this particular set of data.
We got the results and presented in the form of a bar graph, which is shown below:
So as one can see clearly that with just a slight margin, the result is that females
make more typos in their tweets.
And to be precise with the values, the males in this particular data set made about
2702 typos whereas females made about 2862 typos in their tweets.
We were told to take up three classification algorithms of our own choice and build
three respective Machine learning models and compare the Accuracy of all three and
suggest which ML algorithm suits best for the given problem.
The first approach which we went ahead with is taking the ‘Description column’
as the independent variable and the ‘Gender column’ as the dependent variable
(As given).
Then we converted the descriptions which are originally of string type into an
array of numbers before giving it to the ML Model.
Then we split the encoded data into train and test data.
Now comes the Ensemble Machine learning modelling which is nothing but the
Classification Algorithms.
So after performing the training and testing, the accuracy of the model by all
three of these algorithms are
RandomForestClassifier - 57.2 %( approx.)
Logistic Regression - 57.8 %( approx.)
Multinomial Naïve Byes - 60.1 %( approx.)
So, after coming the three models, Multinomial Naïve Byes is giving us the better
accuracy rate than the other models in case of description as independent variable
and gender as dependent variable.
The second approach which we went ahead with is taking the ‘Tweets column’ as
the independent variable and the ‘Gender column’ as the dependent variable (As
given).
Then we converted the tweets which are originally of string type into an array of
numbers before giving it to the ML Model.
Then we split the encoded data into train and test data.
So after performing the training and testing, the accuracy of the model by all
three of these algorithms are
So, after coming the three models, Multinomial Naïve Byes is giving us the better
accuracy rate than the other models in case of tweets as independent variable and
gender as dependent variable.
CONCLUSION:
So in both cases, i.e., by taking Descriptions in one and Tweets in other case as the
independent variables and Gender being the fixed dependent variable, it came out
very clearly that Multinomial Naïve Byes Classification Algorithm is the best
suited in terms of accuracy.