100% found this document useful (1 vote)
163 views

Machine Learning Project Report

This document summarizes a machine learning project that classified tweets by gender. It addressed two questions: 1) The most common emotions/words used by each gender and 2) Which gender makes more typos. For question 1, word clouds showed males commonly used words like "make" and "know" while females used words like "need" and "best". For question 2, a bar graph showed females made slightly more typos than males, with about 2,862 typos for females and 2,702 for males. The project tested three algorithms on tweet classifications and found Multinomial Naive Bayes had the best accuracy at 60.1%.

Uploaded by

Ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
163 views

Machine Learning Project Report

This document summarizes a machine learning project that classified tweets by gender. It addressed two questions: 1) The most common emotions/words used by each gender and 2) Which gender makes more typos. For question 1, word clouds showed males commonly used words like "make" and "know" while females used words like "need" and "best". For question 2, a bar graph showed females made slightly more typos than males, with about 2,862 typos for females and 2,702 for males. The project tested three algorithms on tweet classifications and found Multinomial Naive Bayes had the best accuracy at 60.1%.

Uploaded by

Ashish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

MACHINE LEARNING PROJECT REPORT

PROJECT TITLE

CLASSIFICATION OF TWEETS ACCORDING TO GENDER


The data set provided was about the tweets which were classified between males and
females and it came with a couple of questions which were to be answered.
Questions based on the data set:

1. What are the most common emotions/words used by Males and


Females?

Solution: As after the cleaning, analysis and visualization, it was clear that the most
common emotions/words used by males are

 Make  Know
 Go  See
 Day  Time
 Good  Want
 Amp  People
 Love  Need
 Back  Think
 New  Best
 One  Got

We displayed this in the form of a word cloud which is given below.


And about the most common words used by females in their tweets, those are the
following:

 Make  One
 Need  Best
 Amp  Got
 Time  Go
 Good  People
 Last  Love
 New  Thing
 Day  Want
 Know  Back

We can see all these quite evidently in the word cloud following.
2. Which gender makes more typos in their tweets?

Solution: By using the spellchecker package, we found out the number of typos done
by each gender in this particular set of data.
We got the results and presented in the form of a bar graph, which is shown below:

So as one can see clearly that with just a slight margin, the result is that females
make more typos in their tweets.
And to be precise with the values, the males in this particular data set made about
2702 typos whereas females made about 2862 typos in their tweets.

Now coming to the detail summary of the project:

We were told to take up three classification algorithms of our own choice and build
three respective Machine learning models and compare the Accuracy of all three and
suggest which ML algorithm suits best for the given problem.

So to reach the final conclusion, we did data encoding and exploration.

 The first approach which we went ahead with is taking the ‘Description column’
as the independent variable and the ‘Gender column’ as the dependent variable
(As given).
Then we converted the descriptions which are originally of string type into an
array of numbers before giving it to the ML Model.
Then we split the encoded data into train and test data.

Now comes the Ensemble Machine learning modelling which is nothing but the
Classification Algorithms.

The Classification Algorithms which we used in this are


 RandomForestClassifier
 Logistic Regression
 Multinomial Naïve Byes

So after performing the training and testing, the accuracy of the model by all
three of these algorithms are
 RandomForestClassifier - 57.2 %( approx.)
 Logistic Regression - 57.8 %( approx.)
 Multinomial Naïve Byes - 60.1 %( approx.)

So, after coming the three models, Multinomial Naïve Byes is giving us the better
accuracy rate than the other models in case of description as independent variable
and gender as dependent variable.

 The second approach which we went ahead with is taking the ‘Tweets column’ as
the independent variable and the ‘Gender column’ as the dependent variable (As
given).
Then we converted the tweets which are originally of string type into an array of
numbers before giving it to the ML Model.
Then we split the encoded data into train and test data.

So after performing the training and testing, the accuracy of the model by all
three of these algorithms are

 RandomForestClassifier - 50.2 %( approx.)


 Logistic Regression - 50.6 %( approx.)
 Multinomial Naïve Byes - 52.0 %( approx.)

So, after coming the three models, Multinomial Naïve Byes is giving us the better
accuracy rate than the other models in case of tweets as independent variable and
gender as dependent variable.

CONCLUSION:
So in both cases, i.e., by taking Descriptions in one and Tweets in other case as the
independent variables and Gender being the fixed dependent variable, it came out
very clearly that Multinomial Naïve Byes Classification Algorithm is the best
suited in terms of accuracy.

You might also like