Python | Gender Identification by name using NLTK

Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. We can observe that male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these differences more precisely. In order to run the below python program, you must have to install NLTK. Please follow the installation steps.

pip install nltk

The first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name. Example :

Input : gender_features('saurabh')
Output : {'last_letter': 'h'}

Python3

def gender_features(word):
     return {'last_letter': word[-1]}
gender_features('mahavir')
# output : {'last_letter': 'r'}

A GUI will pop up then choose to download “all” for all packages, and then click ‘download’. This will give you all of the tokenizers, chunkers, other algorithms, and all of the corpora, so that’s why the installation will take quite a time.

nltk.download()

Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:

Deciding whether an email is spam or not.
Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports, " "technology, " and "politics."
Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.

The basic classification task has a number of interesting variants. For example, in multi-class classification, each instance may be assigned multiple labels; in open-class classification, the set of labels is not defined in advance; and in sequence classification, a list of inputs are jointly classified. A classifier is called supervised if it is built based on training corpora containing the correct label for each input. The framework used by supervised classification is shown in figure.

The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. For reasons discussed below, it is important that we employ a separate dev-test set for error analysis, rather than just using the test set. The division of the corpus data into different subsets is shown in following Figure :

Get the link of text file used from here -

By text urls directly. male.txt, female.txt
male.txt and female.txt files are downloaded automatically while nltk.download() method executed successfully. Path in local system: path of nltk: C:\Users\currentUserName\AppData\Roaming path for files inside nltk: \nltk_data\corpora\names

Python3

# importing libraries
import random
from nltk.corpus import names
import nltk

def gender_features(word):
    return {'last_letter':word[-1]}

# preparing a list of examples and corresponding class labels.
labeled_names = ([(name, 'male') for name in names.words('male.txt')]+
             [(name, 'female') for name in names.words('female.txt')])

random.shuffle(labeled_names)

# we use the feature extractor to process the names data.
featuresets = [(gender_features(n), gender) 
               for (n, gender)in labeled_names]

# Divide the resulting list of feature
# sets into a training set and a test set.
train_set, test_set = featuresets[500:], featuresets[:500]

# The training set is used to 
# train a new "naive Bayes" classifier.
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(classifier.classify(gender_features('mahavir')))

# output should be 'male'
print(nltk.classify.accuracy(classifier, train_set))

# it shows accuracy of our classifier and 
# train_set. which must be more than 99 % 
# classifier.show_most_informative_features(10)

Getting informative features from Classifier:

Python3

classifier.show_most_informative_features(10)
# 10 indicates 10 rows

Output:

Python | Gender Identification by name using NLTK

Explore