Natural Language Toolkit (NLTK) is a platform used for building programs for text analysis. We can observe that male and female names have some distinctive characteristics. Names ending in a, e and i are likely to be female, while names ending in k, o, r, s and t are likely to be male. Let's build a classifier to model these differences more precisely.
In order to run the below python program, you must have to install NLTK. Please follow the installation steps.
Python3
A GUI will pop up then choose to download “all” for all packages, and then click ‘download’. This will give you all of the tokenizers, chunkers, other algorithms, and all of the corpora, so that’s why the installation will take quite a time.
The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. For reasons discussed below, it is important that we employ a separate dev-test set for error analysis, rather than just using the test set.
The division of the corpus data into different subsets is shown in following Figure :
Get the link of text file used from here -
Python3
Getting informative features from Classifier:
Python3
pip install nltkThe first step in creating a classifier is deciding what features of the input are relevant, and how to encode those features. For this example, we'll start by just looking at the final letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name. Example :
Input : gender_features('saurabh')
Output : {'last_letter': 'h'}
def gender_features(word):
return {'last_letter': word[-1]}
gender_features('mahavir')
# output : {'last_letter': 'r'}
nltk.download()Classification is the task of choosing the correct class label for a given input. In basic classification tasks, each input is considered in isolation from all other inputs, and the set of labels is defined in advance. Some examples of classification tasks are:
- Deciding whether an email is spam or not.
- Deciding what the topic of a news article is, from a fixed list of topic areas such as "sports, " "technology, " and "politics."
- Deciding whether a given occurrence of the word bank is used to refer to a river bank, a financial institution, the act of tilting to the side, or the act of depositing something in a financial institution.
The training set is used to train the model, and the dev-test set is used to perform error analysis. The test set serves in our final evaluation of the system. For reasons discussed below, it is important that we employ a separate dev-test set for error analysis, rather than just using the test set.
The division of the corpus data into different subsets is shown in following Figure :
Get the link of text file used from here -
- By text urls directly. male.txt, female.txt
male.txtandfemale.txtfiles are downloaded automatically whilenltk.download()method executed successfully. Path in local system: path of nltk:C:\Users\currentUserName\AppData\Roamingpath for files inside nltk:\nltk_data\corpora\names
# importing libraries
import random
from nltk.corpus import names
import nltk
def gender_features(word):
return {'last_letter':word[-1]}
# preparing a list of examples and corresponding class labels.
labeled_names = ([(name, 'male') for name in names.words('male.txt')]+
[(name, 'female') for name in names.words('female.txt')])
random.shuffle(labeled_names)
# we use the feature extractor to process the names data.
featuresets = [(gender_features(n), gender)
for (n, gender)in labeled_names]
# Divide the resulting list of feature
# sets into a training set and a test set.
train_set, test_set = featuresets[500:], featuresets[:500]
# The training set is used to
# train a new "naive Bayes" classifier.
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(classifier.classify(gender_features('mahavir')))
# output should be 'male'
print(nltk.classify.accuracy(classifier, train_set))
# it shows accuracy of our classifier and
# train_set. which must be more than 99 %
# classifier.show_most_informative_features(10)
classifier.show_most_informative_features(10)
# 10 indicates 10 rows
Output:

