0% found this document useful (0 votes)
154 views51 pages

ABP DWDM UNIT 4 Classification 1

This document outlines the contents of Unit IV - Classification for a course on Data Mining and Data Warehousing. It discusses the basic concepts of classification, including the two-step process of model construction using a training set followed by model usage to classify new data. Examples of classification models for loan approval and faculty tenure are provided. The key differences between supervised and unsupervised learning are highlighted. Issues related to data preparation, model evaluation, and interpretability of classification methods are also covered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
154 views51 pages

ABP DWDM UNIT 4 Classification 1

This document outlines the contents of Unit IV - Classification for a course on Data Mining and Data Warehousing. It discusses the basic concepts of classification, including the two-step process of model construction using a training set followed by model usage to classify new data. Examples of classification models for loan approval and faculty tenure are provided. The key differences between supervised and unsupervised learning are highlighted. Issues related to data preparation, model evaluation, and interpretability of classification methods are also covered.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

B.

TECH VI Semester
COMPUTER SCIENCE AND ENGINEERING
VCE-R15 2017 – 2018
DATA MINING AND DATA WAREHOUSING
(A3522)
UNIT – IV
CLASSIFICATION

A. BHANU PRASAD
Associate Professor of CSE
9885990509
[email protected]

VARDHAMAN COLLEGE OF ENGINEERING


(AUTONOMOUS)
Shamshabad – 501218, Hyderabad, AP
Text Books / References / Websites
TEXT BOOKS:
1. Jiawei Han, Micheline Kamber, Jian Pei (2012), Data Mining: Concepts
and Techniques, 3rd edition, Elsevier, United States of America.

REFERENCE BOOKS:
1. Margaret H Dunham (2006), Data Mining Introductory and
Advanced Topics, 2nd edition, Pearson Education, New Delhi, India.
2. Amitesh Sinha(2007), Data Warehousing, Thomson Learning, India.
3. Xingdong Wu, Vipin Kumar (2009), The Top Ten Algorithms in Data
Mining, CRC Press, UK.
4. Max Barmer(2007), Principles of Data Mining, Springer, USA.

2
UNIT – III CONTENTS
4. CLASSIFICATION
4.1 Basic Concepts
4.2 Decision Tree Induction
4.3 Bayesian Classification Methods
4.4 Rule-Based Classification
4.5 Model Evaluation and Selection
4.6 Techniques to Improve Classification Accuracy
4.7 Classification by Neural Networks
4.8 Support Vector Machines
4.9 Classification Using Frequent Patterns: Pattern-Based
Classification
4.10 Lazy Learners

3
4. CLASSIFICATION
4.1. Basic Concepts
 There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
 Two forms of data analysis:
1) Classification
2) Prediction
1) Classification is a form of data analysis that extracts models
describing important data classes. Such models, called classifiers,
predict categorical (discrete, unordered) class labels.
 It classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it in
classifying new data.
2) Prediction models continuous-valued functions, i.e., predicts
unknown or missing values.

 credit approval-“safe” or “risky”


 target marketing-“yes” or “no”
 medical diagnosis-“treatment A,” “treatment B,” or “treatment C”
 These categories can be represented by discrete values, where the
ordering among values has no meaning.
4
Classification: A Two-Step Process
Data classification is a two-step process, consisting of
1) A learning step (where a classification model is constructed) and
2) A classification step (where the model is used to predict class labels
for given data).
1) Learning step (Model construction):
• This is the training phase, where a classification algorithm builds the
classifier by analyzing or “learning from” a training set made up of
database tuples and their associated class labels.
• A tuple, X, is represented by an n-dimensional attribute vector, X=(x1,
x2, … , xn), depicting n measurements made on the tuple from n
database attributes, respectively, A1, A2, …, An.
• Each tuple, X, is assumed to belong to a predefined class as
determined by another database attribute called the class label
attribute. The class label attribute is discrete-valued and unordered.
• It is categorical (or nominal) in that each value serves as a category or
class. The individual tuples making up the training set are referred to
as training tuples and are randomly sampled from the database under
analysis.
• Data tuples can be referred to as samples, examples, instances, data
points, or objects.
5
Classification Process - 1: Model Construction
Ex(1): A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe. Model construction for
Loan Approval.
(a) Learning: Training data are analyzed by a classification algorithm. Here, the
class label attribute is loan decision, and the learned model or classifier is
represented in the form of classification rules.
Classification Algorithm Classifier
(Model)
Training Data
Classification Rules
name age income loan
IF age=youth THEN loan=‘risky’
Sandy youth low risky
IF income=high THEN loan=‘safe’
Lee youth low risky IF age=middle AND income=low
Caroline middle high safe THEN loan=‘risky’
….
Rick middle low risky
Susan senior low safe  The rules can be used to categorize
future data tuples, as well as provide
Claire senior medium safe deeper insight into the data contents.
Joe middle high safe They also provide a compressed data
… … … … representation. 6
Contd..
 Ex(2): Model construction of faculty for tenured (permanent post).

Classification
Algorithms

Training
Data

NAME RANK YEARS TENURED Classifier


M ike A ssistant P rof 3 no (Model)
M ary A ssistant P rof 7 yes
B ill P rofessor 2 yes
Jim A ssociate P rof 7 yes IF rank = ‘professor’
D ave A ssistant P rof 6 no OR years > 6
A nne A ssociate P rof 3 no THEN tenured = ‘yes’
7
Classification Process - 2:
Use the Model in Prediction
2) Classification step (Model usage): for classifying future or unknown
objects
• Estimate accuracy of the model.
• The known label of test sample is compared with the classified
result from the model.
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model.
• Test set is independent of training set, otherwise over-fitting will
occur.
 Ex(1): Classification for Loan Approval.
Classification Rules

Test Data
New Data
name age income loan
Bello senior low safe (John Henry, middle, low)
Sylvia middle low risky Loan decision?
Anne middle high safe (Prediction)
… … … … 8
Contd..
(b) Classification: Test data are used to estimate the accuracy of the
classification rules. If the accuracy is considered acceptable, the rules can
be applied to the classification of new data tuples.

 Ex(2): Classification for faculty to predict tenured (permanent post).

Classifier

Unseen Data
Testing
Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no
(Prediction)
M erlisa A sso c iate P ro f 7 no
Tenured?
G eo rg e P ro fesso r 5 yes
J o sep h A ssistan t P ro f 7 yes
9
Supervised vs. Unsupervised Learning
 Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations.
• New data is classified based on the training set.
 Unsupervised learning (clustering)
• The class labels of training data is unknown.
• The number or set of classes to be learned may not be known in
advance.
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data.

10
Issues regarding Classification & Prediction
1) Data Preparation
 Data cleaning: Preprocess data in order to reduce noise and
handle missing values
 Relevance analysis (feature selection): Remove the irrelevant or
redundant attributes
 Data transformation: Generalize and/or normalize data
2) Evaluating Classification Methods
 Predictive accuracy
 Speed and scalability
• time to construct the model
• time to use the model
• efficiency in disk-resident databases
 Robustness: handling noise and missing values
 Interpretability: understanding and insight provided by the model
 Goodness of rules
• decision tree size
• compactness of classification rules

11
Classification Techniques
1) Decision Tree based Methods
2) Rule-based Methods
3) Memory based reasoning
4) Neural Networks
5) Naïve Bayes and Bayesian Belief Networks
6) Support Vector Machines

12
4.2. Decision Tree Induction
 Decision tree induction is the learning of decision trees from class-
labeled training tuples.
 Decision tree : A flow-chart-like directed acyclic tree structure where
• Internal (non-leaf) node denoted by rectangle represents a test on
an attribute
• Branch represents an outcome of the test
• Leaf (terminal) nodes denoted by ovals represent class labels or
class distribution
• Root node is the topmost node in a tree.
 Decision tree generation consists of two phases
1) Tree construction
• Given a tuple, X, for which the associated class label is unknown,
the attribute values of the tuple are tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class
prediction for that tuple.
• Decision trees can easily be converted to classification rules.
• Partition examples recursively based on selected attributes
2) Tree pruning
• Identify and remove branches that reflect noise or outliers
13
Contd..
 Use of decision tree:
• Classifying an unknown sample.
• Test the attribute values of the sample against the decision tree.
• Decision tree induction algorithms have been used for
classification in many application areas such as medicine,
manufacturing and production, financial analysis, astronomy, and
molecular biology.
• Decision trees are the basis of several commercial rule induction
systems.

14
Training Dataset
Ex: A marketing manager at a company needs to analyze a customer
with a given profile, who will buy a new computer.
RID age income student Credit_rating Class: buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle medium no excellent yes
13 middle high yes fair yes
14 senior medium no excellent no 15
Output: A Decision Tree for “buys_computer”
 A decision tree for the concept buys computer, indicating whether a
customer is likely to purchase a computer.
 Each internal (nonleaf) node represents a test on an attribute.
 Each leaf node represents a class (either buys_computer = yes or
buys_computer = no).

age?

youth middle senior

student? yes credit rating?

no yes excellent fair


Buys Computer? Buys Computer?
no yes no yes

16
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected attributes
• Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
 Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left

17
Contd..

18
Extracting Classification Rules from Trees
 Represent the knowledge in the form of IF-THEN rules
 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a conjunction
 The leaf node holds the class prediction
 Rules are easier for humans to understand
 Example
• IF age=“<=30” AND student=“no” THEN buys_computer = “no”
• IF age=“<=30” AND student=“yes” THEN buys_computer = “yes”
• IF age = “31…40” THEN buys_computer = “yes”
• IF age=“>40” AND credit_rating=“excellent” THEN
buys_computer=“yes”
• IF age=“>40” AND credit_rating=“fair” THEN buys_computer=“no”

19
Attribute Selection Measures
 An attribute selection measure is a heuristic for selecting the splitting
criterion that “best” separates a given data partition, D, of class-
labeled training tuples into individual classes. In other words, we are
looking for the probability that tuple X belongs to class C, given that
we know the attribute description of X.
 Information Gain
 Gain Ratio
 Gini Index
 The Gini index is used in CART. Using the notation previously
described, the Gini index measures the impurity of D, a data partition
or set of training tuples, as

 where pi is the probability that a tuple in D belongs to class Ci and is


estimated by |Ci,D|/|D|. The sum is computed over m classes.
 The Gini index considers a binary split for each attribute. Let’s first
consider the case where A is a discrete-valued attribute having v
distinct values, {a1, a2, …, av} occurring in D.
20
Attribute Selection Measures
 To determine the best binary split on A, we examine all the possible
subsets that can be formed using known values of A. Each subset, SA,
can be considered as a binary test for attribute A of the form “A
SA?”
 Given a tuple, this test is satisfied if the value of A for the tuple is
among the values listed in SA. If A has v possible values, then there
are 2v possible subsets. For example, if income has three possible
values, namely {low, medium, high}, then the possible subsets are
{low, medium, high}, {low, medium}, {low, high}, {medium, high},
{low}, {medium}, {high}, and [ }. We exclude the power set, flow,
medium, highg, and the empty set from consideration since,
conceptually, they do not represent a split. Therefore, there are 2v
􀀀 2 possible ways to form two partitions of the data, D, based on a
binary split on A.

21
4.3. Bayesian Classification Methods
Bayesian classification is based on Bayes’ theorem. Bayesian classifiers
are statistical classifiers. They can predict class membership
probabilities such as the probability that a given tuple belongs to a
particular class.
 Probabilistic learning: Calculate explicit probabilities for hypothesis,
among the most practical approaches to certain types of learning
problems
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct. Prior
knowledge can be combined with observed data.
 Probabilistic prediction: Predict multiple hypotheses, weighted by
their probabilities
 Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured.

22
Bayesian Classification Contd..
Bayes’ Theorem is named after Thomas Bayes.
 Let X be a data tuple. In Bayesian terms, X is considered as
“evidence” whose class label is “unknown” and is described by
measurements made on a set of n attributes.
 Let H be some Hypothesis such as that the data tuple X belongs to a
specified class C.
 For classification problems, we want to determine P(H|X), (posterior
probability of H conditioned on X) the probability that the
hypothesis H holds given the “evidence” or observed data tuple X.
 P(H) is the prior probability, or a priori probability, of H which is
independent of X.
 Similarly, P(X|H) (likelihood) is the posterior probability of X
conditioned on H. Class Prior Probability
Likelihood
 P(X) is the prior probability of X.
 Bayes’ theorem is: P(H | X )  P( X | H )P(H )
P( X )
Posterior Probability
Predictor Prior Probability
23
Bayesian Classification Example

X: 35 years old customer with an


income of $40,000 and fair credit
rating.
H: Hypothesis that the customer will buy a
computer.
Class Prior Probability
Likelihood

P(H | X )  P( X | H )P(H )
P( X )
Posterior Probability Predictor Prior Probability
 P(H|X) (posterior probability of H conditioned on X): Probability (of
hypothesis H) that the customer will buy a computer given that we know
(“evidence”, X) his age, income and credit rating.
 P(H) (prior probability of H): Probability (of H) that the customer will buy a
computer regardless of age, income and credit rating (independent of X).
 P(X|H) (posterior probability of X conditioned on H): Probability that the
customer is 35 years old customer with an income of $40,000 and fair credit
rating, given that he has bought our computer.
 P(X) (prior probability of X): Probability that a person from our set of
customers is 35 yrs old, earns $40,000 have fair credit rating. 24
Naive Bayesian Classification
 Naive Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.
This assumption is called class conditional independence. It is made
to simplify the computations involved and, in this sense, is
considered “naive.”
Naive Bayesian Classification or simple Bayesian classifier:
 D be a training set of tuples
 n-dimensional attribute vector, X = (x1, x2, … , xn),
 m classes, C1, C2, … , Cm.
 Given a tuple, X, the classifier will predict that X belongs to the class
having the highest posterior probability, conditioned on X. That is,
the naive Bayesian classifier predicts that tuple X belongs to the class
Ci if and only if P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j ≠ i.
 Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is
maximized is called the maximum posteriori hypothesis.

25
Contd..
P(X|Ci) P(Ci)
 By Bayes’ theorem : P(Ci|X)=
P(X)
 As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be
maximized.
 The class prior probabilities may be estimated by
P(Ci)= |Ci,D| / |D|, where |Ci,D| is the number of training tuples of
class Ci in D.
 To reduce computation in evaluating P(X|Ci), the naive assumption
of class-conditional independence is made. Thus,
 P(X|Ci)=ς𝑛 𝑘=1 P(xk|Ci)
=P(x1|Ci) * P(x2|Ci) * … * P(xn|Ci).
 We can easily estimate the probabilities P(x1|Ci), P(x2|Ci),…
P(xn|Ci)from the training tuples. Here xk refers to the value of
attribute Ak for tuple X.
 For each attribute, we look at whether the attribute is categorical or
continuous-valued.
26
Contd..
 For instance, to compute P(X|Ci), we consider the following:
 (a) If Ak is categorical, then P(xk|Ci) is the number of tuples of class
Ci in D having the value xk for Ak, divided by |Ci,D|, the number of
tuples of class Ci in D.
 If Ak is continuous-valued, then we need to do a bit more work, but
the calculation is pretty straightforward. A continuous-valued
attribute is typically assumed to have a Gaussian distribution with a
mean μ and standard deviation σ, defined by
1
 g(x, μ, σ)=
2Пσ

 so that P(xk|Ci)= g(xk, μ Ci , σ Ci ).


 To predict the class label of X, P(X|Ci) P(Ci) is evaluated for each
class Ci . The classifier predicts that the class label of tuple X is the
class Ci if and only if P(X|Ci) P(Ci) > P(X|Cj) P(Cj) for 1 ≤ j ≤ m, j ≠ i.
 In other words, the predicted class label is the class Ci for which
P(X|Ci) P(Ci) is the maximum.
27
4.4. Rule-Based Classification
 In rule-based classifiers, the learned model is represented as a set of
IF-THEN rules.
4.4.1 Using IF-THEN Rules for Classification:
 Rules are a good way of representing information or bits of
knowledge. A rule-based classifier uses a set of IF-THEN rules for
classification.
 An IF-THEN rule is an expression of the form:
IF condition THEN conclusion
 An example is rule R1: IF age = youth AND student = yes THEN
buys_computer = yes.
 The “IF” part (or left side) of a rule is known as the rule antecedent or
precondition.
 The “THEN” part (or right side) is the rule consequent. In the rule
antecedent, the condition consists of one or more attribute tests (e.g., age =
youth and student = yes) that are logically ANDed.
 The rule’s consequent contains a class prediction (in this case, we are
predicting whether a customer will buy a computer).
 R1 can also be written as: R1: (age = youth) ^ (student = yes) =>
28
Rule Extraction from a Decision Tree
 We can build a rule based classifier by extracting IF-THEN rules
from a decision tree.
 To extract rules from a decision tree, one rule is created for each path
from the root to a leaf node.
 Each splitting criterion along a given path is logically ANDed to form
the rule antecedent (“IF” part). The leaf node holds the class
prediction, forming the rule consequent (“THEN” part).
 The decision tree of Figure can be converted to classification IF-
THEN rules by tracing the path from the root node to each leaf node
in the tree.
 The rules extracted from Figure are as follows:
• R1: IF age = youth AND student = no THEN buys_computer = no
• R2: IF age = youth AND student = yes THEN buys_computer =
yes
• R3: IF age = middle THEN buys_computer = yes
• R4: IF age = senior AND credit_rating = excellent THEN
buys_computer = no
• R5: IF age = senior AND credit_rating = fair THEN
buys_computer = yes
29
4.5. Model Evaluation and Selection
 Metrics for Evaluating Classifier Performance
 Positive tuples (P): tuples of the main class of interest.
Ex: buys_computer=yes
 Negative tuples (N): all other tuples. Ex: buys_computer=no.
“Building blocks” used in computing many evaluation measures:
 True positives (TP): These refer to the positive tuples that were
correctly labeled by the classifier. (e.g., buys_computer = yes)
 True negatives (TN): These are the negative tuples that were
correctly labeled by the classifier. (e.g., buys_computer = no)
 False positives (FP): These are the negative tuples that were
incorrectly labeled as positive (e.g., tuples of class buys_computer =
no for which the classifier predicted buys_computer = yes).
 False negatives (FN): These are the positive tuples that were
mislabeled as negative (e.g., tuples of class buys_computer = yes for
which the classifier predicted buys_computer = no).
 These terms are summarized in the confusion matrix of Figure .
30
Contd..
 The confusion matrix is a useful tool for analyzing how well your
classifier can recognize tuples of different classes. TP and TN tell us
when the classifier is getting things right, while FP and FN tell us
when the classifier is getting things wrong.

Table: Confusion matrix for the


classes buys_computer = yes and
buys_computer = no, where an entry
in row i and column j shows the
number of tuples of class i that were
labeled by the classifier as class j.
Ideally, the non diagonal entries
should be zero or close to zero.

Fig: Confusion matrix, shown with totals for positive and negative tuples 31
Metrics for Evaluating Classifier Performance

32
4.6. Techniques to Improve Classification Accuracy
 An ensemble for classification is a composite model, made up of a
combination of classifiers.
 The individual classifiers vote, and a class label prediction is returned
by the ensemble based on the collection of votes.
 Ensembles tend to be more accurate than their component classifiers.
Popular Ensemble methods:
 Bagging
 Boosting and
 Random forests

Fig-Increasing classifier accuracy: Ensemble methods generate a set of


classification models, M1, M2, : : : , Mk. Given a new data tuple to classify, each
classifier “votes” for the class label of that tuple. The ensemble combines the
votes to return a class prediction. 33
Bagging
 Given a set, D, of d tuples, bagging works as follows. For iteration i
{i=1, 2, : : : , k},a training set, Di , of d tuples is sampled with
replacement from the original set of tuples, D.
 Note that the term bagging stands for bootstrap aggregation. Each
training set is a bootstrap sample.
 A classifier model, Mi , is learned for each training set, Di .
 To classify an unknown tuple, X, each classifier, Mi , returns its class
prediction, which counts as one vote.
 The bagged classifier, M*, counts the votes and assigns the class with
the most votes to X.
 Bagging can be applied to the prediction of continuous values by
taking the average value of each prediction for a given test tuple.
 The bagged classifier often has significantly greater accuracy than a
single classifier derived from D, the original training data.
 The increased accuracy occurs because the composite model reduces
the variance of the individual classifiers.

34
Boosting
 In boosting, weights are also assigned to each training tuple. A series
of k classifiers is iteratively learned.
 After a classifier, Mi , is learned, the weights are updated to allow the
subsequent classifier,Mi+1, to “pay more attention” to the training
tuples that were misclassified by Mi .
 The final boosted classifier, M*, combines the votes of each individual
classifier, where the weight of each classifier’s vote is a function of its
accuracy.
 AdaBoost (short for Adaptive Boosting) is a popular boosting
algorithm.
 While both Bagging and Boosting can significantly improve accuracy
in comparison to a single model, boosting tends to achieve greater
accuracy.

35
Random Forests
 Imagine that each of the classifiers in the ensemble is a decision tree
classifier so that the collection of classifiers is a “forest.”
 The individual decision trees are generated using a random selection
of attributes at each node to determine the split.
 More formally, each tree depends on the values of a random vector
sampled independently and with the same distribution for all trees in
the forest.
 During classification, each tree votes and the most popular class is
returned.

 Because random forests consider


many fewer attributes for each split,
they are efficient on very large
databases. They can be faster than
either bagging or boosting. Random
forests give internal estimates of
variable importance.

36
Improving Classification Accuracy of
Class-Imbalanced Data
 Given two-class data, the data are class-imbalanced if the main class
of interest (the positive class) is represented by only a few tuples,
while the majority of tuples represent the negative class.
Approaches include
1) Oversampling works by resampling the positive tuples so that the
resulting training set contains an equal number of positive and
negative tuples.
2) Undersampling works by decreasing the number of negative tuples.
It randomly eliminates tuples from the majority (negative) class
until there are an equal number of positive and negative tuples.
3) Threshold-moving moves the threshold, t, so that the rare class
tuples are easier to classify (and hence, there is less chance of costly
false negative errors). Examples: naive Bayesian classifiers and
neural network classifiers – backpropagation
4) Ensemble techniques - Ensemble multiple classifiers introduced
above.

37
4.7 Classification by Neural Networks
 A neural network is a set of connected input/output units in which
each connection has a weight associated with it.
 During the learning phase, the network learns by adjusting the
weights so as to be able to predict the correct class label of the input
tuples.
 Neural network learning is also referred to as connectionist learning
due to the connections between units.
Advantages
• prediction accuracy is generally high
• robust, works when training examples contain errors
• output may be discrete, real-valued, or a vector of several discrete
or real-valued attributes
• fast evaluation of the learned target function
Criticism
• long training time
• difficult to understand the learned function (weights)
• not easy to incorporate domain knowledge 38
A Multilayer Feed-Forward Neural Network
 A multilayer feed-forward neural network consists of an input layer, one or
more hidden layers, and an output layer.
 Each layer is made up of units. The inputs to the network correspond to the
attributes measured for each training tuple.
 The inputs are fed simultaneously into the units making up the input layer.
 These inputs pass through
the input layer and are then Output
weighted and fed Input vector
vector: xi
simultaneously to a second
layer of “neuronlike” units,
known as a hidden layer.
The outputs of the hidden
layer units can be input to
another hidden layer, and
so on.

Fig: Multilayer feed-forward neural network


 The number of hidden layers is arbitrary, although in practice, usually only
one is used. The weighted outputs of the last hidden layer are input to units
making up the output layer, which emits the network’s prediction for given
tuples.
39
Network Pruning and Rule Extraction
Network pruning
 Fully connected network will be hard to articulate.
 N input nodes, h hidden nodes and m output nodes lead to h(m+N)
weights.
 Pruning: Remove some of the links without affecting classification
accuracy of the network.
Extracting rules from a trained network
 Discretize activation values; replace individual activation value by the
cluster average maintaining the network accuracy.
 Enumerate the output from the discretized activation values to find
rules between activation value and output.
 Find the relationship between the input and activation value.
 Combine the above two to have rules relating the output to input.

40
4.8 Support Vector Machines
 Support vector machines (SVMs), a method for the classification of
both linear and nonlinear data.
 An SVM is an algorithm that uses a nonlinear mapping to transform
the original training data into a higher dimension.
The Case When the Data Are Linearly Separable:
 Let us consider a two-class problem, where the classes are linearly
separable.
 Let the data set D be given as (X1, y1), (X2, y2), : : : , (X|D|, y|D|), where
Xi is the set of training tuples with associated class labels, yi .
 Each yi can take one of two values, either +1 or -1 (i.e., yi Є {+1,-1}),
corresponding to the classes buys_computer=yes and
buys_computer=no, respectively.
 From the graph, we see that the 2-D data are linearly separable (or
“linear,” for short), because a straight line can be drawn to separate
all the tuples of class +1 from all the tuples of class -1.

41
Contd..
Class 1, y=+1, (buys_computer=yes)
A2 Class 2, y=-1, (buys_computer=no)  The 2-D training data are
linearly separable.
 There are an infinite
number of separating lines
that could be drawn. How
can we find this best line?
 If our data were 3-D (i.e.,
with three attributes), we
would want to find the best
separating plane.
A1
 To n dimensions, we want to find the best hyperplane or decision
boundary regardless of the number of input attributes.
 An SVM approaches this problem by searching for the maximum
marginal hyperplane.

42
Contd..
Class 1, y=+1, (buys_computer=yes) Class 1, y=+1, (buys_computer=yes)
Class 2, y=-1, (buys_computer=no) Class 2, y=-1, (buys_computer=no)
A2 A2

Small Margin

A1 A1

43
4.9 Classification Using Frequent Patterns:
Pattern-Based Classification
 Frequent patterns show interesting relationships between attribute–
value pairs that occur frequently in a given data set.
 Frequent pattern mining or frequent itemset mining is the search for
these frequent patterns.
 Analysis is useful in many decision-making processes such as product
placement, catalog design, and cross-marketing.
1) Associative classification: where association rules are generated from
frequent patterns and used for classification.
2) Discriminative frequent pattern–based classification: where frequent
patterns serve as combined features, which are considered in addition to
single features when building a classification model.

44
4.9.1) Associative classification
Associative classification: consists of the following steps:
1. Mine data to find strong associations between frequent patterns
(conjunctions of attribute-value pairs) and class labels
Ex: age = youth
2. Analyze the frequent itemsets to generate association rules per class,
which satisfy confidence and support criteria.
Ex: age=youth ^ credit=OK => buys_computer=yes [support=20%,
confidence=93%], where ^ represents a logical “AND”.
3. Organize the rules to form a rule-based classifier.

Typical Associative Classification Methods:


i. CBA (Classification Based on Associations)
ii. CMAR (Classification based on Multiple Association Rules)
iii. CPAR (Classification based on Predictive Association Rules)

45
1) CBA (Classification Based on Associations)
 CBA (Classification Based on Associations): uses an iterative
approach to frequent itemset mining. It consists of two parts.
i. Rule generator (CBA-RG), mines the possible Classification
Association Rules (CARs), based on Apriori algorithm in the form
of, Cond-set (a set of attribute-value pairs) -> class label
ii. Classifier Builder (CBA-CB) organizes rules according to
decreasing precedence based on confidence and then support.
• R1 has higher confidence than R2
• R1 and R2 have same confidence but R1 has higher support
• R1 and R2 have same confidence and support but R1 is generated first
(i.e. R1 has less items than R2)
 When classifying a new tuple, the first rule satisfying the tuple is
used to classify it.
 CBA is more accurate than C4.5 on large datasets.
 CBA Limitations:
• Single coverage, most effective rule with highest confidence .
• Too many rules, Storage overhead (pruning) and
Computational overhead 46
2) CMAR (Classification based on
Multiple Association Rules)
Instead of relying on a single rule for classification, CMAR determines
the class label by a set of rules.
i. Candidate generation
• CMAR adopts a enhanced FP-growth algorithm (faster than
Apriori used by CBA) to find the complete set of classification
association rules (CARs) satisfying the minimum confidence and
minimum support thresholds.
• To improve both accuracy and efficiency , CMAR employs a novel
data structure, CR-tree, to compactly store and efficiently retrieve a
large number of rules for classification and to prune rules based on
confidence, correlation, and database coverage.
• Pruning mechanisms:
- Precedence relationship
- Positive correlation to class label (Χ2 chi-square)
- Multiple database coverage

47
CMAR Contd..
ii. Classification
• It divides the rules into groups according to class labels. All rules
within a group share the same class label and each group has a
distinct class label.
• CMAR uses a weighted Χ2 (chi-square) measure to find the
strongest” group of rules, based on the statistical correlation of
rules within a group. It then assigns X the class label of the
strongest group.
ADVANTAGES:
• Outperforms C4.5 and CBA on accuracy
• Less storage requirements compared to CBA
• Lower running time compared to CBA
• Accuracy does not depend too much on confidence and coverage
threshold
LIMITATIONS:
• Many rules generated
• Confidence-based rule evaluation thus overfitting 48
3) CPAR (Classification based on
Predictive Association Rules)
iii) CPAR (Classification based on Predictive Association Rules):
• Generation of predictive rules (FOIL-like analysis) but allow
covered rules to retain with reduced weight.
• FOIL (First Order Inductive Learner) builds rules to distinguish
positive tuples (e.g., buys_computer=yes) from negative tuples
(e.g., buys_computer=no). For multiclass problems, FOIL is
applied to each class.
• Prediction using best k rules.
• High efficiency, accuracy similar to CMAR.

49
Contd..
2) Discriminative Frequent Pattern–Based Classification:
1. Feature generation:
• The data, D, are partitioned according to class label.
• Use frequent itemset mining to discover frequent patterns in
each partition, satisfying minimum support.
• The collection of frequent patterns, F, makes up the feature
candidates.
2. Feature selection:
Apply feature selection to F, resulting in FS, the set of selected (more
discriminating) frequent patterns. Information gain, Fisher score,
or other evaluation measures can be used for this step. Relevancy
checking can also be incorporated into this step to weed out
redundant patterns. The data set D is transformed to D0, where
the feature space now includes the single features as well as the
selected frequent patterns, FS.
3. Learning of classification model: A classifier is built on the data
set D0. Any learning algorithm can be used as the classification
model. 50
Contd..

Fig: A framework for frequent pattern–based classification:


(a) a two-step general approach versus
(b) the direct approach of DDPMine.

51

You might also like