ABP DWDM UNIT 4 Classification 1
ABP DWDM UNIT 4 Classification 1
TECH VI Semester
COMPUTER SCIENCE AND ENGINEERING
VCE-R15 2017 – 2018
DATA MINING AND DATA WAREHOUSING
(A3522)
UNIT – IV
CLASSIFICATION
A. BHANU PRASAD
Associate Professor of CSE
9885990509
[email protected]
REFERENCE BOOKS:
1. Margaret H Dunham (2006), Data Mining Introductory and
Advanced Topics, 2nd edition, Pearson Education, New Delhi, India.
2. Amitesh Sinha(2007), Data Warehousing, Thomson Learning, India.
3. Xingdong Wu, Vipin Kumar (2009), The Top Ten Algorithms in Data
Mining, CRC Press, UK.
4. Max Barmer(2007), Principles of Data Mining, Springer, USA.
2
UNIT – III CONTENTS
4. CLASSIFICATION
4.1 Basic Concepts
4.2 Decision Tree Induction
4.3 Bayesian Classification Methods
4.4 Rule-Based Classification
4.5 Model Evaluation and Selection
4.6 Techniques to Improve Classification Accuracy
4.7 Classification by Neural Networks
4.8 Support Vector Machines
4.9 Classification Using Frequent Patterns: Pattern-Based
Classification
4.10 Lazy Learners
3
4. CLASSIFICATION
4.1. Basic Concepts
There are two forms of data analysis that can be used for extracting
models describing important classes or to predict future data trends.
Two forms of data analysis:
1) Classification
2) Prediction
1) Classification is a form of data analysis that extracts models
describing important data classes. Such models, called classifiers,
predict categorical (discrete, unordered) class labels.
It classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses it in
classifying new data.
2) Prediction models continuous-valued functions, i.e., predicts
unknown or missing values.
Classification
Algorithms
Training
Data
Test Data
New Data
name age income loan
Bello senior low safe (John Henry, middle, low)
Sylvia middle low risky Loan decision?
Anne middle high safe (Prediction)
… … … … 8
Contd..
(b) Classification: Test data are used to estimate the accuracy of the
classification rules. If the accuracy is considered acceptable, the rules can
be applied to the classification of new data tuples.
Classifier
Unseen Data
Testing
Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom A ssistan t P ro f 2 no
(Prediction)
M erlisa A sso c iate P ro f 7 no
Tenured?
G eo rg e P ro fesso r 5 yes
J o sep h A ssistan t P ro f 7 yes
9
Supervised vs. Unsupervised Learning
Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.)
are accompanied by labels indicating the class of the observations.
• New data is classified based on the training set.
Unsupervised learning (clustering)
• The class labels of training data is unknown.
• The number or set of classes to be learned may not be known in
advance.
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data.
10
Issues regarding Classification & Prediction
1) Data Preparation
Data cleaning: Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection): Remove the irrelevant or
redundant attributes
Data transformation: Generalize and/or normalize data
2) Evaluating Classification Methods
Predictive accuracy
Speed and scalability
• time to construct the model
• time to use the model
• efficiency in disk-resident databases
Robustness: handling noise and missing values
Interpretability: understanding and insight provided by the model
Goodness of rules
• decision tree size
• compactness of classification rules
11
Classification Techniques
1) Decision Tree based Methods
2) Rule-based Methods
3) Memory based reasoning
4) Neural Networks
5) Naïve Bayes and Bayesian Belief Networks
6) Support Vector Machines
12
4.2. Decision Tree Induction
Decision tree induction is the learning of decision trees from class-
labeled training tuples.
Decision tree : A flow-chart-like directed acyclic tree structure where
• Internal (non-leaf) node denoted by rectangle represents a test on
an attribute
• Branch represents an outcome of the test
• Leaf (terminal) nodes denoted by ovals represent class labels or
class distribution
• Root node is the topmost node in a tree.
Decision tree generation consists of two phases
1) Tree construction
• Given a tuple, X, for which the associated class label is unknown,
the attribute values of the tuple are tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class
prediction for that tuple.
• Decision trees can easily be converted to classification rules.
• Partition examples recursively based on selected attributes
2) Tree pruning
• Identify and remove branches that reflect noise or outliers
13
Contd..
Use of decision tree:
• Classifying an unknown sample.
• Test the attribute values of the sample against the decision tree.
• Decision tree induction algorithms have been used for
classification in many application areas such as medicine,
manufacturing and production, financial analysis, astronomy, and
molecular biology.
• Decision trees are the basis of several commercial rule induction
systems.
14
Training Dataset
Ex: A marketing manager at a company needs to analyze a customer
with a given profile, who will buy a new computer.
RID age income student Credit_rating Class: buys_computer
1 youth high no fair no
2 youth high no excellent no
3 middle high no fair yes
4 senior medium no fair yes
5 senior low yes fair yes
6 senior low yes excellent no
7 middle low yes excellent yes
8 youth medium no fair no
9 youth low yes fair yes
10 senior medium yes fair yes
11 youth medium yes excellent yes
12 middle medium no excellent yes
13 middle high yes fair yes
14 senior medium no excellent no 15
Output: A Decision Tree for “buys_computer”
A decision tree for the concept buys computer, indicating whether a
customer is likely to purchase a computer.
Each internal (nonleaf) node represents a test on an attribute.
Each leaf node represents a class (either buys_computer = yes or
buys_computer = no).
age?
16
Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm)
• Tree is constructed in a top-down recursive divide-and-conquer
manner
• At start, all the training examples are at the root
• Attributes are categorical (if continuous-valued, they are
discretized in advance)
• Examples are partitioned recursively based on selected attributes
• Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
Conditions for stopping partitioning
• All samples for a given node belong to the same class
• There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
• There are no samples left
17
Contd..
18
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules
One rule is created for each path from the root to a leaf
Each attribute-value pair along a path forms a conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
• IF age=“<=30” AND student=“no” THEN buys_computer = “no”
• IF age=“<=30” AND student=“yes” THEN buys_computer = “yes”
• IF age = “31…40” THEN buys_computer = “yes”
• IF age=“>40” AND credit_rating=“excellent” THEN
buys_computer=“yes”
• IF age=“>40” AND credit_rating=“fair” THEN buys_computer=“no”
19
Attribute Selection Measures
An attribute selection measure is a heuristic for selecting the splitting
criterion that “best” separates a given data partition, D, of class-
labeled training tuples into individual classes. In other words, we are
looking for the probability that tuple X belongs to class C, given that
we know the attribute description of X.
Information Gain
Gain Ratio
Gini Index
The Gini index is used in CART. Using the notation previously
described, the Gini index measures the impurity of D, a data partition
or set of training tuples, as
21
4.3. Bayesian Classification Methods
Bayesian classification is based on Bayes’ theorem. Bayesian classifiers
are statistical classifiers. They can predict class membership
probabilities such as the probability that a given tuple belongs to a
particular class.
Probabilistic learning: Calculate explicit probabilities for hypothesis,
among the most practical approaches to certain types of learning
problems
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct. Prior
knowledge can be combined with observed data.
Probabilistic prediction: Predict multiple hypotheses, weighted by
their probabilities
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision making
against which other methods can be measured.
22
Bayesian Classification Contd..
Bayes’ Theorem is named after Thomas Bayes.
Let X be a data tuple. In Bayesian terms, X is considered as
“evidence” whose class label is “unknown” and is described by
measurements made on a set of n attributes.
Let H be some Hypothesis such as that the data tuple X belongs to a
specified class C.
For classification problems, we want to determine P(H|X), (posterior
probability of H conditioned on X) the probability that the
hypothesis H holds given the “evidence” or observed data tuple X.
P(H) is the prior probability, or a priori probability, of H which is
independent of X.
Similarly, P(X|H) (likelihood) is the posterior probability of X
conditioned on H. Class Prior Probability
Likelihood
P(X) is the prior probability of X.
Bayes’ theorem is: P(H | X ) P( X | H )P(H )
P( X )
Posterior Probability
Predictor Prior Probability
23
Bayesian Classification Example
P(H | X ) P( X | H )P(H )
P( X )
Posterior Probability Predictor Prior Probability
P(H|X) (posterior probability of H conditioned on X): Probability (of
hypothesis H) that the customer will buy a computer given that we know
(“evidence”, X) his age, income and credit rating.
P(H) (prior probability of H): Probability (of H) that the customer will buy a
computer regardless of age, income and credit rating (independent of X).
P(X|H) (posterior probability of X conditioned on H): Probability that the
customer is 35 years old customer with an income of $40,000 and fair credit
rating, given that he has bought our computer.
P(X) (prior probability of X): Probability that a person from our set of
customers is 35 yrs old, earns $40,000 have fair credit rating. 24
Naive Bayesian Classification
Naive Bayesian classifiers assume that the effect of an attribute value
on a given class is independent of the values of the other attributes.
This assumption is called class conditional independence. It is made
to simplify the computations involved and, in this sense, is
considered “naive.”
Naive Bayesian Classification or simple Bayesian classifier:
D be a training set of tuples
n-dimensional attribute vector, X = (x1, x2, … , xn),
m classes, C1, C2, … , Cm.
Given a tuple, X, the classifier will predict that X belongs to the class
having the highest posterior probability, conditioned on X. That is,
the naive Bayesian classifier predicts that tuple X belongs to the class
Ci if and only if P(Ci|X) > P(Cj|X) for 1 ≤ j ≤ m, j ≠ i.
Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is
maximized is called the maximum posteriori hypothesis.
25
Contd..
P(X|Ci) P(Ci)
By Bayes’ theorem : P(Ci|X)=
P(X)
As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be
maximized.
The class prior probabilities may be estimated by
P(Ci)= |Ci,D| / |D|, where |Ci,D| is the number of training tuples of
class Ci in D.
To reduce computation in evaluating P(X|Ci), the naive assumption
of class-conditional independence is made. Thus,
P(X|Ci)=ς𝑛 𝑘=1 P(xk|Ci)
=P(x1|Ci) * P(x2|Ci) * … * P(xn|Ci).
We can easily estimate the probabilities P(x1|Ci), P(x2|Ci),…
P(xn|Ci)from the training tuples. Here xk refers to the value of
attribute Ak for tuple X.
For each attribute, we look at whether the attribute is categorical or
continuous-valued.
26
Contd..
For instance, to compute P(X|Ci), we consider the following:
(a) If Ak is categorical, then P(xk|Ci) is the number of tuples of class
Ci in D having the value xk for Ak, divided by |Ci,D|, the number of
tuples of class Ci in D.
If Ak is continuous-valued, then we need to do a bit more work, but
the calculation is pretty straightforward. A continuous-valued
attribute is typically assumed to have a Gaussian distribution with a
mean μ and standard deviation σ, defined by
1
g(x, μ, σ)=
2Пσ
Fig: Confusion matrix, shown with totals for positive and negative tuples 31
Metrics for Evaluating Classifier Performance
32
4.6. Techniques to Improve Classification Accuracy
An ensemble for classification is a composite model, made up of a
combination of classifiers.
The individual classifiers vote, and a class label prediction is returned
by the ensemble based on the collection of votes.
Ensembles tend to be more accurate than their component classifiers.
Popular Ensemble methods:
Bagging
Boosting and
Random forests
34
Boosting
In boosting, weights are also assigned to each training tuple. A series
of k classifiers is iteratively learned.
After a classifier, Mi , is learned, the weights are updated to allow the
subsequent classifier,Mi+1, to “pay more attention” to the training
tuples that were misclassified by Mi .
The final boosted classifier, M*, combines the votes of each individual
classifier, where the weight of each classifier’s vote is a function of its
accuracy.
AdaBoost (short for Adaptive Boosting) is a popular boosting
algorithm.
While both Bagging and Boosting can significantly improve accuracy
in comparison to a single model, boosting tends to achieve greater
accuracy.
35
Random Forests
Imagine that each of the classifiers in the ensemble is a decision tree
classifier so that the collection of classifiers is a “forest.”
The individual decision trees are generated using a random selection
of attributes at each node to determine the split.
More formally, each tree depends on the values of a random vector
sampled independently and with the same distribution for all trees in
the forest.
During classification, each tree votes and the most popular class is
returned.
36
Improving Classification Accuracy of
Class-Imbalanced Data
Given two-class data, the data are class-imbalanced if the main class
of interest (the positive class) is represented by only a few tuples,
while the majority of tuples represent the negative class.
Approaches include
1) Oversampling works by resampling the positive tuples so that the
resulting training set contains an equal number of positive and
negative tuples.
2) Undersampling works by decreasing the number of negative tuples.
It randomly eliminates tuples from the majority (negative) class
until there are an equal number of positive and negative tuples.
3) Threshold-moving moves the threshold, t, so that the rare class
tuples are easier to classify (and hence, there is less chance of costly
false negative errors). Examples: naive Bayesian classifiers and
neural network classifiers – backpropagation
4) Ensemble techniques - Ensemble multiple classifiers introduced
above.
37
4.7 Classification by Neural Networks
A neural network is a set of connected input/output units in which
each connection has a weight associated with it.
During the learning phase, the network learns by adjusting the
weights so as to be able to predict the correct class label of the input
tuples.
Neural network learning is also referred to as connectionist learning
due to the connections between units.
Advantages
• prediction accuracy is generally high
• robust, works when training examples contain errors
• output may be discrete, real-valued, or a vector of several discrete
or real-valued attributes
• fast evaluation of the learned target function
Criticism
• long training time
• difficult to understand the learned function (weights)
• not easy to incorporate domain knowledge 38
A Multilayer Feed-Forward Neural Network
A multilayer feed-forward neural network consists of an input layer, one or
more hidden layers, and an output layer.
Each layer is made up of units. The inputs to the network correspond to the
attributes measured for each training tuple.
The inputs are fed simultaneously into the units making up the input layer.
These inputs pass through
the input layer and are then Output
weighted and fed Input vector
vector: xi
simultaneously to a second
layer of “neuronlike” units,
known as a hidden layer.
The outputs of the hidden
layer units can be input to
another hidden layer, and
so on.
40
4.8 Support Vector Machines
Support vector machines (SVMs), a method for the classification of
both linear and nonlinear data.
An SVM is an algorithm that uses a nonlinear mapping to transform
the original training data into a higher dimension.
The Case When the Data Are Linearly Separable:
Let us consider a two-class problem, where the classes are linearly
separable.
Let the data set D be given as (X1, y1), (X2, y2), : : : , (X|D|, y|D|), where
Xi is the set of training tuples with associated class labels, yi .
Each yi can take one of two values, either +1 or -1 (i.e., yi Є {+1,-1}),
corresponding to the classes buys_computer=yes and
buys_computer=no, respectively.
From the graph, we see that the 2-D data are linearly separable (or
“linear,” for short), because a straight line can be drawn to separate
all the tuples of class +1 from all the tuples of class -1.
41
Contd..
Class 1, y=+1, (buys_computer=yes)
A2 Class 2, y=-1, (buys_computer=no) The 2-D training data are
linearly separable.
There are an infinite
number of separating lines
that could be drawn. How
can we find this best line?
If our data were 3-D (i.e.,
with three attributes), we
would want to find the best
separating plane.
A1
To n dimensions, we want to find the best hyperplane or decision
boundary regardless of the number of input attributes.
An SVM approaches this problem by searching for the maximum
marginal hyperplane.
42
Contd..
Class 1, y=+1, (buys_computer=yes) Class 1, y=+1, (buys_computer=yes)
Class 2, y=-1, (buys_computer=no) Class 2, y=-1, (buys_computer=no)
A2 A2
Small Margin
A1 A1
43
4.9 Classification Using Frequent Patterns:
Pattern-Based Classification
Frequent patterns show interesting relationships between attribute–
value pairs that occur frequently in a given data set.
Frequent pattern mining or frequent itemset mining is the search for
these frequent patterns.
Analysis is useful in many decision-making processes such as product
placement, catalog design, and cross-marketing.
1) Associative classification: where association rules are generated from
frequent patterns and used for classification.
2) Discriminative frequent pattern–based classification: where frequent
patterns serve as combined features, which are considered in addition to
single features when building a classification model.
44
4.9.1) Associative classification
Associative classification: consists of the following steps:
1. Mine data to find strong associations between frequent patterns
(conjunctions of attribute-value pairs) and class labels
Ex: age = youth
2. Analyze the frequent itemsets to generate association rules per class,
which satisfy confidence and support criteria.
Ex: age=youth ^ credit=OK => buys_computer=yes [support=20%,
confidence=93%], where ^ represents a logical “AND”.
3. Organize the rules to form a rule-based classifier.
45
1) CBA (Classification Based on Associations)
CBA (Classification Based on Associations): uses an iterative
approach to frequent itemset mining. It consists of two parts.
i. Rule generator (CBA-RG), mines the possible Classification
Association Rules (CARs), based on Apriori algorithm in the form
of, Cond-set (a set of attribute-value pairs) -> class label
ii. Classifier Builder (CBA-CB) organizes rules according to
decreasing precedence based on confidence and then support.
• R1 has higher confidence than R2
• R1 and R2 have same confidence but R1 has higher support
• R1 and R2 have same confidence and support but R1 is generated first
(i.e. R1 has less items than R2)
When classifying a new tuple, the first rule satisfying the tuple is
used to classify it.
CBA is more accurate than C4.5 on large datasets.
CBA Limitations:
• Single coverage, most effective rule with highest confidence .
• Too many rules, Storage overhead (pruning) and
Computational overhead 46
2) CMAR (Classification based on
Multiple Association Rules)
Instead of relying on a single rule for classification, CMAR determines
the class label by a set of rules.
i. Candidate generation
• CMAR adopts a enhanced FP-growth algorithm (faster than
Apriori used by CBA) to find the complete set of classification
association rules (CARs) satisfying the minimum confidence and
minimum support thresholds.
• To improve both accuracy and efficiency , CMAR employs a novel
data structure, CR-tree, to compactly store and efficiently retrieve a
large number of rules for classification and to prune rules based on
confidence, correlation, and database coverage.
• Pruning mechanisms:
- Precedence relationship
- Positive correlation to class label (Χ2 chi-square)
- Multiple database coverage
47
CMAR Contd..
ii. Classification
• It divides the rules into groups according to class labels. All rules
within a group share the same class label and each group has a
distinct class label.
• CMAR uses a weighted Χ2 (chi-square) measure to find the
strongest” group of rules, based on the statistical correlation of
rules within a group. It then assigns X the class label of the
strongest group.
ADVANTAGES:
• Outperforms C4.5 and CBA on accuracy
• Less storage requirements compared to CBA
• Lower running time compared to CBA
• Accuracy does not depend too much on confidence and coverage
threshold
LIMITATIONS:
• Many rules generated
• Confidence-based rule evaluation thus overfitting 48
3) CPAR (Classification based on
Predictive Association Rules)
iii) CPAR (Classification based on Predictive Association Rules):
• Generation of predictive rules (FOIL-like analysis) but allow
covered rules to retain with reduced weight.
• FOIL (First Order Inductive Learner) builds rules to distinguish
positive tuples (e.g., buys_computer=yes) from negative tuples
(e.g., buys_computer=no). For multiclass problems, FOIL is
applied to each class.
• Prediction using best k rules.
• High efficiency, accuracy similar to CMAR.
49
Contd..
2) Discriminative Frequent Pattern–Based Classification:
1. Feature generation:
• The data, D, are partitioned according to class label.
• Use frequent itemset mining to discover frequent patterns in
each partition, satisfying minimum support.
• The collection of frequent patterns, F, makes up the feature
candidates.
2. Feature selection:
Apply feature selection to F, resulting in FS, the set of selected (more
discriminating) frequent patterns. Information gain, Fisher score,
or other evaluation measures can be used for this step. Relevancy
checking can also be incorporated into this step to weed out
redundant patterns. The data set D is transformed to D0, where
the feature space now includes the single features as well as the
selected frequent patterns, FS.
3. Learning of classification model: A classifier is built on the data
set D0. Any learning algorithm can be used as the classification
model. 50
Contd..
51