Classification and Prediction
Classification and Prediction
Target marketing
Medical diagnosis
Fraud detection
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
will occur
If the accuracy is acceptable, use the model to classify data
Classification
Algorithms
Training
Data
Classifier
Testing
Data Unseen Data
(Jeff, Professor, 4)
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no Tenured?
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
January 17, 2022 Data Mining: Concepts and Techniques 5
Supervised vs. Unsupervised Learning
Data cleaning
Preprocess data in order to reduce noise and handle
missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
attributes
Speed
time to construct the model (training time)
age?
<=30 overcast
31..40 >40
no yes no yes
but gini{medium,high} is 0.30 and thus the best since it is the lowest
All attributes are assumed continuous-valued
May need other tools, e.g., clustering, to get the possible split values
Can be modified for categorical attributes
methods)
convertible to simple and easy to understand
classification rules
can use SQL queries for accessing databases
P( H | X) P(X | H ) P(H )
P(X)
Informally, this can be written as
posteriori = likelihood x prior/evidence
Predicts X belongs to C2 iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
January 17, 2022 Data Mining: Concepts and Techniques 36
Towards Naïve Bayesian Classifier
Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute
vector X = (x1, x2, …, xn)
Suppose there are m classes C1, C2, …, Cm.
Classification is to derive the maximum posteriori, i.e., the
maximal P(Ci|X)
This can be derived from Bayes’ theorem
P(X | C )P(C )
P(C | X) i i
i P(X)
Since P(X) is constant for all classes, only
P(C | X) P(X | C )P(C )
i i i
needs to be maximized
January 17, 2022 Data Mining: Concepts and Techniques 37
Derivation of Naïve Bayes Classifier
A simplified assumption: attributes are conditionally
independent (i.e., no dependence relation between
attributes): n
P( X | C i ) P ( x | C i ) P( x | C i ) P( x | C i ) ... P( x | C i )
k 1 2 n
k 1
This greatly reduces the computation cost: Only counts
the class distribution
If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having
value xk for Ak divided by |Ci, D| (# of tuples of Ci in D)
If Ak is continous-valued, P(xk|Ci) is usually computed
based on Gaussian distribution with a mean μ and
standard deviation σ 1
( x ) 2
g ( x, , ) e 2 2
2
and P(xk|Ci) is
P ( X | C i ) g ( xk , C i , C i )
January 17, 2022 Data Mining: Concepts and Techniques 38
Naïve Bayesian Classifier: Training Dataset
age income studentcredit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
Class: 31…40 high no fair yes
C1:buys_computer = ‘yes’ >40 medium no fair yes
C2:buys_computer = ‘no’ >40 low yes fair yes
>40 low yes excellent no
Data sample
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium,
Student = yes <=30 low yes fair yes
Credit_rating = Fair) >40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
January 17, 2022 Data Mining: Concepts and Techniques 39
Naïve Bayesian Classifier: An Example
P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643
P(buys_computer = “no”) = 5/14= 0.357
counterparts
January 17, 2022 Data Mining: Concepts and Techniques 41
Naïve Bayesian Classifier: Comments
Advantages
Easy to implement
Disadvantages
Assumption: class conditional independence, therefore loss
of accuracy
Practically, dependencies exist among variables
Classifier
How to deal with these dependencies?
Bayesian Belief Networks
Classification:
predicts categorical class labels
x1 : # of a word “homepage”
x2 : # of a word “welcome”
Mathematically
x X = n, y Y = {+1, –1}
We want a function f: X Y
additively
•Winnow: update W
multiplicatively
x1
- k
x0 w0
x1 w1
f
output y
xn wn
For Example
n
Input weight weighted Activation y sign( wi xi k )
vector x vector w sum function i 0
Output vector
Err j O j (1 O j ) Errk w jk
Output layer k
j j (l) Err j
wij wij (l ) Err j Oi
Hidden layer Err j O j (1 O j )(T j O j )
wij 1
Oj I j
1 e
Input layer
I j wij Oi j
i
Input vector: X
January 17, 2022 Data Mining: Concepts and Techniques 59
How A Multi-Layer Neural Network Works?
Let data D be (X1, y1), …, (X|D|, y|D|), where Xi is the set of training tuples
associated with the class labels yi
There are infinite lines (hyperplanes) separating the two classes but we want to
find the best one (the one that minimizes classification error on unseen data)
SVM searches for the hyperplane with the largest margin, i.e., maximum
marginal hyperplane (MMH)
SVM—Linearly Inseparable
A
Transform the original input data into a higher dimensional
1
space
SVM can also be used for classifying multiple (> 2) classes and for
regression analysis (with additional user parameters)
SVM Website
http://www.kernel-machines.org/
Representative implementations
LIBSVM: an efficient implementation of SVM, multi-class
classifications, nu-SVM, one-class SVM, including also various
interfaces with java, python, etc.
SVM-light: simpler but performance is not better than LIBSVM,
support only binary classification and only C language
SVM-torch: another recent implementation also written in C.
CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM’01 )
Efficiency: Uses an enhanced FP-tree that maintains the distribution of
class labels among tuples satisfying each frequent itemset
Rule pruning whenever a rule is inserted into the tree
Given two rules, R and R , if the antecedent of R is more general
1 2 1
than that of R2 and conf(R1) ≥ conf(R2), then R2 is pruned
Prunes rules for which the rule antecedent and class are not
positively correlated, based on a χ2 test of statistical significance
Classification based on generated/pruned rules
If only one rule satisfies tuple X, assign the class label of the rule
Instance-based learning:
Store training examples and delay the processing
space.
Locally weighted regression
Case-based reasoning
based inference
January 17, 2022 Data Mining: Concepts and Techniques 91
The k-Nearest Neighbor Algorithm
All instances correspond to points in the n-D space
The nearest neighbor are defined in terms of
Euclidean distance, dist(X1, X2)
Target function could be discrete- or real- valued
For discrete-valued, k-NN returns the most common
value among the k training examples nearest to xq
Vonoroi diagram: the decision surface induced by 1-
NN for a typical set of training examples
_
_
_ _ .
+
_ .
+
xq + . . .
January 17, 2022
_ + .
Data Mining: Concepts and Techniques 92
Discussion on the k-NN Algorithm
Non-linear regression
(x x )( yi y )
w w yw x
i
i 1
1 | D|
0 1
(x
i 1
i x )2
d d
d
d
( yi yi ' ) 2
Relative absolute error:
| y y '|
i 1 Relative squared error:
i i
i 1
d
d
| y y |
i 1
i (y
i 1
i y)2
obtained
Cross-validation (k-fold, where k = 10 is most popular)
Randomly partition the data into k mutually exclusive subsets,
Ensemble methods
Use a combination of models to increase accuracy
classifiers
Boosting: weighted vote with a collection of classifiers