Utility-Driven Data Analytics On Uncertain Data
Utility-Driven Data Analytics On Uncertain Data
Abstract—Modern Internet of Things (IoT) applications gen- [3], [11]. Pattern mining and analysis, from an industrial
erate massive amounts of data, much of it in the form of research perspective, provides a unique and up-to-date in-
objects/items of readings, events, and log entries. Specifically, sight into the manufacturing industry [3]. There are many
most of the objects in these IoT data contain rich embedded
information (e.g., frequency and uncertainty) and different levels industrial utility-driven manufacturing systems with different
of importance (e.g., unit utility of items, interestingness, cost, IoT applications. Consider an environment with all types of
arXiv:1902.09586v1 [cs.DB] 25 Feb 2019
risk, or weight). Many existing approaches in data mining and sensors to monitor abnormal conditions. Let each transaction
analytics have limitations such as only the binary attribute is denote the set of sensors showing above normal value at a
considered within a transaction, as well as all the objects/items particular instance, and the value associated with each sensor
having equal weights or importance. To solve these drawbacks,
a novel utility-driven data analytics algorithm named HUPNU is can be the abnormality or risk measure. Here the number
presented, to extract High-Utility patterns by considering both of units can mean the number of abnormal sensors of the
Positive and Negative unit utilities from Uncertain data. The same type, then how to discover the high risk patterns is quite
qualified high-utility patterns can be effectively discovered for useful for manufacturing analytics in Internet of Things. In
risk prediction, manufacturing management, decision-making, the retail industry, it can identify the purchase behaviors of
among others. By using the developed vertical Probability-Utility
list with the Positive-and-Negative utilities (PU± -list) structure, as customers, which can be used to make specific decisions and
well as several effective pruning strategies. Experiments showed improve the service quality of customers. However, existing
that the developed HUPNU approach performed great in mining Frequent-based Pattern Mining (FPM) algorithms [10] ignore
the qualified patterns efficiently and effectively. that information, and the discovered information or pattern
Index Terms—Internet of Things, manufacturing data, uncer- may contain useless knowledge, such as items with a lower
tainty, utility, data analytics. utility that may be discovered. Thus, traditional FPM cannot
handle the problem of the quantitative databases, and also fails
I. I NTRODUCTION to extract the utility-driven patterns which are insightful for
data analytics. A new utility-driven data analytics framework
W ITH the increasing prevalence of sensors, mobile
phones, actuators, and RFID tags, Internet of Things
(IoT) applications generate massive amounts of rich data per
named High-Utility Pattern Mining (HUPM) [6], [7], [8] was
presented to provide a solution to the above limitations. It is
important to notice that the utility concept can be referenced
day [1], [2], [3]. Examples include the control signals issued to
to reflect multiple participants’ satisfaction [12], utility [11],
Internet-connected devices like lights or thermostats, measure-
[13], revenue [14], risk [13], or weight [15].
ments from industrial and medical equipment, manufacturing
data, or log files from smart-phones that record the complex
behavior of sensor-based applications. These applications need A. Motivation
to find frequent patterns that represent typical system behavior
To reveal high utility/risk patterns for decision-making, it
(e.g., [2], [4], [5]) as well as high risk/utility patterns that
considers the quantitative database, as well as the unit utility
represent very important knowledge from normal behavior [6],
of the items. The analytics of using HUPM instead of FPM
[7], [8]. For industrial areas [3], [9], various algorithms have
is quite helpful for system monitoring, planning, manufactur-
been designed by applying data mining methods [3], [10].
ing management, and decision-making [2], [3], [11]. Most
These methods can be used to evaluate the complicated and
algorithms in traditional HUPM consider all items to have
complex data from the industry and solve existing problems.
positive unit utilities/risks. When the items have the constraint
In some industrial areas, manufacturing schedules can be
of negative values, for example sales discount (e.g., buy two
planned by the discovered information and knowledge, which
get one free) or if a supermarket/retail store sells the products
can then be used to increase and gain utilities, maximally
at a loss to stimulate the sales of other relative items (e.g.,
Wensheng Gan is with the Department of Computer Science, University of sell printers to promote the notebook/PC), traditional HUPM
Illinois at Chicago, IL, USA. (E-mail: [email protected]) cannot mine meaningful information. This situation happens in
Jerry Chun-Wei Lin is with the Department of Computing, Mathematics,
and Physics, Western Norway University of Applied Sciences, Bergen, Nor- real-life scenarios, especially when promoting some profitable
way. (E-mail: [email protected]) items for gaining more money in cross-selling conditions.
Han-Chieh Chao is with the Department of Electrical Engineering, National Some prior works have mentioned that traditional HUPM
Dong Hwa University, Hualien, Taiwan. (E-mail: [email protected])
Athanasios V. Vasilakos is with the Department of Computer Science, algorithms [6], [8] may generate incomplete and missing
Electrical and Space Engineering, Lulea University of Technology, Sweden. information [16] when the dataset consists of negative unit
(E-mail: [email protected]) utility objects/items.
Philip S. Yu is with the Department of Computer Science, University of
Illinois at Chicago, IL, USA. (E-mail: [email protected]) For analyzing the complex industrial data and manufactur-
Manuscript received XXXX; revised XXXX. ing data [2], [11], [17], it may encounter various challenges,
IEEE INTERNET OF THINGS JOURNAL, 2019 2
i.e., the embedded uncertainty, positive risk and negative risk, II. R ELATED W ORK
and other factors. The uncertainty factor exists in many realis- A. Support-based Pattern Mining
tic situations, such as the collection of noisy data sources (e.g.,
Data mining and analytic technologies are used in many
GPS, wireless sensor network, RFID, or WiFi systems) [18],
different domains [2], [11], [17], [21] and they provide pow-
[19]. Traditional pattern mining algorithms cannot be utilized
erful ways of discovering useful, meaningful, and implicit
straightforwardly to inaccurate or uncertain environments for
information from very large datasets. Frequent Pattern Mining
mining the required knowledge or information. The reason for
(FPM) is the most fundamental concept in retrieving the
this is that the utility can be considered as a semantic measure
qualified information using a support-based constraint, and
to value how the “utility” of a pattern based on users’ priori
many works have been developed based on the support criteria
experience, goal, and knowledge, and the uncertainty can be
to mine frequent itemsets or association rules [4], [5], [10].
considered as an objective measure to value the probability
Other factors, such as interestingness [22] or weights, have
of a pattern as an objective existence; they are two totally
also been considered with mining criteria to find the interesting
different factors. Most existing utility-based algorithms have
or important patterns in the task of pattern mining. Many
been studied to handle precise data, while they are unable to
algorithms have been designed to find the meaningful patterns
deal with data uncertainty [20]. If the uncertain factor was
from a binary database [10], [22], [23]. However, they only
not considered in the mining process, it may find useless or
assess whether an item appears in a transaction, and this
misleading information with low existential probability.
approach does not consider the useful factors, for example,
an event may be occurred in multiple quantities in a record.
B. Contribution Quantitative Association-Rule Mining (QARM) [24], [25] was
presented instead of the binary value (0 or 1) for discovering
To solve the above limitations and problems, we attempt more meaningful and useful information.
to design a novel algorithm named HUPNU to extract High- Different from processing precise data, some pattern mining
Utility patterns with both Positive and Negative unit utilities approaches have been developed to deal with uncertain data,
from Uncertain data in intelligent environment. Moreover, to for discovering frequent expected patterns [26] or probabilistic
deal with realistic situations in real-life applications, the pos- frequent patterns [19] by taking the uncertainty from data
itive and negative unit utilities are considered. The significant into account. The reason is that the uncertainty factor exists
contributions of this work are listed below: in many realistic data sources (e.g., GPS, wireless sensor
• To the best of our knowledge on pattern mining, this is network, RFID, or WiFi systems). Some details of uncertain
the first study to discuss the problem by considering both data algorithms and applications can be referred to [18].
the positive and negative unit utilities, and the uncertainty
factor to discover the qualified high-utility patterns. This B. Utility-based Pattern Mining with Efficiency Issues
approach is more suitable for realistic situations such as Although QARM solves the past limitation of traditional
risk prediction, manufacturing management, e-commerce, association-rule mining, it still does not consider more im-
and decision making. portant and interesting factors such as the unit utilities of
• A vertical and compact Probability-Utility list, with both the objects/items, which can bring the profitable objects to
a positive and negative utilities (PU± -list) structure, is user in service-oriented manufacturing system [12], [13]. In
developed. This list can keep all the essential knowledge addition, the support-based constraint of pattern mining is
from the database for later mining progress. inappropriate for measuring the importance of the items in
• A one-phase method called HUPNU is designed to dis- realistic situations. To tackle these problems, High-Utility
cover the qualified high-utility patterns using the PU± -list Pattern Mining (HUPM) [6], [7], [8] was presented to reveal
structure. The multiple database scans and the generate- high-utility patterns. HUPM considers both the unit utility and
and-test mechanism can be largely avoided and ignored. quantity of the objects to show the high-utility patterns from
• We also developed several pruning strategies to easily the quantitative database, which can provide more meaningful
remove the unpromising candidates and reduce the search results than that of the support-based algorithms. Yao et al.
space size of the qualified high-utility patterns. These [27] first defined the utility-mining problem by considering
pruning strategies can also efficiently reduce the size of the occurred quantity (treated as the internal utility) and unit
the PU± -list by the designed downward closure property. utility of the objects/items (treated as the external utility) to
• Several experiments were conducted on both synthetic reveal the itemsets with high utilities. In the past decade,
and realistic datasets. Results showed that the developed HUPM has been considered as the emerging topic in many
HUPNU performed great in mining the qualified patterns tasks of data analytics, and some well-known algorithms are
efficiently and effectively. developed, such as the Transaction-Weighted Utility (TWU)
This paper is organized as follows: A literature review is model [28], IHUP [6], UP-growth and UP-growth+ [8], HUI-
given in Section II. Preliminaries and the problem statement Miner [7], and so on. Many variants of HUPM have also
are shown in Section III. The designed HUPNU with the been discussed focusing on mining different forms of utility-
several pruning strategies are studied in Section IV. Extensive oriented patterns. The importance of HUPM is increasing,
experiments are conducted in Section V. The conclusion is especially in the current era of big data [29], and more
finally provided in Section VI. opportunities and challenges are required for discussion and
IEEE INTERNET OF THINGS JOURNAL, 2019 3
analysis since HUPM can provide realistic benefits to the Example 1: Table I is considered as the running example
retailers and managers in many different applications and and can be described as follows: It has five transactions
domains [11]. (T1 , T2 , . . . , T5 ). Transaction T2 shows that items {c}, {d},
and {e} are purchased together in T2 , and their quantities are
C. Utility-based Pattern Mining with Effectiveness Issues 1, 1, and 2, respectively. We also assume that that unit utilities
of the items in the Table I are defined in ptable as: ptable =
In addition to the efficiency issue of utility mining, a number
{pr(a):$8, pr(b):$5, pr(c):-$2, pr(d):$12, pr(e):$7}. Thus, it
of models have been proposed to address the effectiveness
is obvious to see that an item (c) is sold at loss.
issue of discovering different kinds of high-utility patterns
Definition 1 (utility measure): The u(i, Tc ) indicates the
(HUPs). Current HUPM algorithms can successfully handle
utility of an item i in the transaction Tc , which can be calcu-
the temporal data [30], [31], and dynamic data [32], [33],
lated as: u(i, Tc ) = pr(i) × q(i, Tc ). u(X, Tc ) shows the utility
[34]. Other interesting effectiveness issues, such as HUPM
of an itemset X P in a transaction Tc , which can be calculated
with discount strategies [35], the concise representation [36],
as: u(X, Tc ) = i∈X u(i, Tc ). Note that X ⊆ I. Thus, the
discriminative issue [37], and top-k problem [38] for HUPM,
total utility of X in a database D can P be denoted as u(X),
have been extensively studied. Lin et al. first proposed an
which can be calculated as: u(X) = X⊆Tc ∧Tc ∈D u(X, Tc ).
utility mining model to extract the high-utility patterns from
Example 2: For example in Table I, the utility of {a} in T1
uncertain databases [20]. Different from the above utility
is calculated as: u(a, T1 ) = 5 × $8 = $40. The utility of {a, e}
measures of HUPs, another utility measure for utility-driven
in T1 is calculated as: u({a, e}, T1 ) = u(a, T1 ) + u(e, T1 ) = 5
pattern mining namely utility occupancy is introduced recently
× $8 + 4 × $7 = 6$8. Therefore, the utility of {a, e} in the
[39]. And a comprehensive survey of utility mining has been
Table I can be summed up as: u({a, e}) = u(a, T1 ) + u(e, T1 )
provided by Gan et al. [13].
+ (u(a, T3 ) + u(e, T3 ) = ($40 + $28) + ($32 + $7) = $107.
All the above HUPM algorithms only consider the posi-
For the {a, b, e}, the total utility of {a, b, e} can be calculated
tive utilities/risks and quantities of items. However, in some
as: u({a, b, e}) = u(a, T1 ) + u(b, T1 ) + u(e, T1 ) + u(a, T3 ) +
real-world scenarios, the utility/risk/weight values of the ob-
u(b, T3 ) + u(e, T3 ) = ($40 + $15 + $28) + ($32 + $15 + $7)
jects/items in databases usually can be either positive values
= $137.
or negative values. Therefore, the traditional algorithms of
Since the discovered patterns are usually rare in realistic
HUPM can not successfully be applied to handle the databases
applications, the probabilistic frequent model [19] cannot be
containing negative values. In the past, the two-phase HUINIV-
directly applied for any utility-oriented applications [20]. The
Mine [16], TS-HOUN [40], and one-phase FHN [41] algo-
common method of mining uncertain data uses the expected
rithms are proposed to deal with precise data which containing
support-based model to mine the interesting patterns. For
both positive and negative utility values. However, all the
example, the expected support of X is to sum up the support
existing negative-based approaches cannot be used to process
value of a pattern X in a possible world Wj as: expSup(X)
uncertain data and extract the utility-driven insightful patterns. P|D| Q
= i=1 ( xi ∈X p(xi , Tc )) [18], [19]. The definition of the
expected probability measure of the mentioned problem is
III. P RELIMINARIES AND P ROBLEM S TATEMENT
defined as below.
We first introduce the uncertain database of the defined Definition 2 (probability measure): Let X be a pattern
problem for utility-driven mining in this section. Let I = (itemset) and Tc be a transaction in the database D. The
{i1 , i2 , . . . , im } be a set of objects/items (symbols), and let probability of X in Tc is denoted as: p(X, Tc ), which can
the uncertain database be a set of transactions such as D = Q
be calculated as: p(X, Tc ) = i∈X p(i, Tc ). Note that X ⊆ I.
{T1 , T2 , . . . , Tn }, and each object/item in a transaction has The probability of X in DP can thusQbe denoted as P ro(X),
an uncertain probability of existence such as p(ik , Tc ) [18], and defined as P ro(X) = Tc ∈D ( i∈X p(i, Tc )).
[19]. For each Tc , it has the relationship such that ik ∈ Tc . Example 3: In Table I, the probability of {a} in T1 can
A positive quantity value is defined as the internal utility, and be calculated as: p(a, Ta ) = 0.60. The probability of (a, e)
denoted as q(ik , Tc ). This quantity value shows the quantity in T1 can then be calculated as: p({a, e}, T1 ) = p(a, T1 ) ×
of ik in Tc . Let ik ∈ I be related to a positive or negative p(e, T1 ) = 0.6 × 0.8 = 0.48. The probability of (a) in D can
value, which is defined as the external utility, and denoted as be calculated as: P ro(a) = 2.5, and the probability of {a, b, e}
pr(ik ). A set of external utility of all items in the database in D can be calculated as: P ro({a, b, e}) = p({a, b, e}, T1 ) +
is denoted as ptable = {pr(i1 ), pr(i2 ), . . . , pr(im )}. Table I p({a, b, e}, T3 ) = 0.24 + 0.75 = 0.99.
shows a simple example to illustrate the proposed approach. Definition 3 (Potential High-Utility Itemset, PHUI): Let
X be a pattern (itemset) in an uncertain database D. We
TABLE I can say that X is a PHUI if it satisfies two conditions: (1)
A RUNNING EXAMPLE FOR THE UNCERTAIN DATABASE .
u(X) ≥ minU til, and (2) P ro(X) ≥ minP ro × |D|, in
tid Item: quantity, probability) total utility
T1 (a:5, 0.6); (b:3, 0.50); (d:2, 0.9); (e:4, 0.8) $107
which minUtil is a minimum utility threshold and minPro is
T2 (c:1, 0.75); (d:1, 0.9); (e:2, 1.0) $24 a minimum probability threshold. We can then conclude that
T3 (a:4, 1.0); (b:3, 1.0); (c:2, 0.7); (e:1, 0.75) $50 interesting desired PHUI has a high expected probability and
T4 (a:3, 0.9); (c:1, 0.9) $22 a high utility value.
T5 (b:2, 1.0); (c:4, 0.95); (d:5, 0.6); (e:4, 1.0) $90 Problem Statement. With an uncertain database, a utility
table (with a positive or negative utility value of each item),
IEEE INTERNET OF THINGS JOURNAL, 2019 4
P
a minimum utility threshold (minUtiil), and a minimum prob- i∈Tc ∧pr(i)>0 u(i, Tc ). The transaction-weight utilization of
ability threshold (minPro), the problem of utility-driven data P itemset X is also then redefined as RTWU: RT W U (X) =
an
analytics from uncertain data is to discover the complete set X⊆Tc ∧Tc ∈D RT U (Tc ). The above-stated definitions can be
of PHUIs. used to hold the downward closure property for mining the
For example, if the minPro and minUtil are respectively required PHUIs. Note that RT W U (X) ≥ u(X).
set as minPro = 0.25 and minUtil = 20, the derived PHUIs Example 4: For example, the RT U (T2 ) is $26. Consider
from Table I are {{{a}:$96, 2.50}; {{b}:$40, 2.50}; {{d}:$96, two patterns {a} and {a, b, e}, the RT W U ({a}) = $185 and
2.40}; {{e}:$77, 3.55}; {{a, b}:$102, 1.30}; {{a, c}:$50, RT W U ({a, b, e}) = $161; both of them are the over-estimated
1.51}; {{b, e}:$103, 2.15}; {{c, e}:$35, 2.225}; {{d, e}:$166, values of u({a}) = $96 and u({a, b, e}) = $137.
2.22}; {{b, c, e}:$48, 1.475}}. Here, {{b}:$40, 2.50} indicates
that the utility value and expected probability of pattern {b} B. Probability Utility (PU± )-List Structure
are $40 and 2.50, respectively.
In the developed HUPNU algorithm, the processing order of
the items in the database is defined as , which holds the prop-
IV. P ROPOSED A PPROACH FOR M INING PHUI S erties as: (1) the items are then sorted as the RTWU-ascending
In this section, an utility-driven data analytics framework order; and (2) negative items always succeed positive ones.
named HUPNU is presented to discover the Potential High- The designed Probability Utility (PU± )-List structure used in
Utility Itemsets (PHUIs) from an uncertain database. We the HUPNU algorithm is stated as follows:
further design the Probability-Utility list with positive and neg- Definition 5 (PU± -list): Let X be an itemset in the database.
ative utilities (PU± -list). In addition, several pruning strategies The PU± -list of X is denoted as: X.PUL, and it consists of
are presented here to reduce the search space of the potential five elements: (1) tid represents the transaction ID in the
HUIs. More details are described below. database; (2) pro is the expected probability of X in Ttid ,
and pro(X, Ttid ) ≥ 0; (3) pu shows the positive utility of X
A. Positive and Negative Unit Utilities in Ttid , and u(X, Ttid ) ≥ 0; (4) nu represents the negative
utility of X in Ttid , and u(X, Ttid ) < 0; (5) rpu represents
The monotonic/anti-monotonic proprieties cannot be held P
in utility mining [28]. In this situation, the utility of a i∈Ttid ∧ix∀x∈X u(i, Ttid ) ≥ 0, which keeps only a positive
utility value for the remaining items.
pattern may be higher, lower, or equal to any of its subset
patterns. The search space to discover the meaningful and
(a)
useful patterns may become large if many items exist in
tid pro pu nu rpu
the database. The TWU model [28] was presented to avoid
the problem of “combinational exploration”, which aims at T2 1.0 6 0 13
reducing the search space for mining the high-utility itemsets. T3 0.55 6 0 33
Several extensions are then extensively studied [7], [28], [8] to T5 1.0 24 0 9
improve the mining performance. However, those approaches, Fig. 1. Constructed PU± -list of pattern {a})
including the TWU model, do not consider both the positive
and negative unit utilities of items, which are addressed in this
paper. Moreover, the existing works do not consider the above (a) (e) (b)
situations in the uncertain database for discovering the PHUIs. tid pro pu nu rpu tid pro pu nu rpu tid pro pu nu rpu
We define the following properties for the addressed problem T2 1.0 6 0 13 T2 0.40 3 0 10 T1 0.85 21 0 1
T3 0.55 6 0 33 T3 0.40 15 0 18 T2 0.60 7 0 3
as follows: T5 1.0 24 0 9 T5 0.45 6 0 3 T3 0.60 14 0 4
Property 1: We first assume that pu(X) and nu(X) are T4 0.90 21 0 0
respectively the sum of positive and negative utility of an item
X in a database. Thus, we can obtain that the utility of X
(c) (d)
is calculated as: u(X) = pu(X) + nu(X), where nu(X) ≤ tid pro pu nu rpu tid pro pu nu rpu
u(X) ≤ pu(X) holds. T1 1.0 1 0 0 T2 0.70 0 -10 0
From the above-stated property, we can conclude that u(X) T2 0.75 3 0 0 T3 0.09 0 -5 0
{(T1 , 0.60, $40, $0, $67), (T3 , 1.00, $32, -$4, $22), (T5 , 0.90, Algorithm 1 Construction procedure
$24, -$2, $0)}; {c}.P U L = {(T2 , 0.75, $0, -$2, $0), (T3 , 0.70, Input: P , P y, P z.
$0, -$4, $0), (T4 , 0.90, $0, -$2, $0), (T5 , 0.95, $0, -$8, $0)}; Output: Pyz with its Pyz.PUL.
{a, c}.P U L = {T3 , 0.70, $32, -$4, $0), (T4 , 0.81, $24, -$2, 1: P yz.P U L ← ∅;
$0)}. 2: set P robability = SUM(Y.pro), U tility = SUM(Y.pu) +
Definition 6: The sums of the total utility, pu values, nu SUM(Y.rpu);
values, and rpu values in the PU± -list of X are respec- 3: for each tuple ex ∈ P y.P U L do
tively denoted as: SUM(X.iu), SUM(X.pu), SUM(X.nu), and 4: if ∃ez ∈ P z.P U L and ey.tid = eyz.tid then
SUM(X.rpu), which if P.P U L 6= ∅ then
P can be respectively defined as:
5:
SU M (X.pu) = PX∈Tc ∧Tc ⊆D X.pu(Tc ); 6: search each element e ∈ P.P U L such that e.tid =
SU M (X.nu) = PX∈Tc ∧Tc ⊆D X.nu(Tc ); ey.tid.;
SU M (X.rpu) = X∈Tc ∧Tc ⊆D X.rpu(Tc ); 7: eyz ← <ey.tid, ey.pro × ez.pro/e.pro, ey.pu +
SU M (X.iu) = SU M (X.pu) + SU M (X.nu). ez.pu − e.pu, ey.nu + ez.nu − e.nu, ez.rpu>;
Lemma 1: For two patterns, such as Y and Z in the PU± - 8: else
eyz ← <ey.tid, ey.pro × ez.pro, ey.pu + ez.pu,
P if the relationship holds: (1) SU M (Y.pu)+SU M (Y.rpu)
tree, 9:
- ∀Tc ∈D,Y ⊆Tc PZ*Tc (Y.pu + Y.rpu) < minU til, or (2)
V ey.nu + ez.nu, ez.rpu>;
SU M (Y.pro) - ∀Tc ∈D,Y ⊆Tc V Z*Tc (Y.pro) < minP ro × 10: end if
|D|, the (YZ), or any superset of it, will not be a PHUI. 11: P yz.P U L ← P yz.P U L ∪ {eyz};
Strategy 1 (PU-Prune strategy): If Lemma 1 holds, then 12: else
the PU± -list construction of the pattern Y can be avoided; 13: Probability = Probability - ey.pro, Utility = Utility -
and any of its superset will not be a PHUI. ey.pu - ey.rpu;
According to the designed PU-Prune strategy, the huge size 14: if Probability < minP ro × |D| or U tility <
of the unpromising k-patterns (k ≥ 2, and k is the size of the minU til then
items within the itemset) of the search space can be greatly 15: return null.
filtered. Let P y denote an itemset, and y denote an item; 16: end if
P y is defined as P ∪ y and y is before z. The construction 17: end if
procedure of the PU± -list is shown in Algorithm 1. It takes 18: end for
the PU± -lists of P , P y, and P z as the inputs, and returns 19: return Pyz
the PU± -list of P yz as output. The PU± -lists of k-itemsets
(k ≥ 2) can be easily built using a simple join operation of
X k−1 ⊆ Xk , RT W U (X k ) =
P
(k-1)-itemsets; multiple database scans can be greatly avoided, X k ⊆Tc ∧Tc ∈D tu(Tc ) ≤
k−1
P
and the computational cost of the run time can be reduced. If X k−1 ⊆Tc ∧Tc ∈D tu(Tc ) = RT W U (X ) holds.
P.P U L is empty, the PU± -list of a 2-itemset is constructed Lemma 2 (Upper-bound probability of PHUI): The
(Line 9). Otherwise, the PU± -list of a k-itemset (k ≥ 3) is summed up probability of any node in the PU± -tree is greater
constructed (Lines 5 to 7). For these optimization purposes, than the summed up probabilities of its supersets.
the joint operation of the PU± -lists of P , P y, and P z can be Strategy 2: The RTWU and the probability value of each
constructed by a binary search. item can be easily obtained during the initial database scan.
Thus, if the summed up probability and RTWU of an itemset
X do not achieve the two conditions of PHUI, X and any
C. Proposed Pruning Strategies
supersets of it can be pruned directly.
In this section, we discuss several pruning strategies based Strategy 3: While the depth-first search is performed to
on the developed PU± -list for later mining progress. The traverse the PU± -tree, if the summed up probability of tree
developed strategies can greatly help to remove the uncom- node X such as P ro(X) is no larger than minP ro × |D|,
promising candidates in the early stages, and thus the search then none of the supersets of this node are considered to be a
space that reveals the actual PHUIs can become smaller. The PHUI.
node of the (k-1)-itemset in the designed PU± -tree is denoted Lemma 3 (Upper-bound utility of the PHUI): For a node
as: X k−1 (k ≥ 2), and any superset of it is denoted as: X k . X in the PU± -tree, the summed values of SU M (X.pu) and
Theorem 1 (Downward closure property of RTWU SU M (X.rpu) are always equal to or larger than any of its
and probability): In the PU± -tree, the correctness of supersets.
P ro(X k−1 ) ≥ P ro(X k ) and RTWU(X k−1 ) ≥ RTWU(Xk ) We can conclude from the above lemmas that the summed
hold. Q up utility w.r.t. u(X k ) of an itemset X k is always no
Proof: Since p(X, Tc ) = i∈X p(i, Tc ) can be held greater than the summed values of SUM(X k−1 .pu) and
for any Tc in D, thus we can have that: p(X k , Tc ) ≤ SUM(X k−1 .rpu). Therefore, it is ensured that the transitive
p(X k−1 , Tc ). Since X k−1 is the subset of X k ; the extensions with items having positive or negative utilities can
tids of X k is the subset of the P tids of X
k−1
. We hold the downward closure property. The following strategies
k k
then can have that: P ro(X ) = X k ⊆T ∧T ∈D p(X , Tc )
c c can be designed based on the above upper-bound constraints:
k−1
, Tc ) = P ro(X k−1 ). Thus, it
P
≤ X k−1 ⊆Tc ∧Tc ∈D p(X Strategy 4: While the depth-first search is performed
can be concluded that Pro(X k−1 ) ≥ Pro(Xk ). Moreover, to traverse the PU± -tree, if the summed up values of
IEEE INTERNET OF THINGS JOURNAL, 2019 6
SUM(X k−1 .pu) and SUM(X k−1 .rpu) are less than minU til, Algorithm 3 Search procedure
any supersets of X cannot be a PHUI. Those nodes in the tree Input: P , ExtensionsOfP, minPro, minUtil, EUCS.
can be directly ignored and pruned to reduce the search space Output: The set of PHUIs.
for mining the PHUIs. 1: for each pattern P y ∈ ExtensionsOfP do
Strategy 5: For an itemset X in the PU± -list, if the X.PUL 2: if SUM(Py.pro) ≥ minP ro × |D| ∧ (SUM(Py.pu) +
is null or the P ro(X) value is less than minP ro × |D|, X SUM(Py.nu)) ≥ minU til then
cannot be considered a PHUI; none of its superset (node) is a 3: output P y as a PHUI;
PHUI. Therefore, the construction procedure of PU± -lists of 4: end if
X’s supersets can be ignored. 5: if SUM(Py.pro) ≥ minP ro × |D| ∧ (SUM(Py.pu) +
The efficient Estimated Utility Co-occurrence Pruning SUM(Py.rpu)) ≥ minU til then
(EUCP) strategy [42] is also utilized here for the designed 6: ExtensionsOfPy ← ∅;
HUPNU algorithm. Thus, the Estimated Utility Co-occurrence 7: for P z ∈ ExtensionsOfP such that z y do
Structure (EUCS) is built to keep the RTWU values of the 2- 8: if RT W U ({y, z}) ≥ minU til then
itemsets. More information can be found in [42]. 9: P yz ← P y ∪ P z;
Strategy 6: Let X be a 2-itemset, which is also one of the 10: Pyz.PUL ← Construct(P, Py, Pz);
nodes in the Set-enumeration PU± -tree. While the depth-first 11: if Pyz.PUL 6= ∅ and SUM(Pyz.pro) ≥ minP ro
search is performed, if the RTWU of X is no greater than the ×|D| then
minU til based on the built EUCS, X and any supersets of X 12: ExtensionsOfPy ← ExtensionsOfPy ∪P yz;
are not considered to be the PHUI; the construction progress 13: end if
of the PU± -tree for X and the supersets of it can be ignored. 14: end if
Based on the above proposed pruning strategies, the de- 15: end for
signed HUPNU is shown in detail below. 16: call Search(Pyz, ExtensionsOfPy, minPro, minUtil,
EUCS).
17: end if
D. The Procedure of HUPNU 18: end for
In this section, the main procedure of the developed 19: return PHUIs
HUPNU is shown in Algorithm 2. First, it examines the
uncertain database to find the values of RTWU (with the
redefined RT U ) and P ro(i) of each item (Line 1). The is no less than minP ro×|D|, and the summed up actual utility
expected support and RTWU of each item in the set I ∗ are then of P y in the PU± -list (denoted as SUM(Y.pu) + SUM(Y.nu))
checked against the minP ro×|D| and minU til, respectively, is no less than minU til, then P y can be considered to be a
and the satisfied items are then discovered and obtained. In PHUI (Lines 2 to 4). The developed pruning Strategies 3 and
this step, the other items can be ignored directly since they 4 are then performed to check whether the extension (P y)
could not be the potential HUI (Line 2). The database is then can be explored (Line 5). This progress can be executed by
scanned again (Line 4) to re-order the items in the transactions integrating P y with all extensions P z of P , such that z y
according to the designed order as (Line 3). In addition, the and RT W U ({y, z}) ≥ minU til (Line 8, pruning Strategy 6),
items in the transactions are then re-ordered based on the total to generate the extensions (P yz) containing |P y| + 1 items.
order while performing the database scan. After that, the After that, the PU± -list of P yz can be built by the Construct
PU± -list of each 1-item i ∈ I ∗ is constructed, respectively, and procedure to join the PU± -lists of P , P y, and P z (Lines 9
the depth-first search is recursively performed by the Search to 13). Note that the promising PU± -lists can only be later
procedure with the empty itemset ∅, the set of single items I ∗ , explored (Line 12, pruning Strategy 5). A recursive Search
minPro, minUtil, and the EUCS (Line 5). procedure of P yz is then performed to obtain its utility and
explore the extension(s) (Line 16).
Algorithm 2 HUPNU main procedure
Input: D, minPro, minUtil, ptable. V. E XPERIMENTAL S TUDY
Output: The set of PHUIs. In this section, we evaluate the performance of the devel-
1: scan uncertain database D to calculate the RTWU(i) and oped HUPNU algorithm. All the algorithms were implemented
P ro(i) of each item i ∈ I; with Java language, and executed on an Intel Core-i5 processor
2: I ∗ ← each item i such that P ro(i) ≥ minP ro × |D| ∧ running on Microsoft Windows 7 64 bits operation system
RT W U (i) ≥ minU til; with 4GB of main memory. Memory usage was measured by
3: sort the items in the set of I ∗ as order; Java API. To the best of our knowledge, this is the first paper
4: scan database D to build the PU± -list of each item i ∈ I ∗ that considers both the positive and negative unit utilities of
and construct EU CS; items in an uncertain database; none of the previous works
5: call Search(∅, I ∗ , minPro, minUtil, EUCS). handle this topic. Thus, the developed HUPNU along with
6: return PHUIs several designed pruning strategies were compared in the
experiments. HUPNUP 12 denotes that pruning Strategies 1
The search procedure of the HUPNU is described in Algo- and 2 were involved in the HUPNU; HUPNUP 123 considers
rithm 3. For each extension P y of P , if the probability of P y pruning Strategies 1, 2, and 3; HUPNUP 1234 takes the pruning
IEEE INTERNET OF THINGS JOURNAL, 2019 7
Strategies 1, 2, 3, and 4; and HUPNUAll is concerned with all and minPro values were set lower, more and longer patterns
the pruning Strategies (1 to 5) for evaluation. Experiments were mined, and there was a greater computational cost in
were carried out on five realistic datasets1 (i.e., kosarak, terms of the runtime that was required. This situation can
accidents, psumb, retail, and mushroom) and one synthetic be easily observed in a very dense dataset, for example in
dataset [43]. The individual characteristics of each of the six accidents and psumb.
datasets are given below. The developed PU± -list can also easily help the variants of
• kosarak: it has 990,002 transactions, and the number the algorithms to directly mine the required patterns without
of distinct items is 41,270. The average length and the candidate generation. The list structure can effectively reduce
maximum length of transactions are is 8.09 and 2,498, the multiple database scans. We can also observe that the
respectively. designed pruning strategies can greatly help with reducing the
• accidents: it has 340,183 transactions and 468 distinct number of unpromising patterns. The required memory usage
items. For the all transactions, the average length is 33.8 of the variants of algorithms is much less, but due to the page
and the maximum length is 51. limit, we omit some details of the results. However, in the
• psumb: it has total 49,046 transactions and 2,113 distinct general pattern-mining approach, it can be easily concluded
items. It is a very dense dataset since the average length that more memory usage was required when the threshold
of each transaction is 74. was set lower. The designed PU± -list could actually solve
• retail: this dataset contains 88,162 transactions and total this limitation by using a more compressed search space, and
16,470 distinct items. The average length of transactions achieve effectiveness and efficiency for mining the PHUIs.
is 10.3, and the maximum length is 76.
• mushroom: it has 8,124 transactions with 119 distinct
B. Evaluation of Number of Patterns (Visited Nodes)
items. It is a dense dataset since both the average length
and the maximum length are 23. In this section, the numbers of visited nodes (also known as
• T10I4D100K [43]: this dataset has 100,000 transactions the candidate patterns) in the designed PU± -tree are compared
with 870 items. These transactions have average 10.1 to evaluate the effects of pruning strategies. Note that the
items, and the maximum length is 29. number of visited nodes of the four variants of the HUPNU
algorithm is denoted as N1 , N2 , N3 , and N4 , respectively.
The external utilities of different items for the six datasets
According to the same parameter settings in Fig. 3 and Fig. 4,
were generated in the range of [-1,000, 1,000] using a log-
the final results of the patterns are respectively shown in Table
normal distribution. In addition, the quantities of the items
II and Table III in terms of minUtil and minPro thresholds.
were randomly generated in the range of [1, 5]. These settings
From Table II and Table III, it can be seen that N1 > N2 >
are similar to the previous well-known algorithms [7], [42],
N3 > N4 > PHUIs in terms of varied minUtil and minPro.
[8] for HUPM. Moreover, the unique probabilities of the
We can then draw a conclusion such that: (1) The number of
items were randomly assigned in the range of (0.0, 1.0). In
designed PHUIs discovered was fewer in the uncertain dataset
the experiments, we evaluated the implemented algorithms in
compared to the number of candidate patterns. (2) The set
terms of runtime, number of visited nodes (or patterns), and
of the discovered PHUIs was more meaningful; the designed
scalability. The results are given below.
HUPNU discover more concrete and useful patterns with both
positive and negative unit utilities of item constraints than the
A. Evaluation of Runtime traditional algorithms in HUPM. (3) The search space of the
The runtime of the four implemented algorithms were then HUPNU was very large without any pruning strategies, but the
compared under the variants of minUtil and minPro thresholds. developed strategies could reduce the its size. (4) The results
Results in terms of the two thresholds are shown in Fig. 3 and were reasonable, and even the size of the search space could
Fig. 4, respectively. be reduced by different pruning strategies, the completeness,
As shown in Fig. 3 and Fig. 4, the runtimes of all imple- and the correctness of the final PHUIs could still be held.
mented algorithms decreased along with the increasing of the For the designed pruning strategies 2, 3, and 4 of the
minU til or minP ro threshold. Specifically, the implemented implemented HUPNUP 1 algorithm, they hold the upper-bound
algorithms with variants of pruning strategies greatly improved values for the utility and probability of the patterns, and thus
the performance, up to nearly one or two orders of magnitude combinational exploration could be avoided. However, some
faster than the baseline approach. For example, HUPNUP 1234 , unpromising patterns could not be effectively filtered, which
which adopts all the efficient pruning strategies, outperformed can be obviously seen from the gap between the number of
the other variants of the designed approach. The reason is that final PHUIs and N1 . It can also be seen that PU-Prune had
the HUPNUP 1234 algorithm is concerned with all the pruning a great performance in removing unpromising patterns, which
strategies, and the unpromising candidates can be greatly could be observed in N1 and N2 . This strategy avoids the
reduced. Therefore, the traversal procedure to explore the construction progress for the numerous unpromising patterns.
unpromising patterns in the enumeration tree can be avoided, We can also see that the EUCP strategy made a great effort
as well as the costly join operation to generate the huge to reduce the search space. The relationships of N1 > N2 >
unpromising candidates of the PU± -lists. When the minUtil N3 > N4 were correctly held. When the threshold value was
set lower, for example, minU til or minP ro, the gap between
1 http://fimi.ua.ac.be/data/ different implemented algorithms of the visited patterns could
IEEE INTERNET OF THINGS JOURNAL, 2019 8
(a) kosarak (minPro: 0.0001) (b) accidents (minPro: 0.002) (c) pumsb (minPro: 0.002)
1600 6000 6000
Runtime (sec.)
Runtime (sec.)
4000 4000
1000
3000 3000
800
2000 2000
600
200 0 0
30k 40k 50k 60k 70k 80k 7000k 7500k 8000k 8500k 9000k 9500k 3500k 3600k 3700k 3800k 3900k 4000k
Minimum utility threshold Minimum utility threshold Minimum utility threshold
(d) retail (minPro: 0.0001) (e) mushroom (minPro: 0.004) (f) T10I4D100K (minPro: 0.0001)
10 45 30
9.5
40
9 25
Runtime (sec.)
Runtime (sec.)
Runtime (sec.)
8.5 35
8 20
7.5 30
7 15
25
6.5
6 20 10
40 60 80 100 120 140 10k 15k 20k 25k 30k 35k 300 500 700 900 1100 1300
Minimum utility threshold Minimum utility threshold Minimum utility threshold
HUPNU HUPNU HUPNU HUPNU
P1 P12 P123 P1234
(a) kosarak (minUtil: 30k) (b) accidents (minUtil: 7000k) (c) pumsb (minUtil: 3500k)
2000 6000 6000
5000 5000
1500
Runtime (sec.)
Runtime (sec.)
Runtime (sec.)
4000 4000
2000 2000
500
1000 1000
0 0 0
0.009 0.010 0.011 0.012 0.013 0.014 0.2 0.3 0.4 0.5 0.6 0.7 0.2 0.3 0.4 0.5 0.6 0.7
Minimum probability threshold (%) Minimum probability threshold (%) Minimum probability threshold (%)
(d) retail (minUtil: 100) (e) mushroom (minUtil: 10k) (f) T10I4D100K (minUtil: 300)
14 70 35
60
12
30
Runtime (sec.)
Runtime (sec.)
Runtime (sec.)
50
10
40 25
8
30
20
6
20
4 10 15
0.006 0.008 0.010 0.012 0.014 0.016 0.3 0.4 0.5 0.6 0.7 0.8 0.005 0.010 0.015 0.020 0.025 0.030
Minimum probability threshold (%) Minimum probability threshold (%) Minimum probability threshold (%)
HUPNUP1 HUPNUP12 HUPNUP123 HUPNUP1234
become huge, and the effectiveness of the designed pruning [5] J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without
strategies is held. candidate generation: A frequent-pattern tree approach,” Data Mining
and Knowledge Discovery, vol. 8, no. 1, pp. 53–87, 2004.
[6] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee, “Efficient tree
structures for high utility pattern mining in incremental databases,” IEEE
C. Evaluation of Scalability Transactions on Knowledge and Data Engineering, vol. 21, no. 12, pp.
From Fig. 5, the scalability was carried out on a realistic 1708–1721, 2009.
[7] M. Liu and J. Qu, “Mining high utility itemsets without candidate
BMS-POS dataset under varied dataset sizes. The threshold generation,” in Proceedings of the 21st ACM International Conference
values were set as minPro = 0.0001 and minUtil = 10k, and the on Information and Knowledge Management. ACM, 2012, pp. 55–64.
dataset size was varied from 100k to 500k. In Fig. 5(a), we can [8] V. S. Tseng, B. E. Shie, C. W. Wu, and P. S. Yu, “Efficient algorithms
for mining high utility itemsets from transactional databases,” IEEE
then find that the runtimes of the four implemented algorithms Transactions on Knowledge and Data Engineering, vol. 25, no. 8, pp.
linearly increased along with the increasing size of the dataset. 1772–1786, 2013.
We can see that the runtime of HUPNUP 123 was close to that [9] Z. Sheng, S. Yang, Y. Yu, A. Vasilakos, J. Mccann, and K. Leung, “A
survey on the ietf protocol suite for the internet of things: Standards,
of HUPNUP 12 , and both of them were significantly faster than challenges, and opportunities,” IEEE Wireless Communications, vol. 20,
HUPNUP 1 . We also can see that HUPNUP 1234 outperformed no. 6, pp. 91–98, 2013.
the other implemented algorithms. When the size of the [10] P. Fournier-Viger, J. C. W. Lin, B. Vo, T. T. Chi, J. Zhang, and H. B.
Le, “A survey of itemset mining,” Wiley Interdisciplinary Reviews: Data
dataset increases, the gap between the implemented algorithms Mining and Knowledge Discovery, vol. 7, no. 4, 2017.
becomes larger but they still remain stable with linear growth. [11] U. Yun, G. Lee, and E. Yoon, “Efficient high utility pattern mining for
The memory usage of four implemented algorithms is shown establishing manufacturing plans with sliding window control,” IEEE
Transactions on Industrial Electronics, vol. 64, no. 9, pp. 7239–7249,
in Fig. 5(b). HUPNUP 1 requires the most memory usage; 2017.
HUPNUP 123 and HUPNUP 1234 has similar results, and re- [12] F. Tao, Y. Cheng, L. Zhang, and D. Zhao, “Utility modelling, equilib-
quires the least memory usage. Fig. 5(c) shows the number of rium, and coordination of resource service transaction in service-oriented
manufacturing system,” Proceedings of the Institution of Mechanical
visited patterns of the four implemented algorithms and the Engineers, Part B: Journal of Engineering Manufacture, vol. 226, no. 6,
final PHUIs. The results show the effectiveness and efficiency pp. 1099–1117, 2012.
of the designed pruning strategies, and when the size of the [13] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, V. S. Tseng, and
P. S. Yu, “A survey of utility-oriented pattern mining,” arXiv preprint
dataset becomes larger, the gap of those algorithms becomes arXiv:1805.10511, 2018.
huge. [14] Y. W. Teng, C. H. Tai, P. S. Yu, and M. S. Chen, “Revenue maximization
on the multi-grade product,” in Proceedings of the SIAM International
Conference on Data Mining. SIAM, 2018, pp. 576–584.
VI. C ONCLUSION [15] C. H. Cai, A. W. C. Fu, C. Cheng, and W. Kwong, “Mining association
rules with weighted items,” in International Database Engineering and
In this paper, we present a HUPNU algorithm by jointly Applications Symposium. IEEE, 1998, pp. 68–77.
considering the uncertainty and utility (both positive and [16] C. J. Chu, V. S. Tseng, and T. Liang, “An efficient algorithm for
negative) factors, to reveal the qualified high-utility patterns. mining high utility itemsets with negative item values in large databases,”
Applied Mathematics and Computation, vol. 215, no. 2, pp. 767–778,
This is the first work concerning these realistic factors in 2009.
some real-life situations, such as Internet of Things data and [17] J. Wan, S. Tang, D. Li, S. Wang, C. Liu, H. Abbas, and A. V. Vasilakos,
manufacturing data. A vertical structure named PU± -list was “A manufacturing big data solution for active preventive maintenance,”
IEEE Transactions on Industrial Informatics, vol. 13, no. 4, pp. 2039–
designed to keep necessary information, such as probability, 2047, 2017.
and the positive and negative utilities of the items for later [18] C. C. Aggarwal and P. S. Yu, “A survey of uncertain data algorithms and
mining progress. Based on the above properties, the HUPNU applications,” IEEE Transactions on Knowledge and Data Engineering,
vol. 21, no. 5, pp. 609–623, 2009.
algorithm could directly produce the qualified high-utility [19] T. Bernecker, H. P. Kriegel, M. Renz, F. Verhein, and A. Zuefle, “Prob-
patterns in one phase. Moreover, several efficient pruning abilistic frequent itemset mining in uncertain databases,” in Proceedings
strategies were also developed to greatly reduce the search of the 15th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. ACM, 2009, pp. 119–128.
space for mining the promising patterns, and thus the com- [20] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and V. S.
putation could be sped up efficiently. Extensive experiments Tseng, “Efficient algorithms for mining high-utility itemsets in uncertain
were carried out on several synthetic/realistic datasets to show databases,” Knowledge-Based Systems, vol. 96, pp. 171–187, 2016.
[21] Y. Liu, X. Weng, J. Wan, X. Yue, H. Song, and A. V. Vasilakos,
the efficiency and effectiveness of the designed algorithm in “Exploring data validity in transportation systems for smart cities,” IEEE
terms of runtime, number of discovered qualified patterns, and Communications Magazine, vol. 55, no. 5, pp. 26–33, 2017.
scalability. [22] L. Geng and H. J. Hamilton, “Interestingness measures for data mining:
A survey,” ACM Computing Surveys, vol. 38, no. 3, p. 9, 2006.
[23] J. M. Luna, A. Cano, M. Pechenizkiy, and S. Ventura, “Speeding-
R EFERENCES up association rule mining with inverted index compression,” IEEE
Transactions on Cybernetics, vol. 46, no. 12, pp. 3059–3072, 2016.
[1] L. Atzori, A. Iera, and G. Morabito, “The internet of things: A survey,” [24] R. Srikant and R. Agrawal, “Mining sequential patterns: Generaliza-
Computer Networks, vol. 54, no. 15, pp. 2787–2805, 2010. tions and performance improvements,” in International Conference on
[2] C. W. Tsai, C. F. Lai, M. C. Chiang, L. T. Yang et al., “Data mining Extending Database Technology. Springer, 1996, pp. 1–17.
for internet of things: A survey.” IEEE Communications Surveys and [25] T. P. Hong, C. S. Kuo, and S. C. Chi, “Mining association rules from
Tutorials, vol. 16, no. 1, pp. 77–97, 2014. quantitative data,” Intelligent Data Analysis, vol. 3, no. 5, pp. 363–376,
[3] F. Chen, P. Deng, J. Wan, D. Zhang, A. V. Vasilakos, and X. Rong, 1999.
“Data mining for the internet of things: literature review and challenges,” [26] C. K. Chui, B. Kao, and E. Hung, “Mining frequent itemsets from
International Journal of Distributed Sensor Networks, vol. 11, no. 8, p. uncertain data,” in Pacific-Asia Conference on Knowledge Discovery and
431047, 2015. Data Mining. Springer, 2007, pp. 47–58.
[4] R. Agrawal, R. Srikant et al., “Fast algorithms for mining association [27] H. Yao, H. J. Hamilton, and C. J. Butz, “A foundational approach to
rules,” in Proceedings of the 20th International Conference on very mining itemset utilities from databases,” in Proceedings of the SIAM
Large Databases, vol. 1215, 1994, pp. 487–499. International Conference on Data Mining. SIAM, 2004, pp. 482–486.
IEEE INTERNET OF THINGS JOURNAL, 2019 10
TABLE II
D ERIVED PATTERNS UNDER VARIED minUtil AND FIXED minPro
Threshold of minUtil
Dataset Pattern
minU til1 minU til2 minU til3 minU til4 minU til5 minU til6
N1 109,475,579 79,746,008 61,798,333 50,499,186 41,197,757 33,144,921
N2 14,399,980 12,180,884 10,571,745 9,254,740 8,123,221 7,125,357
(a) kosarak N3 12,322,701 10,611,814 9,369,910 8,286,105 7,345,445 6,505,338
N4 8,067,669 6,811,140 5,945,495 5,217,922 4,603,638 4,081,767
PHUIs 45,613 35,662 28,077 24,317 21,420 18,843
N1 194,512 163,169 138,399 120,560 107,620 96,807
N2 147,052 119,802 98,237 83,198 72,963 64,390
(b) accidents N3 146,426 119,140 97,507 82,366 72,087 63,486
N4 144,957 117,671 96,038 80,897 70,618 62,017
PHUIs 6,331 5,462 4,603 3,940 3,493 3,120
N1 2,522,059 1,814,071 1,365,950 1,184,769 1,059,297 849,484
N2 1,265,985 801,074 590,923 490,411 434,835 342,120
(c) pumsb N3 1,238,118 795,362 583,778 484,847 423,414 337,667
N4 1,236,179 793,423 581,839 482,908 421,475 335,732
PHUIs 8,648 4,123 3,114 1,773 1,388 1,238
N1 96,139,509 48,350,235 35,111,699 27,307,914 21,761,430 17,948,378
N2 21,876,139 17,735,518 15,582,757 13,731,190 11,813,940 10,462,617
(d) retail N3 21,078,117 17,429,475 15,360,149 13,568,770 11,696,475 10,374,338
N4 20,968,980 17,332,324 15,270,502 13,484,318 11,615,439 10,296,705
PHUIs 19,278 13,584 10,655 8,747 7,423 6,378
N1 1,558,452 974,145 721,233 507,260 380,254 322,867
N2 828,898 527,865 394,426 278,595 215,407 181,517
(e) mushroom N3 700,121 440,243 327,442 229,328 175,038 146,840
N4 698,969 439,522 326,867 228,871 174,672 146,516
PHUIs 147,088 85,133 62,773 42,466 30,510 25,590
N1 14,059,513 7,194,329 4,743,728 3,475,577 2,739,103 2,257,562
N2 6,199,289 2,365,166 1,264,020 820,482 607,506 491,439
(f) T10I4D100K N3 4,619,226 1,655,006 872,646 584,661 453,250 383,900
N4 4,454,919 1,570,317 802,457 519,180 390,542 323,262
PHUIs 37,242 20,710 13,916 10,140 7,854 6,292
TABLE III
D ERIVED PATTERNS UNDER VARIED minPro AND FIXED minUtil
Threshold of minPro
Dataset Pattern
minP ro1 minP ro2 minP ro3 minP ro4 minP ro5 minP ro6
N1 79,746,008 73,064,534 66,950,091 61,745,111 57,114,432 53,014,125
N2 12,180,884 11,778,729 11,445,291 11,154,865 10,808,309 10,460,194
(a) kosarak N3 10,611,814 10,375,295 10,205,956 10,077,832 9,882,828 9,664,443
N4 6,811,140 6,028,964 5,459,093 5,019,367 4,612,615 4,243,295
PHUIs 35,662 24,589 17,304 12,595 9,362 7,154
N1 194,512 150,179 112,717 84,525 64,389 48,824
N2 147,052 109,522 81,229 60,283 44,655 32,550
(b) accidents N3 146,426 109,127 80,933 60,098 44,554 32,490
N4 144,957 107,740 79,657 59,029 43,610 31,651
PHUIs 6,331 4,552 3,321 2,411 1,735 1,237
N1 2,522,059 1,892,203 1,433,997 1,054,904 780,328 600,511
N2 1,265,985 904,469 654,212 473,337 344,866 256,833
(c) pumsb N3 1,238,118 885,941 642,025 466,156 340,510 254,184
N4 1,236,179 883,884 640,104 464,472 338,917 252,599
PHUIs 8,648 5,582 3,630 2,277 1,396 817
N1 37,793,976 37,017,817 36,106,601 35,111,699 34,139,328 33,094,290
N2 16,931,811 16,454,728 15,986,124 15,582,757 15,288,602 15,031,344
(d) retail N3 16,526,029 16,131,265 15,715,903 15,360,149 15,101,236 14,876,214
N4 16,507,082 16,090,920 15,652,668 15,270,502 14,984,007 14,728,697
PHUIs 14,406 13,048 1,1802 10,655 9,660 8,724
N1 974,145 923,770 860,295 796,159 727,630 665,124
N2 527,865 485,700 444,208 404,089 363,959 328,273
(e) mushroom N3 440,243 406,109 373,114 341,243 309,047 281,344
N4 439,522 405,048 372,035 339,920 307,708 279,939
PHUIs 85,133 75,769 67,538 60,065 52,944 47,398
N1 7,194,329 6,992,960 6,804,076 6,632,844 6,457,696 6,259,347
N2 2,365,166 2,195,551 2,034,389 1,879,105 1,718,748 1,549,820
(f) T10I4D100K N3 1,655,006 1,519,652 1,398,078 1,296,955 1,211,962 1,133,041
N4 1,570,317 1,384,445 1,225,810 1,097,950 997,283 905,622
PHUIs 20,710 19,278 17,869 16,454 15,132 13,900
IEEE INTERNET OF THINGS JOURNAL, 2019 11
(a) BMS−POS (minPro: 0.0001, minUtil: 10k) (b) BMS−POS (minPro: 0.0001, minUtil: 10k) x 10 (c) BMS−POS (minPro: 0.0001, minUtil: 10k)
6
150 1000 5
HUPNUP1 HUPNUP1
HUPNUP12 HUPNUP12
800 4
HUPNU HUPNU
# patterns
600 3 N2
N3
400 2 N4
50 PHUIs
200 1
0 0 0
100k 200k 300k 400k 500k 100k 200k 300k 400k 500k 100k 200k 300k 400k 500k
Dataset size Dataset size Dataset size