Data Mining Worksheet One
Data Mining Worksheet One
1. Suppose that the data for analysis includes the attribute age. The age values for the data
tuples are (in increasing order) 13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33,
33, 35, 35, 35, 35, 36, 40, 45, 46, 52, 70.
a. What is the mean of the data? What is the median?
b. What is the mode of the data? Comment on the data’s modality (i.e., bimodal,
trimodal, etc.).
c. What is the midrange of the data?
d. Can you find (roughly) the first quartile (Q1) and the third quartile (Q3) of the data?
e. Give the five-number summary of the data.
2. Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8):
a. Compute the Euclidean distance between the two objects.
b. Compute the Manhattan distance between the two objects.
3. In real-world data, tuples with missing values for some attributes are a common occurrence.
Describe various methods for handling this problem.
4. Suppose that the data for analysis includes the attribute age. The age values for the data
tuples are 25, 13, 33, 15, 16, 19, 20, 20, 21, 22, 25, 30, 33, 25, 35, 35, 25, 36, 40, 35, 45, 35,
46, 52, 70, 22, 16
a. Use smoothing by bin means to smooth the above data, using a bin depth of 3.
Illustrate your steps.
b. Using equal width of size 3.
c. Use min-max normalization to transform the value 35 for age onto the range [0.0,
1.0].
d. Use z-score normalization to transform the value 35 for age, where the standard
deviation of age is 12.94 years
5. Use the methods below to normalize the following group of data: 200, 300, 400, 600, 1000
a. min-max normalization by setting min = 0 and max = 1
b. z-score normalization
6. How do you differentiate outliers from noises?
7. Write down the difference between operational database systems and data warehouses
8. Briefly compare the following concepts. You may use an example to explain your point(s).
a. Snowflake schema, fact constellation, star schema models
b. Data cleaning, data transformation, refresh
9. Suppose that a data warehouse consists of the three dimensions time, doctor, and patient,
and the two measures count and charge, where charge is the fee that a doctor charges a
patient for a visit.
a. Enumerate three classes of schemas that are popularly used for modeling data
warehouses.
b. Draw a schema diagram for the above data warehouse using one of the schema
classes listed in (a).
c. Starting with the base cuboid [day, doctor, patient], what specific OLAP operations
should be performed in order to list the total fee collected by each doctor in 2010?
d. To obtain the same list, write an SQL query assuming the data is stored in a
relational database with the schema fee (day, month, year, doctor, hospital, patient,
count, charge).