Entity Relationship Diagram and Basic Database Modeling
Entity Relationship Diagram and Basic Database Modeling
Matilda Wilson
2
Enters a DMBS
connection
(ODBC, JDBC)
Database server
(someone else’s
Data files C program) Applications
9
Functionality of a DBMS
The programmer sees SQL, which has two
components:
• Data Definition Language - DDL
• Data Manipulation Language - DML
– query language
Functionality of a DBMS
Two things to remember:
• Client-server architecture
– Slow, cumbersome connection
– But good for the data
• It is just someone else’s C program
– In the beginning we may be impressed by its speed
– But later we discover that it can be frustratingly slow
– We can do any particular task faster outside the
DBMS
– But the DBMS is general and convenient
1
1
INSERT
INSERTINTO INTOStudents
Students
VALUES(‘Charles’,
VALUES(‘Charles’,‘123456789’,
‘123456789’,‘undergraduate’)
‘undergraduate’)
.. .. .. ..
1
2
Transactions
• Enroll “Mary Johnson” in “CSE444”:
BEGIN
BEGINTRANSACTION;
TRANSACTION;
INSERT
INSERTINTO
INTOTakes
Takes
SELECT
SELECTStudents.SSN,
Students.SSN,Courses.CID
Courses.CID
FROM
FROMStudents,
Students,Courses
Courses
WHERE
WHEREStudents.name
Students.name==‘Mary
‘MaryJohnson’
Johnson’and
and
Courses.name
Courses.name==‘CSE444’
‘CSE444’
----More
Moreupdates
updateshere....
here....
IF
IFeverything-went-OK
everything-went-OK
THEN
THENCOMMIT;
COMMIT;
ELSE
ELSEROLLBACK
ROLLBACK
Transactions
Queries
• Find all courses that “Mary” takes
SELECT
SELECT C.name
C.name
FROM
FROM Students
Students S,
S, Takes
Takes T,
T, Courses
Courses CC
WHERE
WHERE S.name=“Mary”
S.name=“Mary” and and
S.ssn
S.ssn == T.ssn
T.ssn and
and T.cid
T.cid == C.cid
C.cid
• What happens behind the scene ?
– Query processor figures out how to answer the
query efficiently.
1
6
SELECT
SELECT C.name
C.name
FROM
FROMStudents
StudentsS,
S,Takes
TakesT,
T,Courses
CoursesCC
WHERE
WHERES.name=“Mary”
S.name=“Mary”andand cid=cid
S.ssn
S.ssn==T.ssn
T.ssnand
andT.cid
T.cid==C.cid
C.cid
sid=sid
name=“Mary”
Database Systems
• The big commercial database vendors:
– Oracle
– IBM (with DB2) bought Informix recently
– Microsoft (SQL Server)
– Sybase
• Some free database systems (Unix) :
– Postgres
– Mysql
– Predator
Database Principles &
Fundamentals of Design,
DATA MODELS
OUTLINE
• What a database is, what it does, and why
database design is important
• How modern databases evolved from files and
file systems
• About flaws in file system data management
• What a DBMS is, what it does, and how it fits
into the database system
• About types of database systems and database
models
19
OUTLINE CONTINUE
Figure 16.1
22
Database Management
• A database is a shared, integrated computer
structure containing:
– Application (or end user) data
– Metadata (data about data, eg. datatype,
length, required/not required, validation, …)
• Database Management System (DBMS)
– Manages Database structure
– Controls access to data
– Provides query language
23
Advantages of DBMS
• Makes data management more efficient and
effective
• Query language allows quick answers to ad hoc
(one time) queries
• Provides easier access to better-managed data
• Promotes integrated view of organization’s
operations
• Reduces the probability of inconsistent data (same
data stored in different places with possibility of
different values)
24
DBMS Manages Interaction
Figure 1.2
25
Database Design
• Importance of Good Design
– Poor design results in unwanted data redundancy
(unnecessary duplication of data)
– Poor design generates errors leading to decisions
based on incorrect data
• Practical Approach
– Focus on principles and concepts of database
design
– Importance of logical design
26
Historical Roots of Database
• First computer applications focused on
clerical tasks (eg preparing bills)
• Requests for information (eg how many bills
were not paid this month) quickly followed
• File systems developed to address needs
– Data organized according to expected use
– Data Processing (DP) specialists
computerized manual file systems
27
File Terminology
• Data
– Raw Facts
• Field
– Group of characters with specific meaning
• Record
– Logically connected fields that describe a
person, place, or thing
• File
– Collection of related records 28
Simple File System
Figure 1.5
29
File System Disadvantages
• File System Data Management
– Requires extensive programming in third-
generation language (3GL)
– Time consuming
– Makes ad hoc queries impossible
– Leads to islands of information
30
File System Critique (con’t.)
• Data Dependence
– Change in file’s data characteristics requires
modification of data access programs
– Must tell program what to do and how
– Makes file systems cumbersome from
programming and data management views
• Structural Dependence
– Change in file structure requires modification
of related programs
31
File System Critique (con’t.)
• Field Definitions and Naming Conventions
– Flexible record definition anticipates reporting
requirements
– Selection of proper field names important
– Attention to length of field names
– Use of unique record identifiers
32
File System Critique (con’t.)
• Data Redundancy
– Different and conflicting versions of same data
– Results of uncontrolled data redundancy
• Data anomalies
– Modification
– Insertion
– Deletion
• Data inconsistency
– Lack of data integrity
33
Database Systems
• Database consists of logically related data
stored in a single repository
• Provides advantages over file system
management approach
– Eliminates inconsistency, data anomalies, data
dependency, and structural dependency
problems
– Stores data structures, relationships, and
access paths in addition to application data 34
Database vs. File Systems
Figure 1.6
35
Database System Environment
Figure 1.7
36
Database System Types
39
Implementation Database Models
40
Hierarchical Database Model
• Logically represented by an upside down
tree
– Each parent can have many children
– Each child has only one parentFigure 1.8
41
Hierarchical Database Model
• Advantages
– Conceptual simplicity
– Database security and integrity
– Data independence
– Efficiency
• Disadvantages
– Complex implementation
– Difficult to manage and lack of standards
– Lacks structural independence
– Applications programming and use complexity
– Implementation limitations 42
Network Database Model
• Each record can have multiple parents
– Composed of sets
– Each set has owner record and member
record
– Member may have several owners
Figure
1.10 43
Network Database Model
• Advantages
– Conceptual simplicity
– Handles more relationship types
– Data access flexibility
– Promotes database integrity
– Data independence
– Conformance to standards
• Disadvantages
– System complexity
– Lack of structural independence 44
Relational Database Model
45
Relational Database Model (con’t.)
Figure 1.11
46
Relational Database Model
• Advantages
– Structural independence
– Improved conceptual simplicity
– Easier database design, implementation,
management, and use
– Ad hoc query capability with SQL
– Powerful database management system
47
Relational Database Model
• Disadvantages
– Substantial hardware and system software
overhead
– Poor design and implementation is made
easy
– May promote “islands of information”
problems
48
Entity Relationship Database Model
• Complements the relational data model
concepts
• Represented in an entity relationship
diagram (ERD)
• Based on entities, attributes, and
relationships
Figure 1.13
49
Entity Relationship Database Model
• Advantages
– Exceptional conceptual simplicity
– Visual representation
– Effective communication tool
– Integrated with the relational database model
• Disadvantages
– Limited constraint representation
– Limited relationship representation
– No data manipulation language
– Loss of information content
50
Design Principle Introduction
51
Data Modeling and Data Models
• Data models
– Relatively simple representations of complex
real-world data structures
• Often graphical
• Model: an abstraction of a real-world object or
event
– Useful in understanding complexities of the real-
world environment
• Data modeling is iterative and progressive
52
The Importance of Data Models
53
Data Model Basic Building Blocks
60
Hierarchical and Network Models
61
Hierarchical and Network Models
(cont’d.)
• Network model
– Created to represent complex data relationships
more effectively than the hierarchical model
– Improves database performance
– Imposes a database standard
– Resembles hierarchical model
• Record may have more than one parent
62
Hierarchical and Network Models
(cont’d.)
– Collection of records in 1:M relationships
– Set composed of two record types:
• Owner
• Member
• Network model concepts still used today:
– Schema
• Conceptual organization of entire database as
viewed by the database administrator
– Subschema
• Database portion “seen” by the application programs
63
Hierarchical and Network Models
(cont’d.)
– Data management language (DML)
• Defines the environment in which data can be
managed
– Data definition language (DDL)
• Enables the administrator to define the schema
components
64
The Relational Model
65
The Relational Model (cont’d.)
69
The Entity Relationship Model
70
The Entity Relationship Model (cont’d.)
76
Object/Relational and XML (cont’d.)
77
Data Models: A Summary
• Common characteristics:
– Conceptual simplicity with semantic
completeness
– Represent the real world as closely as possible
– Real-world transformations must comply with
consistency and integrity characteristics
• Each new data model capitalized on the
shortcomings of previous models
• Some models better suited for some tasks
79
80
Degrees of Data Abstraction
81
The External Model
82
83
The External Model (cont’d.)
84
The Conceptual Model
85
86
The Conceptual Model (cont’d.)
87
The Internal Model
• Hierarchical model
– Set of one-to-many (1:M) relationships between
a parent and its children segments
• Network data model
– Uses sets to represent 1:M relationships
between record types
• Relational model
– Current database implementation standard
– ER model is a tool for data modeling
• Complements relational model
93
Summary (cont’d.)
• Object-oriented data model: object is basic
modeling structure
• Relational model adopted object-oriented
extensions: extended relational data model
(ERDM)
• OO data models depicted using UML
• Data-modeling requirements are a function of
different data views and abstraction levels
– Three abstraction levels: external, conceptual,
and internal
94
Object-Oriented Database Model
• Objects or abstractions of real-world entities
are stored
– Attributes describe properties
– Collection of similar objects is a class
• Methods represent real world actions of classes
• Classes are organized in a class hierarchy
– Inheritance is ability of object to inherit
attributes and methods of classes above it
95
OO Data Model
• Advantages
– Adds semantic content
– Visual presentation includes semantic content
– Database integrity
– Both structural and data independence
• Disadvantages
– Lack of OODM
– Complex navigational data access
– Steep learning curve
– High system overhead slows transactions 96
Database Models and the Internet
• Characteristics of “Internet age” databases
– Flexible, efficient, and secure Internet access
– Easily used, developed, and supported
– Supports complex data types and
relationships
– Seamless interfaces with multiple data
sources and structures
– Simplicity of conceptual database model
– Many database design, implementation, and
application development tools
– Powerful DBMS GUI make DBA job easier
97
What is a model?
98
Importance of Data Models
99
Obsolete models:
Hierarchical and network models
100
The Relational Model
• Uses key concepts from mathematical relations (tables)
– “Relational” in “relational model” means “tables” (mathematical relations),
not “relationships”
• Table (relations)
– Intersections of
• rows (various data types) and
• columns (same data type)
• Relations have well defined methods (queries) for combining their
data members
– Selecting (reading) and joining (combining) data is defined based on
mathematical principles
• Relational data management system (RDBMS)
– Relations were originally too advanced for 1970s computing power
– As computing power increased, simplicity of the model prevailed
101
The Entity Relationship Model
• Enhancement of the relational model
– Relations (tables) become entities
– Very detailed specification of relationships and their properties
• Entity relationship diagram (ERD)
– Uses graphic representations to model database components
• Many variations for notation exist
– In this class, we use the Crow’s Foot notation
102
Summary of
Data models
• A data model is an abstract way of thinking
about how data is organized
• Although the relational model has become the
dominant data model, it cannot solve all
database challenges
• The Object-Oriented Data Model is useful for
complex data coupled with object-oriented
programming
103
Objectives
105
Entities
106
Attributes
• Characteristics of entities
• Chen notation: attributes represented by ovals
connected to entity rectangle with a line
– Each oval contains the name of attribute it
represents
• Crow’s Foot notation: attributes written in
attribute box below entity rectangle
107
108
Attributes (cont’d.)
109
110
Attributes (cont’d.)
111
112
Attributes (cont’d.)
113
114
Relationships
115
Connectivity and Cardinality
• Connectivity
– Describes the relationship classification
• Cardinality
– Expresses minimum and maximum number of
entity occurrences associated with one
occurrence of related entity
• Established by very concise statements known
as business rules
116
117
Existence Dependence
• Existence dependence
– Entity exists in database only when it is
associated with another related entity
occurrence
• Existence independence
– Entity can exist apart from one or more related
entities
– Sometimes such an entity is referred to as a
strong or regular entity
118
Relationship Strength
119
120
121
Weak Entities
122
123
124
Relationship Participation
• Optional participation
– One entity occurrence does not require
corresponding entity occurrence in particular
relationship
• Mandatory participation
– One entity occurrence requires corresponding
entity occurrence in particular relationship
125
126
127
Relationship Degree
128
129
130
Recursive Relationships
131
132
133
Associative (Composite) Entities
134
135
136
Developing an ER Diagram
• Database design is an iterative process
– Create detailed narrative of organization’s
description of operations
– Identify business rules based on description of
operations
– Identify main entities and relationships from
business rules
– Develop initial ERD
– Identify attributes and primary keys that adequately
describe entities
– Revise and review ERD
137
138
139
140
Summary (cont’d.)
152
Summary
153
Chapter Outline
• Overview of Database Design Process
• Example Database Application (COMPANY)
• ER Model Concepts
– Entities and Attributes
– Entity Types, Value Sets, and Key Attributes
– Relationships and Relationship Types
– Weak Entity Types
– Roles and Attributes in Relationship Types
• ER Diagrams - Notation
• ER Diagram for COMPANY Schema
• Alternative Notations – UML class diagrams, others
154
Overview of Database Design Process
156
Example COMPANY Database
• We need to create a database schema design
based on the following (simplified) requirements
of the COMPANY Database:
– The company is organized into DEPARTMENTs.
Each department has a name, number and an
employee who manages the department. We keep
track of the start date of the department manager.
A department may have several locations.
– Each department controls a number of PROJECTs.
Each project has a unique name, unique number
and is located at a single location.
157
Example COMPANY Database
(Contd.)
– We store each EMPLOYEE’s social security
number, address, salary, sex, and birthdate.
• Each employee works for one department but
may work on several projects.
• We keep track of the number of hours per week
that an employee currently works on each project.
• We also keep track of the direct supervisor of
each employee.
– Each employee may have a number of
DEPENDENTs.
• For each dependent, we keep track of their name,
sex, birthdate, and relationship to the employee.
158
ER Model Concepts
• Entities and Attributes
– Entities are specific objects or things in the mini-world that are represented in
the database.
• For example the EMPLOYEE John Smith, the
Research DEPARTMENT, the ProductX
PROJECT
– Attributes are properties used to describe an entity.
• For example an EMPLOYEE entity may have the
attributes Name, SSN, Address, Sex, BirthDate
– A specific entity will have a value for each of its attributes.
• For example a specific employee entity may have
Name='John Smith', SSN='123456789', Address
='731, Fondren, Houston, TX', Sex='M',
BirthDate='09-JAN-55‘
– Each attribute has a value set (or data type) associated with it – e.g. integer,
string, subrange, enumerated type, …
159
Types of Attributes (1)
• Simple
– Each entity has a single atomic value for the attribute. For example, SSN or Sex.
• Composite
– The attribute may be composed of several components. For example:
• Address(Apt#, House#, Street, City, State, ZipCode, Country), or
• Name(FirstName, MiddleName, LastName).
• Composition may form a hierarchy where some
components are themselves composite.
• Multi-valued
– An entity may have multiple values for that attribute. For example, Color of a CAR or
PreviousDegrees of a STUDENT.
• Denoted as {Color} or {PreviousDegrees}.
160
Types of Attributes (2)
162
Entity Types and Key Attributes (1)
• Entities with the same basic attributes are grouped or
typed into an entity type.
– For example, the entity type EMPLOYEE and PROJECT.
• An attribute of an entity type for which each entity must
have a unique value is called a key attribute of the
entity type.
– For example, SSN of EMPLOYEE.
163
Entity Types and Key Attributes (2)
166
Entity Set
169
Refining the initial design by introducing relationships
171
Relationship instances of the WORKS_FOR N:1
relationship between EMPLOYEE and DEPARTMENT
172
Relationship instances of the M:N WORKS_ON
relationship between EMPLOYEE and PROJECT
173
Relationship type vs. relationship set (1)
• Relationship Type:
– Is the schema description of a relationship
– Identifies the relationship name and the
participating entity types
– Also identifies certain relationship constraints
• Relationship Set:
– The current set of relationship instances
represented in the database
– The current state of a relationship type
174
Relationship type vs. relationship set (2)
176
ER DIAGRAM – Relationship Types are:
WORKS_FOR, MANAGES, WORKS_ON, CONTROLS, SUPERVISION, DEPENDENTS_OF
177
Discussion on Relationship Types
• In the refined design, some attributes from the initial entity types are
refined into relationships:
– Manager of DEPARTMENT -> MANAGES
– Works_on of EMPLOYEE -> WORKS_ON
– Department of EMPLOYEE -> WORKS_FOR
– etc
• In general, more than one relationship type can exist between the same
participating entity types
– MANAGES and WORKS_FOR are distinct relationship types between
EMPLOYEE and DEPARTMENT
– Different meanings and different relationship instances.
178
Recursive Relationship Type
• An relationship type whose with the same participating entity type in
distinct roles
• Example: the SUPERVISION relationship
• EMPLOYEE participates twice in two distinct roles:
– supervisor (or boss) role
– supervisee (or subordinate) role
• Each relationship instance relates two distinct EMPLOYEE entities:
– One employee in supervisor role
– One employee in supervisee role
179
Weak Entity Types
• An entity that does not have a key attribute
• A weak entity must participate in an identifying relationship type with an owner or
identifying entity type
• Entities are identified by the combination of:
– A partial key of the weak entity type
– The particular entity they are related to in the identifying entity type
• Example:
– A DEPENDENT entity is identified by the dependent’s first name, and the specific
EMPLOYEE with whom the dependent is related
– Name of DEPENDENT is the partial key
– DEPENDENT is a weak entity type
– EMPLOYEE is its identifying entity type via the identifying relationship type
DEPENDENT_OF
180
Constraints on Relationships
• Constraints on Relationship Types
– (Also known as ratio constraints)
– Cardinality Ratio (specifies maximum participation)
• One-to-one (1:1)
• One-to-many (1:N) or Many-to-one (N:1)
• Many-to-many (M:N)
– Existence Dependency Constraint (specifies minimum participation) (also
called participation constraint)
• zero (optional participation, not existence-
dependent)
• one or more (mandatory participation, existence-
dependent)
181
Many-to-one (N:1) Relationship
182
Many-to-many (M:N) Relationship
183
Displaying a recursive relationship
184
A Recursive Relationship
Supervision`
185
Recursive Relationship Type is: SUPERVISION
(participation role names are shown)
186
Attributes of Relationship types
• A relationship type can have attributes:
– For example, HoursPerWeek of WORKS_ON
– Its value for each relationship instance describes
the number of hours per week that an
EMPLOYEE works on a PROJECT.
• A value of HoursPerWeek depends on a
particular (employee, project) combination
– Most relationship attributes are used with M:N
relationships
• In 1:N relationships, they can be transferred to
the entity type on the N-side of the relationship
187
Example Attribute of a Relationship
Type:
Hours of WORKS_ON
188
Notation for Constraints on
Relationships
• Cardinality ratio (of a binary relationship): 1:1,
1:N, N:1, or M:N
– Shown by placing appropriate numbers on the
relationship edges.
• Participation constraint (on each participating
entity type): total (called existence dependency)
or partial.
– Total shown by double line, partial by single line.
• NOTE: These are easy to specify for Binary
Relationship Types.
189
Alternative (min, max) notation for
relationship structural constraints:
• Specified on each participation of an entity type E in a relationship type R
• Specifies that each entity e in E participates in at least min and at most max relationship
instances in R
• Default(no constraint): min=0, max=n (signifying no limit)
• Must have minmax, min0, max 1
• Derived from the knowledge of mini-world constraints
• Examples:
– A department has exactly one manager and an employee can manage at most one
department.
• Specify (0,1) for participation of EMPLOYEE in MANAGES
• Specify (1,1) for participation of DEPARTMENT in MANAGES
– An employee can work for exactly one department but a department can have any
number of employees.
• Specify (1,1) for participation of EMPLOYEE in WORKS_FOR
• Specify (0,n) for participation of DEPARTMENT in WORKS_FOR
190
The (min,max) notation for relationship
constraints
191
COMPANY ER Schema Diagram using (min,
max) notation
192
Alternative diagrammatic notation
193
Summary of notation for ER diagrams
194
UML class diagrams
195
UML class diagram for COMPANY database schema
196
Other alternative diagrammatic notations
197
Relationships of Higher Degree
198
Discussion of n-ary relationships (n > 2)
199
Example of a ternary relationship
200
Discussion of n-ary relationships (n > 2)
201
Another example of a ternary relationship
202
Displaying constraints on higher-degree relationships
• The (min, max) constraints can be displayed on the edges – however, they
do not fully describe the constraints
• Displaying a 1, M, or N indicates additional constraints
– An M or N indicates no constraint
– A 1 indicates that an entity can participate in at most one relationship instance
that has a particular combination of the other participating entities
• In general, both (min, max) and 1, M, or N are needed to describe fully the
constraints
203
Data Modeling Tools
• A number of popular tools that cover conceptual modeling and mapping
into relational schema design.
– Examples: ERWin, S- Designer (Enterprise Application Suite), ER- Studio, etc.
• POSITIVES:
– Serves as documentation of application requirements, easy user interface -
mostly graphics editor support
• NEGATIVES:
– Most tools lack a proper distinct notation for relationships with relationship
attributes
– Mostly represent a relational design in a diagrammatic form rather than a
conceptual ER-based design
(See Chapter 12 for details)
204
Some of the Currently Available Automated Database
Design Tools
COMPANY TOOL FUNCTIONALITY
Embarcadero ER Studio Database Modeling in ER and IDEF1X
Technologies
DB Artisan Database administration, space and security management
Rational (IBM) Rational Rose UML Modeling & application generation in C++/JAVA
Resolution Ltd. Xcase Conceptual modeling up to code maintenance
Sybase Enterprise Application Suite Data modeling, business logic modeling
Visio Visio Enterprise Data modeling, design/reengineering Visual Basic/C++
205
Extended Entity-Relationship (EER)
Model (in next chapter)
206
Summary
207
Entity Integrity:
Selecting Primary Keys
208
Natural Keys and Primary
• Keys
Natural key is a real-world identifier used to uniquely
identify real-world objects
– Familiar to end users and forms part of their day-to-day
business vocabulary
• Generally, data modeler uses natural identifier as
primary key of entity being modeled
• May instead use composite primary key or surrogate
key
– Surrogate key - a PK created to simplify the
identification of entity instances
• Has no meaning, exists only to distinguish one entity
from another (e.g., Autonumber)
209
Primary Key Guidelines
210
211
When to Use Composite
Primary Keys
• Composite primary keys useful in two
cases:
– As identifiers of composite entities
• In which each primary key combination is
allowed once in M:N relationship
– As identifiers of weak entities
• In which weak entity has a strong identifying
relationship with the parent entity
• Automatically provides benefit of ensuring
that there cannot be duplicate values
212
Composite PK of
ENROLL ensures a
student can not register
for the same class
twice 213
When to Use Composite
Primary Keys
• When used as identifiers of weak entities
normally used to represent:
– Real-world object that is existent-dependent on
another real-world object
– Real-world object that is represented in data
model as two separate entities in strong
identifying relationship
• Dependent entity exists only when it is related to
parent entity
– EMPLOYEE and DEPENDENT – latter uses a
composite PK containing employee id
– LINE exists only as part of INVOICE
214
When To Use Surrogate Primary Keys
215
When To Use Surrogate
Primary Keys
• If you use surrogate key:
– Ensure that candidate key of entity
in question performs properly
– Use “unique index” and “not null”
constraints
216
When To Use Surrogate
Primary Keys
217
Design Cases:
Learning Flexible Database Design
• Data modeling and design requires skills
acquired through experience
• Experience acquired through practice
• Four special design cases that highlight:
– Importance of flexible design
– Proper identification of primary keys
– Placement of foreign keys
218
Design Case 1: Implementing 1:1
Relationships
• Foreign keys work with primary keys to properly
implement relationships in relational model
• Put primary key of the “one” side on the “many”
side as foreign key
– Primary key: parent entity
– Foreign key: dependent entity
219
Design Case 1: Implementing
1:1 Relationships
• In 1:1 relationship, there are two
options:
– Place a foreign key in both entities
(not recommended)
– Place a foreign key in one of the
entities
• Primary key of one of the two
entities appears as foreign key of
other
220
221
Design Case 2: Maintaining History of
Time-Variant Data
• Normally, existing attribute values are replaced
with new value without regard to previous value
• Time-variant data:
– Values change over time
– Must keep a history of data changes
• Keeping history of time-variant data equivalent to
having a multivalued attribute in your entity
• Must create new entity in 1:M relationships with
original entity
• New entity contains new value, date of change
222
223
224
Design Case 3: Fan Traps
225
226
227
Design Case 4:
Redundant Relationships
• Redundancy is seldom a good thing in database
environment
• Occurs when there are multiple relationship paths
between related entities
• Main concern is that redundant relationships
remain consistent across model
• Some designs use redundant relationships to
simplify the design
• In the following example, the relationship between
DIVISION and PLAYER is not needed as all
information can be obtained through TEAM
228
229
230
Portion of Tiny College ERD
231
Tiny College - New Requirement
232
Tiny College ERD
233
Matilda Wilson
Database Fundamentals
234
Objectives
• In this chapter, students will learn:
– What normalization is and what role it plays in
the database design process
– About the normal forms 1NF, 2NF, 3NF, BCNF,
and 4NF
– How normal forms can be transformed from
lower normal forms to higher normal forms
– That normalization and ER modeling are used
concurrently to produce a good database design
– That some situations require denormalization to
generate information efficiently
235
Database Tables and Normalization
• Normalization
– Process for evaluating and correcting table
structures to minimize data redundancies
• Reduces data anomalies
– Series of stages called normal forms:
• First normal form (1NF)
• Second normal form (2NF)
• Third normal form (3NF)
236
Database Tables and Normalization
• Normalization (continued)
– 2NF is better than 1NF; 3NF is better than 2NF
– For most business database design purposes,
3NF is as high as needed in normalization
– Highest level of normalization is not always most
desirable
• Denormalization produces a lower normal form
– Increased performance but greater data
redundancy
237
The Need for Normalization
• Example: company that manages building projects
(Figure 6.1)
– Each project has its own project number, name,
assigned employees, etc.
– Each employee has an employee number, name, job
class
– Charges its clients by billing hours spent on each
contract
– Hourly billing rate is dependent on employee’s position
– Total charge is a derived attribute and not stored in the
table
– Periodically, report is generated that contains
information such as displayed in Table 6.1
238
239
240
The Need for Normalization
• Structure of data set in Figure 6.1 does not handle
data very well
• Table structure appears to work; report is generated
with ease
• Report may yield different results depending on
what data anomaly has occurred
– Employee can be assigned to more than one project
but each project includes only a single occurrence of
any one employee
• Relational database environment is suited to help
designer avoid data integrity problems
241
The Need for Normalization
• PROJECT_NUM, either a PK or part of a PK, contains
NULLS
• JOB_CLASS values could be abbreviated differently
• Each time an employee is assigned to a project, all
employee information is duplicated
• Update anomalies – Modifying JOB_CLASS for employee
105 requires alterations in two records
• Insertion anomalies – to insert a new employee who has not
been assigned to a project requires a phantom project
• Deletion anomalies – If a project has only one employee
associated with it and that employee leaves, a phantom
employee must be created
242
The Normalization Process
• Each table represents a single subject
• No data item will be unnecessarily stored in more
than one table
• All nonprime attributes in a table are dependent
on the primary key
• Each table is void of insertion, update, and
deletion anomalies
243
The Normalization Process (cont’d.)
• Objective of normalization is to ensure that all
tables are in at least 3NF
• Higher forms are not likely to be encountered in
business environment
• Normalization works one relation at a time
• Progressively breaks table into new set of
relations based on identified dependencies
244
245
The Normalization Process (cont’d.)
• Partial dependency
– Exists when there is a functional dependence in
which the determinant is only part of the primary key
– If (A,B)(C,D); BC and (A,B) is the PK
• BC is a partial dependency because only part of the
PK, B, is needed to determine the value of C
• Transitive dependency
– Exists when there are functional dependencies such
that X → Y, Y → Z, and X is the primary key
• XZ is a transitive dependency because X determines
the value of Z via Y
• The existence of a functional dependence among non-
prime attributes is a sign of transitive dependency 246
Conversion to First Normal Form
• Repeating group
– Group of multiple entries of same type can exist
for any single key attribute occurrence
• Relational table must not contain repeating
groups
• Normalizing table structure will reduce data
redundancies
• Normalization is three-step procedure
247
Conversion to First Normal Form
(cont’d.)
• Step 1: Eliminate the Repeating Groups
– Eliminate nulls: each repeating group attribute
contains an appropriate data value
• Step 2: Identify the Primary Key
– Must uniquely identify attribute value
– New key must be composed
• Step 3: Identify All Dependencies
– Dependencies are depicted with a diagram
248
249
Conversion to First Normal Form
(cont’d.)
• Dependency diagram:
– Depicts all dependencies found within given
table structure
– Helpful in getting bird’s-eye view of all
relationships among table’s attributes
250
251
Conversion to First Normal Form
• First normal form describes tabular format:
– All key attributes are defined
– No repeating groups in the table
– All attributes are dependent on primary key
• All relational tables satisfy 1NF requirements
• Some tables contain partial dependencies
– Dependencies are based on part of the primary
key
– Should be used with caution
252
Conversion to Second Normal Form
• Conversion to 2NF occurs only when the 1NF has a composite key
– If the 1NF key is a single attribute, then the table is automatically in
2NF
• Step 1: Make New Tables to Eliminate Partial Dependencies
– For each component of the PK that acts as a determinant in a partial
dependency, create a new table with a copy of that component as the
PK
– These components also remain in the original table in order to serve
as FKs to the original table
– Write each key component on a separate line; then write the original
composite key on the last line. Each component will become the key
in a new table
PROJ_NUM
EMP_NUM
PROJ_NUM EMP_NUM
253
•
Conversion to Second Normal Form
Step 2: Reassign Corresponding Dependent Attributes
– The dependencies for the original key components are found
by examining the arrows below the dependency diagram in
Fig 6.3
– The attributes in a partial dependency are removed from the
original table and placed in the new table with the
dependency’s determinant
– Any attributes that are not dependent in a partial
dependency remain in the original table
– At this point, most anomalies have been eliminated
PROJECT(PROJ_NUM, PROJ_NAME)
EMPLOYEE(EMP_NUM, EMP_NAME, JOB_CLASS, CHG_HOUR)
ASSIGNMENT(PROJ_NUM , EMP_NUM, ASSIGN_HOURS)
254
255
Conversion to Second Normal Form
256
Conversion to Third Normal Form
257
Conversion to Third Normal Form
• Step 2: Reassign Corresponding Dependent
Attributes
– Identify attributes dependent on each determinant
identified in Step 1
• Identify dependency
– Name table to reflect its contents and function
PROJECT(PROJ_NUM, PROJ_NAME)
ASSIGNMENT(PROJ_NUM , EMP_NUM, ASSIGN_HOURS)
EMPLOYEE(EMP_NUM, EMP_NAME, JOB_CLASS)
JOB(JOB_CLASS, CHG_HOUR)
258
259
Conversion to Third Normal Form
• A table is in third normal form (3NF)
when both of the following are true:
– It is in 2NF
– It contains no transitive dependencies
260
Conversion to Third Normal Form
1NF->2NF – remove partial dependencies
2NF->3NF – remove transitive dependencies
• In both cases, the answer is create a new
table
– The determinant of the problem dependency
remains in the original table and is placed as
the PK of the new table
– The dependents of the problem dependency
are removed from the original table and
placed as nonprime attributes in the new table
261
Improving the Design
• Table structures should be cleaned up to eliminate
initial partial and transitive dependencies
• Normalization cannot, by itself, be relied on to
make good designs
• Valuable because it helps eliminate data
redundancies
• If a table has multiple candidate keys and one is a
composite key, there can be partial dependencies
even when the PK is a single attribute
– Resolve in 3NF as a transitive dependency
262
Improving the Design (cont’d.)
• Issues to address, in order, to produce a good
normalized set of tables:
– Evaluate PK Assignments
• Use JOB_CODE as PK for JOB table rather than
JOB_CLASS to avoid data-entry errors when
used as a FK in EMPLOYEE (DB Designer
/Database Designer)
• JOB (JOB_CODE, JOB_CLASS,CHG_HOUR)
• Why is JOB_CLASS-->CHG_HOUR not a
transitive dependency? (Because JOB_CLASS is
a candidate key)
263
Improving the Design (cont’d.)
– Evaluate Naming Conventions
• CHG_HOUR should be JOB_CHG_HOUR
• JOB_DESCRIPTION is a better than
JOB_CLASS
– Refine Attribute Atomicity
• Atomic attribute – one that can not be further
subdivided
– EMP_NAME is not atomic
– Identify New Attributes
• YTD gross salary, social security payments, hire
date
264
Improving the Design (cont’d.)
– Identify New Relationships
• To track the manager of each project, put
EMP_NUM as a FK in PROJECT
– Refine Primary Keys as Required for Data
Granularity
• What does ASSIGN_HOURS represent ? Yearly
total hours, weekly, daily?
• If need multiple daily entries for project and emp
number, then use a surrogate key ASSIGN_NUM to
avoid duplication of the PK key EMP_NUM,
PROJ_NUM, ASSIGN_DATE
265
Improving the Design (cont’d.)
– Maintain Historical Accuracy
• An employee’s job charge could change over the
lifetime of a project. In order to reconstruct the
charges to a project, another field with the job
charge and date active is required
– Evaluate Using Derived Attributes
• Store rather than derive the charge if it will speed up
reporting
266
267
268
Surrogate Key Considerations
269
Higher-Level Normal Forms
270
The Boyce-Codd Normal Form
271
The Boyce-Codd Normal Form
272
273
274
275
Fourth Normal Form (4NF)
• ER diagram
– Identify relevant entities, their attributes, and
their relationships
– Identify additional entities and attributes
• Normalization procedures
– Focus on characteristics of specific entities
– Micro view of entities within ER diagram
• Difficult to separate normalization process from
ER modeling process
280
Normalization and Database Design
• Given the following business rules:
– The company manages many projects
– Each project requires the services of many employees
– An employee may be assigned to several projects
– Some employees are not assigned to a project and perform
non-project related duties. Some employees are part of a
labor pool and shared by all project teams
– Each employee has a single primary job classification which
determines the hourly billing rate]
– Many employees can have the same job classification.
281
Normalization and Database Design
• We initially define the following entities
PROJECT(PROJ_NUM, PROJ_NAME)
EMPLOYEE(EMP_NUM,EMP_LNAME, EMP_FNAME, EMP_INITIAL, JOB_DESCRIPTION, JOB_CHG_HOUR)
• PROJECT is in 3NF and needs no modification
• EMPLOYEE contains a transitive dependency so we now have
PROJECT(PROJ_NUM, PROJ_NAME)
EMPLOYEE(EMP_NUM,EMP_LNAME, EMP_FNAME, EMP_INITIAL, JOB_CODE)
JOB(JOB_CODE, JOB_DESCRIPTION, JOB_CHG_HOUR)
282
Normalization and Database Design
• EMPLOYEE contains a transitive dependency so we now have
PROJECT(PROJ_NUM, PROJ_NAME)
EMPLOYEE(EMP_NUM,EMP_LNAME, EMP_FNAME, EMP_INITIAL,
JOB_CODE)
JOB(JOB_CODE, JOB_DESCRIPTION, JOB_CHG_HOUR)
283
Normalization and Database Design
• To represent the M:N relationship between EMPLOYEE and
PROJECT, we could try two 1:M realtionships
• An employee can be assigned to many projects
• Each project can have many employees assigned to it
284
Normalization and Database Design
• As this M:N can not be implemented, we include the ASSIGNMENT
entity to track the assignment of employees in projects
285
Normalization and Database Design
• ASSIGN_HOURS is assigned to ASSIGNMENT
• A “manages” relationship is added to in order to keep detailed
information about each project’s manager
• Some additional attributes are added to maintain additional
information
PROJECT(PROJ_NUM, PROJ_NAME,EMP_NUM)
286
287
Denormalization
288
Denormalization (cont’d.)
• Joining the larger number of tables reduces system
speed
• Conflicts are often resolved through compromises
that may include denormalization
• Defects of unnormalized tables:
– Data updates are less efficient because tables are
larger
– Indexing is more cumbersome as there are more
fields per table
– No simple strategies for creating virtual tables known
as views
289
Denormalization
290
Denormalization
• In order to generate the report below, a temporary
denormalized table is used since the last four
semesters of each faculty member could be
different due to sabbatical, leave, start date, etc
291
• EVALDATA is theDenormalization
master data table which is normalized
• FACHIST is created via a series of queries in order to
produce the desired report
292
Data-Modeling Checklist
293
294