|
Presented by Dr. Gordon C. Everest
Time: Registration/Refreshments - 8:30 - 9:00 a.m. Presentation - 9:00 - 11:00 a.m. Announcements - 11:00 - 11:30 a.m.
Location: Standard Insurance Auditorium
Entity
Relationship (ER) data modeling goes back to a paper by Peter Chen in 1976. He
proposed a data model diagraming scheme which would transcend thinking about
physical records, by focusing on the entities and their interrelationships in
the users' real world domain of interest being modeled in a database. When
relational DBMSs came on the scene, ER Diagraming was a good and natural scheme
for designing relational databases. Today, ER Diagrams and Relational DBMSs are
widely used throughout the world in the development of computer-based
information systems.
Data modeling is the foundation of information systems development - if you
don't get the database design "right" then the systems you build will be like a
house of cards, collapsing under the weight of inevitable future forces for
revision, enhancement, integration, and quality improvement. Therefore, we need
a scheme to guide our data modeling efforts which helps produce data models that
are a clear and accurate representation of the users' domain of discourse, and
which facilitate human communication, understanding, and validation.
Many popular data modeling schemes and supporting design (CASE) tools in use
today are essentially variations on ER Diagrams, including Teorey's Extended ER,
Finkelstein's Information Engineering (IE, and adapted by Martin), Appleton's
IDEF1X used by the U. S. Government, Barker in Oracle Designer*2000,
PowerDesigner from Sybase, ERwin now from Computer Associates, Kroenke's
Semantic Object Model (SOM), and even UML Class Diagrams. This talk could
equally be entitled, "What's Wrong with them?"
To answer the "What's Wrong..." question, we need to begin by asking:
- WHAT are we trying to do in Data Modeling?
- WHY do we do data modeling? - the purpose and objectives.
- HOW do we do data modeling? - what drives or guides the process?
- By what criteria do we judge or evaluate a data modeling scheme?
- What is the essential, distinguishing characteristic which underlies all ER
modeling schemes?
- How well does ER modeling satisfy the criteria for a good data modeling
scheme?
Think about these questions. Then we will try to come to some consensus before
launching into a discussion of what's wrong with ER modeling, backed up with
several examples. Having convinced you that there is a problem, we will point
you in the direction of a solution -- a better way of doing data modeling.
Also in preparation for this discussion, please do this simple, little
exercise. Given the following ER Diagram:
ENTITY
---------------------------
| X | A | B | ...
---------------------------
* What does a person understand about a data model when presented with an
ER diagram?
First of all, seeing an entity box labeled with the name of an entity type, one
generally presumes there are attributes contained within the box. So it is
reasonable to presume that one is really thinking about an entity type (as shown
above) having an identifier, say X, and described by some attributes, A, B, etc.
- What semantics are conveyed or presumed by this single-entity schema
diagram? by this record structure? Assume you have ALL the information about
X, A, and B, and it is ALL shown in this ER diagram.
- It is helpful to frame the response as follows:
What does the diagram say (or presume)...
* about X alone?
* about A alone(and similarly about B, etc.)?
* about the relationship between X and A (and similarly X-B)?
* about the relationship between A and B?
The response should be in terms of: existence, attributes, relationships, and
the characteristics of dependency/optionality (mandatory or NULLS ALLOWED), and
multiplicity/?exclusivity (1:1, 1:many, M:N). The response needs to go beyond
the obvious and state all the semantics implied by this little diagram. No
counting observations which are true by construction, e.g., column names are
unique, column ordering is immaterial, or attribute values may change over
time. Use only positive statements, that is, do NOT express what semantics are
NOT implied by the diagram (there are hundreds of things you could say that the
diagram does not say!). Also, state only what is, not what semantics could be
implied by the diagram. You should be able to come up with at least 15 distinct
semantic statements implied by the above single-entity schema diagram -- number
your statements.
As a start, (1) there exists a domain of
values called X .
Furthermore, what does the diagram say about normalization, the Achilles heel of
data modelers? Is it normalized? How do you know?
This talk argues that the distinction between entities and attributes is
artificial, hence troublesome for the data modeler, and ultimately unnecessary
for conceptual data modeling. Prematurely clustering attributes into entity
records gets the data modeler into
trouble. Too much clustering can produce a data model with storage redundancies
and processing anomalies. Normalization is the test which helps to identify the
problems stemming from incorrect clustering, and record decomposition is always
the remedy to correct a violation of one of the normal forms. Unfortunately,
normalization is based upon semantics which are not explicitly represented in ER
diagrams.
Data modeling schemes which use two basic constructs do not require the data
modeler to a priori determine whether a piece of data represents an entity or an
attribute. They do not require the data modeler to cluster attributes into
records. Hence, normalization becomes unnecessary and irrelevant. Clustering
attributes into records should not be done in developing the conceptual data
model. Conceptual data modeling is done at a stage before "logical" data
modeling.
This talk describes some problems that arise with a data modeling scheme, such
as ER Diagrams, which uses three basic constructs (entity, attribute,
relationship) and implicitly or explicitly clusters attributes into entity
records. Hopefully such a discussion will motivate the reader to explore and
seriously consider using a data modeling scheme with two basic constructs, such
as Object-Role Modeling (ORM).
Dr. Gordon C Everest
is Professor Emeritus of MIS and Data Management in
the Carlson School of Management at the University of Minnesota (but continues
to teach as an adjunct at Univ. of Minnesota, Metropolitan State University, and
others). His Ph.D. dissertation from the University of Pennsylvania Wharton
School entitled "Managing Corporate Data Resources" became the textbook
entitled: "Database Management: Objectives, System Functions, and
Administration" (McGraw-Hill, 1986, and remained in print until 2002!). He is
also a contributing author of the CODASYL Systems Committee technical report
entitled: "A Framework for Distributed Database Systems: Distribution
Alternatives and Generic Architectures", and of the final technical report of
the ANSI ASC X3 SPARC DBSSG Object-Oriented DBMS Task Group, released in 1991.
He participated in the ANSI standards community on developing a Common Unified
Data Modeling Scheme and investigating Object-Oriented Database Management
Systems.
Gordon has been teaching all about databases, database management systems,
database administration, and data warehousing since he joined the University in
1970. Students learn the theory of databases, gain practical experience with
real data modeling projects, and with hands-on use of data modeling tools and
DBMSs. Besides teaching about databases, he has helped many organizations and
government agencies design their databases. His approach transfers expertise to
professional data architects within those organizations by having them
participate in and observe the conduct of database design project meetings with
the subject matter experts.
Gordon's lecturing and research interests include advanced conceptual data
modeling, logical database design methods and diagraming conventions, high-level
data languages, selection and use of database management systems,
object-oriented databases, data warehousing, metadata acquisition and
management, CASE tools and the repository, data-centered systems development,
client-server architectures, organization and functions of database
administration, data privacy and security, and the legal aspects of computing. |