Importance of Entity-Relationship Modeling
in a Data Warehouse

Published in The Navigator, Summer 1998
Copyright © 1998, Larissa T. Moss, Method Focus Inc.

The most compelling reason for creating an Entity-Relationship model is to understand the data content and the data relationships of a business. Without a thorough understanding of the business data one cannot construct a fully integrated and flexible database which can support access to all the defined data through all the defined relationships.

Traditionally we used process modeling to design operational systems. Why? Because an operational system is a stand-alone system which needs to perform a specific business function. That business function needs to manipulate a certain subset of business data. Therefore, we collect and store that data in the most efficient way for the processes (programs) to perform their functions. This is different in a data warehouse.

In a data warehouse we want to provide integrated data across all business functions. Data warehouses do not have a specialized set of processes (programs) to perform specific business functions; instead a data warehouse must be designed in such a way that processes not yet conceived can be performed against it in the future. This means that all the data relationships pertinent to the business must be known and understood, because they become the access paths to the data. To achieve this we use the data modeling technique instead of the process modeling technique.

The major difference is the collection and storage of data. In process modeling, data from different business objects is combined and stored in data stores around the processes, whereas in data modeling, data is collected and stored separately for each business object (or entity). Storing the data separated by business objects and connecting these business objects according to the business rules (or policies), allows for maximum flexibility to access all the data in any way.

The process of data modeling involves the following steps:

  • Identify major objects (called entities) that are pertinent to the business.
  • Identify the business policy behind the relationships between these objects.
  • Identify the data elements you want to collect about these objects.
  • Identify the relationships between the data elements within these objects.
  • Assign each data element to one and only one object (entity).

The last step in an Entity-Relationship data model is to ensure that each data element is either a unique identifier or a descriptive attribute of the business object it belongs to. To do this we use a technique called Normalization. The result is a data model in 3rd Normal Form.

This technique is a check (or audit) of the attributes. We compare each non-key attribute to it's key and determine if:

  1. the attribute is a repeating group
  2. the attribute is fully dependent on all components of a composite key
  3. the attribute is dependent on the key and not another non-key attribute.

If any of these conditions exist, as illustrated by the examples below, the entity should be normalized; i.e. split into two entities.

Repeating group
The best analogy for this is to think of an OCCURS clause in Cobol.

Example:

Crs-Nr.	 Course Title	  Year	Semester(1)	#Students(1)	Sem.(2)		#Stud.(2)...
 A333	 DW modeling	1997	 Spring		     	22		   Summer		18
If each semester was stored on a separate row it would produce this redundancy:
     Crs Nr.	 Course Title	 Year	Semester	#Students	
     A333	  DW modeling	1997	Spring		    22
     A333	  DW modeling	1997	Summer		18
     A333	  DW modeling	1997	Fall		        26
     A333	  DW modeling	1997	Winter		    33
Normalize into:
     Entity COURSE:		Course Nr.	Course Title
     Entity ATTENDANCE:	    Course Nr.	   Year		 Semester	# Students

Fully dependent on all components of a composite key
This only applies if the unique identifier (primary key) is made up of more than one attribute.

In that case all non-key attributes must depend on the entire primary key (all of its attributes). Otherwise, like in the previous case, we may be repeating the values of those attributes that do depend on the entire key for each attribute that doesn't.

Example:

     Crs Nr.  Student ID	Grade	   # Units	(#Units is not dependent on
     A333		12345		A		3	Student ID; creates redundancy)
     A333		34567		B		3		
     A333		56789		C		3		
Normalize into:
     Entity COURSE:		Course Nr.	# Units
     Entity GRADE:		 Course Nr.	 Student ID	Grade

Fully dependent on the key and not on a non-key attribute
In this case a relationship would exists between two attributes where neither attribute is part of the primary key.

Example:

     Crs Nr.      # Units	   Tuition 	 (Tuition$ is dependent on #Units,
     A333		   3		$995.00		  regardless of Course Nr.)
     A111		   3		$995.00
     A444		   3		$995.00
     A567		   2		$795.00
Normalize into:
     Entity COURSE:		Course Nr.	# Units
     Entity TUITION:		# Units		      Tuition $

The most flexible relational design is a data model in 3rd normal form. However, because of performance considerations and its complexity of access, it is rarely possible or desirable to implement such a model without some denormalization. Regrettably, and much too frequently, developers and managers use this fact as an excuse to skip the process of Entity-Relationship modeling. This often results in major database redesigns, or worse, in stand-alone systems, because some important business relationships were misunderstood or missed entirely.