MCS-274 Lab 2: Database design (Spring 2009)

Due March 31, 2009

Background

In this lab, you will continue working with the election data from lab 1. This time, your focus will be on developing a better design for the database than simply mirroring the file format in which the Secretary of State's office (SOS) distributed the data. The main portion of the lab uses the E-R design approach, rather than normalization. However, one of the extra-credit options does draw the connection to normalization.

Although the data may have quirks (like an MCD with two names), the scope of work for the main assignment does not include any resolution to such oddities. Instead, you are to represent the data we are given. An extra-credit option addresses this topic.

Minimum tasks

I strongly urge you to check your work with me at each stage, rather than proceeding ahead from what may be an ill-chosen starting point. However, your goal at each stage is to have a reasonable starting point for further design, not necessarily a perfect one. You may wind up discovering something in a later stage that causes you to go back and make minor revisions to an earlier stage.

  1. Identify the entity sets underlying the SOS tables. In many cases, each SOS table corresponds to a single entity set, but be careful -- there is at least one exception. For this task, just prepare a textual list of the entity sets.

  2. Identify the relationships amongst the entity sets. For each, give the name of the relationship and the names of the participating entity sets. If the meaning of the relationship or the role played by any of the entity sets is not clear, include a brief explanatory note.

    You should strive to provide only a minimal, non-redundant collection of relationships that suffices to capture all the information. Do not create a dense web of redundant relationships. To illustrate this point, consider a college course database. A relationship exists between students and courses, and another exists between professors and courses. There would be no reason to include a third relationship linking students and professors, unless it served some unrelated purpose, such as the advisor relationship.

    There are acceptable database designs that do not include any subclasses, and I expect that most of you will choose such a design, for simplicity. However, if you do choose to include subclasses, then your list of relationships should explicitly indicate the "isa" relationships.

  3. For each entity set or relationship, identify its attributes. Each entity set should have at least one attribute, but relationships can have as few as zero attributes. In fact, in some of the reasonable designs, none of the relationships have attributes. Be careful when you decide which entity set or relationship each attribute is associated with. Some of the attributes in the SOS tables are not placed where they logically belong in an E-R diagram.

    Try to avoid redundant attributes. In avoiding redundancy, I would even go so far as to only have one record of each county's name, although the SOS tables actually contain two variants, for example 'SAINT LOUIS' and 'St. Louis'. However, don't go so far as to eliminate the storage of recalculatable information, such as totals, counts, and percentages. Some of those numbers may be wrong in the SOS data (or at least inexplicable). As a computing professional, your responsibility would be to alert the competent authority of any anomalies you notice, but pending their correction, still arrange to store the data as given.

    For this task, you should prepare a textual list in the form of a two-level outline, with entity sets and relationships at the top level and attributes at the second level. That is, you should have a line for each entity set, each relationship, and each attribute, with each attribute name indented in under the corresponding entity set name or relationship name.

  4. Draw an E-R diagram, including such details as keys, referential integrity and degree constraints, and weak entity sets.

  5. Create a new collection of tables based on your E-R diagram, as explained in the textbook's Section 4.5. (If your design includes subclasses, you will also need to consult Section 4.6.)

    Although you could create the tables using the ordinary form of CREATE TABLE, and then use INSERT to import the data from the SOS tables, Oracle provides a more convenient alternative, which combines table creation with data import. This more convenient alternative frees you from having to specify all the column names and types that are the same as in the source table.

    As an example, suppose you wanted to make your own identical copy of the sos.precincts table. You could use the following SQL command in Oracle:

    CREATE TABLE precincts AS
    (SELECT * FROM sos.precincts);
    

    In this example, Oracle copies the column names, types, and any NOT NULL constraints in order to create the table; then it copies all the rows of data as well. The SELECT subquery can be replaced with any other subquery, providing much more control. Here are some specific options that may be useful to you:

    1. If you only want to copy some of the columns, you can list explicit column names in place of *. In this case, you will likely want to include DISTINCT so as to avoid duplicated rows.

    2. If you want to rename some column, you can link the old and new names with the AS keyword.

    3. If you only want to copy some of the rows, you can use a WHERE clause.

    4. If you want to insert data that is a constant, or is computed using an expression, rather than coming from a source table, you can include the constant or other expression in the SELECT clause. In this case, you will surely want to use AS to provide a column name, and you may also need to provide an explicit data type if Oracle cannot infer a suitable type. In particular, if you use the constant NULL, you will want to specify a type. For example, to have a column of type CHAR(1) that is NULL, you could specify CAST(NULL AS CHAR(1)) followed by AS and a name for the column

    5. One specific expression you may want to use would be for extracting each 4-digit candidate ID number from the 8-digit version that includes the office ID. For this purpose, Oracle's SUBSTR function would be useful. For example, SUBSTR(candidate_id, 5) would extract the substring starting at character position 5.

    Most of the data in your tables will come from the SOS tables. However, depending on your design, you may want some extra rows for county 88 and/or 89, and possibly for precinct 0005 of county 89. If so, you can explicitly INSERT those extra rows. If you take this approach, you might consider including some feature in your design that makes it possible to distinguish the genuine counties and precincts from these special ones.

    You should submit the SQL commands you used to create the tables and insert data into them.

  6. Add a primary key to each table, using ALTER TABLE as shown at the bottom of page 326 of your textbook. You should submit these SQL commands.

Extra credit opportunities

You may independently choose any number of these, in any order.

  1. Determine whether each SOS table and each of your new tables is in BCNF.

  2. Verify that the SOS tables could be reconstructed from your tables (i.e., that your decomposition is lossless) by making new tables equivalent to the SOS ones, constructed from yours. Once again, you will find it convenient to use CREATE TABLE ... AS ....

    Then use MINUS queries to verify that the reconstructed tables contain no tuples not in the SOS tables, and vice versa. Recall that Oracle's MINUS is the same as what is called EXCEPT in the SQL standard and your textbook.

    You need not verify reconstruction of all of the SOS tables; the more you verify, the more extra credit you will receive.

    When reconstructing a table that contains county names, you may need to convert county names from whichever form you stored to the other, in order to match the SOS table. If you store the all-caps form, you can reconstruct the other using REPLACE(INITCAP(county_name),'Saint ','St. '). To go in the opposite direction, you would use UPPER instead of INITCAP and you would use the REPLACE function to convert 'ST.' into 'SAINT'.

    When reconstructing a table that contains candidate IDs, you may need to convert the 4-digit form into the 8-digit form. For this purpose, you can use SQL's string-concatenation operator, written ||. For example, the expression 'foo' || 'bar' would evalate to the string 'foobar'.

  3. Add as many constraints to your tables as you reasonably can. You should already have added primary keys. Also, most of your columns will already have NOT NULL constraints imported from the SOS tables. The most important kind of constraint to add would be foreign keys. The syntax for adding these with ALTER TABLE is analogous to what you used for primary keys. Some referential integrity constraints may be difficult or impossible to incorporate into your design as foreign keys, due to the quirks of the SOS data. For these, you should state the constraint you would have liked to add. There might also be a few opportunities to add UNIQUE or CHECK constraints. If you find a reason to add any further NOT NULL constraint, the syntax is ALTER TABLE tableName MODIFY (columnName NOT NULL);

  4. Identify any broader ways in which the data ought to be consistent, which go beyond what you know how to enforce with SQL constraints. Prepare a list of consistency conditions.

  5. Prepare a list of data oddities that you would like to call to the SOS's attention. In some cases, if they were to resolve the oddity, you might be able to simplify the database design or enforce an additional constraint, which the current data doesn't obey. In other cases, the resolution might not have any impact on your design. Still, professional ethics demand that you point out any potential problem you notice.