Normalizing Tables – You will develop a program that generates SQL statements to check normalization in one table and iteratively normalize it if needed.
Introduction – Normalization
You will develop a program that generates SQL statements to check normalization in one table and iteratively normalize it if needed.
The input will be only one table in a database. Each table will have one potential primary key (PK), and non-key columns. Other CKs and FKs will not be considered. The given primary key may simple or composite. Your program will process one table at a time, whose name and columns will be specified when your program is called. The actual table with data will be already stored in the database. Even though your program will generate a short output text file, most processing will be done with SQL queries.
The output is a set of normalized tables, as many as needed, created in your own space. The goal is to reach BCNF on all tables.
Program requirements – Normalization
Your program will generate SQL code based on the input table and columns.
- Input: a plain SQL table, with a potential PK given as parameter. That is, assume you imported a CSV or text file without assurance the table has a valid PK. In this specific HW edition the table will not have nulls.
- PK: you must verify if the table PK is indeed valid. The PK can be composite and can have up 2 to attributes. You are not expected to find a PK if the given PK is invalid. In this specific HW edition the PK will not have nulls.
- FDs: your program must compute all potential functional dependencies considering the max number of columns in the PK and nonkey columns. For FDs not involving the PK there must exist at least two repetitions. In other words, you need to eliminate potential candidate keys from consideration since they do not cause update anomalies.
- Normalization: Your program must check 1NF, 2NF, 3NF, BCNF. If the table is not in BCNF then you have to iteratively normalize it up to BCNF. For 3NF/BCNF assume the key on the LHS has only one attribute.
- Goal: validate PK (remember it can be composite with 2 columns), certify if the input table, with potential PK (input parameter), check if there are repeated rows (1NF, all values are assumed atomic in SQL) and nonkey columns obeys each higher normal form: 2NF, 3NF, BCNF. A table may be in 1NF with an invalid PK given as parameter.
- Optional: the program must iteratively decompose the input table produce a final set of normal- ized tables, satisfying 3NF or BCNF if possible. Each table must satisfy the properties explained in the textbook. Extra credit= 20 points, which can be applied towards HW1.
- Excluded: referential integrity. In this HW you do not have to worry about inconsistent or missing information across tables.
- Program specification
- The input will be specified in the command line, with parameters separated by semicolon ”;” (parameter=value).
- Parameters: input table, PK (potential) and columns (nonkey).
If a table has missing parameters: no PK or nonkey columns specified your program should halt and display an error message” invalid input”.
System Software – Normalization
OS: Linux (any distro is fine if you install the DBMS locally). However, it is feasible to develop programs in Windows and MacOs but later you must test them in our Linux server.
- Processing: find FDs, decompose and leave all temporary tables in each higher normal form. Each normalized table must have a PK in the DDL; in case there is a choice (e.g. symmetrical FDs) use alphabetical order to choose its PK. Example: B C, C B: choose B. At the end of processing, you must write a list of normalized tables in a text file, showing normalized table names with a suffix indicating NF and table number as needed. Example for table T: T 2NF 1,T 2NF 2,T 3NF 1,T 3NF 2 ,T BCNF 1,T BCNF 2.
- Forbidden: You cannot export or read tables row by row, doing processing outside the DBMS. The bulk of processing must be done with SQL queries and temporary tables.
- Allowed: you can store all table PK and column information inside the host language with any data structure including lists and arrays. You can use any libraries to store, search, parse table/column names. You can retrieve row counts, sums or boolean values computed by SQL queries.
- SQL code generation: There is only one (give) primary key without other (secondary) candidate keys. You can use any combination of SELECT statements, temporary tables and views. You should preserve column names as you normalize tables. You are not allowed to modify the input table in any way.
- SQL: Your program must generate a plain SQL script file for all the queries used in your program (correctly formatted as shown in the textbook and seen in class, following SQL for- mat/indentation standards). This file will be checked by the TAs for originality and format. The instructor may also check it in case of grade complaints.
Write to text file ”nf.sql” all the SQL statements generated during the checking of normalization. Append after each run. This SQL should have comments indicating table name being checked and be well formatted to be understood by a person, not as a long line.
SQL is not case sensitive. But it is preferred SQL keywords are in uppercase and table/column names in lower case.
- 10% extra credit: create a web page calling JS where you can connect to the database, list all tables (retrieving them from the internal schema information) and let the user customize the normalization process. The user can choose a table and your program can ”propose” PKs with up to k attributes. The user can tune a ”redundancy” parameter that acts as a threshold to discard weak FDs (e.g. one column that is almost unique). Then the program can proceed with
nornalization. If you decide to do it you need to email the instructor and TAs; this extra credit cannot be obtained if you submit your HW without alerting us.
- Your program should not crash and stop if there is some error. The program should drop any derived or temp table in advance to avoid exceptions. Test your program well with invalid input.
- Omissions or mistakes in test cases, specification or TAs instructions: they must be reported as soon as possible. Important requirements will fixed in this PDF. It is impossible to foresee every complication when you develop your program. At the same time, this specification would have to be 10-page long to be ultra-specific about every requirement. It is understood that a mistake or omission pointed out to the TAs or instructor in 3 days or less before the deadline will be very difficult to fix.
Program call and output specification – Normalization
Here is a specification of input/output. There is only one main input parameter: the input table. Ther are two additional parameters: a potential PK and its non-key columns.
- syntax example for main program call from the Linux prompt (simplified):
node normalize.js “table=T;pk=i,j;columns=a,b,c,d”.
- If the connection to the DBMS fails, the program must display an error message. If the connection to the DBMS is successful, you should show a ”success” message and generate the SQL script file regardless the selected option, and dump clean SQL statements generated by your program.
- Input checking: Your program will be tested with several tables having different names and columns. But you must make elementary error checking for non-existent tables and invalid column names. That is, you should not waste time catching every potential error like a missing parenthesis, pkk instead of pk, repeated column names, inconsistent column names, and so on.
- Output: a text file ’nf.txt’ specifying for each table if the PK is valid and if each NF (1NF, 2NF, 3NF, BCNF) is satisfied Y/N (one normal form per line and Y/N). Append this file after every run.
Optional: full decomposition as a set of normalized tables, iteratively written for each NF. Normalized tables will have a name with the corresponding normal form as a suffix. The list of normalized tables will be listed in file ”nf.txt”, one per line (do not include any other info). The most important ones are those in BCNF, but intermediate normal forms will be checked for correctness.
- potential PK: up to 2 columns. Since SQL only supports simople data types it is assumed values are atomic. Therefore, you only need to validate the PK.
- 1NF: Remember that 1NF requires the existence of a PK (i.e. unique rows). SQL assumes values are atomic (even if they are strings).
- Nonkey columns: up to 10 columns.
- Finding FDs: only one column of the left side. The only exception for FDs is the PK which may be composite, but still you will find FDs with one column on the left side.
- Data types: The input table will have only integer numbers and strings (no dates, no floating point, no boolean).
- The SQL of your generated script file should be properly formatted and indented (one SELECT in one line, JOIN/WHERE/GROUP each in a different line). Indent nested queries.
Example of output file nf.txt for a table with an invalid PK (nulls or duplicates) and duplicate rows (ignoring columns not given as parameters): Normalization
Example of output file nf.txt for a table with an invalid PK (has duplicates), but unique rows (there exists some hidden PK):
hw2_example PK N
Example of output file nf.txt for a denormalized table with valid PK:
hw2_not2NF PK Y
Example of output file nf.txt for a normalized table up to 3NF but not BCNF:
hw2_3nf PK Y
Grading – Normalization
The TAs will provide several input tables, each with potential PK and non-key columns. These test cases will give the students the opportunity of evaluate their own projects, and make corrections before the official submission. The TA will use scripts for automatic grading. You can expect tables to have no more than 256 rows, to make SQL development faster and easier to check.
Score: A program not uploaded to the UH Linux server will get 0 (zero, the TAs have no obligation to run testing outside this server). A program with syntax or data type errors will get 10. A working program, producing correct output for some cases, will get a score 30 or better. A program passing checks with most tables can get 60 or better. It is expected you fix your JS program to avoid server errors or system crashes; please test it well and make sure all open files or processed are closed.
Resubmission: The instructor will decide if scores are averaged or a point subtraction is applied. In general any resubmission incurs in at least 10% penalty. A minor fix should be resubmitted within 1 day when grades are posted with 10 points off. A major fix should be resubmitted within 2 days and 20 points will be deducted. There is presubmission since the TAs and instructor have made many test files and test tables available.
Code originality: The source code and SQL statements will be checked for plagiarism with source code analysis tools (we cannot disclose them): any significant similarity with some other student code will be reported to the instructor. Notice reformatting code, changing variable names, changing order of functions, changing for loops for while loops, repackaging into different files, adding spurious variables; all of them can be easily detected since the ”logic” solving the problem is the same. Warning: copying/pasting code from the Internet (stack overflow, github, tutorials) may trigger a red flag: the instructor will decide if there is cheating or not. If you get SQL or JS code from the Internet disclose the sources in your README file. End of Normalization Assignment