DBMS Normalization
Normalization is a process of arranging relations or set of relations in a manner so that insertion, deletion or updation anomalies do not persist with in the database. It’s being considered foremost while creating any table or relation to maintain consistent data and reduce redundancy. A database table or relation is said to be normalized if it satisfies the following Normal Forms:
1. First Normal Form (1NF)
- A relation (Set of tuples. Here each Student data like “John who is of class 10th and holds some subjects represent a relation.
- In RDBMS tables represents relation with other table by establishing Foreign key.) is said to be in 1NF if each attribute in relation contains an atomic value (i.e. ,single valued attribute).
Consider the below Students table. In which each student is pursuing more than one subject. But it is not the convenient way to represent it in the table. Because our rule state that single value attribute should be maintain by attribute.
Students Table
Id | Name | Subject | Class |
1 | John | Maths Physics | 10th |
2 | Amit | English Chemistry | 10th |
So, we have simplified it by making it single valued attribute as in the below table.
Id | Name | Subject | Class |
1 | John | Maths | 10th |
1 | John | Physics | 10th |
2 | Amit | English | 12th |
2 | Amit | Chemistry | 12th |
2. Second Normal Form (2NF)
- 2NF states that relation should be in 1NF firstly.
- Then all non-prime-key attributes are fully functional dependent on the primary key or we can say no partial dependency.
Consider the below Employee_Project table
EmpId | EmpName | ProjectId | ProjectName | Working Hours |
E1 | John | P1 | Johnsons | 4 |
E2 | Rohn | P2 | Almora | 6 |
E3 | Sumit | P1 | Johnsons | 3 |
E1 | John | P2 | Almora | 5 |
Given table is in 1NF.
Now examine whether the table fulfill our next condition. Here EmpId and ProjectId are primary key attributes and in current table they together are working as composite key. while EmpName and ProjectName are non-primary key attributes.
In the above table ProjectName is non primary key attribute and is functionally dependent on ProjectId and not on primary key EmpId which is making partial dependency. Also, if we have to fetch the working hours for Project Johnsons. Then we will have to use ProjectId in addition with EmpId to uniquely identify a row. Thus ProjectId+EmpId (composite key) act as primary key. We will remove partial dependency by making separate table for Project.
Let’s shatter it into two tables to obey the 2NF rule.
Employee_Project
EmpId | EmpName | ProjectId | Working Hours |
E1 | John | P1 | 4 |
E2 | Rohan | P2 | 6 |
E3 | Sumit | P1 | 3 |
E1 | John | P2 | 5 |
Projects
ProjectId | ProjectName |
P1 | Johnsons |
P2 | Almora |
Now both tables are in 2NF.
3. Third Normal Form(3NF)
- 3NF states that relation in table should be in 2NF.
- Has no transitive functional dependency for non-prime attributes.
- Reduce data replication and maintain data integrity.
Student_Semester table
Id | StudentId | SubjectId | Marks | Semester | TotalSubjects |
1 | 1 | 1 | 60 | Sem1 | 4 |
2 | 2 | 2 | 70 | Sem2 | 5 |
3 | 1 | 2 | 80 | Sem1 | 4 |
Here in the given table, Suppose some student studies in XYZ college. Each semester has its own list of subjects. So, SubjectId functionally depends on Semester, Marks(non-prime attribute) depends on SubjectId(prime attribute) and Semester depends on StudentID. TotalSubjects depends on Semester. Thus TotalSubjects (non-prime attribute)column transitively depends upon StudentId(prime attribute).
Semester depends upon StudentId
TotalSubjects depends upon Semester
Then TotalSubjects indirectly depends upon StudentId
So, we will create three separate tables Student_Subject, Subjects and Semester
StudentId | SubjectId | Marks | SemId |
1 | 1 | 60 | 1 |
2 | 2 | 70 | 2 |
3 | 1 | 80 | 1 |
Subjects Table
SubjectId | Name |
1 | Maths |
2 | English |
Semester Table
SemesterId | Name |
1 | Semester1 |
2 | Semester2 |
So, now transitive dependency is removed, obey 2NF and our tables are in 3NF. One thing need to remember while normalizing is that each tuple should maintain unique record.
Boyce Codd normal form (BCNF)
- Advance version of 3NF. Also known as 3.5 Normal Form.
- States that relation should be in 3NF.
- For every functional dependency X->Y, X should be the super key (i.e., If Y is prime attribute depends on X then X should not be non-prime attribute).
Students Table
StudentId | Subject | University | Teacher |
S1 | Maths | DFG | Ray |
S2 | English | DFG | Eva |
S3 | Science | DFG | Jack |
S1 | Civics | DFG | John |
S1 | English | DFG | Mack |
If we examine then student StudentId + SubjectName together form primary key(i.e. ,composite key). Because with the help of these two we would able to uniquely identify each row.
Subject(prime attribute, say Y) depends on Teacher(non-prime, say X). In the given table BCNF rule is violated:
- Because prime attribute depends on non-prime.
- Neither StudentId nor SubjectName independently act as primary key.
Also, this way of storing data in table would generate insert, update and delete anomalies. Consider a scenario, suppose if we have to update the University name then we have to update all rows with value “DFG”. Here we have taken a simple example but think about large database(with 50,000 Or + records) it would be time consuming and increase computational cost. If in any case any of the record left unchanged then will create inconsistency with Updation Anomaly.
As we can see in the given table, same piece of information exist multiple timesl(Like Student with Name S1 and University Name exist multiple times i.e. hold same piece of information).So, we need to normalize it by fragmenting into tables.
Student_University Table
StudentId | UniversityId |
S1 | 1 |
S2 | 1 |
S3 | 1 |
S4 | 1 |
Student_Teacher Table
StudentId | TeacherId |
1 | T1 |
2 | T2 |
3 | T3 |
1 | T4 |
1 | T5 |
Teacher_Subject
TeacherId | Subject |
T1 | Maths |
T2 | English |
T3 | Science |
T4 | Civics |
T5 | English |
University table
Id | Name |
1 | DFG |
2 | …… continue |
We simply need to change one row to update the University name.
4. Fourth Normal Form(4NF)
- 4NF states that relation should be in BCNF.
- For any single value within the table, there should not be any multivalued dependency.
- Say for a functional dependency X → Y, if for a single value of X, multiple values of Y exists, then the relation will be considered to have multi-valued dependency.
Consider the below Employee table:
EmpId | Skills | PhoneNo. |
12 | PHP | 9999999999 |
12 | JAVA | 8888888888 |
13 | C++ | 7777777777 |
Company column has no relation with employee phone number. So, we will disintegrate into two columns.
Emp_Skill
EmpId | Skills |
12 | PHP |
12 | JAVA |
13 | C++ |
Emp_PhoneNo
EmpId | PhoneNo |
12 | 9999999999 |
12 | 8888888888 |
13 | 7777777777 |
Now both tables are in 4NF.
5. Fifth Normal Form(5NF)
- 5NF states that relation should be in 4NF.
- It is required that according to the functional dependency, table cannot be further break down into more tables without losing data.
- Generally disintegration of table is done to avoid redundancy but it should be done without data loss.
- Should not contain join dependency.
- Also known as project join normal form.
Let’s firstly try to understand join dependency.
Consider the below table. Here the Student studies subject Electronics from teacher Mario. Mario can teach subject Electronics, C++. One more student Miley studies subject C++ from Mario.
But this is not mandatory in the current relation that any student who study from any teacher is required to take the all subjects taught by any particular teacher.
Students
Student | Subject | Teacher |
John | Electronics | Mario |
John | C++ | Mario |
Miley | C++ | Mario |
Jack | Electronics | Albert |
Table contain redundancy as John and Mario appears twice. Now we will decompose it into three tables or create three projections and find out whether the table contain join dependency or not.
Student_Subject
Student | Subject |
John | Electronics |
John | C++ |
Miley | C++ |
Jack | Electronics |
Subject_Teacher
Electronics | Mario |
C++ | Mario |
Electronics | Albert |
Student_Teacher
Student | Teacher |
John | Mario |
Miley | Mario |
Jack | Albert |
Student | Subject | Teacher |
John | Electronics | Mario |
John | Electronics | Albert |
John | C++ | Mario |
Miley | C++ | Mario |
Jack | Electronics | Mario |
Jack | Electronics | Albert |
When we have joined Student_Subject and Subject_Teacher. Then 2 more spurious row gets added into the result as shown above. Thus lossy decomposition exist here.
Now we further perform natural join on this result with Student_Teacher and got the original Students table below. Thus the original Students table contains join dependency and is not in 5NF. However table Student_Subject,Subject_Teacher,Student_Teacher are in 5NF. Because these tables can’t be break down further and when we perform natural join on these, original table is generated without loss of information. Also, no any spurious row is added.
If decomposed tables or projections are joined. A table (say Students) is said to have JOIN dependency when there is:
- No loss of information in the generated table (say Students’).
- No any spurious row is added in the generated table(say Students’) .
Students’
Student | Subject | Teacher |
John | Electronics | Mario |
John | C++ | Mario |
Miley | C++ | Mario |
Jack | Electronics | Albert |
However in some scenarios, it is not always possible to eliminate the redundancy from the table. Because if we try to do so then it may result in information loss. Consider the Student table below:
Students table
Student | Teacher | Subject |
John | Mack | Electronics |
John | Mack | C++ |
John | Abrahim | DBMS |
Miley | Mack | DBMS |
Student_Teacher
Student | Teacher |
John | Mack |
John | Abrahim |
Miley | Mack |
Teacher_Subject
Teacher | Subject |
Mack | Electronics |
Mack | C++ |
Abrahim | DBMS |
Mack | DBMS |
Student_Subject
Student | Subject |
John | Electronics |
John | C++ |
John | DBMS |
Miley | DBMS |
So, when we perform join on Student_Teacher and Teacher_Subject, we found two spurious rows within the table.
Student | Teacher | Subject | |
John | Mack | Electronics | |
John | Mack | C++ | |
John | Mack | DBMS( | X(INCORRECT) |
John | Abrahim | DBMS | |
Miley | Mack | Electronics | X(INCORRECT) |
Miley | Mack | C++ | X(INCORRECT) |
Miley | Mack | DBMS |
Finally when we natural join this resulted table with Student_Subject, we got the below table. Thus it is lossy decomposition and original table including fragmented tables are not in 5NF.
Student | Teacher | Subject | |
John | Mack | Electronics | |
John | Mack | C++ | |
John | Mack | DBMS | X(INCORRECT) |
John | Abrahim | DBMS | |
Miley | Mack | DBMS |