CS7079NI Data Warehousing and Big Data – INDIVIDUAL COURSEWORK

 Level 7 – Big Data and Data Warehousing

Abstract

MediCare wants to use big data technology to solve problems with finding trends in patient medical records and comprehending differences in treatment results. The hospital aims to detect high-risk patients for disorders including diabetes and heart disease by gathering and evaluating enormous amounts of patient data, such as medical histories, treatment plans, and comments. The implementation of a big data system enables better early disease outbreak identification, personalised treatment planning, and real-time patient health monitoring. The design, implementation, and advantages of a big data framework for MediCare Hospital are examined in this course, with an emphasis on how predictive analytics and sophisticated data-driven insights could transform patient care and decision-making.

Table of Contents

Abstract

Table of Contents

Introduction

ETL Process

Extraction

Transformation

Load

ELT: OLTP and OLAP

Online Transactional Processing (OLTP)

Online Analytical Processing (OLAP)

Notable differences between OLTP and OLAP:

Creation of data warehousing database and demonstration of data loading process using ETL

Creating Data mart

Dimension and Fact Tables

Creating Dimension tables

Creating Fact table

A subquery to identify patients whose treatment costs are significantly higher than the average treatment cost for similar diagnoses within a given period.

A scenario where multiple hospital staff try to update a patient’s record simultaneously, showcasing how locking and concurrency control can prevent data inconsistencies.

ETL Process

Big Data Processing

Loading patient health records into Spark

Clean the patient dataset

Aggregating patient data

Analysis to identify high-risk patients

NoSQL

Adding new connection string

Creating database

Inserting a sample dataset of patient feedback

Queries to retrieve, analyze, and filter the feedback for patterns in patient satisfaction and treatment outcomes

Conclusion

References

Introduction

MediCare Hospital, a leading healthcare provider, aims to address critical challenges such as identifying high-risk patients, improving treatment outcomes, and optimizing resource utilization. Large volumes of data, such as patient records, treatment histories, diagnostic data, and patient feedback, are managed by the hospital and include important insights that help guide decision-making.

MediCare Hospital is willing to implement data warehousing and big data solutions that are tailored to its needs. By employing advanced database concepts, ETL processes, Big Data frameworks like Apache Spark, and NoSQL databases like MongoDB, the hospital can uncover patterns in health records, monitor patient health in real time, and deliver personalized care. The case study also provides a practical scenario for applying technical skills in real-world contexts, demonstrating the transformative potential of Big Data technologies in the healthcare sector.

ETL Process

ETL stands for Extract, Transform, and Load, and it is one data warehousing technique. Data is extracted from several systems that act like data sources using an ETL tool, transformed in a staging area, and then fed into a data warehouse system.

ETL process

Extraction

The first step in the ETL process is extraction. Data must be moved into the staging area from a variety of source systems, including relational databases, No SQL, XML, and flat files. It is necessary to retrieve the extracted data from many source systems and store it in the staging area before transferring it to the data warehouse because it is in various formats and may be corrupted. As a result, simply adding data to the data warehouse could be harmful, making reversal much more difficult. This makes it one of the most important phases of the ETL process. 

Transformation

The second stage of the ETL process is transformation. In this step, several rules or algorithms are applied to convert the collected data into a single standard format. It could entail carrying out tasks or processes like cleaning, filtering, sorting, connecting, and separating. 

Load

The third and last stage of the ETL process is loading. To finish the process, the transformed data is now loaded into the data warehouse. The data is either regularly fed into the data warehouse or less frequently but more frequently, according to the circumstances. Only the requirements, which vary from system to system, dictate the amount and timing of loading.

ELT: OLTP and OLAP

Relational databases are categorized as OLTP and OLAP. Despite both of them being widely used in numerous commercial applications, their main use cases vary slightly.

Online Transactional Processing (OLTP)

Transaction data is gathered by an OLTP system and kept in a database. Each transaction proposes unique, multi-field, multi-column database entries. Because databases using OLTP are often accessed, written to, and updated, rapid processing is the main goal. Even if a transaction fails, data integrity is guaranteed by integrated system logic.

Online Analytical Processing (OLAP)

Large volumes of past information that have been compiled from OLTP databases and other sources are subjected to complex queries by OLAP for data mining, analytics, and business intelligence projects. It emphasises how quickly these particular questions are answered. Each query makes use of one or more columns of data that have been compiled from numerous rows. 

Notable differences between OLTP and OLAP:

OLTPOLAP
Manages a significant volume of minor transactions​Manages a significant volume of data with complex queries 
INSERT, UPDATE, and DELETE commands are basedMaking use of SELECT commands to gather data for reports
Milliseconds response time Depending on how much data needs to be processed, response times might range from a second to a minute or an hour.
Used for real-time supervision and management of crucial business operationsUtilised for planning, problem-solving, decision-supporting, and uncovering hidden truths
Data updates are quick, brief changes that are started by the user.Regularly scheduled, lengthy batch tasks are used to refresh the data.
Regular backups are performed to guarantee business continuity and to adhere to governance and legal obligations.Lost data can be reloaded from the OLTP db upon request if frequent backups are necessary.
provides a list of daily business transactions.Multifaceted enterprise data view
standardized databases to increase productivityDatabases that have been renormalized for analysis

How OLAP and OLTP work in conjunction

Creation of data warehousing database and demonstration of data loading process using ETL

To illustrate the application of an ETL procedure to the creation of a datawarehouse and developing a three-dimensional data model for this project, the following photos and instructions have been provided.

Creation of the database 

Newly created database MediCare

ERD of Star Schema using Kimball‘s method

The above entity relationship diagram follows Kimball’s dimensional modeling methodology. It follows the principles of designing the star schema for a data warehouse. Following Kimball’s star schema, there is the presence of one single FactPatientAdmission table which contains measures and keys linking to other dimension tables. It measures data for patient admission. Dimension tables provide descriptive attributes for the facts. 

Creating Data mart

A data mart is a single business line-focused subset of a data warehouse. Collections of condensed data collected for analysis on a particular division or unit inside an organisation, such the sales department, are called data marts.

Dimension and Fact Tables

Data warehouses utilise terms like dimensions and facts. A fact is a numerical item of information, such a sale or download. Facts are stored in fact tables, which are linked to multiple dimension tables via a foreign key relation.   The measurements of the things in the facts table are factual companions that describe them. In the following example, each fact has an ID associated with a set of attributes.

Fact TableDimension Table
There are more records and fewer attributes in the fact table.There are fewer records and more characteristics in the dimension table.
Fact table expands vertically.The dimension table expands horizontally. 
The primary keys to each dimension table are concatenated to form the fact table’s primary key.The primary key is contained in each dimension table.
Only once dimension tables are finished can a fact table be constructed.First, dimension tables must be made.
There are fewer fact tables in a schema.There are more dimension tables in a schema.
Both textual and numeric data can be included in a fact table.Attributes are always in text format in the dimension table.

Dimension and fact tables are foundation of data warehousing. In the beginning of  developing data warehouse is to identify distinct entities and characteristics that are based on goals. The hospital is primarily concerned with:

  • Personalized treatment plans
  • Operational efficiency
  • Detect disease outbreaks early

The following fact and dimension tables were created to provide the expected result:

Tables in warehouse design

Creating Dimension tables

Creating Fact table

A subquery to identify patients whose treatment costs are significantly higher than the average treatment cost for similar diagnoses within a given period.

SELECT

    FPA.PatientID,

    P.Name AS PatientName,

    FPA.DiagnosisID,

    FPA.TreatmentID,

    T.TreatmentDescription,

    FPA.BillingAmount,

    AVG_DiagnosisCost.AverageCost

FROM

    FactPatientAdmission FPA

JOIN

    DimPatient P ON FPA.PatientID = P.PatientID

JOIN

    DimDiagnosis D ON FPA.DiagnosisID = D.DiagnosisID

JOIN

    DimTreatment T ON FPA.TreatmentID = T.TreatmentID

— Subquery to calculate the average cost to each diagnosis within a given period

JOIN (

    SELECT

        DiagnosisID,

        AVG(BillingAmount) AS AverageCost

    FROM

        FactPatientAdmission

    WHERE

        DateOfAdmission BETWEEN ‘2024-01-01’ AND ‘2024-12-31’

    GROUP BY

        DiagnosisID

) AVG_DiagnosisCost 

ON

    FPA.DiagnosisID = AVG_DiagnosisCost.DiagnosisID;

A scenario where multiple hospital staff try to update a patient’s record simultaneously, showcasing how locking and concurrency control can prevent data inconsistencies.

Without use of concurrency control we get to see that the last transaction will overwrite the earlier transaction’s update, whereas by implementing concurrency control we can see that Transaction 1 locks the record (PatientID = 1) and begins updating ‘BloodType’ and Transaction 2 must wait until Transaction 1 is completes and releases the lock before proceeding.

ETL Process

ETL flow has been designed for the assignment as follow:

  • Extract data from the database MediCare from Microsoft SQL Server Management Studio.
  • Transform the data – 
    • We’ve converted data in the Gender column in DimPatient table where ‘Male’ will be transformed to ‘M’ and ‘Female’ will be transferred to ‘F’
    • In DimDoctor table, we’ve removed ‘Dr.’ from fullname column
  • Load the transformed data back into the same database

Initially, dimpatient and dimdoctor tables are as below:

DimPatient before ETL completion

DimDoctor before ETL completion

NiFi Flow

The extraction flow for NiFi is:

First a connection pool is setup to connect to local server and GenerateTableFetch processor will generate query to select all data from the given tables (here DimPatient and DimDoctor)s

ExecuteSQL processor will execute the generated query which will return the data in Avro formal. The query from above GenerateTable will be executed.

Avro file format return from ExecuteSQL

ConvertAvroToJSON is self-explanatory. This is done to ensure handling of our transformation with ease in JSON format. Finally, ConvertJSONToSQL processor allows conversion of the JSON file to SQL so that sql query can be applied to the current data for transformation.

Transformation is handled by PutSQL com

Two cases of transformation is handled in the ETL flow.

  • DimPatient’s ‘Male’ is transformed to ‘M’, and ‘Female’ is transformed to ‘F’ in our database
  • ‘Dr. ’ is removed from DimDoctor’s fullname column

final result after ETL are:

DimPatient after ETL completion

DimDoctor after ETL completion

Big Data Processing

git  clone https://github.com/silwalprabin/big-data-tools

cd  big-data-tools 

The above command is used for setting up a big data environment with tools like Kibana, Spark, 

docker-compose up -d

docker-compose up -d command runs the multiple services (in the figure below) in containers. In addition to these, it also allows running of SparkUI on localhost.

Spark UI on localhost

Loading patient health records into Spark

First, a sample dataset (csv) is loaded as HDFS file using following sets of commands:

docker cp healthcare_data.csv  namenode:/

docker exec -it  namenode /bin/bash

hdfs dfs -mkdir -p  /user/data

hdfs dfs -put /healthcare_data.csv /user/data/healthcare_data.csv

Reading dataset from HDFS and converting it to Spark DataFrame was achieved through a python script which will be executed. The image right below shows these commands. The second image below shows copying of this file. 

# Initialize Spark Sessions

sparks = SparkSession.builder.appName(“PatientDataAnalysis”).getOrCreate()

# Loading data from HDFS

df = sparks.read.csv(“hdfs://namenode:9000/users/data/healthcare_patient.csv”, header=True, inferSchema=True)

Clean the patient dataset

This is achieved through codes in the python script

cleaned_df = (

    df.dropDuplicates()

      .na.drop()

      .withColumn(“Name”, initcap(col(“Name”)))

      .withColumn(“Date of Admission”, col(“Date of Admission”).cast(“date”))

      .withColumn(“Discharge Date”, col(“Discharge Date”).cast(“date”))

      .withColumn(“Gender”, when(col(“Gender”).isin(“M”, “F”), col(“Gender”)).otherwise(lit(“Unknown”)))

)

The transformation of data is achieved through above displayed sets of commands.

na.drop – drops all the values with N/A or null values

withColumn(“Name”, initcap(col(“Name”))) – The Name column had data in mixed cases, which was change to display the first letter of each word as capital letter (for e.g. DEviN ToWNSenD was changed to Devin Townsend)

Date of Admission and Discharged date is casted to date data type to ensure proper formatting of date throughout the data.

Finally the gender was checked to be M, F or unknown so that only valid gender values are in the column.

Aggregating patient data   

aggregated_df = (

    cleaned_df.groupBy(“Name”)

              .agg(

                  sum(“Billing Amount”).alias(“Total Healthcare Costs”),

                  count(“Past Treatment, surgeries, and procedures”).alias(“Treatment Count”)

              )

)

  • .groupBy(“Name”) groups the rows in cleaned_df by the Name column
  • sums the total billing amount column after grouping 
  • also count past treatment, surgeries and procedures column for each patient Name

Analysis to identify high-risk patients

high_risk_df = (

    cleaned_df.filter(

        (col(“Body Mass Index (BMI)”) > 30) |

        (split(col(“Blood Pressure”), “/”).getItem(1).cast(“int”) > 140) | 

        (col(“Blood Glucose”) > 125) |

        (col(“Haemoglobin A1C”) > 6.5) |

        (col(“C-Reactive Protein (CRP)”) > 10)

    ).select(“Name”, “Current Condition (Diagnosis)”, “Body Mass Index (BMI)”, “Blood Pressure” , “Blood Glucose”, “Haemoglobin A1C” , “C-Reactive Protein (CRP)”)

)

cleaned_df was filtered for high-risk patients for following conditions:

  • Body Mass Index (BMI) > 30
  • Diastolic blood pressure > 140
  • Blood Glucose > 125
  • Haemoglobin A1C > 6.5
  • C-Reactive Protein (CRP) > 10

if either of these conditions are present for a patient, it is displayed.

NoSQL

I’ve installed and used MongoDB Compass to create and populate the tables. I’ve used previously created database schema with addition of PatientFeedback table.

Adding new connection string

Creating database

Database and Collections in MongoDB

Inserting a sample dataset of patient feedback

In the same way, I’ve populated other tables. 

Queries to retrieve, analyze, and filter the feedback for patterns in patient satisfaction and treatment outcomes

The highlighted box is where query is to be written in MongoDB Compass

Retrieve feedback by patient id

by multiple patient ids

We can retrieve similarly by contents of comments.

Combining two queries

average ratings by doctor id – sorted by doctor id

number of feedbacks by doctors

Average rating for doctors with more than 1 feedbacks

calculate the average rating and the percentage of positive feedbacks (ratings 4 and above) for each doctor-admission combination and sorts the result by the percentage of positive feedbacks in descending order.

Conclusion

Big Data technology exploration has been enlightening and fulfilling, providing both theoretical and practical information. I got the chance to work with strong tools like NoSQL databases for handling unstructured data, Apache Spark for distributed data processing, and ETL (Extract, Transform, Load) tools for effective data integration and transformation. Big Data technology exploration has been enlightening and fulfilling, providing both theoretical and practical information. I got the chance to work with strong tools like NoSQL databases for handling unstructured data, Apache Spark for distributed data processing, and ETL (Extract, Transform, Load) tools for effective data integration and transformation. This coursework has strengthened my technical skills and reinforced my understanding of how data-driven solutions can reshape healthcare, making it more efficient, predictive, and patient-centered.

References

Leave a Comment