Data and Information Retrieval

Task 1: Database design

Introduction
The International Space Station (ISS) is a habitable artificial satellite in low Earth
orbit. It is the ninth space station to be inhabited by crews following previous orbital
stations that were launched by the US the former Soviet Union and later Russia. The
ISS is intended to be a laboratory, observatory and factory in space as well as to
provide transportation, maintenance, and act as a staging base for possible future
missions to the Moon, Mars and beyond. In order to support the crew and overall
operation of ISS the space agencies in charge of running the station conduct regular
missions to launch spacecraft carrying payloads of essential or replacement
equipment up to ISS. A payload inventory, see table below, is recorded of each
mission, consisting of the space agency leading the mission and the equipment
payload to be sent up to ISS. The overall weight of the payload is also determined in
order to calculate the fuel needed for orbital insertion of the spacecraft to successfully
rendezvous with ISS.

Currently there is no database being used for managing the payload inventory
information in the table above.
This task is split up into two parts:

  1. In its current form, it’s a traditional DB. Keep it that way? Your call. Explain your
    decision.
  2. Design the database for the information above. Implement the DB using any
    method of your choice (SQL, MongoDB, Cassandra DB or Graph DB).
    Note:
    If it is a relational database it should be normalized first.

Task 2: Poster of ethics associated with a medical database

A hospital is considering producing a database from patient data it has collected over
the past 20 years to analysis itself and to sell to other interested parties. You should
create an A3 sized poster to describe ethics issues that the hospital should consider
before creating, analysis and making available this database. There are various
documents on the web on how the create a poster using PowerPoint. Please explore
these before you start.
Evidence:
You should create you’re A3 sized poster in PowerPoint and save it as a pdf
document for submission with your report. Your poster should identify ethical factors
that need to be considered when developing and analysis such a medical database,
offer recommendations to the hospital and drawn conclusions.

Task 3: A data mining system for a Hospital

A hospital has been collecting a great deal of data on their patients and have heard
that use of data mining could improve their service. They would like you to create a
brief report that includes the following:
i. What data mining is and an appropriate data mining application for the
Hospital.
ii. How you would go about creating the system using the data mining life
cycle below.
iii. If the small amount of data (diabetes.arff) collected so far by the hospital
is appropriate for assessing if a person has diabetes.
iv. The use of a data mining model such as a multilayer perceptron or
decision tree to determine whether a person has diabetes. Note, you will
need to use a data mining tool like WEKA to create your model and use
the diabetes.arff data to train and test this model.

Deliverable:
Include a report section that addresses the four sections above and fulfils the marking
criteria.

Task 4: Your Big Data Big Idea

Identify and implement an idea that you have about how you would use Big Data for
something intriguing.

  1. Purpose an idea and clearly outlined (What is the purpose of your data collection
    and analysis). In the lecture notes above, you can see that each idea is specific and
    has a specific purpose.
  2. Acquire the Data. You can do that in many ways including using available public
    large data sets.
  3. Analyze the data in order to achieve the objective you set out for yourself in step 1.
  4. Produce a report the includes your results, data visualization and thoughts.
    Evidence:
    Write a short report about what you did and how it worked out. No more than 1000
    words. You should carry out research into these areas and reference your work using
    the CU Harvard Style.

=>

Task 1: Database design

Question 1

I have proposed MongoDB for ISS database design and created the collection according to the data.

MongoDB is a document-based database which is built on a scale-out architecture for scalable applications which means it is a structure that allows collaboration of machines to work together which can ultimately create fast systems that handle huge amounts of data. Document based databases are flexible which means it can handle variations in the structure of documents and data (Why Use MongoDB & When to Use It?, 2020).

I have particularly used MongoDB because:

  • The document data model is a powerful method that can store and retrieve data quickly.
  • MongoDB can easily support larger volumes traffics and data.
  • MongoDB can enable collaboration between larger of team members.
  • MongoDB can store, manage, and search data with variety of data types like text, geospatial, and even time series dimensions.

Question 2

Collection create query 1
Collection create query 2
Collection create query 3
Final result in tabular format
Final result in JSON format 1
Final result in JSON format 2
Final result in JSON format 3

Task 2: Poster of ethics associated with a medical database

Ethical poster for hospital

Task 3: A data mining system for a Hospital

Question 1

Data mining is the process of exploring and analyzing sets of data to uncover efficient and meaningful patterns. The aim of data mining is to predict future trends and outcomes from historical or past data (Data Mining Explained, 2020).

The application of data mining can be as follows

  1. Database marketing
  2. Credit risk management
  3. Healthcare and bioinformatics
  4. Fraud detection
  5. Spam filtering
  6. Sentiment analysis

Data mining in healthcare

Healthcare professionals can use statistical models to predict the future of patient’s health condition per the risk factors. Other data like Demographic, family, and genetic data can be used to model a dataset which can help patients make changes in their lifestyle to prevent or lower the onset of negative health conditions preemptively.

Benefits of data mining

Automated Decision-Making: Data Mining can allow healthcare services to continually analyze data and automate the critical decisions of patient’s future without the delay of human judgment.

Accurate Prediction and Forecasting: Data mining can facilitate efficient planning and can provides healthcare services with reliable prediction of patient’s health based on past trends and current conditions.

Cost Reduction: Data mining can allow more efficient use and allocation of healthcare resources. They can plan and make best decisions with accurate prediction that will result in maximum cost reduction.

Question 2

Stages of Data Mining

Problem definition

In context to hospital and healthcare services, the prime factor of problem definition in data mining can be the forecast or prediction of future health risk of a patients. It can also aid healthcare employees to make critical decisions upon proceeding to either financially or medically. 

Data gathering

The data for hospital can be gathered with old data entries of hospitals. This is one of the important and early stages of data mining process. The data gathered should be consistent, accurate and meaningful.

Model Building

After data is collected next step is to build a data model for prediction of future trends. The model can be trained using different approaches along with different algorithms as per the needs of hospital.

Use Knowledge

Finally, upon completing the data modeling, it can be used to evaluate and predict the future outcome of patients related with the old patient’s behavior and sickness. The predictions will be relevant as per the data fed in training. If the results are inconsistent and inaccurate, future evaluation and filtration of data can be retrained for better outcomes.

Question 3

The small amount of data provided for data modeling is not effectively relevant in case of data mining. One of the primary reasons for this irrelevancy is that the prediction of small data modeling is very limited and inconsistent. For example, I have created 2 data modeling trees; one with small amount, with 7 datasets being fed to model data and another with around 700 datasets.
The tree made from small data is separated with 2 leaves only each extending from preg.

Upon evaluating, the result shows that it the preg is less than or equals to 4 it is tested negative and if greater than 4 it is positive.

While the decision tree made from big amount of data has multiple leaves extending from plas and preg is a sub leaf of it.

When comparing both the decision tree, there is a huge difference between them. The critical differences are the variables and nodes of each other and the result predicting whether it’s positive or negative.

If the data with any other variables other than preg like mass, age, pedi or preg are provided to predict from the first data model, the result will certainly be inconsistent and inaccurate. Where as from the second data model it can easily predict the strong and accurate result from those variables.

It’s not that we cannot make the data model from small amount of data, but the results, predictions and forecasts are highly in risk of being inaccurate. Hence, the more the data provided to model for a data mining, the better the result and prediction outcomes it can offer.

Question 4

Data mining Weka 1
Data mining Weka 2
Data mining Weka 3
Data mining Weka 4
Data mining Weka 5
Data mining Weka 6
Data mining Weka 7
Data mining Weka 8
Data mining Weka 9
Data mining Weka 10

Task 4: Your Big Data Big Idea

Question 1

I have a proposal of providing the data analysis and visualization of datasets of Netflix Tv shows and movies. The main idea behind this data analysis is to get an in-depth knowledge of data of Netflix. The size and amount of data of Netflix and how the data are divided into their data types. Additionally, I think it would to efficient and effective to break the huge data of Netflix to gain insights like how many shows are movies and how many are tv shows. Likewise, it would also be interesting to discover the differences in data within the dataset.

I will be visualizing the whole network of data, and also other charts and graphs are generated to aid some valuable insights to the dataset of Netflix.

Question 2

I have acquired this dataset of Netflix on website of Kaggle. URL for this particular dataset: https://www.kaggle.com/shivamb/netflix-shows. I have not used all the data provided by the link, but I think the remaining data I have used is enough to show my findings.

Dataset example of Netflix for visualization

Question 3

There are total 6236 numbers of unique records in this dataset. There are 6 columns on the dataset with headers as; id, type, title, rating, duration and description.

The id is the unique id of the show, the type contains either the Netflix show is a TV show or a Movie. The title is the main title of the show. The rating shows categorizes show in TV-rating like “TV-14”, “R”, “TV-Y”, etc. The duration consists either the count of seasons of show or total minute of a movie. Description is simply the description of a show.

Question 4

Data visualization using Gephi
Edges of R rated shows of Netflix
Data visualization – Top 10 shows of Netflix according to duration
Data visualization – Total number of Netflix shows with ratings
Data visualization – Total number of Tv shows and movies

References

MongoDB. 2020. Why Use Mongodb & When To Use It?. [online] Available at: <https://www.mongodb.com/why-use-mongodb> [Accessed 17 August 2020].

MicroStrategy. 2020. Data Mining Explained. [online] Available at: <https://www.microstrategy.com/us/resources/introductory-guides/data-mining-explained#healthcare> [Accessed 19 August 2020].

Leave a Comment