
Data and Information Retrieval
Task 1: Database design
Introduction
The International Space Station (ISS) is a habitable artificial satellite in low Earth
orbit. It is the ninth space station to be inhabited by crews following previous orbital
stations that were launched by the US the former Soviet Union and later Russia. The
ISS is intended to be a laboratory, observatory and factory in space as well as to
provide transportation, maintenance, and act as a staging base for possible future
missions to the Moon, Mars and beyond. In order to support the crew and overall
operation of ISS the space agencies in charge of running the station conduct regular
missions to launch spacecraft carrying payloads of essential or replacement
equipment up to ISS. A payload inventory, see table below, is recorded of each
mission, consisting of the space agency leading the mission and the equipment
payload to be sent up to ISS. The overall weight of the payload is also determined in
order to calculate the fuel needed for orbital insertion of the spacecraft to successfully
rendezvous with ISS.

Currently there is no database being used for managing the payload inventory
information in the table above.
This task is split up into two parts:
- In its current form, it’s a traditional DB. Keep it that way? Your call. Explain your
decision. - Design the database for the information above. Implement the DB using any
method of your choice (SQL, MongoDB, Cassandra DB or Graph DB).
Note:
If it is a relational database it should be normalized first.
Task 2: Poster of ethics associated with a medical database
A hospital is considering producing a database from patient data it has collected over
the past 20 years to analysis itself and to sell to other interested parties. You should
create an A3 sized poster to describe ethics issues that the hospital should consider
before creating, analysis and making available this database. There are various
documents on the web on how the create a poster using PowerPoint. Please explore
these before you start.
Evidence:
You should create you’re A3 sized poster in PowerPoint and save it as a pdf
document for submission with your report. Your poster should identify ethical factors
that need to be considered when developing and analysis such a medical database,
offer recommendations to the hospital and drawn conclusions.
Task 3: A data mining system for a Hospital
A hospital has been collecting a great deal of data on their patients and have heard
that use of data mining could improve their service. They would like you to create a
brief report that includes the following:
i. What data mining is and an appropriate data mining application for the
Hospital.
ii. How you would go about creating the system using the data mining life
cycle below.
iii. If the small amount of data (diabetes.arff) collected so far by the hospital
is appropriate for assessing if a person has diabetes.
iv. The use of a data mining model such as a multilayer perceptron or
decision tree to determine whether a person has diabetes. Note, you will
need to use a data mining tool like WEKA to create your model and use
the diabetes.arff data to train and test this model.

Deliverable:
Include a report section that addresses the four sections above and fulfils the marking
criteria.
Task 4: Your Big Data Big Idea
Identify and implement an idea that you have about how you would use Big Data for
something intriguing.
- Purpose an idea and clearly outlined (What is the purpose of your data collection
and analysis). In the lecture notes above, you can see that each idea is specific and
has a specific purpose. - Acquire the Data. You can do that in many ways including using available public
large data sets. - Analyze the data in order to achieve the objective you set out for yourself in step 1.
- Produce a report the includes your results, data visualization and thoughts.
Evidence:
Write a short report about what you did and how it worked out. No more than 1000
words. You should carry out research into these areas and reference your work using
the CU Harvard Style.
=>
Task 1: Database design
Question 1
I have proposed MongoDB for ISS database design and created the collection according to the data.
MongoDB is a document-based database which is built on a scale-out architecture for scalable applications which means it is a structure that allows collaboration of machines to work together which can ultimately create fast systems that handle huge amounts of data. Document based databases are flexible which means it can handle variations in the structure of documents and data (Why Use MongoDB & When to Use It?, 2020).
I have particularly used MongoDB because:
- The document data model is a powerful method that can store and retrieve data quickly.
- MongoDB can easily support larger volumes traffics and data.
- MongoDB can enable collaboration between larger of team members.
- MongoDB can store, manage, and search data with variety of data types like text, geospatial, and even time series dimensions.
Question 2







Task 2: Poster of ethics associated with a medical database

Task 3: A data mining system for a Hospital
Question 1
Data mining is the process of exploring and analyzing sets of data to uncover efficient and meaningful patterns. The aim of data mining is to predict future trends and outcomes from historical or past data (Data Mining Explained, 2020).
The application of data mining can be as follows
- Database marketing
- Credit risk management
- Healthcare and bioinformatics
- Fraud detection
- Spam filtering
- Sentiment analysis
Data mining in healthcare
Healthcare professionals can use statistical models to predict the future of patient’s health condition per the risk factors. Other data like Demographic, family, and genetic data can be used to model a dataset which can help patients make changes in their lifestyle to prevent or lower the onset of negative health conditions preemptively.
Benefits of data mining
Automated Decision-Making: Data Mining can allow healthcare services to continually analyze data and automate the critical decisions of patient’s future without the delay of human judgment.
Accurate Prediction and Forecasting: Data mining can facilitate efficient planning and can provides healthcare services with reliable prediction of patient’s health based on past trends and current conditions.
Cost Reduction: Data mining can allow more efficient use and allocation of healthcare resources. They can plan and make best decisions with accurate prediction that will result in maximum cost reduction.
Question 2

Stages of Data Mining
Problem definition
In context to hospital and healthcare services, the prime factor of problem definition in data mining can be the forecast or prediction of future health risk of a patients. It can also aid healthcare employees to make critical decisions upon proceeding to either financially or medically.
Data gathering
The data for hospital can be gathered with old data entries of hospitals. This is one of the important and early stages of data mining process. The data gathered should be consistent, accurate and meaningful.
Model Building
After data is collected next step is to build a data model for prediction of future trends. The model can be trained using different approaches along with different algorithms as per the needs of hospital.
Use Knowledge
Finally, upon completing the data modeling, it can be used to evaluate and predict the future outcome of patients related with the old patient’s behavior and sickness. The predictions will be relevant as per the data fed in training. If the results are inconsistent and inaccurate, future evaluation and filtration of data can be retrained for better outcomes.
Question 3
The small amount of data provided for data modeling is not effectively relevant in case of data mining. One of the primary reasons for this irrelevancy is that the prediction of small data modeling is very limited and inconsistent. For example, I have created 2 data modeling trees; one with small amount, with 7 datasets being fed to model data and another with around 700 datasets.
The tree made from small data is separated with 2 leaves only each extending from preg.

Upon evaluating, the result shows that it the preg is less than or equals to 4 it is tested negative and if greater than 4 it is positive.
While the decision tree made from big amount of data has multiple leaves extending from plas and preg is a sub leaf of it.

When comparing both the decision tree, there is a huge difference between them. The critical differences are the variables and nodes of each other and the result predicting whether it’s positive or negative.
If the data with any other variables other than preg like mass, age, pedi or preg are provided to predict from the first data model, the result will certainly be inconsistent and inaccurate. Where as from the second data model it can easily predict the strong and accurate result from those variables.
It’s not that we cannot make the data model from small amount of data, but the results, predictions and forecasts are highly in risk of being inaccurate. Hence, the more the data provided to model for a data mining, the better the result and prediction outcomes it can offer.
Question 4










Task 4: Your Big Data Big Idea
Question 1
I have a proposal of providing the data analysis and visualization of datasets of Netflix Tv shows and movies. The main idea behind this data analysis is to get an in-depth knowledge of data of Netflix. The size and amount of data of Netflix and how the data are divided into their data types. Additionally, I think it would to efficient and effective to break the huge data of Netflix to gain insights like how many shows are movies and how many are tv shows. Likewise, it would also be interesting to discover the differences in data within the dataset.
I will be visualizing the whole network of data, and also other charts and graphs are generated to aid some valuable insights to the dataset of Netflix.
Question 2
I have acquired this dataset of Netflix on website of Kaggle. URL for this particular dataset: https://www.kaggle.com/shivamb/netflix-shows. I have not used all the data provided by the link, but I think the remaining data I have used is enough to show my findings.

Question 3
There are total 6236 numbers of unique records in this dataset. There are 6 columns on the dataset with headers as; id, type, title, rating, duration and description.
The id is the unique id of the show, the type contains either the Netflix show is a TV show or a Movie. The title is the main title of the show. The rating shows categorizes show in TV-rating like “TV-14”, “R”, “TV-Y”, etc. The duration consists either the count of seasons of show or total minute of a movie. Description is simply the description of a show.





Excellent goods from you, man. I’ve understand your stuff previous to and you are just extremely
great. I actually like what you have acquired here, certainly like
what you are saying and the way in which you say it. You make it entertaining and
you still care for to keep it sensible. I can not wait to read far more from you.
This is really a terrific web site.
It’s a shame you don’t have a donate button! I’d most certainly
donate to this superb blog! I suppose for now i’ll settle for book-marking and adding your
RSS feed to my Google account. I look forward to brand new
updates and will talk about this website with my Facebook group.
Chat soon!