SECI2143 - Project 2 Report - MyePortfolio@UTM

Introduction

On April 15, 1912, a history was made. The infamous ‘unsinkable’ Titanic shipwreck was reported to cause the death of 1502 out of 2224 passengers and crew after colliding with an iceberg. Although the surviving of a shipwreck was mainly due to luck, it is believed that a certain group of people has more chance of surviving than others. Our group is interested to find out whether a person’s attributes do play a role in the chances of surviving in a shipwreck or not. Specifically focused on wealth, gender and age attributes of the data In this report, we discuss this claim by giving more explanation on the sample dataset, how we conducted 4 different types of data analysis tests and the conclusion of our findings.

The Dataset

2.1 Data Variables and Description

The Titanic dataset retrieved from Kaggle (train.csv) has a total of 891 samples out of 2224 population. The data has a total of 12 different variables. The table below shows the variables available from the csv file.

From the provided data, we decided that we want to analyze the attributes that will most likely
have a significance in the chances of surviving. The chosen variables are:
1. Survived
2. Pclass
3. Sex
4. Age
5. Far

2.2 Data Pre-processing

We first check our dataset to make sure the data is clean. This is to ensure that there is no bias or miscalculation during the test. Based on the chosen data, we found out that only ‘Age’ data has some missing values, precisely 117 missing data out of 891. To counter this issue, we decided to fill in the empty values with imputation using median value, as the median will not get affected by extreme values.

Data Analysis

3.1 Hypothesis Testing One Sample Test

The null hypothesis is that wealthier people have a 50% chance of surviving, which is the same as other people. The alternative hypothesis is that rich people have a chance higher than 50% on surviving compared to others.

H 0 : p = 0.5
H 1 : p > 0.5

Based on the data, a total of 223 wealthy people survived the shipwreck among 400 of them. P(z > 2.3) = 0.01072. By using a 95% confidence level, we will compare the p value with alpha value (significance level). Since that the p value is less than the alpha value, H 0 is rejected. There is sufficient evidence to support the claim that wealthier people do have a higher chance of surviving in a shipwreck.

0.01072 < 0.05 P value < alpha value
H 0 is rejected

From here, we rejected our null hypothesis because wealthier people can have a better chance of surviving with 95% confidence level. This may be due to the fact that a higher class cabin has better facilities, better equipment to protect the passengers if anything bad happens and the cabin location is safer compared to lower class. Additionally, the first class passengers are usually being prioritized first followed by second class and third class.

Image gallery

Chi Square Test

Chi Square Test of Independence

For this test, we tried to prove that the null hypothesis is that gender does not play a role in surviving a shipwreck, which means that the rate of surviving is independent of gender. The alternative hypothesis is that the rate of survival is related to the gender of passengers.

H 0 : The chance of survival are independent of gender
H 1 : The chance of survival are dependent of gender

The degree of freedom is 1 and the significance level of the test is 0.05. So, based on the Chi Square distribution table, we obtained a critical value of 3.841. After doing the analysis, we obtained the test statistic value of 263.05.

Test statistic: x2 (1, 0.05 )= 263.05
Critical value: x^2(1, 0.05) = 3.841

H 0 is rejected

From here, we conclude that our null hypothesis is false because the chances of survival are dependent on gender. In this case, we observed that females have a higher chance of surviving compared to male. This may be affected by prioritizing females to get safety lifeboats and male are protecting them from dangers. Although male population is higher than females, the total number of male survivors is less than female survivors.

Correlation Test

Correlation Test

For this test, we will use variables of Age and Fare since those are the only numerical data variables available. We would like to determine if age has a linear relationship with the fare price. The null hypothesis is that there is no correlation between the age and the fare price, which means younger people or older people would have to pay around the same fare price. The alternative hypothesis is that there is a correlation between age and fare.

H 0 : ρ = 0
H 1: ρ ≠ 0

Pearson’s product- moment correlation coefficient method is used to conduct this analysis since the variables are ratios. After analysing using R programming, we got the results of r = 0.09606669, t = 2.5753 and with the p-value is 0.01022. The diagram 1 shows the scatter plot obtained by using our data.

By using a significance level of 0.05, we can compare the P value with alpha value. Since the P value is less than alpha value, H 0 is rejected. There is sufficient evidence of a linear
relationship between age and fare at the 5% level of significance.

(0.01022 < 0.05)
P value < alpha value
H 0 is rejected

In conclusion, there is a linear correlation between fare and age but the strength of the correlation is very weak as the r is very close to 0. The direction of correlation is positive
since our r value is 0.09606669. Although our final result has proven that linear correlation exists between both variables, we believe that the result is not very accurate due to the 2 outliers available in the dataset.

Regression Test

Regression Test

For this test, we will continue the analysis from the correlation method that uses variables of Age and Fare since those are the only numerical data variables available.

The null hypothesis is that there is no correlation between the age and the fare price, which means younger people or older people would have to pay around the same fare price. The
alternative hypothesis is that there is a correlation between age and fare.

H 0 : t = 0
H 1 : t ≠ 0

Based on the result of our dataset, we got the coefficient of the intercept point which is 24.3009 with the slope(Age) having the value of 0.3500. The intercept point and the Age point values a plot of line between all the points of the data. Making it possible to form a new formula which is Fare = 24.3009 + 0.35(Age). For example, if one passenger is 20 years old, then the program predicts (on average) that its fare is around 24.3009+0.35(20) = RM31.40.

So in conclusion, we would be able to prove that there is a positive linear relationship, with a linear regression line in a graph. There might be a slight inaccuracy reading in our dataset, as there are 2 outliers present during our time in R programming, where a linear regression line does not require an outlier at all. For the residuals, the sum of all data such as Min, 1Q, Median, 3Q and Max are supposed to be close to 0 which in this case, has the sum of values that is beyond 0.

Conclusion

In conclusion, the chances of surviving in a shipwreck is dependent on one's attribute. Our initial stance is proven to be false as gender, wealth and age. It is observed that female, rich people and children are the group of people that is more likely to survive as they are the people with given priorities. The most interesting findings from this analysis is that wealthier people have higher survival rates. After researching on the Internet, it is found that the cabin position of different classes is different. The first class and second class cabin are located at the top and center part of the ship.

Appendix

The dataset source is available on Kaggle : Titanic - Machine Learning from Disaster | Kaggle

(URL : Titanic - Machine Learning from Disaster | Kaggle )

Reflection

By doing this analysis project, I am now able gain the knowledge of using Rstudio and learning R language to analyse statistical data. Data preprocessing before the project is required to minimize the error during the calculation. In our dataset for instance, the age data has a few missing values that we need to replace so that data representation is more accurate. In my opinion, choosing the right dataset is important because it can help to do a better analysis for this project and get the better results during the tests, especially in correlation and regression tests. Furthermore, this project also taught me to understand the application of each different test and which data variables are the most suitable for different tests. In this way I can organize my data better and to categorize every data so that the result will be in a clean and accurate manner.

File(s) to download

Download PSDA_Project2_sec09.zip
PSDA_Project2_sec09.zip Details
- Saturday, 03 July 2021 [1.3MB]

For more details, please refer to this ZIP file
Download train.csv
train.csv Details
- Saturday, 03 July 2021 [59.8KB]

We used this dataset for the calculation

Video About Report

psda-video_cargRvZa.mp4 [26.42MB]

Details

SECI2143-09 Assessments

SECI2143 - Project 2 Report