SECI2143 - Probability and Statistical Data Analysis

Project 2: Hypothesis Testing on Attributes of Heart Disease Patients

Introduction

     This dataset contains data of patients suffering from heart disease. The dataset is obtained from https://www.kaggle.com/ronitf/heart-disease-uci and is uploaded by the user ronit. Variables that can be found in this dataset includes age, sex, chest pain type, resting blood pressure, serum cholesterol, fasting blood sugar level, resting electrocardiographic results, maximum heart rate achieved, exercise induced angina, and ST depression induced by exercise relative to rest, slope of peak exercise ST segment, number of major vessels, Thallium injection, and presence of heart disease. In total, there are 14 different data in which six are ratio, and another eight are nominal type of data as shown in Data Description.

Objective

     For this project, the data that will be used are the variables age, sex, chest pain type, resting blood pressure, serum cholesterol, and maximum heart rate achieved. Five variables which are age, resting blood pressure, serum cholesterol, and maximum heart rate are ratio data while the other five data are nominal type data. The purpose of this study is to find out whether the resting blood pressure has any relationship with the maximum heart rate of a patient of heart disease, to see if there is any correlation between serum cholesterol level and maximum heart rate, to find out if there is a linear relationship between resting blood pressure and maximum heart rate of patients, and to see if chest pain type is related to the gender of the patients or not.

Hypothesis Testing for 2 Samples

The hypothesis testing is done to test if there are any difference between the mean blood pressure of patients aged above 55 and mean blood pressure of patients aged 55 and below. The null hypothesis is the mean between the two groups are the same and the alternative hypothesis is the mean has different values. The significance level for this test is 0.05 and the variance are assumed to be unequal.

 

H0 : Average blood pressure of patients aged above 55 = Average blood pressure of patients aged              55 and below.

H1 : Average blood pressure of patients aged above 55 != Average blood pressure of patients aged            55 and below.

α = 0.05

hypo.jpg

 

The screenshot above shows the line used in RStudio to calculate the critical and test t value. The value of tval is the  value while t.alpha is t critical value. Therefore, we have  = 4.793 and t = ±1.968 since the test is two-tailed.

 

Since the value of  > t, null hypothesis is rejected. There is sufficient evidence to prove that the average blood pressure of heart disease patients aged above 55 is different than the average blood pressure of heart disease patients aged 55 and below.

Correlation Between Serum Cholesterol and Maximum Heart Rate

korelasi.jpeg

      Based on the scatter plot between the two variables Serum cholesterol and Maximum heart rate shown above, the points on the graph appears to be random and the two variables have no dependency with each other. There is also no linear relationship between the two variables. The correlation coefficient is calculated in RStudio using the cor() function.

 

corelationcoeff.jpg

    The value of the correlation coefficient is very close to zero and therefore the strength of the correlation coefficient is very weak. The value is consistent with the features seen in the scatter plot and thus it can be concluded that higher level of serum cholesterol does not lead to higher heart rate of a heart disease patient.

 

H0 : ρ = 0 (no linear correlation)

H1 : ρ != 0(linear correlation exists)

α = 0.05

 

     For the significance test for correlation, the function cor.test() is used in RStudio. The significance level of the test is 0.05.

 

korelasi2.jpg

     The t test value obtained from running the function is -0.17246 and the critical t value is -1.69932. Since the test value is smaller than the critical value, null hypothesis is not rejected. There is insufficient evidence of a linear correlation between the serum cholesterol level and maximum heart rate of the patients.

Regression Between Resting Blood Pressure and Maximum Heart Rate

regresi.jpeg

     The graph shows the scatter plot and regression line between blood pressure and maximum heart rate of the patients. Based on the scatter plot, there is no relationship between the two variables. The regression line is almost flat, suggesting that there is no relationship between blood pressure and heart rate. The summary() function is used to find the intercept and slope for the regression equation.

 

regresi2.jpg

The regression equation is: y = 157.674 - 0.061x

 

H0 : β1 = 0 (no linear relationship)

H1 : β1 != 0 (linear relationship exists)

α = 0.05

 

The p-value for the regression test is also obtained from the summary function. The obtained p-value for the test is 0.418. Since the p-value (0.418) is larger than α (0.05), therefore null hypothesis is accepted. There is not enough evidence to suggest that there is a linear relationship between blood pressure and maximum heart rate of heart disease patients.

Chi-square Test of Independence

     For the chi-square test of independence, two variables are tested which are chest pain type and gender. The null hypothesis is chest pain type does not have any relationship with gender and the alternative hypothesis is chest pain type have a relationship with gender. The significance level, α is 0.05. The function used in RStudio to find the chi-square and p value is chisq.test() function. A table of gender and chest pain type is created and stored in variable gndrtype. Gender type 0 is female while gender type 1 is male. The variable is then passed as argument in the chisq.test() function.

 

H0 : Chest pain type is independent of gender.

H1 : Chest pain type is not independent of gender.

α = 0.05

 

chisquare.png

 

     Since the p-value (0.07779) is greater than the significance level (0.05), null hypothesis is not rejected. There is not enough evidence to suggest that chest pain type has any relationship with the gender of the patient.

Discussion and Conclusion

     The dataset contained various data of 303 different patients. Based on the hypothesis testing on two sample that was done, it is apparent that patients aged above 55 have different average blood pressure than patients aged 55 and below. Other than that, the t value of the test is larger than the positive t critical value shows that the average blood pressure of the former group of patients is greater than the latter.

 

     For the correlation between serum cholesterol and maximum heart rate, based on the scatter plot graph shown, there is no correlation between the two variables. The value of serum cholesterol does not have any effect on the value of maximum heart rate of the patients.

 

     From the linear regression model, it can be concluded that there is no linear relationship between the resting blood pressure and maximum heart rate of the patients suffering from heart disease. The slope of the regression line is close to zero and therefore shows no relationship between the two variables.

 

     Lastly, the chi-square test of independence at significance level of 0.05 between the variables, chest pain type and gender shows that the two variables are unrelated. The type of chest pain suffered by the patient is independent of the gender of the heart disease patients.

Reflection

     From doing this project, I have learnt that this subject has its application in the real world. This project has taught me that it is important to analyze data before making a real conclusion based on our hypothesis. It also helps me to do calculation to get the most precise value that can be used in analysis. The calculation is very important because that will be the determining factor in accepting or rejecting a hypothesis. I also learnt about how important it is to keep data so that it can be used for further research and can help the advancement in science and technology.

 

(Presentation video for this project has been uploaded to e-learning)