Assignment
-
Download Assignment 1.pdf
Assignment 1.pdf Details
- Sunday, 05 April 2020 [1.5MB] -
Download Assignment 2.pdf
Assignment 2.pdf Details
- Sunday, 05 April 2020 [736.7KB] -
Download Assignment 3.pdf
Assignment 3.pdf Details
- Wednesday, 29 April 2020 [4.5MB] -
Download Assignment 4.pdf
Assignment 4.pdf Details
- Sunday, 24 May 2020 [2.9MB] -
Download Assignment 5.pdf
Assignment 5.pdf Details
- Friday, 26 June 2020 [1.2MB] -
Download Assignment 6.pdf
Assignment 6.pdf Details
- Friday, 26 June 2020 [1MB]
Project 1
-
Download PSDA Project 1.zip
PSDA Project 1.zip Details
- Wednesday, 29 April 2020 [6.2MB]
Reflection for Project 1
Probability and Statistical Data Analysis in short PSDA is a course that study about statistic and probability. There are two types of statistics such as descriptive and inferential. In our first project, we are using descriptive statistic to present, organize and summarize our data. Data can be differentiate into two types which are primary data and secondary data. The data that we obtain for our Project 1 (A Study on Shopping Preference of UTM Students (Online or Offline)) is a primary data as all the data are collected by ourselves.
To collect the data from respondent, we use Google Form. Thus, by setting and organizing the questions using Google Form, I learned how to relate the types of questions and the types of data need to be collected that mentioned in our task given for instance, nominal, ordinal, interval and ratio. After finalizing the questions, the next things that we need to consider is about the types of graph that used to represent our data. This may include pie chart, bar chart, histogram and so on. This process take much time due to we need to think whether the graph representation is suitable to present our data. At here, we are very appreciate to our lecturer Dr. Chan Weng Howe as he spend his time to explain and give us opinion on our project.
Through this project, I’ve learn how to generate data using R Studio. R Studio is a very useful application especially for a Data Engineering student to summarized all the data needed and run the code to produce a plot. The problem I faced is that the R Tutorial is not enough for me to only refer to it, but I need to explore more on other resources especially on YouTube. Exploration more on YouTube helps me to write and run the code more quick and smooth when doing my project.
Furthermore, because of Movement Control Order (MCO), our presentation cannot be done in class. We can only do it through online or video recorded. So, what I found out is that we can record our video presentation through Power Point! When I heard about this from my teammate, it surprised me as I have use Power Point for so many years from secondary school until university. Thus, this method is chosen for our video presentation.
In conclusion, through Project 1, I have learn a lot of knowledge that I did not know before. I will practice more on this knowledge especially for R Studio because I as a Data Engineering student, statistic and data analysis are the core. I hope that through this subject, PSDA, it will help me to improve myself and become successful Data Engineer in the future.
Project 2
-
Download KhorYongXin_PSDA_Project2.zip
KhorYongXin_PSDA_Project2.zip Details
- Friday, 26 June 2020 [1.3MB]
Reflection for Project 2
The title for my Project 2 is A Study on Death in Malaysia 2018. The secondary data used is collected from the Department of Statistic Malaysia entitled is “Statistic on causes of death” for the year 2018. The sample size is collected from the number of death in 8 states in Malaysia while the population is the citizen in Malaysia. This topic is interesting to carry out few tests on certain claim. For instance, is the mean death of the man and mean death of the female is different, is the classification of death which is classified by the causes of death, and the states are independent, is there one of the probabilities of the types of death is different to other with all the probabilities are equals, is the relationship between the age and the number of deaths have positive linear relationship, is the relationship between age and number of death having linear regression or not.
In this project, I have learn a lot of methods of the test to be carried out on certain claim. It is very useful to use when I am dealing with a huge number of population. I can just performing those tests on the sample data collected to estimate for the population parameters. It will be more accurate when the number of sample data are greater than 30 because the standard error will decrease.
From the results, there are some claims that are rejected or insufficient evidence to support those claim. Hence, the result of supporting or not supporting of the claim is by referring to the test statistic as well as the critical value. When the test statistic calculated is lies in the critical region means it is rejecting the null hypothesis, however, when it does not lie in the critical region, it fails to reject null hypothesis. For some of the tests, it needs the degree of freedom to use for chi-square value and t-value. The degree of freedom takes a vital role because with the wrong degree of freedom, it will affect the value and makes the results to be not accurate. Thus, in my project, when calculation for the degree of freedom, I need to be very sure that I used the correct parameters in the formula.
Furthermore, as you can see, the number of sample used in the tests, some are different because the way I look and carry out the test is different. For instance, in hypothesis testing, the number of sample size is 8 because the data collected is from the 8 states as mentioned in the report. However, for the rest of the tests, the number of sample is 76704 which is the sample size from Malaysia.
For the correlation and regression analysis, I found out that if the conclusion that I get is positive relationship between both variables, how I can show more evidences or validate the conclusion is that, I calculate the correlation coefficient as well as the coefficient of determination. Both of them support the conclusion and hence, the results can be trusted. Below are the graphs for correlation and regression respectively.
When cope with Rstudio, I just refer to the tutorial slides given by our lecturer Dr. Chan Weng Howe. The tutorial slides are very useful as it helps me and save my time when doing the coding. However, in the regression part, the tutorial slide does not provide enough information for the names of parameters in the console. The parameter that showed in r console are complicated for me and in the tutorial slide, it does not mention the representation of parameters in the r console. So, I discover them by referring to tutorial in YouTube. Lastly, for my presentation, I used PowerPoint to record the slide, make some animation to let my video more interesting and guide the viewer to look at what I am presenting. After done my presentation, I export it to mp4 so that our lecturer can refer easily.
In conclusion, the results from those tests for the claims, we can conclude that the mean number of death for male and female are the same, states in Malaysia and the classification of death are dependent, each types of death are not having the same proportion to be happened and the age and the number of death are having a strong positive relationship.