Statistical Inference Correlation & - PowerPoint Presentation

delilah . @delilah

66 views
Uploaded On 2023-07-28

Statistical Inference Correlation & - PPT Presentation

Simple Linear Regression April 17 2018 Correlation analysis M easuring the degree of association between two continuous variables x and y We have a linear relationship between x and y ID: 1012739

correlation linear sbp age linear correlation age sbp model association line regression data pearson increase test coefficient relationship average

Link:

Copy

Embed:

<iframe width="560" height="315" src="https://www.docslides.com/embed/1012739" frameborder="0" allowfullscreen></iframe>

Download Presentation from below link

Download Presentation The PPT/PDF document "Statistical Inference Correlation &" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

Presentation Transcript

1. Statistical InferenceCorrelation & Simple Linear RegressionApril 17, 2018

2. Correlation analysis* Measuring the degree of association between two continuous variables, x and yWe have a linear relationship between x and y if a straight line drawn through the midst of the points provides the most appropriate approximation to the observed relationshipWe measure how close the observations are to the straight line that best describes their linear relationship by calculating the Pearson product moment correlation coefficient, usually simply called the correlation coefficient2*The following slides were adapted from Prof. Trinquart’s R Course (BS730)

3. Example3

4. Pearson product moment correlation coefficient4 estimate of the population correlation coefficient, ρ

5. Pearson product moment correlation coefficientbetween -1 and +1indicates the direction and strength of the linear relationshipr>0 one variable increases as the other variable increasesr<0: one variable decreases as the other increasesr = 0 : no linear correlation (although there may be a non-linear relationship)r = +1 or -1: perfect linear correlation with all the points lying on the linethe closer r is to the extremes, the greater the degree of linear association5

6. Rule of thumbr≥0.9 very highly correlated variables (r2 ≥81%)0.7≤r<0.9 highly correlated (49% ≤ r2 < 81%)0.5≤r<0.7 moderately correlated (25% ≤ r2 < 49%)0.3≤r< 0.5 low correlation (9% ≤ r2 < 25%)r<0.3 little if any (linear) correlation ( r2 < 9%)6

7. Pearson product moment correlation coefficientvalid only within the range of values of x and y in the samplex and y can be interchanged without affecting the value of r(linear) correlation does not imply a cause/effect relationshipr2 proportion of variability in y that can be explained by its linear relationship with x (coefficient of determination)7

8. When not to calculate Pearson’s rwhen there is a non-linear relationship (eg U-shaped and J-shaped relationships)when there are outlierswhen there are subgroups for which the mean of least one of the variables are different8

9. Pearson Correlation AssumptionsObservations are independentThe association is linearVariables are approximately normally distributed9

10. Statistical significance ≠ Clinical relevanceThe significance of a given correlation coefficient is a function of sample size; i.e., a low correlation can be significant if the sample size is large enough10

11. Hypothesis TestingH0: There is no correlation between the two variables ρ=0H1: There is a correlation between the two variables ρ≠011

12. Pearson correlation test statisticThe Student’s t distribution is used where se(r) is defined asNote that the standard error is inversely related to n. A larger sample size corresponds to a smaller se(r). If the population correlation is zero and assuming x and y follow normal distributions, then the test statistic has a Student’s t distribution with n-2 degrees of freedom12

13. Confidence Interval for ρSince r is an estimate of a parameter, we can calculate a confidence interval for the population correlation coefficient, ρBased on z = 0.5 ln[(1+r)/(1-r)]Because the transformation is a non-linear function of r, the confidence interval is not symmetric around r.13

14. In RBegin with a scatter plot!plot(x, y, xlab="x-label", ylab="y-label")abline(lm(y ~ x))cor( x, y, method = “pearson")cor.test(x, y, method=“pearson")default method is "pearson" so you may omit the method option14

15. Example15

16. Example: Reporting ResultsH0: There is no linear association between age and SBP (ρ=0) H1: There is a linear association between age and SBP (ρ≠0)Pearson’s correlation was used to determine if there was a linear association between age and systolic blood pressure (SBP)There is significant evidence at α=0.05 of a moderate, positive, linear association between age and SBP, r = 0.65 with a 95% C.I. of 0.09 to 0.90, p-value = 0.029.16

17. Simple Linear Regression

18. BackgroundEven as more and more sophisticated statistical procedures are developed, linear regression remains a fundamental element of the quantitative health sciences.Concept of regression and correlation invented by Francis Galton (b. 1822; d. 1911)Studied relationship between parental and child heightContemporary paper in European Journal of Human Genetics showed that modern methods improved little on the original predictive model.Source: Predicting Height: The Victorian Approach Beats Modern Genomics http://www.wired.com/2009/03/predicting-height-the-victorian-approach-beats-modern-genomics/

19. Modern Example from Simply Statistics“Data supports claim that if Kobe stops ball hogging the Lakers will win more”“Linear regression suggests that an increase of 1% in % of shots taken by Kobe results in a drop of 1.16 points (+/- 0.22) in score differential.”Source: http://simplystatistics.org/2013/01/28/data-supports-claim-that-if-kobe-stops-ball-hogging-the-lakers-will-win-more/

20. Questions we might ask with linear regressionTo use the parents' heights to predict child heights. To try to find a parsimonious, easily described mean relationship between parent and children's heights. To investigate the variation in child heights that appears unrelated to parents' heights (residual variation).To quantify what impact genotype information has beyond parental height in explaining child height. To figure out how/whether and what assumptions are needed to generalize findings beyond the data in question. Why do children of very tall parents tend to be tall, but a little shorter than their parents and why children of very short parents tend to be short, but a little taller than their parents? (This is a famous question called 'Regression to the mean'.)

21. Linear RegressionIf we believe y is dependent on x, with a change in y being attributed to a change in x, rather than the other way round, we can determine the linear regression line (the regression of y on x) that best describes the straight line relationship between the two variables21

22. ModelFind the equation of the line that best fits the data. The generic equation of the line relating y to x is of the form:In the sample, the equation of the line of “best fit” is written as:NOTE: The actual data will NEVER fall exactly on this "best fit" line. That’s why the model contains an error term. The error term contains the influences of other factors not captured by the model.22

23. For a given value of x, is the value of y which lies on the estimated line. It is the value we expect for y (i.e. its average) if we know the value of x, and is called the fitted value of y; intercept; it is the value of y when x = 0 slope; it represents the amount by which y increases on average if we increase x by one unit 23

24. 24ResidualObservedPredicted

25. Method of least squaresThe "best fit" line is the line that minimizes the sum of the squared residuals.Want to minimize: For linear regressions with only one independent variable, X, this yields the following equations:25

26. Example: Model for SBP and AgeFor the previous example using SBP and age:26

27. Example: best fit and predicted valueThe "best fit" line for these data is:Predicted SBP for a person that is 24 years old:There was an observed SBP value of 116 at age=24. The residual at age=24 isThe range of age in the given data was 22- 40. Therefore, we should not use the above model to predict the SBP for a 65 yr old person27

28. correlation and coefficient of simple linear regressionIn a simple linear regression model, the sample correlation coefficient, r, and the estimated slope are related 28

29. In Rlm() functionThe first argument of the lm() function is a formula object, with the outcome specified followed by the ~ operator then the predictorsmod1 <- lm(y ~ x, data=ds)summary(mod1)summary.aov(mod1)29

30. Example: sbp and age30

31. Example: ANOVA tableSum of Squares31

32. Example: Sum of SquaresModel SS = SS of the differences between y predicted by the model and the overall average. Error SS = SS of the differences between observed y and y predicted by the model. Total SS = SS of the differences between observed y and the overall average. The better the model fits, the larger the model SS is and the smaller the error SS32

33. F testF value: test statistic for the overall modelIn a situation with only a single X variable, the F-test is equivalent to the t test for the null hypothesis that β1=0 33

34. p-value of the goodness of the fitPr(>F): P-value for the above F test.34

35. R2R2 proportion of variability in y explained by the independent variable(s)R2 = (Model SS / Total SS) = 158.93/(158.93+213.26) = 0.427R-square takes values between 0 and 1 and measures the goodness of fit of the model. Values close to 1 are indicative of a good fit35

36. R2When there is only one independent variable, x, R2 is the proportion of the variability in y that can be explained by x.R2 is equal to the square of the Pearson correlation between x and y36

37. Parameter estimatesParameter EstimatesEstimated coefficients, and Of particular interest is, , the slope of the linear regressionInterpretation: On average, every one-unit change in X corresponds to a unit change in Y. 37

38. Confidence interval38model1<-lm(sbp~age, data=ds)confint(model1)

39. Example: Confidence Intervals for and 95% ConfidenceIn this example there are n-2 = 9 degrees of freedom and for α = 0.05, the critical t is 2.262. The 95% confidence interval for is0.73 ± (2.262)(0.28)0.73 ± 0.63[0.09, 1.37] 39

40. Example: Testing of parametersTest for Prob > |t| are the p-values for the t-tests40

41. Example: testing the parameterAnother way to view the test =0 is that it tells us whether the model provides a significant improvement over the information about y provided only by the sample mean.Notice that, in the case of one predictor, F is the square of t 41

42. Example: Reporting ResultsThe null hypothesis isH0: , or "the slope of the model is zero" or "there is no linear association between X and Y”. Be sure to interpret the estimated regression coefficient of interest (). 42

43. Example: Reporting Results – Technical SummaryH0: There is no linear association between SBP and age among women ages 22-40. H1: There is a linear association between SBP and age among women ages 22-40. OR H0: βage = 0 vs H1: βage ≠ 043

44. Example: Reporting Results – Technical Summary44Conclusion of hypothesis test: Reject H0. There is significant evidence at α=0.05 of a linear association between SBP and age among women ages 22-40 (t=2.59, df=9).

45. Sample Write-upMethodsTo examine the association between SBP (mmHg), and age (years), we performed a linear regression analysis. We used a 0.05 level of significanceResultsIn the linear regression analysis of the association of SBP and age, we found that there was a significant, positive linear association, with an estimated slope of 0.73 (t=2.59, df=9; p = 0.0292). On average, a one year increase in age corresponds to an increase of 0.73 mmHg in SBP. The R2 for this model was 0.43 (age accounts for 43% of the variability of SBP in these data).45

46. Express results for a continuous independent variableOn average, a one-year increase in age corresponds to a 0.73 mmHg increase in SBP.It’s better to report the effect usingA clinically meaningful increment Δ (eg 10-year increase in age) if knownA increase by one standard deviation SDOn average, a -year increase in age corresponds to a increase in SBP 46

47. DeliverablesDue 4/22Complete draftDue 4/29Final Research PaperDue 5/8Final presentation