/
REGRESSION   analysis SECTION 1 REGRESSION   analysis SECTION 1

REGRESSION analysis SECTION 1 - PowerPoint Presentation

madison
madison . @madison
Follow
65 views
Uploaded On 2023-10-31

REGRESSION analysis SECTION 1 - PPT Presentation

INTRODUCTİON HISTORY 18221911 Sir Galton REGRESSION WAS COINED 1805 LENARDE METHOD OF LEAST SQUARES 1809 GAUSS METHOD OF LEAST SQUARES HISTORY 18571936 Karl Pearson ID: 1027753

variable regression variables line regression variable line variables response error values independent explanatory variance linear relationship squares coefficient sse

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "REGRESSION analysis SECTION 1" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1. REGRESSION analysis

2. SECTION 1INTRODUCTİON

3. HISTORY1822-1911: Sir Galton "REGRESSION" WAS COINED1805: LENARDEMETHOD OF LEAST SQUARES1809: GAUSSMETHOD OF LEAST SQUARES

4. HISTORY1857-1936: Karl PearsonJOINED DSTRIBUTION WAS ASSUMED TO BE GAUSSIAN1851-1952: George Yule JOINED DSTRIBUTION WAS ASSUMED TO BE GAUSSIAN1890-1962: Sir Ronald Fisher WEAKENED THE ASSUMPTION OF YULE AND PEARSON

5. HISTORYThe earliest form of regression was the method of least squares, published by Legendre in 1805, and by Gauss in 1809. They both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the Sun (mostly comets, but also later the then newly discovered minor planets). 1821Gauss published a further development of the theory of least squares, including a version of the Gauss–Markov theorem1805 – 1809

6. HISTORYGalton’s work was later extended by Udny Yule and Karl Pearson to a more general statistical context. In the work of Yule and Pearson, the joint distribution of the response and explanatory variables is assumed to be Gaussian.1897 – 1903 The term "regression" was coined by Francis Galton to describe a biological phenomenon. The phenomenon was that the heights of descendants of tall ancestors tend to regress down towards a normal average (a phenomenon also known as regression toward the mean). For Galton, regression had only this biological meaning.1890

7. HISTORYEconomists used electromechanical desk calculators to calculate regressions.1950s – 1960sThis assumption was weakened by R.A. Fisher. He assumed that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this respect, Fisher's assumption is closer to Gauss's formulation of 1821.1922 – 1925

8. HISTORYIt sometimes took up to 24 hours to receive the result from one regressionBEFORE1970Regression methods continue to be an area of active research. In recent decades, new methods have been developed for robust regression, regression involving correlated responses such as time series and growth curves, regression in which the predictor (independent variable) or response variables are curves, images, graphs, or other complex data objects, regression methods accommodating various types of missing data, nonparametric regression, Bayesian methods for regression, regression in which the predictor variables are measured with error, regression with more predictor variables than observations, and causal inference with regression.

9. Statistical modelingDeterministic ModelsDescribes Relationship between VariablesProbabilistic ModelsHypothesize Exact RelationshipsSuitable When Prediction Error is NegligibleExample: Body mass index (BMI) is measure of body fat basedMetric Formula: Non-metric Formula:  Hypothesize 2 Components:DeterministicRandom ErrorExample: Systolic blood pressure of newborns Is 6 Times the Age in days + Random ErrorRandom Error May Be Due to Factors Other Than age in days (e.g. Birthweight) 

10. PROBABILISTIC MODELSREGRESSION MODELSCORRELATION MODELSOTHER MODELSTypes of Probabilistic ModelsSTATISTICAL MODELING

11. SIMPLE1 EXPLANATORY VARIABLEMULTIPLE2+ EXPLANATORY VARIABLESLINEARNON-LINEAR NON-LINEARTYPES OF REGRESSION MODELSLINEAR

12. REGRESSIONanalysisIn analyzing data for the health sciences disciplines, we find that it is frequently desirable to learn something about the relationship between two numeric variables. We may, for example, be interested in studying the relationship between blood pressure and age, height and weight, the concentration of an injected drug and heart rate, the consumption level of some nutrient and weight gain, the intensity of a stimulus and reaction time, or total family income and medical care expenditures. The nature and strength of the relationships between variables such as these may be examined using linear models such as regression and correlation analysis, two statistical techniques that, although related, serve different purposes.

13. REGRESSIONanalysisRegression analysis is helpful in assessing specific forms of the relationship between variables, and the ultimate objective when this method of analysis is employed usually is to predict or estimate the value of one variable corresponding to a given value ofanother variable.

14. 1. Define problem or question“STEPSREGRESSION MODELING2. Specify model3. Collect data4. Do descriptive data analysis5. Estimate unknown parameters6. Evaluate model7. Use model for prediction

15. SIMPLE VS. MULTIPLE1 represents the unit change in Y per unit change in Xi 2takes into account the effect of other 𝛽𝑖s1represents the unit change in Y per unit change in X 2does not take into account any other variable besides single independent variable3net regression coefficient

16. LINEARITYthe Y variable is linearly related to the value of the X variableINDEPENDENCE OF ERROR the error (residual) is independent for each value of X12HOMO-SCEDASTICITYthe variation around the line of regression be constant for all values of XNORMALITYthe values of Y be normally distributed at each value of X45ASSUMPTIONS3CONTINUOUS Vtwo variables should be either interval or ratio variables 6NO SIGNIFICANT OUTLIERSoutliers can have a negative effect on the regression analysis

17. statistical model that can predict the values of a dependent (response) variable based upon the values of the independent (explanatory) variables. “GOALDevelop a

18. SECTION 2LİNEAR REGRESSİON ANALYSİS

19. TYPES OF CORRELATİONPositive correlationNegative correlationNo correlation

20. SİMPLE LİNEAR REGRESSİONdescribes the linear relationship between a predictor variable, plotted on the x-axis, and a response variable, plotted on the y-axisIndependent Variable (X)Dependent Variable (Y)

21. LİNEAR EQUATİONStraight line is the simplest model of the relationship between two interval-scaled attributes, and its slope gives us an indication of the existence of an association between them. Therefore, an objective way to investigate an Association between interval attributes will be to draw a straight line through the center of the cloud of points and measure its slope. If the slope is zero, the line is horizontal and we conclude that there is no association. If it is non-zero, then we can conclude on an association.So we have two problems to solve: how to draw the straight line that best models the relationship between attributes and how to determine whether its slope is different from zero.

22. LINEAR REGRESSION MODELRelationship Between Variables Is a Linear Function POPULATION Y-INTERCEPTPOPULATION SLOPERANDOM ERRORDEPENDENT (RESPONSE) VARIABLEINDEPENDENT (EXPLANATORY) VARIABLE         

23. LINEAR REGRESSION MODELRelationship Between Variables Is a Linear FunctionUnknown Relationship  

24. RASample Linear Regression ModelPopulation Linear Regression ModelObserved valueObserved value= Random error   Observed valueUnsampled value= Random error   

25. SECTION 3THE ORDINARY LEAST SQUARE METHOD (OLS)

26. !THE ORDINARY LEAST SQUARE METHOD (OLS)HOW TO FIT DATA TO A LINEAR MODEL?

27. THE ORDINARY LEAST SQUARE METHODOverview‘Best Fit’ means difference between actual Y values & predicted Y values are a minimum. But positive differences off-set negative. So square errors!LS minimizes the Sum of the Squared Differences (errors) (SSE) 

28. LEAST SQUARES REGRESSIONLEAST SQUARES GRAPHICALLY   Residual (ε) =Sum of squares of residuals =Model line:we must find values of and that minimise     

29. THE REGRESSİON COEFFİCİENTS  

30. Prediction equationSample Y - interceptSample slopeCOEFFICIENT EQUATIONS   

31. INTERPRETATION 12    If = 4, then average Y is expected to be 4 when X is 0 

32. REQUİRED STATİSTİCS  

33. DESCRİPTİVE STATİSTİCS

34. REGRESSİON STATİSTİCSThe Sum of Squares Regression (SSR) is the sum of the squared differences between the prediction for each observation and the population mean.The Total Sum of Squares (SST) is equal to SSR + SSE(measure of explained variation)(measure of unexplained variation) (measure of total variation in y)

35. !YVariance to beexplained by predictors(SST)

36. !YX1Variance NOT explained by X1 (SSE) Variance explained by X1(SSR)

37. THE COEFFİCİENT OF DETERMİNATİONThe proportion of total variation (SST) that is explained by the regression (SSR) is known as the Coefficient of Determination, and is often referred to as .  The value of can range between 0 and 1, and the higher its value the more accurate the regression model is. It is often referred to as a percentage. 

38. THE COEFFİCİENT OF DETERMİNATİONAn important measure of association between variables. Represented as because its value is the square of another measure of association frequently used, called thecorrelation coefficient, which is represented by . Although we can obtain from , the two measures are not completely equivalent.   has values between 0 and 1  ranges from 1 to +1  in addition to providing a measure of the strength of an association, also informs us of the type of association In both cases, the greater the absolute value of the coefficient, the greater the strength of the associationUnlike the coefficient of determination, the correlation coefficient is an abstract value that has no direct and precise interpretation, somewhat like a score.

39. THE COEFFİCİENT OF DETERMİNATİONThese two measures are related to the degree of dispersion of the observations about the regression line. In a scatterplot, when the two variables are independent, the points will be distributed over the entire area of the plot. The regression line is horizontal and the coefficient of determination is zero. When an association exists, the regression line is oblique and the points are more or less spread along the line. The higher the strength of the association, the less the dispersion of the points around the line and the greater will be and the absolute value of . If all the points are over the line, has value 1 and value +1 or 1. 

40. THE COEFFİCİENT OF DETERMİNATİONThe importance of these measures of association comes from the fact that it is very common to find evidence of association between two variables, and it is the strength of the association that tells us whether it has some important meaning. In clinical research, associations explaining less than 50% of the variance of the dependent variable, that is, associations with less than 0.50 or, equivalently, with between 0.70 and 0.70 are usually not regarded as important. 

41. STANDARD ERROR OF REGRESSİONThe Standard Error of a regression is a measure of its variability. It can be used in a similar manner to standard deviation, allowing for prediction intervals.Standard Error for the regression model

42. Mean Squared ErrorFrom the regression equationcompute the predicted values of the dependent variable1Compute the variance of the residuals from y and y*23 Obtain the sum of squaresof x from the variance of x 

43. 4The standard error of theregression coefficient is This estimate of the true standard error of is unbiased on the condition that the dispersion of the points about the regression line is approximately the same along the length of the line. This will happen if the variance of Y is the same for every value of X, that is, if Y is homoscedastic. If this condition is not met, then the estimate of the standard error of may be larger or smaller than the true standard error, and there is no way of telling which.In summary, we can estimate the standard error of the regression coefficient from our sample and construct confidence intervals, under the following assumptions:The dependent variable has a normal distribution for all values of the independent variable. The variance of the dependent variable is equal for all values of the independent variable. If the independent variable is interval its distribution is normal. The relationship between the two variables is linear. 

44. An estimate of the variance of Y for fixed values of X can be obtained from the variance of the residuals, that is, the variance of the departure of each y from the value predicted by the regression RESIDUAL MEAN SQUARE TEST IN LINEAR REGRESSION We can test the null hypothesis that with a different test based on analysis of variance.  The Figure compares a situation where the null hypothesis is true, on the left, with a situation where the null hypothesis is false, on the right. When the two variables are independent, and the slope of the sample regression line will be very nearly zero (not exactly zero because of sampling variation). If the null hypothesis is false, the regression line will be steep and the departures of the values y from the regression line will be less than the departures from . Therefore, the residual mean square will be smaller than the total variance of Y. We could compare the two estimates and by taking the ratio . The resulting variance ratio would follow an F distribution if the two estimates of were independent, and if the null hypothesis were false the variance ratio would have a value much larger than expected under . 

45. SECTION 4Quiz

46. QUIZThe regression line is drawn so that:A. The line goes through more points than any other possible line, straight or curvedB. The line goes through more points than any other possible straight line.C. The same number of points are below and above the regression line.D. The sum of the absolute errors is as small as possible.E. The sum of the squared errors is as small as possible.E. The sum of the squared errors is as small as possible.

47. QUIZIn order for the regression technique to give the best and minimum variance prediction, all the following conditions must be met, EXCEPT for:A. The relation is linear.B. We have not omitted any significant variable.C. Both the X and Y variables (the predictors and the response) are normally distributed.D. The residuals (errors) are normally distributed.E. The variance around the regression line is about the same for all values of the predictor.C. Both the X and Y variables (the predictors and the response) are normally distributed.

48. QUIZIf a regression has the problem of heteroscedasticity,A. The predictions it makes will be wrong on average.B. The predictions it makes will be correct on average, but we will not be certain of the RMSE (root-mean-square error)C. It will also have the problem of an omitted variable or variables.D. It will also be based on a non-linear equationB. Heteroscedasticity implies that the variance will differ for different values of the regressor.The assumption of homoscedasticity (meaning “same variance”) is central to linear regression models.  Homoscedasticity describes a situation in which the error term (that is, the “noise” or random disturbance in the relationship between the independent variables and the dependent variable) is the same across all values of the independent variables.  Heteroscedasticity (the violation of homoscedasticity) is present when the size of the error term differs across values of an independent variable.  The impact of violating the assumption of homoscedasticity is a matter of degree, increasing as heteroscedasticity increases.

49. QUIZIn regression, the equation that describes how the response variable (y) is related to the explanatory variable (x) is:a. the correlation modelb. the regression modelc. used to compute the correlation coefficientd. None of these alternatives is correct.b. the regression model

50. QUIZThe regression line is drawn so that:A. The line goes through more points than any other possible line, straight or curvedB. The line goes through more points than any other possible straight line.C. The same number of points are below and above the regression line.D. The sum of the absolute errors is as small as possible.E. The sum of the squared errors is as small as possible.E. The sum of the squared errors is as small as possible.

51. QUIZThe relationship between number of beers consumed (x) and blood alcohol content (y) was studiedin 16 male college students by using least squares regression. The following regression equationwas obtained from this study:y= -0.0127 + 0.0180xThe above equation implies that:a. each beer consumed increases blood alcohol by 1.27%b. on average it takes 1.8 beers to increase blood alcohol content by 1%c. each beer consumed increases blood alcohol by an average of amount of 1.8%d. each beer consumed increases blood alcohol by exactly 0.018c. each beer consumed increases blood alcohol by an average of amount of 1.8%

52. QUIZSSE can never bea. larger than SSTb. smaller than SSTc. equal to 1d. equal to zeroa. larger than SST

53. QUIZRegression modeling is a statistical framework for developing a mathematical equation that describes howa. one explanatory and one or more response variables are relatedb. several explanatory and several response variables response are relatedc. one response and one or more explanatory variables are relatedd. All of these are correct.c. one response and one or more explanatory variables are related

54. QUIZIn regression analysis, the variable that is being predicted is thea. response, or dependent, variableb. independent variablec. intervening variabled. is usually xa. response, or dependent, variable

55. QUIZIn least squares regression, which of the following is not a required assumption about the error term ε?a. The expected value of the error term is one.b. The variance of the error term is the same for all values of x.c. The values of the error term are independent.d. The error term is normally distributed.a. The expected value of the error term is one.

56. QUIZLarger values of () imply that the observations are more closely grouped about thea. average value of the independent variablesb. average value of the dependent variablec. least squares lined. origin c. least squares line

57. QUIZIn a regression analysis if r2 = 1, thena. SSE must also be equal to oneb. SSE must be equal to zeroc. SSE can be any positive valued. SSE must be negativeb. SSE must be equal to zero

58. QUIZIn regression analysis, the variable that is used to explain the change in the outcome of an experiment, or some natural process, is calleda. the x-variableb. the independent variablec. the predictor variabled. the explanatory variablee. all of the above (a-d) are correctf. none are correcte. all of the above (a-d) are correct

59. QUIZIn the case of an algebraic model for a straight line, if a value for the x variable is specified, thena. the exact value of the response variable can be computedb. the computed response to the independent value will always give a minimal residualc. the computed value of y will always be the best estimate of the mean responsed. none of these alternatives is correct.a. the exact value of the response variable can be computed

60. QUIZIn a regression and correlation analysis if = 1, thena. SSE = SSTb. SSE = 1c. SSR = SSEd. SSR = SST d. SSR = SST

61. QUIZIf the coefficient of determination is a positive value, then the regression equationa. must have a positive slopeb. must have a negative slopec. could have either a positive or a negative sloped. must have a positive y interceptc. could have either a positive or a negative slope

62. QUIZIf two variables, x and y, have a very strong linear relationship, thena. there is evidence that x causes a change in yb. there is evidence that y causes a change in xc. there might not be any causal relationship between x and yd. None of these alternatives is correct.c. there might not be any causal relationship between x and y

63. QUIZIn regression analysis, if the independent variable is measured in kilograms, the dependent variablea. must also be in kilogramsb. must be in some unit of weightc. cannot be in kilogramsd. can be any unitsd. can be any units

64. QUIZIn a regression analysis if SSE = 200 and SSR = 300, then the coefficient of determination isa. 0.6667b. 0.6000c. 0.4000d. 1.5000b. 0.6000

65. QUIZA fitted least squares regression linea. may be used to predict a value of y if the corresponding x value is givenb. is evidence for a cause-effect relationship between x and yc. can only be computed if a strong linear relationship exists between x and yd. None of these alternatives is correct.a. may be used to predict a value of y if the corresponding x value is given

66. QUIZYou have carried out a regression analysis; but, after thinking about the relationship between variables, you have decided you must swap the explanatory and the response variables. After refitting the regression model to the data you expect that:a. the value of the correlation coefficient will changeb. the value of SSE will changec. the value of the coefficient of determination will changed. the sign of the slope will changee. nothing changesb. the value of SSE will change

67. QUIZSuppose you use regression to predict the height of a woman’s current boyfriend by using her own height as the explanatory variable. Height was measured in feet from a sample of 100 women undergraduates, and their boyfriends, at Dalhousie University. Now, suppose that the height of both the women and the men are converted to centimeters. The impact of this conversion on the slope is:a. the sign of the slope will changeb. the magnitude of the slope will changec. both a and b are correctd. neither a nor b are correctd. neither a nor b are correct

68. QUIZA residual plot:a. displays residuals of the explanatory variable versus residuals of the response variable.b. displays residuals of the explanatory variable versus the response variable.c. displays explanatory variable versus residuals of the response variable.d. displays the explanatory variable versus the response variable.e. displays the explanatory variable on the x axis versus the response variable on the y axis.c. displays explanatory variable versus residuals of the response variable.

69. QUIZWhen the error terms have a constant variance, a plot of the residuals versus the independent variable x has a pattern thata. fans outb. funnels inc. fans out, but then funnels ind. forms a horizontal band patterne. forms a linear pattern that can be positive or negatived. forms a horizontal band pattern

70. THANK YOUSERVER.RAREDIS.ORG/EDU