/
Copyright © Cengage Learning. All rights reserved. Copyright © Cengage Learning. All rights reserved.

Copyright © Cengage Learning. All rights reserved. - PowerPoint Presentation

trish-goza
trish-goza . @trish-goza
Follow
342 views
Uploaded On 2019-11-22

Copyright © Cengage Learning. All rights reserved. - PPT Presentation

Copyright Cengage Learning All rights reserved 14 GoodnessofFit Tests and Categorical Data Analysis Copyright Cengage Learning All rights reserved 143 TwoWay Contingency Tables TwoWay Contingency Tables ID: 766892

factor category categories testing category factor testing categories table sample population independence number individuals contingency cell expected cont

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Copyright © Cengage Learning. All right..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Copyright © Cengage Learning. All rights reserved. 14 Goodness-of-Fit Tests and Categorical Data Analysis

Copyright © Cengage Learning. All rights reserved. 14.3 Two-Way Contingency Tables

Two-Way Contingency TablesIn the scenarios of Sections 14.1 and 14.2, the observed frequencies were displayed in a single row within a rectangular table. We now study problems in which the data also consists of counts or frequencies, but the data table will now have I rows (I  2) and J columns, so IJ cells. There are two commonly encountered situations in which such data arises:1. There are I populations of interest, each corresponding to a different row of the table, and each population is divided into the same J categories. A sample is taken from the i th population ( i = 1,…, I ) and the counts are entered in the cells in the i th row of the table.

Two-Way Contingency Tables For example, customers of each of I = 3 department- store chains might have available the same J = 5 payment categories: cash, check, store credit card, Visa, and MasterCard.2. There is a single population of interest, with each individual in the population categorized with respect to two different factors. There are I categories associated with the first factor and J categories associated with the second factor.

Two-Way Contingency Tables A single sample is taken, and the number of individuals belonging in both category i of factor 1 and category j of factor 2 is entered in the cell in row i, column j (i = 1,…, I; j = 1,…, J). As an example, customers making a purchase might be classified according to both department in which the purchase was made, with I = 6 departments, and according to method of payment, with J = 5 as in (1) above.

Two-Way Contingency TablesLet n ij denote the number of individuals in the sample (s) falling in the (i, j)th cell (row i, column j) of the table—that is, the (i, j)th cell count. The table displaying the nij’s is called a two-way contingency table; a prototype is shown in Table 14.9. Figure 14.9 A Two-Way Contingency Table

Two-Way Contingency TablesIn situations of type 1, we want to investigate whether the proportions in the different categories are the same for all populations. The null hypothesis states that the populations are homogeneous with respect to these categories. In type 2 situations, we investigate whether the categories of the two factors occur independently of one another in the population.

Testing for Homogeneity

Testing for HomogeneitySuppose each individual in every one of the I populations belongs in exactly one of the same J categories.A sample of ni individuals is taken from the ith population; let n = ni andnij = the number of individuals in the ith sample who fall into category jnj = = the total number of individuals among the n sample who fall into category j

Testing for HomogeneityThe n ij’s are recorded in a two-way contingency table with I rows and J columns. The sum of the nij’s in the ith row is ni, and the sum of entries in the jth column will be denoted by n. j .Let pij = the proportion of the individuals in population i who fall into category j Thus, for population 1, the J proportions are P 11 , P 12 ,…, P 1 J (which sum to 1) and similarly for the other populations.

Testing for HomogeneityThe null hypothesis of homogeneity states that the proportion of individuals in category j is the same for each population and that this is true for every category; that is, for every category; that is, for every j, P1J = P2J = … = PIJ. When H0 is true, we can use P1, P2,…, PJ to denote the population proportions in the J different categories; these proportions are common to all I populations.

Testing for HomogeneityThe expected number of individuals in the i th sample who fall in the jth category when H0 is true is then E(Nij) = ni  P j.To estimate E(Nij), we must first estimate pj, the proportion in category j. Among the total sample of n individuals, fall into category j , so we use = Nj ln as the estimator (this can be shown to be the maximum likelihood estimator of p j ).

Testing for HomogeneitySubstitution of the estimate for Pj in niPj yields a simple formula for estimated expected counts under H0: The test statistic also has the same form as in Sections 14.1 and 14.2

Testing for HomogeneityThe number of degrees of freedom comes from the general rule of thumb. In each row of Table 14.9 there are J – 1freely determined cell counts (each sample size ni is fixed),so there are a total of I(J – 1) freely determined cells.Parameters P1,…, PJ are estimated, but because Pi = 1, only J – 1 of these are independent. Thus df = I(J – 1) – (J – 1) = (J – 1)( I – 1).

Testing for Homogeneity

Example 14.13 A company packages a particular product in cans of three different sizes, each one using a different production line. Most cans conform to specifications, but a quality control engineer has identified the following reasons for nonconformance:1. Blemish on can2. Crack in can3. Improper pull tab location 4. Pull tab missing5. Other

Example 14.13 A sample of nonconforming units is selected from each of the three lines, and each unit is categorized according to reason for nonconformity, resulting in the following contingency table data:cont’d

Example 14.13 Does the data suggest that the proportions falling in the various nonconformance categories are not the same for the three lines? The parameters of interest are the various proportions, and the relevant hypotheses areH0: the production lines are homogeneous with respect to the five nonconformance categories; that is, P 1j = P2j = P3j for j = 1,…, 5Ha : the production lines are not homogeneous with respect to the categories cont’d

Example 14.13 The estimated expected frequencies (assuming homogeneity) must now be calculated.Consider the first nonconformance category for the first production line.When the lines are homogeneous, estimated expected number among the 150 selected units that are blemished cont’d

Example 14.13 The contribution of the cell in the upper-left corner to is then cont’d

Example 14.13 The other contributions are calculated in a similar manner. Figure 14.5 shows Minitab output for the chi-squared test. Figure 14.5 Minitab output for the chi-squared test of Example 14.13 cont’d

Example 14.13 The observed count is the top number in each cell, and directly below it is the estimated expected count.The contribution of each cell to appears below the counts, and the test statistic value is =21.403. Allestimated expected counts are at least 5, so combiningcategories is unnecessary. The test is based on (3 – 1)(5 – 1) = 8 df. Appendix Table A.11 shows that the area under the 8 df chi-squared curve to the right of 20.09 is .010 and the area to the right of 21.95 is .005. cont’d

Example 14.13Therefore we can say that .005 , P -value , .01; Minitab gives P-value 5 .006. Using a significance level of .01, the null hypothesis of homogeneity can be rejected in favor of the alternative that the distribution of reason for nonconformity is somehow different for the three production lines. At this point it is desirable to seek an explanation for why the hypothesis of homogeneity is implausible. cont’d

Example 14.13Figure 14.6 shows a stacked comparative bar chart of the data.

Example 14.13It appears that the three lines are relatively homogenous with respect to the Other and Missing categories but not with respect to the Location, Crack, and Blemish categories. Line 1’s incidence rate of crack nonconformities is much higher than for the other two lines, whereas location nonconformities appear to be more of a problem for line 2 than for the other two lines and blemish nonconformities occur much more frequently for line 3 than for the other two lines.

Testing for Independence

Testing for Independence (Lack of Association) We focus now on the relationship between two different factors in a single population. Each individual in the population is assumed to belong in exactly one of the I categories associated with the first factor and exactly one of the J categories associated with the second factor.For example, the population of interest might consist of all individuals who regularly watch the national news on television, with the first factor being preferred network (ABC, CBS, NBC, or PBS, so I = 4) and the second factor political philosophy (liberal, moderate, or conservative, giving J = 3).

Testing for Independence (Lack of Association) For a sample of n individuals taken from the population, let nij denote the number among the n who fall both in category i of the first factor and category j of the second factor.The nij’s can be displayed in a two-way contingency table with I rows and J columns.In the case of homogeneity for I populations, the row totals were fixed in advance, and only the J column totals were random.

Testing for Independence (Lack of Association) Now only the total sample size is fixed, and both the ni.’s and n.j’s are observed values of random variables. To state the hypotheses of interest, letPij = the proportion of individuals in the population who belong in category i of factor 1 and category j of factor 2 = P (a randomly selected individual falls in both category i of factor 1 and category j of factor 2)

Testing for Independence (Lack of Association) ThenPi. = = P (a randomly selected individual falls in category i of factor 1) P.j = = P (a randomly selected individual falls in category j of factor 2) Recall that two events, A and B, are independent if P ( A ∩ B ) = P ( A )  P ( B ).

Testing for Independence (Lack of Association) The null hypothesis here says that an individual’s category with respect to factor 1 is independent of the category with respect to factor 2. In symbols, this becomes Pij = Pi  P .j for every pair (i, j).The expected count in cell (i, j) is n  P ij , so when the null hypothesis is true, E(Nij ) = n i  P i .  P . j . To obtain a chi-squared statistic, we must therefore estimate the P i. ’s( i = 1,…, I ) and P. j ’s( j = 1,…, J ).

Testing for Independence (Lack of Association) The (maximum likelihood) estimates are = sample proportion for category i of factor 1and = sample proportion for category j of factor 2This gives estimated expected cell counts identical to those in the case of homogeneity.

Testing for Independence (Lack of Association) The test statistic is also identical to that used in testing for homogeneity, as is the number of degrees of freedom.This is because the number of freely determined cell counts is IJ – 1, since only the total n is fixed in advance.

Testing for Independence (Lack of Association) There are I estimated Pi.’s, but only I – 1 are independently estimated since Pi. = 1; and similarly J – 1P.j’s are independently estimated, so I + J – 2 parameters are independently estimated.The rule of thumb now yields df = IJ – 1 – ( I + J – 2) = IJ – I – J + 1 = ( I – 1)  ( J – 1).

Testing for Independence (Lack of Association)

Example 14. 14The accompanying two-way table from Minitab (Table 14.10) gives a cross-classification in which the row factor is level of paternal education (completed university, partial university, secondary, partial secondary) and the column factor represents the quartile of neonatal (i.e., newborn) weight gain (Q1 5 lowest 25%, Q2 5 next lowest 25%, Q3, Q4); the data appeared in the article “Impact of Neonatal Growth on IQ and Behavior at Early School Age” (Pediatrics, July 2013, e53–60). Does it appear that educational level is independent of NWG in the sampled population? cont’d

Example 14.14 cont’d

Example 14.14 The contribution to from the cell in the upper-left corner is /411.63 = .261. The 15 other contributions are calculated in the same way. Then = .261 +… + .253 = 19.016. When is true, the test statistic has approximately a chi-squared distribution with (4 - 1)(4 2 1) = 9 df . The expected value of a chi-squared rv is just its number of degrees of freedom, so E ( ) = 9 under the assumption of independence.   cont’d

Example 14.14Clearly the test statistic value exceeds what would be expected if the two factors were independent, but is it by enough to suggest implausibility of this null hypothesis? Table A.11 shows that .025 is the area to the right of 19.02 under the chi-squared curve with 9 df. Thus the P-value for the test is roughly .025 (which is the value calculated by Minitab; the cited article reported .03). At significance level .05, the null hypothesis of independence would be rejected since P-value .025 . 05 = . However , this conclusion would not be justified at a significance level of . 01. The P-v alue is such that people might argue over what conclusion is appropriate.  

Example 14.14Someone persuaded by our analysis to reject the assertion of independence would want to look more closely at the data to seek an explanation for that conclusion. Perhaps, for example, those in a higher quartile tend to have higher educational levels. Figure 14.7 shows histograms (bar graphs) of the percentages in the various educational level categories for each of the four different quartiles.

Example 14.14The four histograms appear to be very similar; the visual impression is that the distribution over the four educational levels does not depend much on the NWG quartile. This seemingly contradicts the finding of statistical significance. Now note that the sample size here is extremely large, and this inflates the value of the chi-squared statistic. With the same percentages as in Figure 14.7 but a much more moderate sample size, the value of would be much smaller and the P-value much larger. Our test result achieved statistical significance, but there does not seem to be any practical significance  

Example 14.14

Testing for Independence (Lack of Association)Models and methods for analyzing data in which each individual is categorized with respect to three or more factors (multidimensional contingency tables) are discussed in several of the chapter references.