Chi-Square Goodness-of-Fit Test: Practice Problems and Solutions

The Actuary's Free Study Guide for Exam 3L - Section 55

G. Stolyarov II
This section of sample problems and solutions is a part of The Actuary's Free Study Guide for Exam 3L, authored by Mr. Stolyarov. This is Section 55 of the Study Guide. See an index of all sections by following the link in this paragraph.

The Chi-square distribution (χ2 distribution) can be defined as follows:

The p.d.f. of U = j=1m∑[Zj2], where each Zj is a standard normal random variable independent of all the other Zj's, is called the Chi-square distribution with m degrees of freedom. (Larsen and Marx 2006, p. 474).

It is easy to remember the Chi-square distribution as the p.d.f. that results from adding the squares of m standard normal random variables, where m is the number of degrees of freedom for the Chi-square distribution.

The chi-square distribution has critical values associated with a given significance level and a number of degrees of freedom. You can look up these critical values in a Table of Critical Values for the Chi-Square Distribution.

Larsen and Marx define a goodness of fit test as "any procedure that seeks to determine whether a set of data could reasonably have originated from some given probability distribution, or class of probability distributions" (599).

The Chi-square goodness of fit test can be phrased as follows.

"Let r1, r2, ..., rt be the set of possible outcomes (or ranges of outcomes) associated with each of n independent trials, where P(ri) = pi, for i = 1, 2, ..., t. Let Xi = the number of times ri occurs. Then the following holds.

a. The random variable D = i=1t∑[(Xi - npi)2/npi] has approximately a χ2 distribution with t-1 degrees of freedom. For the approximation to be adequate, the t classes should be defined so that npi ≥ 5 for all i.

b. Let k1, k2, ..., kt be the observed frequencies for the outcomes r1, r2, ..., rt, respectively, and let np1o, np2o, ..., npto be the corresponding expected frequencies based on the null hypothesis. At the α level of significance, H0: fY(y) = fo(y) (or H0: pX(k) = po(k)) is rejected if

d = i=1t∑[(ki - npio)2/(npio)] ≥ χ21- α,t-1." (Larsen and Marx 2006, p. 616)

For instance, if we have a distribution of observations, and we know both the expected numbers of observations of each kind and the actual numbers of observations of this kind, then we can let ki be the actual number of observations of the type i and npio be the expected number of observations of the type i. Then we can find d = i=1t∑[(ki - npio)2/(npio)] and compare it to

χ21- α,t-1, the chi-square statistic with t-1 degrees of freedom and significance level of 1-α.

If d ≥ χ21- α,t-1, then we reject the null hypothesis H0.

Sources: Broverman, Sam. Actuarial Exam Solutions - CAS Exam 3 - Fall 2006.

Larsen, Richard J. and Morris L. Marx. An Introduction to Mathematical Statistics and Its Applications. Fourth Edition. Pearson Prentice Hall: 2006. pp. 474, 599, 616.

Original Problems and Solutions from The Actuary's Free Study Guide

Problem S3L55-1. Consider the following table of observed sightings of leopard-mice, by territory and color.

........................White....Blue....Yellow.....Magenta........Total

Territory 1..........3........36.......63..............62...............164

Territory 2..........15......531.....351.............11...............908

Territory 3..........315.....60.......14...............2...............391

Total..................333.....627......428.............75.............1463.

If you have a hypothesis regarding the distribution of mice in each territory and you use a Chi-square goodness-of-fit test to evaluate your hypothesis, how many degrees of freedom would the associated Chi-square distribution need to have?

Solution S3L55-1. It is useful to think of degrees of freedom as the number of values in the table that you can pick without being constrained by your previous choices to make some particular choice. For instance, in Territory 1, you can pick the number of white, blue, and yellow leopard-mice, but these three choices will determine the number of magenta leopard-mice, provided that the total number of mice in Territory 1 must remain fixed at 164. So you have 3 degrees of freedom from Territory 1. Likewise, in Territory 2, you can make 3 selections for the numbers of three kinds of mice, which will entirely determine the number of mice of the fourth kind. You have no degrees of freedom from Territory 3, because the totals of each kind of mice by color are fixed and cannot be changed, so your selections from Territories 1 and 2 will determine the values in Territory 3. Thus, the Chi-square distribution associated with this goodness-of-fit test will need to have 6 degrees of freedom.

Problem S3L55-2. Consider the following table of observed sightings of leopard-mice, by territory and color.

........................White....Blue....Yellow.....Magenta........Total

Territory 1..........3........36.......63..............62...............164

Territory 2..........15......531.....351.............11...............908

Territory 3..........315.....60.......14...............2...............391

Total..................333.....627......428.............75.............1463.

You have a hypothesis regarding the distribution of mice in each territory and you use a Chi-square goodness-of-fit test to evaluate your hypothesis. Your hypothesis is that in each territory, the distribution of mice by color is the same. The significance level of the Chi-square statistic you are using is 10%. Find the critical value associated with the significance level for this test. Use a Table of Critical Values for the Chi-Square Distribution.

Solution S3L55-2. We want the value corresponding to a significance level of 0.1 and 6 degrees of freedom (as we found in Solution S3L55-1). We look up this value in the table and find it to be 10.645.

Problem S3L55-3. Consider the following table of observed sightings of leopard-mice, by territory and color.

........................White....Blue....Yellow.....Magenta........Total

Territory 1..........3........36.......63..............62...............164

Territory 2..........15......531.....351.............11...............908

Territory 3..........315.....60.......14...............2...............391

Total..................333.....627......428.............75.............1463.

You have a hypothesis regarding the distribution of leopard-mice in each territory and you use a Chi-square goodness-of-fit test to evaluate your hypothesis. Your hypothesis is that in each territory, the distribution of mice by color is the same. The significance level of the Chi-square statistic you are using is 10%. Create a table of expected observations of leopard-mice by territory if you believe that the total distribution of mice among territories is what will be reflected in the distribution of mice by color. Round your answers to the nearest whole leopard-mouse in each case.

Solution S3L55-3. For each color of leopard-mouse, we would expect 164/1463 = 0.1120984279 of the mice to be in Territory 1, 908/1463 = 0.6206425154 of the mice to be in Territory 2, and 391/1463 = 0.2672590567 of the mice to be in Territory 3, if our null hypothesis is indeed correct. To fit these probabilities in our table while doing computations, we can let A = 0.1120984279, B = 0.6206425154, and C = 0.6206425154. (If you have a TI-83 Plus calculator, you can actually program it to associate these letters with the probabilities given here.)

Thus, we get the following expected distribution of observations:

........................White....Blue....Yellow.....Magenta........Total

Territory 1..........333A....627A ...428A........75A...............164

Territory 2..........333B... 627B.....428B........75B...............908

Territory 3..........333C....627C.... 428C........75C...............391

Total..................333......627......428..........75.............1463.

This is the same table as the following, with all values rounded to the nearest whole number.

........................White....Blue....Yellow.....Magenta........Total

Territory 1..........37.........70.......48............8...............164

Territory 2..........207..... 389......266...........47...............908

Territory 3..........89.......168......114..........20...............391

Total..................333......627......428..........75.............1463.

Problem S3L55-4. Similar to Question 5 from the Casualty Actuarial Society's Fall 2006 Exam 3.

Consider the following table of observed sightings of leopard-mice, by territory and color.

........................White....Blue....Yellow.....Magenta........Total

Territory 1..........3........36.......63..............62...............164

Territory 2..........15......531.....351.............11...............908

Territory 3..........315.....60.......14...............2...............391

Total..................333.....627......428.............75.............1463.

You have a hypothesis regarding the distribution of leopard-mice in each territory and you use a Chi-square goodness-of-fit test to evaluate your hypothesis. Your hypothesis is that in each territory, the distribution of mice by color is the same. The significance level of the Chi-square statistic you are using is 10%. Find the test statistic d for this Chi-square goodness of fit test. For your table of expected values, use the answer to Problem S3L55-3 (i.e., use the table with rounded values. The work will proceed much faster if you do.). Also decide whether this test would require you to reject or fail to reject the null hypothesis.

Solution S3L55-4. We have the following to tables to consider.

Table of actual observations (Table for values of ki):

........................White....Blue....Yellow.....Magenta........Total

Territory 1..........3........36.......63..............62...............164

Territory 2..........15......531.....351.............11...............908

Territory 3..........315.....60.......14...............2...............391

Total..................333.....627......428.............75.............1463.

Table of expected observations (Table for values of npio):

........................White....Blue....Yellow.....Magenta........Total

Territory 1..........37.........70.......48............8...............164

Territory 2..........207..... 389......266...........47...............908

Territory 3..........89.......168......114..........20...............391

Total..................333......627......428..........75.............1463.

We use the formula d = i=1t∑[(ki - npio)2/(npio)] =

(3 - 37)2/37 + (36 - 70)2/70 + (63 - 48)2/48 + (62 - 8)2/8 + (15 - 207)2/207 + (531 - 389)2/389 +

(351 - 266)2/266 + (11 - 47)2/47 + (315 - 89)2/89 + (60 - 168)2/168 + (14 - 114)2/114.

Thus, d = about 1363.212281. We recall from Solution S3L55-2 that our critical value for this test is 10.645. Since d is substantially greater than 10.645, we reject the null hypothesis.

Problem S3L55-5. Similar to Question 5 from the Casualty Actuarial Society's Fall 2006 Exam 3. You have the following table of expected observations for the number of managers in a given large firm who are completely idle, distributed by their positions in the management hierarchy

Upper ....Upper-Mid-Level ....Mid-Mid-Level......Lower-Mid-Level.............Total

20..............70.........................150....................560.............................800

Your null hypothesis is that the distribution of idle managers corresponds to the above table.

Here is what you actually observe with regard to managerial idleness:

Upper ....Upper-Mid-Level ....Mid-Mid-Level......Lower-Mid-Level.............Total

16..............96.........................175....................513.............................800

You are performing a Chi-square goodness-of-fit test on your hypothesis. Your significance level for this test is 5%. Find the difference between the test statistic d and the critical value c for this test. Use a Table of Critical Values for the Chi-Square Distribution as necessary. Would you need to reject the null hypothesis?

Solution S3L55-5. The Chi-square distribution associated with this test will have 3 degrees of freedom, because once we have picked any 3 values in the table, the fourth value will be determined by our prior choices. We look up the critical value c, the value associated with 3 degrees of freedom and a significance level of 5%. This value is 7.815.

Now we find d = i=1t∑[(ki - npio)2/(npio)] = (16 - 20)2/20 + (96 - 70)2/70 + (175 - 150)2/150 +

(513 - 560)2/560 = d = 18.56845238.

Now we find d - c = 18.56845238 - 7.815 = 10.75345238. Since d - c > 0, the null hypothesis needs to be rejected.

See other sections of The Actuary's Free Study Guide for Exam 3L.

Published by G. Stolyarov II

G. Stolyarov II is a science fiction novelist, independent essayist, poet, amateur mathematician, composer, author, and actuary.  View profile

To comment, please sign in to your Yahoo! account, or sign up for a new account.