Discuss the difference between correlation and causation

Question:

Discuss the difference between correlation and causation
Discuss the purpose of multiple regression
Discuss the underlying assumptions of multiple regression and what can be done if the assumptions are not met.

Note:

1. Define the words in the own words. Do not directly quote from the textbook.

2. Need to write at least 2 paragraphs

3. Need to include the information from the textbook as the reference.

4. Need to include at least 1 peer reviewed article as the reference.

5. Please find the textbook and related power point in the attachment

Correlation and Linear Regression

Chapter 13

13-1

In this chapter, we study the relationship between two interval- or ratio-level variables and develop numerical measures to express the relationship between two variables. We also develop an equation to express the relationship between variables. We examine both correlation analysis and regression analysis.

Learning Objectives

LO13-1 Explain the purpose of correlation analysis

LO13-2 Calculate a correlation coefficient to test and interpret the relationship between two variables

LO13-3 Apply regression analysis to estimate the linear relationship between two variables

LO13-4 Evaluate the significance of the slope of the regression equation

LO13-5 Evaluate a regression equation’s ability to predict using the standard estimate of the error and the coefficient of determination

LO13-6 Calculate and interpret confidence and prediction intervals

LO13-7 Use a log function to transform a nonlinear relationship

13-2

What is Correlation Analysis?

Used to report the relationship between two variables

In addition to graphing techniques, we’ll develop numerical measures to describe the relationships

Examples

Does the amount Healthtex spends per month on training its sales force affect its monthly sales?

Does the number of hours students study for an exam influence the exam score?

CORRELATION ANALYSIS A group of techniques to measure the relationship between two variables.

13-3

In all business fields, identifying and studying relationships between variables can provide information on ways to increase profits, methods to decrease costs, or variables to predict demand.

Scatter Diagram

A scatter diagram is a graphic tool used to portray the relationship between two variables

The independent variable is scaled on the X-axis and is the variable used as the predictor

The dependent variable is scaled on the Y-axis and is the variable being estimated

Graphing the data in a scatter diagram will make the relationship between sales calls and copiers sales easier to see.

13-4

We often begin our study of the relationship between two variables with a scatter diagram. It gives us a visual representation of the relationship between the variables. For instance, a sales manager wants to know if there is a relationship between the number of sales calls made in a month and the number of copiers sold that month and begins the analysis with a random sample of 15 sales representatives. With this data, the number of sales calls is the independent variable and number of copiers sold is the dependent variable.

Scatter Diagram Example

North American Copier Sales sells copiers to businesses of all sizes throughout the United States and Canada. The new national sales manager is preparing for an upcoming sales meeting and would like to impress upon the sales representatives the importance of making an extra sales call each day. She takes a random sample of 15 sales representatives and gathers information on the number of sales calls made last month and the number of copiers sold. Develop a scatter diagram of the data.

Sales reps who make more calls tend to sell more copiers!

13-5

We develop a scatter diagram of the data. The first salesperson, Brian Virost, made 96 sales calls and sold 41 copiers; to plot this point move along the horizontal axis to x=96 and then go vertically to y=41 and place a dot at that intersection. Do this for the all the sales data. It is perfectly reasonable for the manager to tell the sales people that the more sales calls they make, the more copiers they can expect to sell. Note, that while there does seem to be a positive relationship between the two variables, all the points do not fall on a line.

Correlation Coefficient

Characteristics of the correlation coefficient are:

The sample correlation coefficient is identified as r

It shows the direction and strength of the linear relationship between two interval- or ratio-scale variables

It ranges from −1.00 to 1.00

If it’s 0, there is no association

A value near 1.00 indicates a direct or positive correlation

A value near −1.00 indicates a negative correlation

CORRELATION COEFFICIENT A measure of the strength of the linear relationship between two variables.

13-6

Both variables must be at least the interval scale of measurement to find the correlation coefficient. A value of −1 indicates perfect negative correlation and a value of +1 indicates perfect positive correlation.

Correlation Coefficient (2 of 2)

The following graphs summarize the strength and direction of the correlation coefficient

13-7

In the set of charts at the bottom of the slide, the first one indicates no correlation between the number of children as the independent variable, and income (as the dependent variable). The middle chart shows there is a slightly negative correlation between price and quantity. The chart on the right shows a strong positive relationship between hours studied (the independent variable) and exam score (the dependent variable).

Correlation Coefficient, r

How is the correlation coefficient determined? We’ll use North American Copier Sales as an example. We begin with a scatter diagram, but this time we’ll draw a vertical line at the mean of the x-values (96 sales calls) and a horizontal line at the mean of the y-values (45 copiers).

13-8

Drawing lines through the center of the data establishes quadrants. These two variables are positively related when the number of copiers sold is above the mean and the number of sales calls is also above the mean; the points appear in quadrant 1. When the number of sales calls is less than the mean, so is the number of copiers sold, the points appear in quadrant lll.

Correlation Coefficient, r, Continued

How is the correlation coefficient determined? Now we find the deviations from the mean number of sales calls and the mean number of copiers sold; then multiply them. The sum of their product is 6,672 and will be used in formula 13-1 to find r. We also need the standard deviations. The result, r=.865 indicates a strong, positive relationship.

13-9

The correlation coefficient is designated by the letter r and found with equation 13-1. We will use Excel to find the standard deviations of the two variables, x (sales calls) and y (copier sales) to use in the formula.

Correlation Coefficient Example

The Applewood Auto Group’s marketing department believes younger buyers purchase vehicles on which lower profits are earned and older buyers purchase vehicles on which higher profits are earned. They would like to use this information as part of an upcoming advertising campaign to try to attract older buyers. Develop a scatter diagram and then determine the correlation coefficient. Would this be a useful advertising feature?

The scatter diagram suggests that a positive relationship does exist between age and profit, but it does not appear to be a strong relationship.

Next, calculate r, which is 0.262. The relationship is positive but weak. The data does not support a business decision to create an advertising campaign to attract older buyers!

13-10

We use Excel to calculate r; r is .262 and is much closer to zero than one. We would observe the relationship between the age of the buyer and the profit of their purchase is not strong.

Testing the Significance of r

13-11

Testing the Significance of r Example

13-12

The population in this example is all of the salespeople employed by the firm. This is a two-tailed test. We use Appendix B.5 for degrees of freedom n-2=15-2=13 and a level of significance of .05. Use formula 13-2; the result is 6.216. We reject the null hypothesis; there is correlation with respect to the number of sales calls made and the number of copiers sold in the population of salespeople.

Testing the Significance of r Example Continued

13-13

Step 5: Make decision; reject H0, t=6.216

Step 6: Interpret; there is correlation with respect to the number of sales calls made and the number of copiers sold in the population of salespeople.

Testing the Significance of the Correlation Coefficient

In the Applewood Auto Group example, we found an r=0.262 which is positive, but rather weak. We test our conclusion by conducting a hypothesis test that the correlation is greater than 0.

13-14

This is a one-tailed (right-tailed) test. The degrees of freedom in this test is n − 2 = 180 − 2 = 178; but Appendix B.5 doesn’t have 178, so we use 180, so the critical value is 1.653. We use formula 13-2 and conclude the sample correlation is too large to have come from a population with no correlation. The outcome of a marketing campaign directed to older buyers is uncertain.

Regression Analysis

In regression analysis, we estimate one variable based on another variable

The variable being estimated is the dependent variable

The variable used to make the estimate or predict the value is the independent variable

The relationship between the variables is linear

Both the independent and the dependent variables must be interval or ratio scale

REGRESSION EQUATION An equation that expresses the linear relationship between two variables.

13-15

The least squares criterion is used to determine the regression equation.

Least Squares Principle

In regression analysis, our objective is to use the data to position a line that best represents the relationship between two variables

The first approach is to use a scatter diagram to visually position the line

But this depends on judgement; we would prefer a method that results in a single, best regression line

13-16

The lines drawn in the chart on the right represents the judgement of four people. The method that results in a single, best regression line is called the least squares principle.

Least Squares Regression Line

To illustrate, the same data are plotted in the three charts below

LEAST SQUARES PRINCIPLE A mathematical procedure that uses the data to position a line with the objective of minimizing the sum of the squares of the vertical distances between the actual y values and the predicted values of y.

13-17

The line drawn in chart 13-9 is the best fitting line and is drawn using the least squares method. It is the best fitting because the sum of the squares of the vertical deviations about it is at a minimum; the sum of the squares is 24. Chart 13-10 and 13-11 was drawn differently and their sum of the squares is 44 and 132 respectively.

Least Squares Regression Line (2 of 2)

13-18

Least Squares Regression Line Example

Recall the example of North American Copier Sales. The sales manager gathered information on the number of sales calls made and the number of copiers sold. Use the least squares method to determine a linear equation to express the relationship between the two variables.

13-19

The first step is to find the slope of the least squares regression line, b

Next, find a

Then determine the regression line

So if a salesperson makes 100 calls, he or she can expect to sell 46.0432 copiers

The b value of .2608 indicates that for each additional sales call, the sales representative can expect to increase the number of copiers sold by about .2608. So 20 additional sales calls in a month will result in about five more copiers being sold.

Drawing the Regression Line

13-20

The line of regression is drawn on the scatter diagram. Estimated sales for all sales representatives are calculated using the formula we determined earlier and placed in the table. The regression line will always pass through the mean of variables x and y. Plus, there is no other line through the data where the sum of the deviations is smaller.

Regression Equation Slope Test

13-21

Regression Equation Slope Test Example

13-22

This is a one-tailed test. If we do not reject the null hypothesis, we conclude that the slope of the regression line could be zero. We use Excel to determine the needed regression statistics. We find the critical value in Appendix B.5 with degrees of freedom of n − 2, 15 − 2 = 13 and a level of significance of .05, it is 1.771. We reject the null hypothesis and conclude the slope of the line is greater than 0.

Regression Equation Slope Test Example (2 of 2)

13-23

Highlighted, b is .2606; the standard error is .0420

Evaluating a Regression Equation’s Ability to Predict

Perfect prediction is practically impossible in almost all disciplines, including economics and business

The North American Copier Sales example showed a significant relationship between sales calls and copier sales, the equation is

Number of copiers sold = 19.9632 + .2608(Number of sales calls)

What if the number of sales calls is 84, and we calculate the number of copiers sold is 41.8704—we did have two employees with 84 sales calls, they sold just 30 and 24

So, is the regression equation a good predictor?

We need a measure that will tell how inaccurate the estimate might be

13-24

The measure we’ll use is the standard error of the estimate, sy,x. We find more information on the next slide.

The Standard Error of Estimate

The standard error of estimate measures the variation around the regression line

It is in the same units as the dependent variable

It is based on squared deviations from the regression line

Small values indicate that the points cluster closely about the regression line

It is computed using the following formula

STANDARD ERROR OF ESTIMATE A measure of the dispersion, or scatter, of the observed values around the line of regression for a given value of x.

13-25

The standard error of estimate is the same concept as the standard deviation in chapter 3. The standard deviation measures dispersion around the mean. The standard error of estimate measures dispersion around the regression line for a given value of x.

The Standard Error of Estimate Example

The standard error of estimate is 6.720

If the standard error of estimate is small, this indicates that the data are relatively close to the regression line and the regression equation can be used. If it is large, the data are widely scattered around the regression line and the regression equation will not provide a precise estimate of y.

13-26

The standard error of estimate can be calculated using statistical software like Excel.

Coefficient of Determination

It ranges from 0 to 1.0

It is the square of the correlation coefficient

It is found from the following formula

In the North American Copier Sales example, the correlation coefficient was .865; just square that (.865)2 = .748; this is the coefficient of determination

This means 74.8% of the variation in the number of copiers sold is explained by the variation in sales calls

COEFFICIENT OF DETERMINATION The proportion of the total variation in the dependent variable Y that is explained, or accounted for, by the variation in the independent variable X.

13-27

The coefficient of determination provides a more interpretable measure of a regression equation’s ability to predict. It’s easy to compute too; just square the correlation coefficient.

Relationships among r, r2, and sy,x

Recall the standard error of estimate measures how close the actual values are to the regression line

When it is small, the two variables are closely related

The correlation coefficient measures the strength of the linear association between two variables

When points on the scatter diagram are close to the line, the correlation coefficient tends to be large

Therefore, the correlation coefficient and the standard error of estimate are inversely related

13-28

As the strength of a linear relationship between two variables increases, the correlation coefficient increases and the standard error of the estimate decreases.

Inference about Linear Regression

We can predict the number of copiers sold (y) for a selected value of number of sales calls made (x)

But first, let’s review the regression assumptions of each of the distributions in the graph below

13-29

We’ll now relate these assumptions to North American Copier Sales.

Constructing Confidence and Prediction Intervals

Use a confidence interval when the regression equation is used to predict the mean value of y for a given value of x

For instance, we would use a confidence interval to estimate the mean salary of all executives in the retail industry based on their years of experience

Use a prediction interval when the regression equation is used to predict an individual y for a given value of x

For instance, we would estimate the salary of a particular retail executive who has 20 years of experience

13-30

Two different predictions can be made for a selected value of the independent variable; a confidence interval and a prediction interval. In a confidence interval, the width of the interval is affected by the level of confidence, the size of the standard error of the estimate, and the size of the sample, as well as the value of the independent variable. The prediction interval is also based on the level of confidence, the size of the standard error of the estimate, the size of the sample, and the value of the independent variable. The difference between formulas 13-11 and 13-12 is the 1 under the radical. The prediction interval will be wider than the confidence interval.

Confidence Interval and Prediction Interval Example

We return to the North American Copier Sales example. Determine a 95% confidence interval for all sales representatives who make 50 calls, and determine a prediction interval for Sheila Baker, a west coast sales representative who made 50 sales calls.

The 95% confidence interval for all sales representatives is 27.3942 up to 38.6122.

The 95% prediction interval for Sheila Baker is 17.442 up to 48.5644 copiers.

13-31

Transforming Data

Regression analysis and the correlation coefficient requires data to be linear

But what if data is not linear?

If data is not linear, we can rescale one or both of the variables so the new relationship is linear

Common transformations include

Computing the log to the base 10 of y, Log(y)

Taking the square root

Taking the reciprocal

Squaring one or both variables

Caution: when you are interpreting a correlation coefficient or regression equation – it could be nonlinear

13-32

For example, instead of using the actual values of the dependent variable y, we would create a new dependent variable by transforming it.

Transforming Data Example

GroceryLand Supermarkets is a regional grocery chain located in the midwestern United States. The director of marketing wishes to study the effect of price on weekly sales of their two-liter private brand diet cola. The objectives of the study are

To determine whether there is a relationship between selling price and weekly sales. Is this relationship direct or indirect? Is it strong or weak?

To determine the effect of price increases or decreases on sales. Can we effectively forecast sales based on the price?

To begin, the company decides to price the two-liter diet cola from $0.50 to $2.00. To collect the data, a random sample of 20 stores is taken and then each store is randomly assigned a selling price.

13-33

There is a strong relationship between the two variables. The coefficient of determination is 88.9%. So 88.9% of the variation in Sales is accounted for by the variation in Price. But, a careful analysis of the scatter diagram reveals that the relationship may not be linear. That means we need to transform the data.

Transforming Data Example (2 of 3)

13-34

A strong, inverse relationship!

Transforming Data Example (3 of 3)

The director of marketing decides to transform the dependent variable, Sales, by taking the logarithm to the base 10 of each sales value. Note the new variable, Log-Sales, in the following analysis as it is used as the dependent variable with Price as the independent variable.

13-35

Clearly, as price increases, sales decrease. This relationship will be very helpful to GroceryLand when making pricing decisions for this product.

Chapter 12 Practice Problems

13-36

Question 3

13-37

Bi-lo Appliance Super-Store has outlets in several large metropolitan areas in New England. The general sales manager aired a commercial for a digital camera on selected local TV stations prior to a sale starting on Saturday and ending Sunday. She obtained the information for Saturday–Sunday digital camera sales at the various outlets and paired it with the number of times the advertisement was shown on the local TV stations. The purpose is to find whether there is any relationship between the number of times the advertisement was aired and digital camera sales. The pairings are:

What is the dependent variable?

Draw a scatter diagram.

Determine the correlation coefficient.

Interpret these statistical measures.

LO13-2

Question 11

13-38

The Airline Passenger Association studied the relationship between the number of passengers on a particular flight and the cost of the flight. It seems logical that more passengers on the flight will result in more weight and more luggage, which in turn will result in higher fuel costs. For a sample of 15 flights, the correlation between the number of passengers and total fuel cost was .667. Is it reasonable to conclude that there is positive association in the population between the two variables? Use the .01 significance level.

LO13-2

Question 17

13-39

Bloomberg Intelligence listed 50 companies to watch in 2018 (www.bloomberg.com/features/companies-to-watch-2018). Twelve of the companies are listed here with their total assets and 12-month sales.

Let sales be the dependent variable and total assets the independent variable.

Draw a scatter diagram.

Compute the correlation coefficient.

Determine the regression equation.

For a company with $100 billion in assets, predict the 12-month sales.

LO13-3

Question 23

13-40

Refer to Exercise 17. The regression equation is ŷ = 1.85 + .08x, the sample size is 12, and the standard error of the slope is 0.03. Use the .05 significance level. Can we conclude that the slope of the regression line is different from zero?

LO13-4

Question 27

13-41

Bradford Electric Illuminating Company is studying the relationship between kilowatt-hours (thousands) used and the number of rooms in a private single-family residence. A random sample of 10 homes yielded the following:

Determine the standard error of estimate and the coefficient of determination. Interpret the coefficient of determination.

LO13-5

Question 33

13-42

Determine the .95 confidence interval, in thousands of kilowatt-hours, for the mean of all six-room homes.

Determine the .95 prediction interval, in thousands of kilowatt-hours, for a particular six-room home.

LO13-6

Question 35

13-43

Using the following data with x as the independent variable and y as the dependent variable, answer the items.

Create a scatter diagram and describe the relationship between x and y.

Compute the correlation coefficient.

Transform the x variable by squaring each value, x2.

Create a scatter diagram and describe the relationship between x2 and y.

Compute the correlation coefficient between x2 and y.

Compare the relationships between x and y, and x2 and y.

Interpret your results.

LO13-7

Nonparametric Methods: Nominal Level Hypothesis Tests

Chapter 15

15-1

This chapter considers tests of hypothesis for nominal level data. Nonparametric hypothesis tests do not require the assumption that the population be normal. First we consider two mutually exclusive groups, then several mutually exclusive groups. We will use the chi-square distribution as a test statistic in this chapter too.

Learning Objectives

LO15-1 Test a hypothesis about a population proportion

LO15-2 Test a hypothesis about two population proportions

LO15-3 Test a hypothesis comparing an observed set of frequencies to an expected frequency distribution

LO15-4 Explain the limitations of using the chi-square statistic in goodness-of-fit tests

LO15-5 Test a hypothesis that an observed frequency distribution is normally distributed

LO15-6 Perform a chi-square test for independence on a contingency table

15-2

Test a Hypothesis of a Population Proportion

Recall that a proportion is the ratio of the number of successes to the number of observations

Examples

Historically, GM reports that 70% of leased vehicles are returned with less than 36,000 miles; a recent sample of 200 found that 158 had less than 36,000 miles. Has the proportion increased?

Able Moving and Storage advises its clients that their household goods will be delivered in 3 to 5 days for a long-distance move. Records show this is true 90% of the time. A recent sample of 200 moves found that they were successful 190 times. Has the success rate increased?

15-3

Here are examples of potential hypothesis testing situations. To test, first take a random sample from the population. We will assume the binomial assumptions discussed in chapter 6 are met.

Hypothesis Test of a Population Proportion

15-4

Population Proportion Test Example

A Republican governor of a western state is thinking about running for reelection. Historically, to be reelected, a Republican needed at least 80% of the vote in the northern part of the state. The governor hires a polling organization to survey the voters there. The polling organization will poll 2,000 voters. Use a statistical hypothesis-testing procedure to assess the governor’s chances of winning reelection.

15-5

This situation regarding the governor’s reelection meets the binomial conditions. This is a one-tailed test, (left-tailed), so the region of rejection is in the left tail. The level of significance is the likelihood of rejecting the null hypothesis when it is true. The critical value is found by referring to Appendix B.3. Go to the column indicating a .05 significance level and then down that column to the row with infinite degrees of freedom; the value is 1.645. Therefore, the critical value is −1.645. The survey revealed that 1,550 planned to vote for the incumbent governor; the sample proportion is .775 (found by 1550 ÷ 2000). The computed value of z is −2.80 and is less than the critical value. The evidence does not support the claim that the incumbent governor will return to the governor’s mansion for another 4 years.

Population Proportion Test Example Continued

15-6

Step 5: Take sample, make a decision, reject the null hypothesis

Step 6: Interpret; the governor does not have the votes to win

Two-Sample Tests about Proportions

We can also test whether two samples came from populations with an equal proportion of successes

Examples

The vice president of human resources wishes to know whether there is a difference in the proportion of hourly employees who miss more than 5 days of work per year at the Atlanta and the Houston plants

A consultant to the airline industry is investigating the fear of flying among adults. The company wishes to know if there is a difference between the proportion of men versus women who are fearful of flying

15-7

In the above cases, each sampled item or individual can be classified as a “success” or “failure.” In the next example, we will assume that each sample is large enough that the normal distribution will serve as a good approximation of the binomial distribution, and we will use z as our test statistic.

The Two-Sample Test of Proportions

To test whether two samples came from populations with an equal proportion of successes

First pool the two sample proportions using the following formula

Then we compute the value of the test statistic from the following formula

15-8

In the formulas, x1 is the number possessing the trait in the first sample, x2 is the number possessing the trait in the second sample, n1 is the number of observations in the first sample, and n2 is the number of observations in the second sample; pc is the pooled proportion possessing the trait in the combined samples and is called the pooled estimate of the population proportion, p1 is the proportion in the first sample possessing the trait and p2 is the proportion in the second sample possessing the trait.

Two-Sample Tests about Proportions Example

Manelli Perfume Company recently developed a new fragrance that it plans to market under the name Heavenly. Market studies indicate that Heavenly has very good market potential. The sales department at Manelli is interested in whether there is a difference in the proportion of working women and stay-at-home women who would purchase Heavenly.

15-9

The null hypothesis in this example is that there is no difference in the proportion of working women and stay-at-home women who prefer Heavenly. The alternate is that the two populations are not equal. The two samples are sufficiently large so we use the standard normal distribution, z, as the test statistic.

To find the critical value, go to Appendix B.5. In the table headings find the row labeled “Level of Significance for Two-Tailed Test” and select the column with an alpha of .05. Go to the bottom row with infinite degrees of freedom. The z value is 1.96, so the critical value for this test is −1.96 and 1.96. This test is continued on the next slide.

Two-Sample Tests about Proportions Example Continued

Manelli Perfume Company samples 100 working women and 200 stay-at-home women to find out if the population proportions are equal. Each of the sampled women will be asked to smell Heavenly and indicate whether she likes it well enough to purchase a bottle.

Step 5: Take sample, make decision, reject H0

Step 6: Interpret; working women and stay-at-home women will purchase Heavenly at different rates or proportions.

15-10

The random sample of the 100 working women revealed that 81 liked the fragrance well enough to purchase it; the random sample of 200 stay-at-home women revealed 138 liked the fragrance well enough to purchase it. The research question now is, is the difference of .12 in the two sampled proportions due to chance or whether there is a difference in the two populations. Pool the two sample proportions and then calculate the test statistic; z= 2.207 and falls in the area of rejection, to the right of 1.960. So reject the null hypothesis that the proportion of working women that will purchase Heavenly is equal to the proportion of stay-at-home women who will purchase it.

Goodness-of-Fit Test

We can compare an observed frequency distribution to an expected frequency distribution

Example

An insurance company wishes to compare the historical distribution of policy types with a sample of 2,000 current policies

Does the current distribution of policies “fit” this historical distribution, or has it changed?

15-11

A goodness-of-fit test is one of the most commonly used statistical tests. The table shows the historical relative frequency distribution of policy types; these are the expected frequencies.

Goodness-of-Fit Test Example

Bubba’s Fish and Pasta is a chain of restaurants along the Gulf Coast of Florida. Bubba is considering adding steak to the menu. Before doing so, he hires a research firm to conduct a survey to find out what the patron’s favorite meal is when eating out. Here are the results of the survey of 120 adults.

Is the difference in the number of times each entrée is selected due to chance, or should we conclude that the entrées are not equally preferred?

Is it reasonable to conclude there is no preference among the four entrées?

15-12

The purpose of this test is to compare an observed frequency distribution to an expected frequency distribution. The scale of measurement is nominal; each of the categories (chicken, fish, meat, and pasta) are also referred to as cells. If the entrées are equally popular, we would expect 30 adults to like each entrée (120 sampled ÷ 4 categories = 30); this is the expected frequency. To investigate, we use the six-step hypothesis-testing procedure.

Goodness-of-Fit Test Example Continued

15-13

Chi-Square Characteristics

The characteristics of the chi-square distribution are

The value of chi-square is never negative

There is a family of chi-square distributions

The chi-square distribution is positively skewed

As the degrees of freedom increase, the distribution approaches a normal distribution

15-14

The chi-square distribution has many applications in statistics. Notice how each time the degrees of freedom change, a new distribution is formed; see the chi-square distributions for selected degrees of freedom in the chart. We can use MegaStat to compute the goodness-of-fit test; see the Software Commands in Appendix C.

Goodness-of-Fit Test Example Concluded

Step 5: Take sample, make decision, do not reject H0, 2.200 is not greater than 7.815

Step 6: Interpret; the data do not suggest the preferences among the four entrées are different.

15-15

In the formula, fo is the observed frequency and fe is the expected frequency. Basically, once the observed frequencies and the expected frequencies are listed in columns, subtract the expected frequency from the observed frequency in the next column. Then square that difference and then finally divide the squared difference by the expected frequency and sum the results. This sum is the chi-square statistic. We do not reject the null hypothesis. We conclude the differences between the observed values and the expected values are due to chance.

Hypothesis Test of Unequal Expected Frequencies Example

The American Hospital Administration Association reports the number of times senior citizens are admitted to a hospital during a one-year period; 40% are not admitted, 30% are admitted once, 20% are admitted twice, and 10% are admitted 3 or more times.

Then, a survey of 150 residents of Bartow Estates, a community devoted to active seniors located in central Florida, revealed 55 residents were not admitted, 50 were admitted once, 32 were admitted twice, and the rest in the survey were admitted three or more times. Can we conclude the survey at Bartow Estates is consistent with the information reported by the AHAA?

15-16

The chi-square test can also be used if the expected frequencies are not equal. This example gives a practical use of the chi-square goodness-of-fit test—namely, to find whether a local experience differs from the national experience. We can use the AHAA information to compute expected frequencies for the Bartow Estates residents. If there is no difference between the national experience and the Bartow study, then the expectation is that 40% of the Bartow residents would have been admitted once; (.40)(150)= 60 and so on. The observed and expected frequencies for Bartow residents are given in the table. The six-step hypothesis test follows on the next slide.

Hypothesis Test of Unequal Expected Frequencies Example Continued

15-17

Use Appendix B.7 and the .05 significance level to find the critical value for the decision rule. The number of degrees of freedom is 3, found by k − 1 and k = 4. The critical value is 7.815. The decision rule is to reject the null hypothesis if the chi-square statistic > 7.815. The chi-square statistic is 1.3723, so we fail to reject the null hypothesis. We conclude there is no difference between the local and the national experience for hospital admissions.

Hypothesis Test of Unequal Expected Frequencies Example Concluded

15-18

Step 5: Calculate the test statistic, make decision, do not reject H0,1.3723 < 7.815

Step 6: Interpret; there is no evidence of a difference between the local and the national experience for hospital admissions

Limitations of Chi-Square

If there is an unusually small frequency in a cell, chi-square might result in an erroneous conclusion

A very small number in the denominator can make the quotient quite large

For only two cells, the fe should be at least 5

For more than two cells, chi-square should not be used if more than 20% of the fe cells have an expected frequency that is less than 5

15-19

When there is a small frequency in a cell (category), it results in too much weight being given to those categories. When possible, combine categories to resolve the problem. In this example of levels of management, the three vice president cells were combined to create an expected frequency of 7.

Limitations of Chi-Square Continued

15-20

The issue can be resolved by combining categories if it is logical to do so. In this example, we combine the three vice president categories, which satisfies the 20% policy.

Goodness-of-Fit Test Continued

A goodness-of-fit test can be used to determine whether a sample of observations is from a normal population

Calculate the mean and standard deviation of the sample data

Group the data into a frequency distribution

Convert the class limits to z values and find the standard normal probability distribution for each class

For each class, find the expected normally distributed frequency by multiplying the standard normal probability distribution by the class frequency

Calculate the chi-square goodness-of-fit statistic based on the observed and expected class frequencies

Find the expected frequency in each cell by determining the product of the probability of finding a value in each cell by the total number of observations

If we use the information on the sample mean and the sample standard deviation from the sample data, the degrees of freedom are k − 3

15-21

Here are the steps to perform a goodness-of-fit test to determine whether a sample of observations is from a normal population.

Hypothesis Test that a Distribution is Normal Example

We investigate whether the profit data of Applewood Auto Group follows the normal distribution. In chapter 3, we found the mean profit was $1,843.17 and the standard deviation was $643.63.

15-22

Once all the expected frequencies have been calculated, put them in the table in the column for area. The areas should add up to 1.0000

Hypothesis Test that a Distribution is Normal Example Continued

Now, combine the classes that have fe < 5.

Once that is done, we can calculate the chi-square statistic.

15-23

Next, we’ll conduct a hypothesis test.

Hypothesis Test that a Distribution is Normal Example Concluded

15-24

Now, we’ll conduct a hypothesis test. When we estimate population parameters with sample data, we lose one degree of freedom for each estimate. So the number of the degrees of freedom is k − 2 − 1, 8 − 2 − 1= 5 since we used the sample mean and the sample standard deviation. From Appendix B.7 using the .05 significance level, the critical value is 11.070. Using formula 15-4, we calculate the chi-square test statistic to be 5.220. Do not reject H0; it appears that the distribution of profits is normal.

Contingency Table

We can use a contingency table to test whether two traits or characteristics are related

The expected frequency will be determined as follows

The degrees of freedom = (Rows – 1)(Columns – 1)

Example

Ford Motor Company operates the Dearborn plant with 3 shifts per day, 5 days a week. Vehicles are classified as to quality level (acceptable, unacceptable) and shift (day, afternoon, night). Is there a difference in the quality level on the three shifts?

15-25

Here is an example where we are interested in testing whether two nominal-scaled variables are related. Each observation is classified according to two traits.

Contingency Table Example

Rainbow Chemical, Inc. employs hourly and salaried employees. The vice president of human resources surveyed 380 employees about their satisfaction levels with the current health care benefits program. The employees were then classified according to pay type, salary or hourly. Is it reasonable to conclude that pay type and level of satisfaction with the health care benefits are related?

15-26

The usual hypothesis testing procedure is used. The HR manager requested the .05 significance level. The level of measurement for pay type is nominal scale. The satisfaction level with health benefits is actually ordinal scale, but we use it as a nominal scale variable. Each sampled employee is classified by two criteria, level of satisfaction with benefits and pay type, and the information is tabulated in a contingency table. The hypothesis test is continued on the next slide.

Contingency Table Example Continued

Step 4: Formulate the decision rule, reject H0 if chi-square > 5.991

Step 5: Make decision, chi-square is 2.506, do not reject H0

Step 6: Interpret; the sample data do not provide evidence that pay type and satisfaction level with health care benefits are related.

15-27

Since there are two rows and three columns in the contingency table, the degrees of freedom are 2; df = (number of rows − 1)(number of columns − 1)=(2 − 1)(3 − 1)=2 Refer to Appendix B.7, move down the df column in the left margin to the row with 2 df. Move across this row to the column headed .05. The value is 5.991. Use formula 15-5 to calculate the expected frequencies for the table. Then use formula 15-4 to calculate chi-square. It is 2.506, therefore we do not reject H0; it appears that pay type and health care benefits are not related.

Chapter 15 Practice Problems

15-28

Question 3

15-29

The U.S. Department of Transportation estimates that 10% of Americans carpool. Does that imply that 10% of cars will have two or more occupants? A sample of 300 cars traveling southbound on the New Jersey Turnpike yesterday revealed that 63 had two or more occupants. At the .01 significance level, can we conclude that 10% of cars traveling on the New Jersey Turnpike have two or more occupants?

LO15-1

Question 7

15-30

The null and alternate hypotheses are:

H0: π1 ≤ π2

H1: π1 > π2

A sample of 100 observations from the first population indicated that x1 is 70. A sample of 150 observations from the second population revealed x2 to be 90. Use the .05 significance level to test the hypothesis.

State the decision rule.

Compute the pooled proportion.

Compute the value of the test statistic.

What is your decision regarding the null hypothesis?

LO15-2

Question 19

15-31

A group of department store buyers viewed a new line of dresses and gave their opinions of them. The results were:

Because the largest number (47) indicated the new line is outstanding, the head designer thinks that this is a mandate to go into mass production of the dresses. The head sweeper (who somehow became involved in this) believes that there is not a clear mandate and claims that the opinions are evenly distributed among the six categories. He further states that the slight differences among the various counts are probably due to chance. Test the null hypothesis that there is no significant difference among the opinions of the buyers at the .01 level of significance.

LO15-3

Question 23

15-32

From experience, the bank credit card department of Carolina Bank knows that 5% of its card holders have had some high school, 15% have completed high school, 25% have had some college, and 55% have completed college. Of the 500 card holders whose cards have been called in for failure to pay their charges this month, 50 had some high school, 100 had completed high school, 190 had some college, and 160 had completed college. Can we conclude that the distribution of card holders who do not pay their charges is different from all others? Use the .01 significance level.

LO15-3

Question 25

15-33

The IRS is interested in the number of individual tax forms prepared by small accounting firms. The IRS randomly sampled 50 public accounting firms with 10 or fewer employees in the Dallas–Fort Worth area. The following frequency table reports the results of the study. Assume the sample mean is 44.8 clients and the sample standard deviation is 9.37 clients. Is it reasonable to conclude that the sample data are from a population that follows a normal probability distribution? Use the .05 significance level.

LO15-5

Question 29

15-34

The quality control department at Food Town Inc., a grocery chain in upstate New York, conducts a monthly check on the comparison of scanned prices to posted prices. The following chart summarizes the results of a sample of 500 items last month. Company management would like to know whether there is any relationship between error rates on regularly priced items and specially priced items. Use the .01 significance level.

LO15-6

Discuss the difference between correlation and causation

Related posts

Select at least one student to whom you will administer the informal RTI assessment created in Clinical Field Experience A. Score the assessment and share the results with the student to increase understanding of his or her strengths and areas for improvement.

Examples of nonbiased formal and informal specialized diagnostic assessments that are administered to the students being evaluated, including any legal and ethical requirements.