Boss Of IT: Multiple Regression Analysis in R “Favorite Fast-Food Prediction with Live Data”

University of Applied Sciences, Frankfurt

Department of Computer Science & Engineering

Md. Kabir Hosen

Idea: Forming a model equation with multiple Regression analysis on the observed data collected and the predicted value. Calculation of residual errors, scatter plot, descriptive statistics, mean, median, ratio and correlation in R.

Questionnaire:

1. Respondent Name: ………………………………………………….

2. What is your age group?

i. Under 18

ii. 18-26

iii. 27-35

iv. 36-others

3. Living place ………………………………………………………….

4. Gender

i. Male

ii. Female

iii. Others………………………………………………………

5. Occupation

i. Student

ii. Employee

iii. others

6. Your Favorite Fast Food?

i. KFC

ii. Mc Donald

iii. Pizza Hut

iv. Burger King

v. Others………………………………………………………

7. Price is

i. Cheap

ii. Average

iii. Good

iv. Outstanding

8. Quality of Service

i. Good

ii. Very good

iii. Excellent

iv. Others……………………………………………………..

9. Test of food

i. Good

ii. Very good

iii. Excellent

iv. Others………………………………………………………

10. How many times do you go to your fast food restaurant in per month?

i. 1 - 2

ii. 3-5

iii. 6-10

iv. More than 10

11. Which one of the reasons you go to your restaurant?

i. Special occasion (birthday, holiday)

ii. Regular Meal

iii. Business Lunch

iv. Just for the food

12. Overall Satisfaction

i. Good

ii. Very good

iii. Excellent

iv. Others………………………………………………….

Response Variable:

Response variable is “Favorite Fast Food”.

Prediction:

We are going to predict the “Favorite Fast Food” according to the customer feedback data Ex. "Age", "Gender", "Occupation", "Price", "Quality of Service", "Taste of Food", "Monthly Restaurant Visit", "Reasons for Restaurant Visit" & “Satisfaction".

Aims of a Successful Guest Survey:

The survey will undertake to:

1. Measure overall customer satisfaction.
2. Learn about the customer.
3. Identify buying habits and dining patterns.
5. Find out why customers visit restaurant.
6. Learn what influences guest purchase decisions.
7. Learn what guests believe you do well and not so well.
8. Discover what we can do to improve operations.
9. Identify processes for change that will improve customer satisfaction.
10. How to increase customer loyalty.

11. Finally Measure which Fast Food we are going to launch.

“Favorite Fast Food Prediction with Live Data”

……………………………………………………………………………

Introduction:

A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is now modeled as a function of several explanatory variables. The function lm can be used to perform multiple linear regression in R and much of the syntax is the same as that used for fitting simple linear regression models. To perform multiple linear regression with p explanatory variables use the command:

>lm(response ~ explanatory_1 + explanatory_2 + … + explanatory_p)

Here the terms response and explanatory_i in the function should be replaced by the names of the response and explanatory variables, respectively, used in the analysis.

Ex. Data was collected on 50 guest recently sold in the Frankfurt city. It consisted of the "Age" , "Gender", "Occupation", "Fav_FastFood", "Price", "Q_Service", "Taste_Food", "Monthly_Visit", "Reasons_Visit" & "Satisfaction".

The following program reads in the data.

>data1<-read.csv(file.choose(),header=T) # Read data from Guest Feedback Excel CSV File

>data1

Suppose we are only interested in working with a subset of the variables (e.g. “Fav_FastFood” , “Price”, and “Age”). It is possible (but not necessary) to construct a new data frame consisting solely of these values using the commands:

> myvars=c('Fav_FastFood','Age', 'Price')

> Guestdata=data1[myvars]

> names(Guestdata)

[1] "Fav_FastFood" "Age" "Price"

> Guestdata

Fav_FastFood Age Price

1 3 3 2

2 2 2 2

3 2 2 2

4 2 3 2

5 1 2 2

6 3 2 2

7 4 2 2

………………………….

………………………..up to 50 reading

Before fitting our regression model we want to investigate how the variables are related to one another. We can do this graphically by constructing scatter plots of all pair-wise combinations of variables in the data frame. This can be done by typing:

Guestdata=c(”Fav_FastFood”,”Age”,”Price”)

>plot(Guestdata)

To fit a multiple linear regression model with “Fav_FastFood” as the response / dependent variable and “Age” and “Price” as the explanatory / independent variables, use the command:

> Guestdata=(lm(Fav_FastFood~Age+Price))

> Guestdata

Call:

lm(formula = Fav_FastFood ~ Age + Price)

Coefficients:

(Intercept) Age Price

4.5163 -0.9469 0.2334

This output indicates that the fitted value is given by, Y^=4.5163 + -0.9469x₁ + 0.2334x₂

Inference in the multiple regression setting is typically performed in a number of steps. We begin by testing whether the explanatory variables collectively have an effect on the response variable, i.e.

H₀: β₁=β₂=….β_p=0

If we can reject this hypothesis, we continue by testing whether the individual regression coefficients are significant while controlling for the other variables in the model.

We can access the results of each test by typing:

> Guestdata=(lm(Fav_FastFood~Age+Price)) # Reduced Model

> summary(Guestdata)

Call:

lm(formula = Fav_FastFood ~ Age + Price)

Residuals:

Min 1Q Median 3Q Max

-2.3226 -1.0892 -0.1158 1.4459 2.0910

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.5163 0.9586 4.711 2.22e-05 ***

Age -0.9469 0.3353 -2.824 0.00694 **

Price 0.2334 0.2822 0.827 0.41237

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.373 on 47 degrees of freedom

Multiple R-squared: 0.1478, Adjusted R-squared: 0.1115

F-statistic: 4.075 on 2 and 47 DF, p-value: 0.02333

The output shows that F = 4.075 (p < 0.02333), indicating that we should clearly accept the null hypothesis that the variable Age collectively have effect on Fav_FastFood. But Price has no effect on Fav_FastFood (response variable).In addition, the output also shows that R²= 0.1478 and R² adjusted = 0.1115.

Testing a subset of variables using a partial F-test

Sometimes we are interested in simultaneously testing whether a certain subset of the coefficients are equal to 0 (e.g. 3 = 4 = 0). We can do this using a partial F-test. This test involves comparing the SSE from a reduced model (excluding the parameters we hypothesis are equal to zero) with the SSE from the full model (including all of the parameters).

In R we can perform partial F-tests by fitting both the reduced and full models separately and thereafter comparing them using the anova function.

Ex. Suppose we include the variables “Age”, “Price”"Gender", "Occupation", "Q_Service", "Taste_Food", "Monthly_Visit", "Reasons_Visit" & "Satisfaction" in our model and are interested in testing whether the "Gender", "Occupation", "Q_Service", "Taste_Food", "Monthly_Visit", “Price”, "Reasons_Visit" & "Satisfaction" are not significant after taking “Age” into consideration.

# Reduced Model

> reduced=(lm(Fav_FastFood~Price+Age))

> reduced

Call:

lm(formula = Fav_FastFood ~ Price + Age)

Coefficients:

(Intercept) Price Age

4.5163 0.2334 -0.9469

> summary(reduced)

Call:

lm(formula = Fav_FastFood ~ Age + Price)

Residuals:

Min 1Q Median 3Q Max

-2.3226 -1.0892 -0.1158 1.4459 2.0910

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 4.5163 0.9586 4.711 2.22e-05 ***

Age -0.9469 0.3353 -2.824 0.00694 **

Price 0.2334 0.2822 0.827 0.41237

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.373 on 47 degrees of freedom

Multiple R-squared: 0.1478, Adjusted R-squared: 0.1115

F-statistic: 4.075 on 2 and 47 DF, p-value: 0.02333

# Full Model

> attach(full)

>full=(lm(Fav_FastFood~Price+Age+Gender+Occupation+Q_Service+Taste_Food+Monthly_Visit+Reasons_Visit+Satisfaction))

> full

Call:

lm(formula = Fav_FastFood ~ Price + Age + Gender + Occupation +

Price + Q_Service + Taste_Food + Monthly_Visit + Reasons_Visit +

Satisfaction)

Coefficients:

(Intercept) Price Age Gender Occupation Q_Service

5.14578 0.17980 -1.10841 -0.24159 -0.57401 -0.13015

Taste_Food Monthly_Visit Reasons_Visit Satisfaction

0.09593 -0.25246 0.22270 0.29820

> summary(full)

Call:

lm(formula = Fav_FastFood ~ Price + Age + Gender + Occupation +

Price + Q_Service + Taste_Food + Monthly_Visit + Reasons_Visit +

Satisfaction)

Residuals:

Min 1Q Median 3Q Max

-2.31351 -1.12243 -0.06685 0.87608 2.15450

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 5.14578 2.97431 1.730 0.09133 .

Price 0.17980 0.30538 0.589 0.55933

Age -1.10841 0.38296 -2.894 0.00613 **

Gender -0.24159 0.41801 -0.578 0.56654

Occupation -0.57401 1.71982 -0.334 0.74030

Q_Service -0.13015 0.24642 -0.528 0.60029

Taste_Food 0.09593 0.29604 0.324 0.74760

Monthly_Visit -0.25246 0.36233 -0.697 0.48997

Reasons_Visit 0.22270 0.27519 0.809 0.42314

Satisfaction 0.29820 0.29773 1.002 0.32257

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.437 on 40 degrees of freedom

Multiple R-squared: 0.2054, Adjusted R-squared: 0.02659

F-statistic: 1.149 on 9 and 40 DF, p-value: 0.3531

#Compare the Models

> anova(reduced, full)

Analysis of Variance Table

Model 1: Fav_FastFood ~ Price + Age

Model 2: Fav_FastFood ~ Price + Age + Gender + Occupation + Price + Q_Service + Taste_Food + Monthly_Visit + Reasons_Visit + Satisfaction

Res.Df RSS Df Sum of Sq F Pr(>F)

1 47 88.562

2 40 82.577 7 5.9849 0.4142 0.8878

The output shows the results of the partial F-test. Since F= 0.4142 (p-value=0.8878) we can reject the null hypothesis (3 = 4 = 0) at the 5% level of significance. It appears that the variables "Gender", "Occupation",”Price” "Q_Service", "Taste_Food", "Monthly_Visit", "Reasons_Visit" & "Satisfaction" do contribute significant information to the “Favorite Fast Food” once the variable “Age” has not taken into consideration.

Confidence and Prediction Intervals

We often use our regression models to estimate the mean response or predict future values of the response variable for certain values of the response variables. The function predict() can be used to make both confidence intervals for the mean response and prediction intervals. To make confidence intervals for the mean response use the option interval=”confidence”. To make a prediction interval use the option interval=”prediction”. By default this makes 95% confidence and prediction intervals. If you instead want to make a 99% confidence or prediction interval use the option level=0.99.

Ex. Obtain a 95% confidence interval for the mean Fav_FastFood of Age whose level is 2 and Price level is 2).

> reduced=(lm(Fav_FastFood~Price+Age))

> predict(reduced,data.frame(Age=2,Price=2),interval="confidence")

fit lwr upr

1 3.08924 2.615599 3.56288

A 95% confidence interval is given by (2.615599, 3.56288)

Ex. Obtain a 95% prediction interval for the mean Fav_FastFood of Age whose level is 2 and Price level is 2

> predict(reduced,data.frame(Age=2,Price=2),interval="prediction")

fit lwr upr

1 3.08924 0.287413 5.891067

A 95% prediction interval is given by (0.287413, 5.891067).

Note that this is quite a bit wider than the confidence interval, indicating that the variation about the mean is fairly large.

Conclusion:

After consideration of all scenarios we formulate our multiple regression model equation and we observed that only “Age” (independent variable) has the significant impact on choosing the Favorite Fast Food (response variable).

More: Contact: kabircse115@gmail.com

Boss Of IT

Friday, August 22, 2014

Multiple Regression Analysis in R “Favorite Fast-Food Prediction with Live Data”

No comments:

Post a Comment

Blog Archive