Friday, August 22, 2014

Multiple Regression Analysis in R “Favorite Fast-Food Prediction with Live Data”

 University of Applied Sciences, Frankfurt
Department of Computer Science & Engineering 
Md. Kabir Hosen
Idea: Forming a model equation with multiple Regression analysis on the observed data collected and the predicted value. Calculation of residual errors, scatter plot, descriptive statistics, mean, median, ratio and correlation in R.

Questionnaire:
1.     Respondent Name:  ………………………………………………….
2.      What is your age group?
i.                   Under 18
ii.                 18-26
iii.              27-35
iv.              36-others
3.     Living place ………………………………………………………….
4.     Gender
                               i.            Male
                             ii.            Female
                          iii.            Others………………………………………………………
5.     Occupation
                               i.            Student
                             ii.            Employee
                          iii.            others
6.     Your Favorite Fast Food?
                               i.            KFC
                             ii.            Mc Donald
                          iii.            Pizza Hut
                          iv.            Burger King
                             v.            Others………………………………………………………
7.     Price is
                               i.            Cheap
                             ii.            Average
                          iii.            Good
                          iv.            Outstanding


8.     Quality of Service
                                 i.            Good
                               ii.            Very good
                            iii.            Excellent
                            iv.            Others……………………………………………………..
9.     Test of food
                               i.            Good
                             ii.            Very good
                          iii.            Excellent
                          iv.            Others………………………………………………………
10.                        How many times do you go to your fast food restaurant in per month?
                               i.            1 - 2
                             ii.            3-5
                          iii.            6-10
                          iv.            More than 10
11.                         Which one of the reasons you go to your restaurant?
i.                     Special occasion (birthday, holiday)
ii.                    Regular Meal
iii.                  Business Lunch
iv.                 Just for the food
12.                        Overall Satisfaction
                               i.            Good
                             ii.            Very good
                          iii.            Excellent
                          iv.            Others………………………………………………….


Response Variable:
 Response variable is “Favorite Fast Food”.
 Prediction:                      
We are going to predict the “Favorite Fast Food” according to the customer feedback data Ex. "Age", "Gender", "Occupation", "Price", "Quality of Service", "Taste of Food", "Monthly Restaurant Visit", "Reasons for Restaurant Visit" & “Satisfaction".
Aims of a Successful Guest Survey:
The survey will undertake to:
1. Measure overall customer satisfaction.
2. Learn about the customer.
3. Identify buying habits and dining patterns.
5. Find out why customers visit restaurant.
6. Learn what influences guest purchase decisions.
7. Learn what guests believe you do well and not so well.
8. Discover what we can do to improve operations.
9. Identify processes for change that will improve customer satisfaction.
10. How to increase customer loyalty.
11. Finally Measure which Fast Food we are going to launch.

  
“Favorite Fast Food Prediction with Live Data”
……………………………………………………………………………
Introduction:

A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is now modeled as a function of several explanatory variables. The function lm can be used to perform multiple linear regression in R and much of the syntax is the same as that used for fitting simple linear regression models. To perform multiple linear regression with p explanatory variables use the command:
>lm(response ~ explanatory_1 + explanatory_2 + … + explanatory_p)

Here the terms response and explanatory_i in the function should be replaced by the names of the response and explanatory variables, respectively, used in the analysis.

 Ex. Data was collected on 50 guest recently sold in the Frankfurt city. It consisted of the "Age" , "Gender", "Occupation", "Fav_FastFood", "Price", "Q_Service", "Taste_Food",    "Monthly_Visit", "Reasons_Visit" &  "Satisfaction".

The following program reads in the data.

>data1<-read.csv(file.choose(),header=T)  # Read data from Guest Feedback Excel CSV File
>data1

 Suppose we are only interested in working with a subset of the variables (e.g. “Fav_FastFood” , “Price”, and “Age”). It is possible (but not necessary) to construct a new data frame consisting solely of these values using the commands:

> myvars=c('Fav_FastFood','Age', 'Price')

> Guestdata=data1[myvars]
> names(Guestdata)
[1]     "Fav_FastFood" "Age"          "Price"      
> Guestdata
   Fav_FastFood   Age   Price
1             3                       3        2
2             2                      2        2
3             2                     2        2
4             2                     3        2
5             1                     2        2
6             3                     2        2
7             4                     2        2
………………………….
………………………..up to 50 reading

Before fitting our regression model we want to investigate how the variables are related to one another. We can do this graphically by constructing scatter plots of all pair-wise combinations of variables in the data frame. This can be done by typing:

Guestdata=c(”Fav_FastFood”,”Age”,”Price”)
>plot(Guestdata)



To fit a multiple linear regression model with “Fav_FastFood” as the response / dependent variable and “Age” and “Price” as the explanatory / independent variables, use the command:


> Guestdata=(lm(Fav_FastFood~Age+Price))             
> Guestdata
Call:
lm(formula = Fav_FastFood ~ Age + Price)
Coefficients:
(Intercept)          Age        Price 
     4.5163      -0.9469       0.2334 
This output indicates that the fitted value is given by, Y^=4.5163 + -0.9469x1 + 0.2334x2

Inference in the multiple regression setting is typically performed in a number of steps. We begin by testing whether the explanatory variables collectively have an effect on the response variable, i.e.
H0: β12=….βp=0

If we can reject this hypothesis, we continue by testing whether the individual regression coefficients are significant while controlling for the other variables in the model.
We can access the results of each test by typing:
> Guestdata=(lm(Fav_FastFood~Age+Price))                # Reduced Model
> summary(Guestdata)
Call:
lm(formula = Fav_FastFood ~ Age + Price)
Residuals:
    Min        1Q              Median      3Q                        Max
-2.3226       -1.0892      -0.1158        1.4459        2.0910
Coefficients:
                             Estimate     Std. Error  t value       Pr(>|t|)   
(Intercept)                      4.5163         0.9586      4.711          2.22e-05 ***
Age                       -0.9469         0.3353     -2.824          0.00694 **
Price                      0.2334         0.2822       0.827          0.41237   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.373 on 47 degrees of freedom
Multiple R-squared:  0.1478,    Adjusted R-squared:  0.1115
F-statistic: 4.075 on 2 and 47 DF, p-value: 0.02333

The output shows that F = 4.075 (p < 0.02333), indicating that we should clearly accept the null hypothesis that the variable Age collectively have effect on Fav_FastFood. But Price has no effect on Fav_FastFood (response variable).In addition, the output also shows that R2= 0.1478 and R2 adjusted = 0.1115.

Testing a subset of variables using a partial F-test
Sometimes we are interested in simultaneously testing whether a certain subset of the coefficients are equal to 0 (e.g. 3 = 4 = 0). We can do this using a partial F-test. This test involves comparing the SSE from a reduced model (excluding the parameters we hypothesis are equal to zero) with the SSE from the full model (including all of the parameters).
In R we can perform partial F-tests by fitting both the reduced and full models separately and thereafter comparing them using the anova function.

Ex. Suppose we include the variables “Age”, “Price”"Gender", "Occupation", "Q_Service", "Taste_Food",  "Monthly_Visit", "Reasons_Visit" & "Satisfaction" in our model and are interested in testing whether the "Gender", "Occupation", "Q_Service", "Taste_Food",  "Monthly_Visit", “Price”, "Reasons_Visit" & "Satisfaction" are not significant after taking “Age” into consideration.

# Reduced Model
> reduced=(lm(Fav_FastFood~Price+Age))
> reduced
Call:
lm(formula = Fav_FastFood ~ Price + Age)
Coefficients:
(Intercept)        Price          Age 
     4.5163       0.2334      -0.9469    
> summary(reduced)
Call:
lm(formula = Fav_FastFood ~ Age + Price)
Residuals:
    Min      1Q  Median      3Q     Max
-2.3226 -1.0892 -0.1158  1.4459  2.0910
Coefficients:
            Estimate   Std. Error t value                  Pr(>|t|)   
(Intercept)   4.5163     0.9586           4.711                             2.22e-05 ***
Age          -0.9469     0.3353     -2.824                  0.00694 **
Price         0.2334     0.2822       0.827                   0.41237   
---

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.373 on 47 degrees of freedom
Multiple R-squared:  0.1478,    Adjusted R-squared:  0.1115
F-statistic: 4.075 on 2 and 47 DF,  p-value: 0.02333
 # Full Model
> attach(full)
>full=(lm(Fav_FastFood~Price+Age+Gender+Occupation+Q_Service+Taste_Food+Monthly_Visit+Reasons_Visit+Satisfaction))
> full
Call:
lm(formula = Fav_FastFood ~ Price + Age + Gender + Occupation +
    Price + Q_Service + Taste_Food + Monthly_Visit + Reasons_Visit +
    Satisfaction)
Coefficients:
  (Intercept)          Price            Age         Gender                   Occupation      Q_Service    
      5.14578        0.17980       -1.10841       -0.24159       -0.57401       -0.13015
Taste_Food    Monthly_Visit       Reasons_Visit    Satisfaction 
       0.09593         -0.25246                          0.22270         0.29820 

> summary(full)
Call:
lm(formula = Fav_FastFood ~ Price + Age + Gender + Occupation +
    Price + Q_Service + Taste_Food + Monthly_Visit + Reasons_Visit +
    Satisfaction)

Residuals:
     Min         1Q             Median       3Q              Max
-2.31351     -1.12243     -0.06685      0.87608       2.15450
Coefficients:
              Estimate                    Std. Error   t value         Pr(>|t|)  
(Intercept)    5.14578    2.97431   1.730               0.09133 .
Price          0.17980     0.30538      0.589              0.55933  
Age           -1.10841    0.38296     -2.894              0.00613 **
Gender        -0.24159    0.41801    -0.578            0.56654  
Occupation    -0.57401    1.71982   -0.334                    0.74030  
Q_Service     -0.13015    0.24642   -0.528           0.60029  
Taste_Food     0.09593    0.29604   0.324           0.74760  
Monthly_Visit -0.25246    0.36233   -0.697                  0.48997  
Reasons_Visit  0.22270    0.27519    0.809                   0.42314  
Satisfaction   0.29820    0.29773    1.002             0.32257  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.437 on 40 degrees of freedom
Multiple R-squared:  0.2054,    Adjusted R-squared:  0.02659
F-statistic: 1.149 on 9 and 40 DF, p-value: 0.3531


#Compare the Models
> anova(reduced, full)                               
Analysis of Variance Table
Model 1: Fav_FastFood ~ Price + Age
Model 2: Fav_FastFood ~ Price + Age + Gender + Occupation + Price + Q_Service + Taste_Food + Monthly_Visit + Reasons_Visit + Satisfaction
  Res.Df    RSS     Df               Sum of Sq      F              Pr(>F)
1     47 88.562                          
2     40 82.577       7                  5.9849      0.4142        0.8878
The output shows the results of the partial F-test. Since F= 0.4142 (p-value=0.8878) we can reject the null hypothesis (3 = 4 = 0) at the 5% level of significance. It appears that the variables "Gender", "Occupation",”Price” "Q_Service", "Taste_Food",  "Monthly_Visit", "Reasons_Visit" & "Satisfaction" do contribute significant information to the “Favorite Fast Food” once the variable “Age” has not taken into consideration.

Confidence and Prediction Intervals
We often use our regression models to estimate the mean response or predict future values of the response variable for certain values of the response variables. The function predict() can be used to make both confidence intervals for the mean response and prediction intervals. To make confidence intervals for the mean response use the option interval=”confidence”. To make a prediction interval use the option interval=”prediction”. By default this makes 95% confidence and prediction intervals. If you instead want to make a 99% confidence or prediction interval use the option level=0.99.

Ex. Obtain a 95% confidence interval for the mean Fav_FastFood of Age whose level is 2 and Price level is 2).

> reduced=(lm(Fav_FastFood~Price+Age))

> predict(reduced,data.frame(Age=2,Price=2),interval="confidence")
           fit                 lwr                      upr
1        3.08924      2.615599             3.56288
A 95% confidence interval is given by (2.615599, 3.56288)

Ex. Obtain a 95% prediction interval for the mean Fav_FastFood of Age whose level is 2 and Price level is 2
> predict(reduced,data.frame(Age=2,Price=2),interval="prediction")
      fit                    lwr                           upr
1   3.08924          0.287413                5.891067

A 95% prediction interval is given by (0.287413, 5.891067).
Note that this is quite a bit wider than the confidence interval, indicating that the variation about the mean is fairly large.
Conclusion:
After consideration of all scenarios we formulate our multiple regression model equation and we observed that only “Age” (independent variable) has the significant impact on choosing the Favorite Fast Food (response variable).

More: Contact: kabircse115@gmail.com

No comments:

Post a Comment