University
of Applied Sciences, Frankfurt
Department
of Computer Science & Engineering
Md. Kabir Hosen
Idea: Forming a model equation with
multiple Regression analysis on the observed data collected and the predicted
value. Calculation of residual errors, scatter plot, descriptive statistics,
mean, median, ratio and correlation in R.
Questionnaire:
1.
Respondent
Name:
………………………………………………….
2. What is
your age group?
i.
Under 18
ii.
18-26
iii.
27-35
iv.
36-others
3. Living place ………………………………………………………….
4. Gender
i.
Male
ii.
Female
iii.
Others………………………………………………………
5. Occupation
i.
Student
ii.
Employee
iii.
others
6. Your Favorite Fast Food?
i.
KFC
ii.
Mc Donald
iii.
Pizza Hut
iv.
Burger King
v.
Others………………………………………………………
7. Price is
i.
Cheap
ii.
Average
iii.
Good
iv.
Outstanding
8. Quality of Service
i.
Good
ii.
Very good
iii.
Excellent
iv.
Others……………………………………………………..
9. Test of food
i.
Good
ii.
Very good
iii.
Excellent
iv.
Others………………………………………………………
10.
How many times do you go
to your fast food restaurant in per month?
i.
1 - 2
ii.
3-5
iii.
6-10
iv.
More than 10
11.
Which
one of the reasons you go to your restaurant?
i.
Special occasion (birthday, holiday)
ii.
Regular Meal
iii.
Business Lunch
iv.
Just for the food
12.
Overall Satisfaction
i.
Good
ii.
Very good
iii.
Excellent
iv.
Others………………………………………………….
Response
Variable:
Response variable is “Favorite Fast Food”.
Prediction:
We are going to predict the “Favorite Fast Food” according to the
customer feedback data Ex. "Age", "Gender",
"Occupation", "Price", "Quality of Service",
"Taste of Food", "Monthly Restaurant Visit", "Reasons
for Restaurant Visit" & “Satisfaction".
Aims
of a Successful Guest Survey:
The survey will undertake to:
1.
Measure overall customer satisfaction.
2. Learn about the customer.
3. Identify buying habits and dining patterns.
5. Find out why customers visit restaurant.
6. Learn what influences guest purchase decisions.
7. Learn what guests believe you do well and not so well.
8. Discover what we can do to improve operations.
9. Identify processes for change that will improve customer satisfaction.
10. How to increase customer loyalty.
11.
Finally Measure which Fast Food we are going to launch.
“Favorite Fast Food Prediction with Live Data”
……………………………………………………………………………
Introduction:
A regression with two or more
explanatory variables is called a multiple regression. Rather than modeling the
mean response as a straight line, as in simple regression, it is now modeled as
a function of several explanatory variables. The function lm can be used to
perform multiple linear regression in R and much of the syntax is the same as
that used for fitting simple linear regression models. To perform multiple
linear regression with p explanatory variables use the command:
>lm(response ~ explanatory_1 +
explanatory_2 + … + explanatory_p)
Here the terms response and explanatory_i
in the function should be replaced by the names of the response and
explanatory variables, respectively, used in the analysis.
Ex. Data was collected on 50 guest
recently sold in the Frankfurt city. It consisted of the "Age" ,
"Gender", "Occupation", "Fav_FastFood",
"Price", "Q_Service", "Taste_Food", "Monthly_Visit",
"Reasons_Visit" &
"Satisfaction".
The
following program reads in the data.
>data1<-read.csv(file.choose(),header=T) # Read data
from Guest Feedback Excel CSV File
>data1
Suppose we are only interested in
working with a subset of the variables (e.g. “Fav_FastFood” , “Price”, and
“Age”). It is possible (but not necessary) to construct a new data frame
consisting solely of these values using the commands:
>
myvars=c('Fav_FastFood','Age', 'Price')
>
Guestdata=data1[myvars]
> names(Guestdata)
[1] "Fav_FastFood" "Age" "Price"
> Guestdata
Fav_FastFood
Age Price
1 3
3 2
2 2
2 2
3 2
2 2
4 2
3 2
5 1
2 2
6 3
2 2
7 4
2 2
………………………….
………………………..up to 50 reading
Before
fitting our regression model we want to investigate how the variables are
related to one another. We can do this graphically by constructing scatter
plots of all pair-wise combinations of variables in the data frame. This can be
done by typing:
Guestdata=c(”Fav_FastFood”,”Age”,”Price”)
>plot(Guestdata)
To fit a multiple linear regression
model with “Fav_FastFood” as the response / dependent variable and “Age” and “Price” as the explanatory / independent variables, use the
command:
> Guestdata=(lm(Fav_FastFood~Age+Price))
> Guestdata
Call:
lm(formula = Fav_FastFood ~
Age + Price)
Coefficients:
(Intercept) Age Price
4.5163 -0.9469 0.2334
This output indicates that the fitted
value is given by, Y^=4.5163 + -0.9469x1 + 0.2334x2
Inference in the multiple
regression setting is typically performed in a number of steps. We begin by
testing whether the explanatory variables collectively have an effect on the
response variable, i.e.
H0:
β1=β2=….βp=0
If we can reject this hypothesis, we
continue by testing whether the individual regression coefficients are
significant while controlling for the other variables in the model.
We can access
the results of each test by typing:
>
Guestdata=(lm(Fav_FastFood~Age+Price)) # Reduced Model
>
summary(Guestdata)
Call:
lm(formula
= Fav_FastFood ~ Age + Price)
Residuals:
Min 1Q Median 3Q Max
-2.3226 -1.0892 -0.1158
1.4459 2.0910
Coefficients:
Estimate Std. Error
t value Pr(>|t|)
(Intercept)
4.5163
0.9586
4.711 2.22e-05 ***
Age -0.9469
0.3353 -2.824 0.00694 **
Price 0.2334 0.2822
0.827 0.41237
---
Signif.
codes: 0 ‘***’ 0.001
‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual
standard error:
1.373 on 47 degrees of freedom
Multiple
R-squared: 0.1478,
Adjusted R-squared: 0.1115
F-statistic:
4.075
on 2 and 47 DF, p-value: 0.02333
The output shows
that F = 4.075 (p < 0.02333),
indicating that we should clearly accept the null hypothesis that the variable Age collectively have effect on Fav_FastFood. But Price has no
effect on Fav_FastFood (response
variable).In addition, the output also shows that R2= 0.1478 and
R2 adjusted = 0.1115.
Testing a subset of variables using a
partial F-test
Sometimes we are
interested in simultaneously testing whether a certain subset of the
coefficients are equal to 0 (e.g. 3 = 4 = 0). We can do this using a partial F-test. This test involves comparing
the SSE from a reduced model
(excluding the parameters we hypothesis are equal to zero) with the SSE from
the full model (including all of the
parameters).
In R we can perform partial F-tests by
fitting both the reduced and full models separately and thereafter comparing
them using the anova function.
Ex. Suppose we include the variables
“Age”, “Price”"Gender", "Occupation",
"Q_Service", "Taste_Food",
"Monthly_Visit", "Reasons_Visit" &
"Satisfaction" in our model and are interested in testing whether the
"Gender", "Occupation", "Q_Service",
"Taste_Food", "Monthly_Visit", “Price”,
"Reasons_Visit" & "Satisfaction" are not significant
after taking “Age” into consideration.
#
Reduced Model
>
reduced=(lm(Fav_FastFood~Price+Age))
>
reduced
Call:
lm(formula = Fav_FastFood ~ Price + Age)
Coefficients:
(Intercept) Price Age
4.5163 0.2334 -0.9469
>
summary(reduced)
Call:
lm(formula = Fav_FastFood ~ Age + Price)
Residuals:
Min
1Q Median 3Q
Max
-2.3226 -1.0892 -0.1158 1.4459
2.0910
Coefficients:
Estimate
Std. Error t value Pr(>|t|)
(Intercept) 4.5163
0.9586 4.711 2.22e-05
***
Age -0.9469 0.3353 -2.824 0.00694 **
Price 0.2334 0.2822 0.827 0.41237
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’
1
Residual standard error: 1.373 on 47
degrees of freedom
Multiple R-squared: 0.1478,
Adjusted R-squared: 0.1115
F-statistic: 4.075 on 2 and 47 DF, p-value: 0.02333
#
Full Model
>
attach(full)
>full=(lm(Fav_FastFood~Price+Age+Gender+Occupation+Q_Service+Taste_Food+Monthly_Visit+Reasons_Visit+Satisfaction))
> full
Call:
lm(formula =
Fav_FastFood ~ Price + Age + Gender + Occupation +
Price +
Q_Service + Taste_Food + Monthly_Visit + Reasons_Visit +
Satisfaction)
Coefficients:
(Intercept) Price Age Gender Occupation Q_Service
5.14578 0.17980 -1.10841 -0.24159 -0.57401 -0.13015
Taste_Food Monthly_Visit Reasons_Visit Satisfaction
0.09593 -0.25246 0.22270 0.29820
> summary(full)
Call:
lm(formula =
Fav_FastFood ~ Price + Age + Gender + Occupation +
Price +
Q_Service + Taste_Food + Monthly_Visit + Reasons_Visit +
Satisfaction)
Residuals:
Min 1Q Median 3Q Max
-2.31351 -1.12243 -0.06685
0.87608 2.15450
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.14578
2.97431 1.730 0.09133 .
Price 0.17980 0.30538 0.589 0.55933
Age -1.10841 0.38296
-2.894 0.00613 **
Gender -0.24159 0.41801
-0.578 0.56654
Occupation -0.57401
1.71982 -0.334 0.74030
Q_Service -0.13015
0.24642 -0.528 0.60029
Taste_Food 0.09593
0.29604 0.324 0.74760
Monthly_Visit
-0.25246 0.36233 -0.697
0.48997
Reasons_Visit 0.22270
0.27519 0.809 0.42314
Satisfaction 0.29820
0.29773 1.002 0.32257
---
Signif.
codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05
‘.’ 0.1 ‘ ’ 1
Residual
standard error: 1.437 on 40 degrees of freedom
Multiple
R-squared: 0.2054, Adjusted R-squared: 0.02659
F-statistic:
1.149 on 9 and 40 DF, p-value:
0.3531
#Compare the Models
> anova(reduced, full)
Analysis of
Variance Table
Model 1: Fav_FastFood ~ Price + Age
Model 2: Fav_FastFood ~ Price + Age + Gender +
Occupation + Price + Q_Service + Taste_Food + Monthly_Visit + Reasons_Visit +
Satisfaction
Res.Df RSS Df
Sum of Sq F Pr(>F)
1 47 88.562
2 40 82.577 7 5.9849 0.4142 0.8878
The output shows
the results of the partial F-test. Since F=
0.4142 (p-value=0.8878) we can
reject the null hypothesis (3 = 4 = 0) at the 5% level of significance. It
appears that the variables "Gender", "Occupation",”Price”
"Q_Service", "Taste_Food",
"Monthly_Visit", "Reasons_Visit" & "Satisfaction"
do contribute significant information to the “Favorite Fast Food” once the
variable “Age” has not taken into consideration.
Confidence
and Prediction Intervals
We often use our
regression models to estimate the mean response or predict future values of the
response variable for certain values of the response variables. The function
predict() can be used to make both confidence intervals for the mean response
and prediction intervals. To make confidence intervals for the mean response
use the option interval=”confidence”. To make a prediction interval use the
option interval=”prediction”. By default this makes 95% confidence and
prediction intervals. If you instead want to make a 99% confidence or
prediction interval use the option level=0.99.
Ex. Obtain a 95% confidence interval
for the mean Fav_FastFood of Age whose level is 2 and Price level is 2).
>
reduced=(lm(Fav_FastFood~Price+Age))
>
predict(reduced,data.frame(Age=2,Price=2),interval="confidence")
fit lwr
upr
1 3.08924 2.615599 3.56288
A 95% confidence interval is given by
(2.615599, 3.56288)
Ex. Obtain a 95% prediction interval
for the mean Fav_FastFood of Age whose level is 2 and Price level is 2
>
predict(reduced,data.frame(Age=2,Price=2),interval="prediction")
fit
lwr
upr
1
3.08924 0.287413 5.891067
A 95% prediction interval is given by
(0.287413, 5.891067).
Note
that
this is quite a bit wider than the
confidence interval, indicating that the variation about the mean is fairly
large.
Conclusion:
After consideration of all scenarios we
formulate our multiple regression model equation and we observed that only “Age” (independent variable) has the significant
impact on choosing the Favorite Fast
Food (response variable).
More: Contact: kabircse115@gmail.com