Data Collection and Analysis

General Questions that the Data Analysis should Answer

Do users give similar ratings to a business on Google and Yelp? Are user reviews consistent with user ratings on Google and Yelp? Develop a combined rating using Google and Yelp ratings. Is a combined rating more informative compared to individual ratings?

Data Collection

Collect data from Google and Yelp to compare reviews of the same business.

Data can often be collected from other websites that don’t have APIs by scraping the sites. If you would prefer to do this then discuss this option with me as soon as possible.

Students may not use an existing data set for this assignment.

Using APIs in R to Collect Data

Google and Yelp both have APIs that can be used to access their data.

Google API

Google Maps APIs can be accessed using the R library googleway. You will need a a valid API key from Google to use the library. Follow the instructions here to get a key.

The following code uses the Google Places API Web Service to collect ratings data of Tim Hortons in Toronto.


res_goog <- google_places(search_string = "Tim Hortons in Toronto Ontario", key = mygooglekey)
knitr::kable(head(res_goog$results %>% select(formatted_address,rating)))
formatted_address rating
101 College St, Toronto, ON M5G 1L7, Canada 3.8
145 King St W, Toronto, ON M5H 1J8, Canada 3.0
322 King St W, Toronto, ON M5V 1J2, Canada 3.9
207 Queens Quay W, Toronto, ON M5J 1A7, Canada 3.6
33 Yonge St, Toronto, ON M5E 1G4, Canada 3.8
261 Yonge St, Toronto, ON M5B 1N8, Canada 3.8
# First five Google Reviews on first location
res_goog_details <- google_place_details(place_id = res_goog$results$place_id[1],key = mygooglekey)
knitr::kable(res_goog_details$result$reviews %>% select(rating, text))
rating text
1 I wouldnt even give them 1 star except i had to in order to get to this part of writing a review.. I workerd 2 shifts there, and they never paid me! When i call them the supervisor and manager both hang the phone up on me. When i asked where my pay for the hours i worked were? Discusting ! Makes me sick to my stomach to think they are taking advantage of people like me ! And they dont return phone calls either ! Nice bunch to work for eh ! 😡😡😡
4 Standard Timmy’s. The staff are really friendly however
3 Good coffee. Probably the slowest service amongst Tim Hortons in the area.
5 Long line but moves very quickly.
5 best timmies on college

Yelp API

The Yelp Fusion API can be accessed using the httr library. You will need valid client credentials to access the API. Follow the instructions here. The following code was modified from here.

res <- POST("",
            body = list(grant_type = "client_credentials",
                        client_id = yelp_clientid,
                        client_secret = yelp_clientsecret))
token <- content(res)$access_token

A full list of parameters that can be passed to the Yelp search API are available here.

yelp <- ""
term <- "tim hortons"
location <- "toronto"
limit <- 5
radius <- 1000

url <- modify_url(yelp, path = c("v3", "businesses", "search"), 
                  query = list(term = term, location=location, limit = limit,radius=radius))

res_yelp <- GET(url, add_headers('Authorization' = paste("bearer", token)))
ct <- content(res_yelp)

knitr::kable(data_frame(ct$businesses[[1]]$name, ct$businesses[[1]]$location$address1, ct$businesses[[1]]$rating))
ct\(businesses[[1]]\)name ct\(businesses[[1]]\)location$address1 ct\(businesses[[1]]\)rating
Tim Hortons 334 Bloor Street W 3
# Get reviews for the first location returned

url_reviews <- modify_url(yelp, path = c("v3", "businesses", ct$businesses[[1]]$id,"reviews"))
res_yelp_reviews <- GET(url_reviews, add_headers('Authorization' = paste("bearer", token)))
ct_rev <- content(res_yelp_reviews)
## [1] "This is a small TIm Horton right off the corner of Spadina and Bloor, so I give it a point for being close to subway, I actually did grab a coffee on my way..."

The Assignment

The assignment is to collect data and wrangle it into a format that can be analysed using statistical methods.

What Should I Submit?

The data analysis will be based on data from a particular time period. Conduct the analysis and write the report in an R Notebook file. Save the data and R Notebook files as part of an R project.

The R Project directory should contain:

The R Notebook with your data analysis should be able to be knitted on a machine running R Studio.

Answers to Some Common Questions

  1. It’s not necessary for R code chunks to appear in the report (use the chunk option echo=FALSE). It’s not necessary to include in the report unless there is some part of the code that will contribute to describing what you have done in the data analysis. Also, you will be submitting your R Notebook file so I can see all the gory details. This leads to …

  2. What should be reported in the report? A high level description of what you have done. This leads to …

  3. Who is the intended audience for the report and what do you mean by a “high level description”? The intended audience is an educated person that has taken at least one basic statistics course, but might be a bit rusty on the details. For example, your supervisor at work completed an MBA ten years ago and took a few statistics courses, but the details are a bit hazy.

How will my writing be evaluated?

Your writing will be evaluated for clarity and conciseness.

  1. Title [1-5] There should be an appropriate title, adequate summary, and complete information including names and dates.

  2. Introduction [1-5] The purpose of the research should be clearly stated and the scope of what is considered in the report should be clear.

  3. Methods [1-5] The role of each method should be clearly stated. The description of the analyses should be clear and unambiguous so that another statistician or data scientist could easily re-construct it. The methods should be described accurately.

  4. Results [1-5] There should be appropriate tables and graphs. The results should be clearly stated in the context of the problem. The size and direction of significant results should be given. The results must be accurately stated. The research question should be adequately answered.

  5. Conclusion / Discussion [1-5] The results should be clearly and completely summarized. This section should also include discussion of limitations and/or concerns and/or suggestions for future consideration as appropriate.

  6. General Considerations [1-5] The ideas should be presented in logical order, with well-organized sections, no grammatical, spelling, or punctuation errors, an appropriate level of technical detail, and be clear and easy to follow.