I moved to Seattle three months ago and found there were many Asian food restaurants here. Some of the Chinese food is really authetic. However, when I was enjoying the food in some Chinese restaurants I found many customers are Asian people, which indicates Americans don't like Chinese food very much. To analysis how popular Chinese food is in Seattle, I scraped the information of restuarants in Seattle for different styles of Asian food from Yelp.

1. Scraping Data From Yelp

I scraped four types of food: chinese, japanese, thai, vietnamese. As the image shows below, the scraped information of a restaurant includes Name, Price(dollor sign), Review_count, Star(rating from 0 to 5 stars), Category(the type of food) and Address. Yelp lists ten restaurants each page, so you need specify the number of restaurants you want to scrape (set the parameter num_restaurant in the code).

I wrote Python code to scrape data from Yelp. You can either choose store the data into MySQL database or csv file. The code can also be found at my github.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Thu Dec 15 15:38:11 2016

@author: Vincent 
"""
import requests
from bs4 import BeautifulSoup
import re
import pymysql
import csv
import os

class YelpSeattleRestaurant:
    """
    The code scrapes the restaurant information of Name, Price, Category, Star and
    Review count in Seattle areas. You can use it to scrape other areas but the url
    need be modified correctly.

    Params: url: the web page link of restaurant list.
            num_restaurant: the number of restaurant. Since Yelp has 10 restaurants
            per page, the code scrapes the largest multiple of 10 less than num_restaurant

    Note: MySQL database need be created before executing.
    """
    def __init__(self, url, num_restaurant, style):
        self.UA = 'Chrome/54 (Macintosh; Intel Mac OS X 10.11; rv:32.0) Gecko/20100101 Firefox/32.0'
        self.url = url
        self.page_num = [str(i) for i in range(0, num_restaurant, 10)]
        self.style = style

    def scrape_name(self, soup):
        # seach a outer tag and a inner tag
        name = ''
        name_div = soup.find_all('h3', {'class': 'search-result-title'})
        name_a = soup.find_all('a', {'class': 'biz-name js-analytics-click'})
        for d in name_a:
            name = name + d.string
        return name

    def scrape_price(self, soup):
        price = ''
        price_ = soup.find_all('span', {'class': 'business-attribute price-range'})
        for p in price_:
            price = price + p.string
        return price

    def scrape_category(self, soup):
        category = ''
        category_ = soup.find_all('span', {'class': 'category-str-list'})
        if category_:
            for c in category_:
                ## .get_text() returns the same as .text; ## replace both '\n' and space with ''
                category = category + c.get_text().strip().replace('\n', '').replace(' ', '')
        else:
            category = None
        return category

    def scrape_star(self, soup):
        star = ''
        star_ = soup.find_all('div', {'class': re.compile('i-stars i-stars*')})
        if star_:
            for s in star_:
                star_str = s['title'].split()[0]
                star = star + star_str
        else:
            star = None
        return star

    def scrape_review(self, soup):
        review_count = ''
        reviews = soup.find_all('span', {'class': 'review-count rating-qualifier'})
        if reviews:
            for r in reviews:
                review_count = review_count + r.text.strip().split()[0]
        else:
            review_count = None
        return review_count

    def scrape_address(self, soup):
        address = ''
        address_ = soup.find_all('address')
        if address_:
            for a in address_:
                address = address + a.text.strip()
        else:
            address = None
        return address

    def store(self, name, price, review_count, star, category, address):
        conn = pymysql.connect(host='localhost',
                               user='root',
                               passwd='root',
                               db='Yelp_scrape',
                               use_unicode=True,
                               charset="utf8")

        # create a cursor object in order to execute queries
        cur = conn.cursor()
        # create a new record
        sql = 'Insert into Restaurant (Name, Price, Review_count, ' \
              'Star, Category, Address) Values (%s, %s, %s, %s, %s, %s)'
        cur.execute(sql, (name, price, review_count, star, category, address))
        # commit changes
        cur.connection.commit()

    def write_mysql(self):
        for i, page in enumerate(self.page_num):
            print('Begin to scrape: Page %s' % (i))
            url_curr = self.url + page + '&cflt=' + self.style
            html = requests.get(url_curr, headers={'User-Agent': self.UA})
            soup = BeautifulSoup(html.content, 'lxml')

            restaurants_div = soup.find_all('div', {'class': 'biz-listing-large'})
            for restaurant in restaurants_div:
                name = self.scrape_name(restaurant)
                price = self.scrape_price(restaurant)
                category = self.scrape_category(restaurant)
                star = self.scrape_star(restaurant)
                review_count = self.scrape_review(restaurant)
                address = self.scrape_address(restaurant)
                # for each restaurant, store its information in MySQL database
                try:
                    self.store(name, price, review_count, star, category, address)
                except:
                    # print('Failed or error')
                    pass
            print('Finished scraping: Page %s' % (i))

    def write_csv(self):
        # if the file doesn't exist, create a new file
        file = 'yelp_' + self.style + '.csv'
        if not os.path.exists(file):
            with open(file, 'w') as f:
                writer = csv.writer(f)
                writer.writerow(['Name', 'Price', 'Review_count', 'Star', 'Category', 'Address'])
        else:
            pass
        # check each restaurant, append its info into csv file.
        for i, page in enumerate(self.page_num):
            print('Begin to scrape: Page %s' % (i))
            url_curr = self.url + page + '&cflt=' + self.style
            html = requests.get(url_curr, headers={'User-Agent': self.UA})
            soup = BeautifulSoup(html.content, 'lxml')

            restaurants_div = soup.find_all('div', {'class': 'biz-listing-large'})
            for restaurant in restaurants_div:
                name = self.scrape_name(restaurant)
                price = self.scrape_price(restaurant)
                category = self.scrape_category(restaurant)
                star = self.scrape_star(restaurant)
                review_count = self.scrape_review(restaurant)
                address = self.scrape_address(restaurant)
                with open(file, 'a') as f: # note the mode is 'a', append
                    writer = csv.writer(f)
                    writer.writerow([name, price, review_count, star, category, address])
            print('Finished scraping: Page %s' % (i))

url = 'https://www.yelp.com/search?find_desc=restaurants&find_loc=Seattle,+WA&start='
num_restaurant = 300
styles = ['chinese', 'japanese', 'thai', 'vietnamese', 'india',
          'American (New)', 'American (Traditional)']
Scraper = YelpSeattleRestaurant(url, num_restaurant, 'vietnamese')
Scraper.write_csv()

2. Visualization in R

We have scraped the information of restaurants using Python and stored them into seperate .csv files, now we simply use R(ggplot2) to visualize the relation between different variables.

  • Read and combine data from seperate files. Create a new column to indicate the style(country) of the restaurant. Note actually one restaurant may have more than one style, and even some of them have overlapping, for example one restaurant may be labeled as chinese and thai at the same time, we just keep these overlappings and allow one restaurant appears in multiple categories.
styles <- c('chinese', 'japanese', 'thai', 'vietnamese')
read_data <- function(styles){
  data <- data.frame()
  for(s in styles){
    file = paste0('yelp_', s, '.csv', sep='')
    temp = read.csv(file, header = TRUE)
    temp$Style <- s
    data <- rbind(data, temp)
  }
  data$Style <- factor(data$Style)
  data <- data[!is.na(data$Star), ]
  return(data)
}
data <- read_data(styles)

Question 1: For Asian foods, which one is expensive or cheap?

For the low-end restaurant market, Vietnamese style has higher market share than its overall ratio. Chinese food performs neither better nor worse. Chinese restaurants has a large amount but most of them are less than 2 dollar signs. What surprised me is that the high-end market is totally dominated by Japanese food. Except one Vieranamese food restaurant, all the ones greater than 3 dollar signs are Japanese.

library(ggplot2)
library(dplyr)

data %>%
  group_by_('Style') %>%
  summarize(Num_restaurant = n(), Ratio_restaurant = n()/nrow(data))
## # A tibble: 4 <U+00D7> 3
##        Style Num_restaurant Ratio_restaurant
##       <fctr>          <int>            <dbl>
## 1    chinese            427        0.2883187
## 2   japanese            498        0.3362593
## 3       thai            265        0.1789332
## 4 vietnamese            291        0.1964889
p0 <- ggplot(data, aes(x=Price, fill=Style)) + geom_bar(position='fill') + 
  labs(title = 'Percentage of Restaurants for Each Price')
p0

plot of chunk unnamed-chunk-2

Question 2: Which Asian food are more is more welcome (higher everage rating star and more reviews)?

  • It's very astounding that Thai food and Vietnamese has the largest average number of reviews and rating stars.
  • Chinese food has the lowest rating star. Although the number of Chinese restaurant is large, but the standard deviation of rating is normal, consistent with the other foods.
  • We could infer Chinese food indeed has a lower acceptance in Seattle. I guesss it would be the great gap of food culture between East and West. I will do a future research on review using NLP techniques to reveal why Chinese food has such a lowe score and whether there is biased/underestimated evaluation on Chinese food.
data %>% 
  group_by_('Style') %>%
  summarize(avg_num_Review = mean(Review_count), avg_num_Star = mean(Star), std_Star= sd(Star))
## # A tibble: 4 <U+00D7> 4
##        Style avg_num_Review avg_num_Star  std_Star
##       <fctr>          <dbl>        <dbl>     <dbl>
## 1    chinese       124.1967     3.447307 0.5744244
## 2   japanese       132.0301     3.653614 0.5753111
## 3       thai       156.8755     3.675472 0.5403792
## 4 vietnamese       119.3333     3.704467 0.5830484
p1 <- ggplot(data, aes(x=Star, fill=Style)) + geom_bar(position='fill') + 
  labs(title = 'Percentage of Restaurants for Each Star')
p2 <- ggplot(data, aes(x=log(Review_count+1), fill=Style)) + geom_density(alpha=0.5) + 
  labs(title = 'Density of log(Review_count + 1)')
p1; p2

plot of chunk unnamed-chunk-3plot of chunk unnamed-chunk-3

Question 3: The overall visualization and conclusion.

The size of the point indicates the amount of the reviews.(I took the square root so that it's easy to compare). Each color indicates one type of food. From the plot we can see:

  • Most restaurant locate in the region of $-$$ and 3.5 Star - 4 Star. It looks Thai food and Vietnamese food are more welcome in this region, which is consistent with prior analysis.
  • Japanese food dominates the hign-end market. Only several high-end resturants with expensive menu and highest rating could impressed customers that Japanese food is excellent.
  • Chinese food is cheap and low star rated, also doesn't receive much attention.

Intuitive Conclusion:

  • Japanese food is expensive and highly rated.
  • Thai and Vietnamese food is cheap and highly rated.
  • Chinese food is cheap and lowly rated.
# Price vs. Star, use Review_count as point size.
set.seed(2016)
p3 <- ggplot(data, aes(x=Star, y=Price, col=Style)) + 
  geom_point(position = 'jitter', aes(size=sqrt(Review_count)), alpha=0.4) + 
  facet_wrap(~Style, nrow=2) + 
  labs(title = 'Price vs. Star')
set.seed(2016)
p4 <- ggplot(data, aes(x=Star, y=Price, col=Style)) + 
  geom_point(position = 'jitter', aes(size=sqrt(Review_count)), alpha=0.3) + 
  labs(title = 'Price vs. Star in One Figure')
p3; p4

plot of chunk unnamed-chunk-4plot of chunk unnamed-chunk-4


Comments

comments powered by Disqus