Assignment 5: Exploring Yelp Reviews in Philadelphia¶
In this assignment, we'll explore restaurant review data available through the Yelp Dataset Challenge. The dataset includes Yelp data for user reviews and business information for many metropolitan areas. I've already downloaded this dataset (8 GB total!) and extracted out the data files for reviews and restaurants in Philadelphia. I've placed these data files into the data directory in this repository.
This assignment is broken into two parts:
Part 1: Analyzing correlations between restaurant reviews and census data
We'll explore the relationship between restaurant reviews and the income levels of the restaurant's surrounding area.
Part 2: Exploring the impact of fast food restaurants
We'll run a sentiment analysis on reviews of fast food restaurants and estimate income levels in neighborhoods with fast food restaurants. We'll test how well our sentiment analysis works by comparing the number of stars to the sentiment of reviews.
Background readings
1. Correlating restaurant ratings and income levels¶
In this part, we'll use the census API to download household income data and explore how it correlates with restaurant review data.
import geopandas as gpd
import holoviews as hv
import hvplot.pandas
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import cenpy
import json
!pip install transformers
!pip install torch transformers
C:\Users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:39: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def nb_dist(x, y): C:\Users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:165: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def get_faces(triangle): C:\Users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:199: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def build_faces(faces, triangles_is, num_triangles, num_faces_single): C:\Users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:261: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details. def nb_mask_faces(mask, faces):
Requirement already satisfied: transformers in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (4.36.0) Requirement already satisfied: filelock in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (3.13.1) Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.19.4) Requirement already satisfied: numpy>=1.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (1.24.4) Requirement already satisfied: packaging>=20.0 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (23.1) Requirement already satisfied: pyyaml>=5.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (6.0.1) Requirement already satisfied: regex!=2019.12.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (2023.8.8) Requirement already satisfied: requests in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (2.31.0) Requirement already satisfied: tokenizers<0.19,>=0.14 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.15.0) Requirement already satisfied: safetensors>=0.3.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.4.1) Requirement already satisfied: tqdm>=4.27 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (4.65.0) Requirement already satisfied: fsspec>=2023.5.0 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (2023.6.0) Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (4.8.0) Requirement already satisfied: colorama in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from tqdm>=4.27->transformers) (0.4.6) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (3.2.0) Requirement already satisfied: idna<4,>=2.5 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (1.26.15) Requirement already satisfied: certifi>=2017.4.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (2023.7.22) Requirement already satisfied: torch in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (2.1.1) Requirement already satisfied: transformers in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (4.36.0) Requirement already satisfied: filelock in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (3.13.1) Requirement already satisfied: typing-extensions in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (4.8.0) Requirement already satisfied: sympy in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (1.12) Requirement already satisfied: networkx in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (3.1) Requirement already satisfied: jinja2 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (3.1.2) Requirement already satisfied: fsspec in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (2023.6.0) Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.19.4) Requirement already satisfied: numpy>=1.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (1.24.4) Requirement already satisfied: packaging>=20.0 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (23.1) Requirement already satisfied: pyyaml>=5.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (6.0.1) Requirement already satisfied: regex!=2019.12.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (2023.8.8) Requirement already satisfied: requests in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (2.31.0) Requirement already satisfied: tokenizers<0.19,>=0.14 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.15.0) Requirement already satisfied: safetensors>=0.3.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.4.1) Requirement already satisfied: tqdm>=4.27 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (4.65.0) Requirement already satisfied: colorama in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from tqdm>=4.27->transformers) (0.4.6) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from jinja2->torch) (2.1.3) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (3.2.0) Requirement already satisfied: idna<4,>=2.5 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (1.26.15) Requirement already satisfied: certifi>=2017.4.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (2023.7.22) Requirement already satisfied: mpmath>=0.19 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from sympy->torch) (1.3.0)
1.1 Query the Census API¶
Use the cenpy package to download median household income in the past 12 months by census tract from the 2021 ACS 5-year data set for your county of interest.
You have two options to find the correct variable names:
- Search through: https://api.census.gov/data/2021/acs/acs5/variables.html
- Initialize an API connection and use the
.varslike()function to search for the proper keywords
At the end of this step, you should have a pandas DataFrame holding the income data for all census tracts within the county being analyzed. Feel free to rename your variable from the ACS so it has a more meaningful name!
::: {.callout-caution} Some census tracts won't have any value because there are not enough households in that tract. The census will use a negative number as a default value for those tracts. You can safely remove those tracts from the analysis! :::
variables = [
"NAME",
"B19013_001E"
]
acs = cenpy.remote.APIConnection("ACSDT5Y2021")
counties = cenpy.explorer.fips_table("COUNTY")
philly_county_code = "101"
PA_state_code = "42"
philly_data = acs.query(
cols=variables,
geo_unit="block group:*",
geo_filter={"state": PA_state_code, "county": philly_county_code, "tract": "*"},
)
1.2 Download census tracts from the Census and merge the data from part 1.1¶
- Download census tracts for the desired geography using the
pygrispackage - Merge the downloaded census tracts with the household income DataFrame
censustracts = gpd.read_file('Census_Tracts_2010.geojson')
censustracts.rename(columns={'TRACTCE10':'tract'}, inplace=True)
MEDHHINC = gpd.GeoDataFrame(pd.merge(censustracts, philly_data, on='tract')).reset_index(drop=True)
MEDHHINC['B19013_001E'] = pd.to_numeric(MEDHHINC['B19013_001E'], errors='coerce')
MEDHHINC = MEDHHINC[MEDHHINC['B19013_001E'] >= 0]
MEDHHINC.head()
| OBJECTID | STATEFP10 | COUNTYFP10 | tract | GEOID10 | NAME10 | NAMELSAD10 | MTFCC10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | LOGRECNO | geometry | NAME | B19013_001E | state | county | block group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 42 | 101 | 009400 | 42101009400 | 94 | Census Tract 94 | G5020 | S | 366717 | 0 | +39.9632709 | -075.2322437 | 10429 | POLYGON ((-75.22927 39.96054, -75.22865 39.960... | Block Group 1, Census Tract 94, Philadelphia C... | 11012 | 42 | 101 | 1 |
| 1 | 1 | 42 | 101 | 009400 | 42101009400 | 94 | Census Tract 94 | G5020 | S | 366717 | 0 | +39.9632709 | -075.2322437 | 10429 | POLYGON ((-75.22927 39.96054, -75.22865 39.960... | Block Group 2, Census Tract 94, Philadelphia C... | 27019 | 42 | 101 | 2 |
| 2 | 1 | 42 | 101 | 009400 | 42101009400 | 94 | Census Tract 94 | G5020 | S | 366717 | 0 | +39.9632709 | -075.2322437 | 10429 | POLYGON ((-75.22927 39.96054, -75.22865 39.960... | Block Group 3, Census Tract 94, Philadelphia C... | 35457 | 42 | 101 | 3 |
| 6 | 2 | 42 | 101 | 009500 | 42101009500 | 95 | Census Tract 95 | G5020 | S | 319070 | 0 | +39.9658709 | -075.2379140 | 10430 | POLYGON ((-75.23536 39.96852, -75.23545 39.969... | Block Group 3, Census Tract 95, Philadelphia C... | 32839 | 42 | 101 | 3 |
| 7 | 3 | 42 | 101 | 009600 | 42101009600 | 96 | Census Tract 96 | G5020 | S | 405273 | 0 | +39.9655396 | -075.2435075 | 10431 | POLYGON ((-75.24343 39.96230, -75.24339 39.962... | Block Group 1, Census Tract 96, Philadelphia C... | 60271 | 42 | 101 | 1 |
1.3 Load the restaurants data¶
The Yelp dataset includes data for 7,350 restaurants across the city. Load the data from the data/ folder and use the latitude and longitude columns to create a GeoDataFrame after loading the JSON data. Be sure to set the right CRS on when initializing the GeoDataFrame!
Notes
The JSON data is in a "records" format. To load it, you'll need to pass the following keywords:
orient='records'lines=True
restaurants = pd.read_json('restaurants_philly.json', orient='records', lines=True)
restaurants = gpd.GeoDataFrame(restaurants,
geometry=gpd.points_from_xy(restaurants['longitude'], restaurants['latitude']),
crs='EPSG:4326')
restaurants.head()
| business_id | latitude | longitude | name | review_count | stars | categories | geometry | |
|---|---|---|---|---|---|---|---|---|
| 0 | MTSW4McQd7CbVtyjqoe9mw | 39.955505 | -75.155564 | St Honore Pastries | 80 | 4.0 | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | POINT (-75.15556 39.95551) |
| 1 | MUTTqe8uqyMdBl186RmNeA | 39.953949 | -75.143226 | Tuna Bar | 245 | 4.0 | Sushi Bars, Restaurants, Japanese | POINT (-75.14323 39.95395) |
| 2 | ROeacJQwBeh05Rqg7F6TCg | 39.943223 | -75.162568 | BAP | 205 | 4.5 | Korean, Restaurants | POINT (-75.16257 39.94322) |
| 3 | QdN72BWoyFypdGJhhI5r7g | 39.939825 | -75.157447 | Bar One | 65 | 4.0 | Cocktail Bars, Bars, Italian, Nightlife, Resta... | POINT (-75.15745 39.93982) |
| 4 | Mjboz24M9NlBeiOJKLEd_Q | 40.022466 | -75.218314 | DeSandro on Main | 41 | 3.0 | Pizza, Restaurants, Salad, Soup | POINT (-75.21831 40.02247) |
1.4 Add tract info for each restaurant¶
Do a spatial join to identify which census tract each restaurant is within. Make sure each dataframe has the same CRS!
At the end of this step, you should have a new dataframe with a column identifying the tract number for each restaurant.
censustracts = censustracts.to_crs('EPSG:4326')
restaurants = restaurants.to_crs('EPSG:4326')
restaurants_gdf = gpd.sjoin(restaurants, censustracts, how = "left", predicate="within").reset_index(drop=True)
del restaurants_gdf['index_right']
restaurants_gdf.head()
| business_id | latitude | longitude | name | review_count | stars | categories | geometry | OBJECTID | STATEFP10 | ... | GEOID10 | NAME10 | NAMELSAD10 | MTFCC10 | FUNCSTAT10 | ALAND10 | AWATER10 | INTPTLAT10 | INTPTLON10 | LOGRECNO | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MTSW4McQd7CbVtyjqoe9mw | 39.955505 | -75.155564 | St Honore Pastries | 80 | 4.0 | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | POINT (-75.15556 39.95551) | 191 | 42 | ... | 42101000200 | 2 | Census Tract 2 | G5020 | S | 382478 | 0 | +39.9553999 | -075.1569775 | 10336 |
| 1 | MUTTqe8uqyMdBl186RmNeA | 39.953949 | -75.143226 | Tuna Bar | 245 | 4.0 | Sushi Bars, Restaurants, Japanese | POINT (-75.14323 39.95395) | 190 | 42 | ... | 42101000100 | 1 | Census Tract 1 | G5020 | S | 704909 | 0 | +39.9523827 | -075.1466628 | 10335 |
| 2 | ROeacJQwBeh05Rqg7F6TCg | 39.943223 | -75.162568 | BAP | 205 | 4.5 | Korean, Restaurants | POINT (-75.16257 39.94322) | 33 | 42 | ... | 42101001500 | 15 | Census Tract 15 | G5020 | S | 239383 | 0 | +39.9419037 | -075.1591158 | 10356 |
| 3 | QdN72BWoyFypdGJhhI5r7g | 39.939825 | -75.157447 | Bar One | 65 | 4.0 | Cocktail Bars, Bars, Italian, Nightlife, Resta... | POINT (-75.15745 39.93982) | 36 | 42 | ... | 42101001800 | 18 | Census Tract 18 | G5020 | S | 242483 | 0 | +39.9399998 | -075.1593091 | 10359 |
| 4 | Mjboz24M9NlBeiOJKLEd_Q | 40.022466 | -75.218314 | DeSandro on Main | 41 | 3.0 | Pizza, Restaurants, Salad, Soup | POINT (-75.21831 40.02247) | 85 | 42 | ... | 42101021000 | 210 | Census Tract 210 | G5020 | S | 858223 | 55719 | +40.0244730 | -075.2151045 | 10536 |
5 rows × 22 columns
1.5 Add income data to your restaurant data¶
Add the income data to your dataframe from the previous step, merging the census data based on the tract that each restaurant is within.
merged_gdf = gpd.sjoin(restaurants_gdf, MEDHHINC, how = "left", predicate="within")
merged_gdf.head()
| business_id | latitude | longitude | name | review_count | stars | categories | geometry | OBJECTID_left | STATEFP10_left | ... | ALAND10_right | AWATER10_right | INTPTLAT10_right | INTPTLON10_right | LOGRECNO_right | NAME | B19013_001E | state | county | block group | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MTSW4McQd7CbVtyjqoe9mw | 39.955505 | -75.155564 | St Honore Pastries | 80 | 4.0 | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | POINT (-75.15556 39.95551) | 191 | 42 | ... | 382478.0 | 0.0 | +39.9553999 | -075.1569775 | 10336 | Block Group 1, Census Tract 2, Philadelphia Co... | 42308.0 | 42 | 101 | 1 |
| 0 | MTSW4McQd7CbVtyjqoe9mw | 39.955505 | -75.155564 | St Honore Pastries | 80 | 4.0 | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | POINT (-75.15556 39.95551) | 191 | 42 | ... | 382478.0 | 0.0 | +39.9553999 | -075.1569775 | 10336 | Block Group 2, Census Tract 2, Philadelphia Co... | 250001.0 | 42 | 101 | 2 |
| 1 | MUTTqe8uqyMdBl186RmNeA | 39.953949 | -75.143226 | Tuna Bar | 245 | 4.0 | Sushi Bars, Restaurants, Japanese | POINT (-75.14323 39.95395) | 190 | 42 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | ROeacJQwBeh05Rqg7F6TCg | 39.943223 | -75.162568 | BAP | 205 | 4.5 | Korean, Restaurants | POINT (-75.16257 39.94322) | 33 | 42 | ... | 239383.0 | 0.0 | +39.9419037 | -075.1591158 | 10356 | Block Group 1, Census Tract 15, Philadelphia C... | 66290.0 | 42 | 101 | 1 |
| 2 | ROeacJQwBeh05Rqg7F6TCg | 39.943223 | -75.162568 | BAP | 205 | 4.5 | Korean, Restaurants | POINT (-75.16257 39.94322) | 33 | 42 | ... | 239383.0 | 0.0 | +39.9419037 | -075.1591158 | 10356 | Block Group 2, Census Tract 15, Philadelphia C... | 107292.0 | 42 | 101 | 2 |
5 rows × 42 columns
1.6 Make a plot of median household income vs. Yelp stars¶
Our dataset has the number of stars for each restaurant, rounded to the nearest 0.5 star. In this step, create a line plot that shows the average income value for each stars category (e.g., all restaurants with 1 star, 1.5 stars, 2 stars, etc.)
While their are multiple ways to do this, the seaborn.lineplot() is a great option. This can show the average value in each category as well as 95% uncertainty intervals. Use this function to plot the stars ("x") vs. average income ("y") for all of our restaurants, using the dataframe from last step. Be sure to format your figure to make it look nice!
Question: Is there a correlation between a restaurant's ratings and the income levels of its surrounding neighborhood?
grouped_df = merged_gdf.groupby('stars')['B19013_001E'].mean().reset_index()
grouped_df.head()
| stars | B19013_001E | |
|---|---|---|
| 0 | 1.0 | 55197.340659 |
| 1 | 1.5 | 59296.992105 |
| 2 | 2.0 | 65606.081140 |
| 3 | 2.5 | 71007.582278 |
| 4 | 3.0 | 75801.154528 |
ax = sns.lineplot(
data=grouped_df,
x='stars',
y='B19013_001E',
estimator= 'mean',
errorbar=('ci', 95))
ax.set(xlabel='Ratings', ylabel='Average Income')
ax.set_title('Ratings vs. Average Income in Census Block Group')
Text(0.5, 1.0, 'Ratings vs. Average Income in Census Block Group')
2. Fast food trends in Philadelphia¶
At the end of part 1, you should have seen a strong trend where higher income tracts generally had restaurants with better reviews. In this section, we'll explore the impact of fast food restaurants and how they might be impacting this trend.
Hypothesis
- Fast food restaurants are predominantly located in areas with lower median income levels.
- Fast food restaurants have worse reviews compared to typical restaurants.
If true, these two hypotheses could help to explain the trend we found in part 1. Let's dive in and test our hypotheses!
2.1 Identify fast food restaurants¶
The "categories" column in our dataset contains multiple classifications for each restaurant. One such category is "Fast Food". In this step, add a new column called "is_fast_food" that is True if the "categories" column contains the term "Fast Food" and False otherwise
merged_gdf['is_fast_food'] = merged_gdf['categories'].str.contains('Fast Food', case=False)
merged_gdf.head()
| business_id | latitude | longitude | name | review_count | stars | categories | geometry | OBJECTID_left | STATEFP10_left | ... | AWATER10_right | INTPTLAT10_right | INTPTLON10_right | LOGRECNO_right | NAME | B19013_001E | state | county | block group | is_fast_food | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MTSW4McQd7CbVtyjqoe9mw | 39.955505 | -75.155564 | St Honore Pastries | 80 | 4.0 | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | POINT (-75.15556 39.95551) | 191 | 42 | ... | 0.0 | +39.9553999 | -075.1569775 | 10336 | Block Group 1, Census Tract 2, Philadelphia Co... | 42308.0 | 42 | 101 | 1 | False |
| 0 | MTSW4McQd7CbVtyjqoe9mw | 39.955505 | -75.155564 | St Honore Pastries | 80 | 4.0 | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | POINT (-75.15556 39.95551) | 191 | 42 | ... | 0.0 | +39.9553999 | -075.1569775 | 10336 | Block Group 2, Census Tract 2, Philadelphia Co... | 250001.0 | 42 | 101 | 2 | False |
| 1 | MUTTqe8uqyMdBl186RmNeA | 39.953949 | -75.143226 | Tuna Bar | 245 | 4.0 | Sushi Bars, Restaurants, Japanese | POINT (-75.14323 39.95395) | 190 | 42 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 2 | ROeacJQwBeh05Rqg7F6TCg | 39.943223 | -75.162568 | BAP | 205 | 4.5 | Korean, Restaurants | POINT (-75.16257 39.94322) | 33 | 42 | ... | 0.0 | +39.9419037 | -075.1591158 | 10356 | Block Group 1, Census Tract 15, Philadelphia C... | 66290.0 | 42 | 101 | 1 | False |
| 2 | ROeacJQwBeh05Rqg7F6TCg | 39.943223 | -75.162568 | BAP | 205 | 4.5 | Korean, Restaurants | POINT (-75.16257 39.94322) | 33 | 42 | ... | 0.0 | +39.9419037 | -075.1591158 | 10356 | Block Group 2, Census Tract 15, Philadelphia C... | 107292.0 | 42 | 101 | 2 | False |
5 rows × 43 columns
2.2 Calculate the median income for fast food and otherwise¶
Group by the "is_fast_food" column and calculate the median income for restaurants that are and are not fast food. You should find that income levels are lower in tracts with fast food.
Note: this is just an estimate, since we are calculating a median of median income values.
grouped_gdf2 = merged_gdf.groupby('is_fast_food')['B19013_001E'].median().reset_index()
grouped_gdf2.head()
| is_fast_food | B19013_001E | |
|---|---|---|
| 0 | False | 69783.0 |
| 1 | True | 55773.5 |
2.3 Load fast food review data¶
In the rest of part 2, we're going to run a sentiment analysis on the reviews for fast food restaurants.
The review data for all fast food restaurants identified in part 2.1 is already stored in the data/ folder. The data is stored as a JSON file and you can use pandas.read_json to load it.
Notes
The JSON data is in a "records" format. To load it, you'll need to pass the following keywords:
orient='records'lines=True
reviews = pd.read_json('reviews_philly_fast_food.json', orient='records', lines=True)
merged_gdf
| business_id | latitude | longitude | name | review_count | stars | categories | geometry | OBJECTID_left | STATEFP10_left | ... | AWATER10_right | INTPTLAT10_right | INTPTLON10_right | LOGRECNO_right | NAME | B19013_001E | state | county | block group | is_fast_food | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MTSW4McQd7CbVtyjqoe9mw | 39.955505 | -75.155564 | St Honore Pastries | 80 | 4.0 | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | POINT (-75.15556 39.95551) | 191 | 42 | ... | 0.0 | +39.9553999 | -075.1569775 | 10336 | Block Group 1, Census Tract 2, Philadelphia Co... | 42308.0 | 42 | 101 | 1 | False |
| 0 | MTSW4McQd7CbVtyjqoe9mw | 39.955505 | -75.155564 | St Honore Pastries | 80 | 4.0 | Restaurants, Food, Bubble Tea, Coffee & Tea, B... | POINT (-75.15556 39.95551) | 191 | 42 | ... | 0.0 | +39.9553999 | -075.1569775 | 10336 | Block Group 2, Census Tract 2, Philadelphia Co... | 250001.0 | 42 | 101 | 2 | False |
| 1 | MUTTqe8uqyMdBl186RmNeA | 39.953949 | -75.143226 | Tuna Bar | 245 | 4.0 | Sushi Bars, Restaurants, Japanese | POINT (-75.14323 39.95395) | 190 | 42 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 2 | ROeacJQwBeh05Rqg7F6TCg | 39.943223 | -75.162568 | BAP | 205 | 4.5 | Korean, Restaurants | POINT (-75.16257 39.94322) | 33 | 42 | ... | 0.0 | +39.9419037 | -075.1591158 | 10356 | Block Group 1, Census Tract 15, Philadelphia C... | 66290.0 | 42 | 101 | 1 | False |
| 2 | ROeacJQwBeh05Rqg7F6TCg | 39.943223 | -75.162568 | BAP | 205 | 4.5 | Korean, Restaurants | POINT (-75.16257 39.94322) | 33 | 42 | ... | 0.0 | +39.9419037 | -075.1591158 | 10356 | Block Group 2, Census Tract 15, Philadelphia C... | 107292.0 | 42 | 101 | 2 | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7348 | 8n93L-ilMAsvwUatarykSg | 39.951018 | -75.198240 | Kitchen Gia | 22 | 3.0 | Coffee & Tea, Food, Sandwiches, American (Trad... | POINT (-75.19824 39.95102) | 378 | 42 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | False |
| 7349 | WnT9NIzQgLlILjPT0kEcsQ | 39.935982 | -75.158665 | Adelita Taqueria & Restaurant | 35 | 4.5 | Restaurants, Mexican | POINT (-75.15866 39.93598) | 42 | 42 | ... | 0.0 | +39.9367634 | -075.1595100 | 10365 | Block Group 1, Census Tract 24, Philadelphia C... | 117500.0 | 42 | 101 | 1 | False |
| 7349 | WnT9NIzQgLlILjPT0kEcsQ | 39.935982 | -75.158665 | Adelita Taqueria & Restaurant | 35 | 4.5 | Restaurants, Mexican | POINT (-75.15866 39.93598) | 42 | 42 | ... | 0.0 | +39.9367634 | -075.1595100 | 10365 | Block Group 2, Census Tract 24, Philadelphia C... | 135139.0 | 42 | 101 | 2 | False |
| 7349 | WnT9NIzQgLlILjPT0kEcsQ | 39.935982 | -75.158665 | Adelita Taqueria & Restaurant | 35 | 4.5 | Restaurants, Mexican | POINT (-75.15866 39.93598) | 42 | 42 | ... | 0.0 | +39.9367634 | -075.1595100 | 10365 | Block Group 3, Census Tract 24, Philadelphia C... | 66313.0 | 42 | 101 | 3 | False |
| 7349 | WnT9NIzQgLlILjPT0kEcsQ | 39.935982 | -75.158665 | Adelita Taqueria & Restaurant | 35 | 4.5 | Restaurants, Mexican | POINT (-75.15866 39.93598) | 42 | 42 | ... | 0.0 | +39.9367634 | -075.1595100 | 10365 | Block Group 4, Census Tract 24, Philadelphia C... | 82606.0 | 42 | 101 | 4 | False |
17818 rows × 43 columns
2.4 Trim to the most popular fast food restaurants¶
There's too many reviews to run a sentiment analysis on all of them in a reasonable time. Let's trim our reviews dataset to the most popular fast food restaurants, using the list provided below.
You will need to get the "business_id" values for each of these restaurants from the restaurants data loaded in part 1.3. Then, trim the reviews data to include reviews only for those business IDs.
popular_fast_food = [
"McDonald's",
"Wendy's",
"Subway",
"Popeyes Louisiana Kitchen",
"Taco Bell",
"KFC",
"Burger King",
]
business_id = merged_gdf[['business_id','name']]
reviews = pd.merge(business_id, reviews, on = 'business_id')
reviews.head()
| business_id | name | review_id | stars | text | |
|---|---|---|---|---|---|
| 0 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | WtP3kl8xVOcoJOYyxkGfIg | 1 | This Wendy's is the worst Wendy's to go to . T... |
| 1 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | 7l__He_d8pHrEAAZZv2l0Q | 1 | This location has the slowest service ever eve... |
| 2 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | WSIe14YeSYUYqTdx6UDoFw | 1 | I got a burnt baked potato no friends ! Cook ... |
| 3 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | kc9ABqXrw1-91xP6G0KWmQ | 1 | The restaurant was extremely dirty. All the ta... |
| 4 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | ekdJfqlBxQf5oaZvaKQ8Zg | 1 | Eww I dont like Wendy's it looks nasty as hell... |
reviews_filtered = reviews[reviews['name'].isin(popular_fast_food)]
reviews_filtered.head()
| business_id | name | review_id | stars | text | |
|---|---|---|---|---|---|
| 0 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | WtP3kl8xVOcoJOYyxkGfIg | 1 | This Wendy's is the worst Wendy's to go to . T... |
| 1 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | 7l__He_d8pHrEAAZZv2l0Q | 1 | This location has the slowest service ever eve... |
| 2 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | WSIe14YeSYUYqTdx6UDoFw | 1 | I got a burnt baked potato no friends ! Cook ... |
| 3 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | kc9ABqXrw1-91xP6G0KWmQ | 1 | The restaurant was extremely dirty. All the ta... |
| 4 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | ekdJfqlBxQf5oaZvaKQ8Zg | 1 | Eww I dont like Wendy's it looks nasty as hell... |
2.5 Run the emotions classifier on fast food reviews¶
Run a sentiment analysis on the reviews data from the previous step. Use the DistilBERT model that can predict emotion labels (anger, fear, sadness, joy, love, and surprise). Transform the result from the classifier into a DataFrame so that you have a column for each of the emotion labels.
from transformers import pipeline
import torch
text_data = reviews_filtered['text'].tolist()
model = "bhadresh-savani/distilbert-base-uncased-emotion"
emotion_classifier = pipeline(
task="text-classification",
model=model,
tokenizer=model,
truncation=True,
)
emotion_scores = emotion_classifier(text_data)
emotion_df = pd.DataFrame(emotion_scores)
emotion_df['text'] = text_data
print(emotion_df.columns)
Index(['label', 'score', 'text'], dtype='object')
emotion_df_no_duplicates = emotion_df.drop_duplicates(subset='text')
emotion_pivot = emotion_df_no_duplicates.pivot(index='text', columns='label', values='score').reset_index()
emotion_pivot.columns.name = None # Remove the 'label' name for columns
emotion_pivot.head()
| text | anger | fear | joy | love | sadness | surprise | |
|---|---|---|---|---|---|---|---|
| 0 | "Meh. I've experienced better." That's what ... | NaN | NaN | 0.997238 | NaN | NaN | NaN |
| 1 | "No burgers in Burger King!" I went to this Bu... | NaN | NaN | NaN | 0.881435 | NaN | NaN |
| 2 | "Open 24 hours"... or when we feel like it. Th... | NaN | NaN | NaN | NaN | 0.994735 | NaN |
| 3 | "Why the fuck you add salad dressing to the ba... | 0.997625 | NaN | NaN | NaN | NaN | NaN |
| 4 | $.69 for any size coffee you say? Breakfast s... | NaN | NaN | NaN | NaN | 0.898883 | NaN |
FF_emotions = pd.merge(reviews_filtered, emotion_pivot, on='text', how='left')
FF_emotions['prediction'] = FF_emotions[['anger', 'fear', 'sadness', 'joy', 'surprise', 'love']].idxmax(axis=1)
FF_emotions.head()
| business_id | name | review_id | stars | text | anger | fear | joy | love | sadness | surprise | prediction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | WtP3kl8xVOcoJOYyxkGfIg | 1 | This Wendy's is the worst Wendy's to go to . T... | NaN | NaN | NaN | NaN | 0.995998 | NaN | sadness |
| 1 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | 7l__He_d8pHrEAAZZv2l0Q | 1 | This location has the slowest service ever eve... | 0.646855 | NaN | NaN | NaN | NaN | NaN | anger |
| 2 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | WSIe14YeSYUYqTdx6UDoFw | 1 | I got a burnt baked potato no friends ! Cook ... | 0.890414 | NaN | NaN | NaN | NaN | NaN | anger |
| 3 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | kc9ABqXrw1-91xP6G0KWmQ | 1 | The restaurant was extremely dirty. All the ta... | NaN | NaN | NaN | NaN | 0.642099 | NaN | sadness |
| 4 | O1oZpbZNDMH_gz8DhsZCdA | Wendy's | ekdJfqlBxQf5oaZvaKQ8Zg | 1 | Eww I dont like Wendy's it looks nasty as hell... | 0.986700 | NaN | NaN | NaN | NaN | NaN | anger |
FF_emotions.groupby("prediction").size().plot(kind='barh');
2.7 Combine the ratings and sentiment data¶
Combine the data from part 2.4 (reviews data) and part 2.6 (emotion data). Use the pd.concat() function and combine along the column axis.
Note: You'll need to reset the index of your reviews data frame so it matches the emotion data index (it should run from 0 to the length of the data - 1).
FF_reviews_reset = reviews_filtered.reset_index(drop=True)
emotion_data_subset = FF_emotions[['anger', 'fear', 'joy', 'love', 'sadness', 'surprise', 'prediction']]
rating_sentiment = pd.concat([FF_reviews_reset, emotion_data_subset], axis=1)
print(rating_sentiment.head())
business_id name review_id stars \
0 O1oZpbZNDMH_gz8DhsZCdA Wendy's WtP3kl8xVOcoJOYyxkGfIg 1
1 O1oZpbZNDMH_gz8DhsZCdA Wendy's 7l__He_d8pHrEAAZZv2l0Q 1
2 O1oZpbZNDMH_gz8DhsZCdA Wendy's WSIe14YeSYUYqTdx6UDoFw 1
3 O1oZpbZNDMH_gz8DhsZCdA Wendy's kc9ABqXrw1-91xP6G0KWmQ 1
4 O1oZpbZNDMH_gz8DhsZCdA Wendy's ekdJfqlBxQf5oaZvaKQ8Zg 1
text anger fear joy \
0 This Wendy's is the worst Wendy's to go to . T... NaN NaN NaN
1 This location has the slowest service ever eve... 0.646855 NaN NaN
2 I got a burnt baked potato no friends ! Cook ... 0.890414 NaN NaN
3 The restaurant was extremely dirty. All the ta... NaN NaN NaN
4 Eww I dont like Wendy's it looks nasty as hell... 0.986700 NaN NaN
love sadness surprise prediction
0 NaN 0.995998 NaN sadness
1 NaN NaN NaN anger
2 NaN NaN NaN anger
3 NaN 0.642099 NaN sadness
4 NaN NaN NaN anger
2.8 Plot sentiment vs. stars¶
We now have a dataframe with the predicted primary emotion for each review and the associated number of stars for each review. Let's explore two questions:
- Does sentiment analysis work? Do reviews with fewer stars have negative emotions?
- For our fast food restaurants, are reviews generally positive or negative?
Use seaborn's histplot() to make a stacked bar chart showing the breakdown of each emotion for each stars category (1 star, 2 stars, etc.). A few notes:
- To stack multiple emotion labels in one bar, use the
multiple="stack"keyword - The
discrete=Truecan be helpful to tell seaborn our stars values are discrete categories
sns.set(style="whitegrid")
plt.figure(figsize=(12, 8))
sns.histplot(data=rating_sentiment, x='stars', hue='prediction', multiple="stack", discrete=True, palette='plasma')
plt.xlabel('Stars')
plt.ylabel('Number of Reviews')
plt.title('Sentiment vs. Stars for Fast Food Reviews')
plt.show()