Assignment 5: Exploring Yelp Reviews in Philadelphia¶

In this assignment, we'll explore restaurant review data available through the Yelp Dataset Challenge. The dataset includes Yelp data for user reviews and business information for many metropolitan areas. I've already downloaded this dataset (8 GB total!) and extracted out the data files for reviews and restaurants in Philadelphia. I've placed these data files into the data directory in this repository.

This assignment is broken into two parts:

Part 1: Analyzing correlations between restaurant reviews and census data

We'll explore the relationship between restaurant reviews and the income levels of the restaurant's surrounding area.

Part 2: Exploring the impact of fast food restaurants

We'll run a sentiment analysis on reviews of fast food restaurants and estimate income levels in neighborhoods with fast food restaurants. We'll test how well our sentiment analysis works by comparing the number of stars to the sentiment of reviews.

Background readings

  • Does sentiment analysis work?
  • The Geography of Taste: Using Yelp to Study Urban Culture

1. Correlating restaurant ratings and income levels¶

In this part, we'll use the census API to download household income data and explore how it correlates with restaurant review data.

In [1]:
import geopandas as gpd
import holoviews as hv
import hvplot.pandas
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import cenpy
import json
!pip install transformers
!pip install torch transformers
C:\Users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:39: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def nb_dist(x, y):
C:\Users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:165: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def get_faces(triangle):
C:\Users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:199: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def build_faces(faces, triangles_is, num_triangles, num_faces_single):
C:\Users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages\libpysal\cg\alpha_shapes.py:261: NumbaDeprecationWarning: The 'nopython' keyword argument was not supplied to the 'numba.jit' decorator. The implicit default value for this argument is currently False, but it will be changed to True in Numba 0.59.0. See https://numba.readthedocs.io/en/stable/reference/deprecation.html#deprecation-of-object-mode-fall-back-behaviour-when-using-jit for details.
  def nb_mask_faces(mask, faces):
Requirement already satisfied: transformers in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (4.36.0)
Requirement already satisfied: filelock in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (3.13.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.19.4)
Requirement already satisfied: numpy>=1.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (1.24.4)
Requirement already satisfied: packaging>=20.0 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (23.1)
Requirement already satisfied: pyyaml>=5.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (2023.8.8)
Requirement already satisfied: requests in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (2.31.0)
Requirement already satisfied: tokenizers<0.19,>=0.14 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.15.0)
Requirement already satisfied: safetensors>=0.3.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.4.1)
Requirement already satisfied: tqdm>=4.27 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (4.65.0)
Requirement already satisfied: fsspec>=2023.5.0 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (2023.6.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from huggingface-hub<1.0,>=0.19.3->transformers) (4.8.0)
Requirement already satisfied: colorama in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from tqdm>=4.27->transformers) (0.4.6)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (2023.7.22)
Requirement already satisfied: torch in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (2.1.1)
Requirement already satisfied: transformers in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (4.36.0)
Requirement already satisfied: filelock in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (3.13.1)
Requirement already satisfied: typing-extensions in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (4.8.0)
Requirement already satisfied: sympy in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (1.12)
Requirement already satisfied: networkx in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (3.1)
Requirement already satisfied: jinja2 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (3.1.2)
Requirement already satisfied: fsspec in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from torch) (2023.6.0)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.19.4)
Requirement already satisfied: numpy>=1.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (1.24.4)
Requirement already satisfied: packaging>=20.0 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (23.1)
Requirement already satisfied: pyyaml>=5.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (6.0.1)
Requirement already satisfied: regex!=2019.12.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (2023.8.8)
Requirement already satisfied: requests in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (2.31.0)
Requirement already satisfied: tokenizers<0.19,>=0.14 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.15.0)
Requirement already satisfied: safetensors>=0.3.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (0.4.1)
Requirement already satisfied: tqdm>=4.27 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from transformers) (4.65.0)
Requirement already satisfied: colorama in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from tqdm>=4.27->transformers) (0.4.6)
Requirement already satisfied: MarkupSafe>=2.0 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from jinja2->torch) (2.1.3)
Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (3.2.0)
Requirement already satisfied: idna<4,>=2.5 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (3.4)
Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (1.26.15)
Requirement already satisfied: certifi>=2017.4.17 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from requests->transformers) (2023.7.22)
Requirement already satisfied: mpmath>=0.19 in c:\users\cruse\mambaforge1\envs\musa-550-fall-2023\lib\site-packages (from sympy->torch) (1.3.0)

1.1 Query the Census API¶

Use the cenpy package to download median household income in the past 12 months by census tract from the 2021 ACS 5-year data set for your county of interest.

You have two options to find the correct variable names:

  • Search through: https://api.census.gov/data/2021/acs/acs5/variables.html
  • Initialize an API connection and use the .varslike() function to search for the proper keywords

At the end of this step, you should have a pandas DataFrame holding the income data for all census tracts within the county being analyzed. Feel free to rename your variable from the ACS so it has a more meaningful name!

::: {.callout-caution} Some census tracts won't have any value because there are not enough households in that tract. The census will use a negative number as a default value for those tracts. You can safely remove those tracts from the analysis! :::

In [2]:
variables = [
    "NAME",
    "B19013_001E"   
]
In [3]:
acs = cenpy.remote.APIConnection("ACSDT5Y2021")
In [4]:
counties = cenpy.explorer.fips_table("COUNTY")
In [5]:
philly_county_code = "101"
PA_state_code = "42"
philly_data = acs.query(
    cols=variables,
    geo_unit="block group:*",
    geo_filter={"state": PA_state_code, "county": philly_county_code, "tract": "*"},
)

1.2 Download census tracts from the Census and merge the data from part 1.1¶

  • Download census tracts for the desired geography using the pygris package
  • Merge the downloaded census tracts with the household income DataFrame
In [6]:
censustracts = gpd.read_file('Census_Tracts_2010.geojson')
censustracts.rename(columns={'TRACTCE10':'tract'}, inplace=True)
MEDHHINC = gpd.GeoDataFrame(pd.merge(censustracts, philly_data, on='tract')).reset_index(drop=True)
MEDHHINC['B19013_001E'] = pd.to_numeric(MEDHHINC['B19013_001E'], errors='coerce')
MEDHHINC = MEDHHINC[MEDHHINC['B19013_001E'] >= 0]
MEDHHINC.head()
Out[6]:
OBJECTID STATEFP10 COUNTYFP10 tract GEOID10 NAME10 NAMELSAD10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10 LOGRECNO geometry NAME B19013_001E state county block group
0 1 42 101 009400 42101009400 94 Census Tract 94 G5020 S 366717 0 +39.9632709 -075.2322437 10429 POLYGON ((-75.22927 39.96054, -75.22865 39.960... Block Group 1, Census Tract 94, Philadelphia C... 11012 42 101 1
1 1 42 101 009400 42101009400 94 Census Tract 94 G5020 S 366717 0 +39.9632709 -075.2322437 10429 POLYGON ((-75.22927 39.96054, -75.22865 39.960... Block Group 2, Census Tract 94, Philadelphia C... 27019 42 101 2
2 1 42 101 009400 42101009400 94 Census Tract 94 G5020 S 366717 0 +39.9632709 -075.2322437 10429 POLYGON ((-75.22927 39.96054, -75.22865 39.960... Block Group 3, Census Tract 94, Philadelphia C... 35457 42 101 3
6 2 42 101 009500 42101009500 95 Census Tract 95 G5020 S 319070 0 +39.9658709 -075.2379140 10430 POLYGON ((-75.23536 39.96852, -75.23545 39.969... Block Group 3, Census Tract 95, Philadelphia C... 32839 42 101 3
7 3 42 101 009600 42101009600 96 Census Tract 96 G5020 S 405273 0 +39.9655396 -075.2435075 10431 POLYGON ((-75.24343 39.96230, -75.24339 39.962... Block Group 1, Census Tract 96, Philadelphia C... 60271 42 101 1

1.3 Load the restaurants data¶

The Yelp dataset includes data for 7,350 restaurants across the city. Load the data from the data/ folder and use the latitude and longitude columns to create a GeoDataFrame after loading the JSON data. Be sure to set the right CRS on when initializing the GeoDataFrame!

Notes

The JSON data is in a "records" format. To load it, you'll need to pass the following keywords:

  • orient='records'
  • lines=True
In [7]:
restaurants = pd.read_json('restaurants_philly.json', orient='records', lines=True)
restaurants = gpd.GeoDataFrame(restaurants, 
                               geometry=gpd.points_from_xy(restaurants['longitude'], restaurants['latitude']),
                               crs='EPSG:4326')
restaurants.head()
Out[7]:
business_id latitude longitude name review_count stars categories geometry
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, B... POINT (-75.15556 39.95551)
1 MUTTqe8uqyMdBl186RmNeA 39.953949 -75.143226 Tuna Bar 245 4.0 Sushi Bars, Restaurants, Japanese POINT (-75.14323 39.95395)
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322)
3 QdN72BWoyFypdGJhhI5r7g 39.939825 -75.157447 Bar One 65 4.0 Cocktail Bars, Bars, Italian, Nightlife, Resta... POINT (-75.15745 39.93982)
4 Mjboz24M9NlBeiOJKLEd_Q 40.022466 -75.218314 DeSandro on Main 41 3.0 Pizza, Restaurants, Salad, Soup POINT (-75.21831 40.02247)

1.4 Add tract info for each restaurant¶

Do a spatial join to identify which census tract each restaurant is within. Make sure each dataframe has the same CRS!

At the end of this step, you should have a new dataframe with a column identifying the tract number for each restaurant.

In [8]:
censustracts = censustracts.to_crs('EPSG:4326')
restaurants = restaurants.to_crs('EPSG:4326')
restaurants_gdf = gpd.sjoin(restaurants, censustracts, how = "left", predicate="within").reset_index(drop=True)
del restaurants_gdf['index_right']
restaurants_gdf.head()
Out[8]:
business_id latitude longitude name review_count stars categories geometry OBJECTID STATEFP10 ... GEOID10 NAME10 NAMELSAD10 MTFCC10 FUNCSTAT10 ALAND10 AWATER10 INTPTLAT10 INTPTLON10 LOGRECNO
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, B... POINT (-75.15556 39.95551) 191 42 ... 42101000200 2 Census Tract 2 G5020 S 382478 0 +39.9553999 -075.1569775 10336
1 MUTTqe8uqyMdBl186RmNeA 39.953949 -75.143226 Tuna Bar 245 4.0 Sushi Bars, Restaurants, Japanese POINT (-75.14323 39.95395) 190 42 ... 42101000100 1 Census Tract 1 G5020 S 704909 0 +39.9523827 -075.1466628 10335
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322) 33 42 ... 42101001500 15 Census Tract 15 G5020 S 239383 0 +39.9419037 -075.1591158 10356
3 QdN72BWoyFypdGJhhI5r7g 39.939825 -75.157447 Bar One 65 4.0 Cocktail Bars, Bars, Italian, Nightlife, Resta... POINT (-75.15745 39.93982) 36 42 ... 42101001800 18 Census Tract 18 G5020 S 242483 0 +39.9399998 -075.1593091 10359
4 Mjboz24M9NlBeiOJKLEd_Q 40.022466 -75.218314 DeSandro on Main 41 3.0 Pizza, Restaurants, Salad, Soup POINT (-75.21831 40.02247) 85 42 ... 42101021000 210 Census Tract 210 G5020 S 858223 55719 +40.0244730 -075.2151045 10536

5 rows × 22 columns

1.5 Add income data to your restaurant data¶

Add the income data to your dataframe from the previous step, merging the census data based on the tract that each restaurant is within.

In [9]:
merged_gdf = gpd.sjoin(restaurants_gdf, MEDHHINC, how = "left", predicate="within")
merged_gdf.head()
Out[9]:
business_id latitude longitude name review_count stars categories geometry OBJECTID_left STATEFP10_left ... ALAND10_right AWATER10_right INTPTLAT10_right INTPTLON10_right LOGRECNO_right NAME B19013_001E state county block group
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, B... POINT (-75.15556 39.95551) 191 42 ... 382478.0 0.0 +39.9553999 -075.1569775 10336 Block Group 1, Census Tract 2, Philadelphia Co... 42308.0 42 101 1
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, B... POINT (-75.15556 39.95551) 191 42 ... 382478.0 0.0 +39.9553999 -075.1569775 10336 Block Group 2, Census Tract 2, Philadelphia Co... 250001.0 42 101 2
1 MUTTqe8uqyMdBl186RmNeA 39.953949 -75.143226 Tuna Bar 245 4.0 Sushi Bars, Restaurants, Japanese POINT (-75.14323 39.95395) 190 42 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322) 33 42 ... 239383.0 0.0 +39.9419037 -075.1591158 10356 Block Group 1, Census Tract 15, Philadelphia C... 66290.0 42 101 1
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322) 33 42 ... 239383.0 0.0 +39.9419037 -075.1591158 10356 Block Group 2, Census Tract 15, Philadelphia C... 107292.0 42 101 2

5 rows × 42 columns

1.6 Make a plot of median household income vs. Yelp stars¶

Our dataset has the number of stars for each restaurant, rounded to the nearest 0.5 star. In this step, create a line plot that shows the average income value for each stars category (e.g., all restaurants with 1 star, 1.5 stars, 2 stars, etc.)

While their are multiple ways to do this, the seaborn.lineplot() is a great option. This can show the average value in each category as well as 95% uncertainty intervals. Use this function to plot the stars ("x") vs. average income ("y") for all of our restaurants, using the dataframe from last step. Be sure to format your figure to make it look nice!

Question: Is there a correlation between a restaurant's ratings and the income levels of its surrounding neighborhood?

In [10]:
grouped_df = merged_gdf.groupby('stars')['B19013_001E'].mean().reset_index()
grouped_df.head()
Out[10]:
stars B19013_001E
0 1.0 55197.340659
1 1.5 59296.992105
2 2.0 65606.081140
3 2.5 71007.582278
4 3.0 75801.154528
In [11]:
ax = sns.lineplot(
    data=grouped_df,
    x='stars',
    y='B19013_001E',
    estimator= 'mean',
    errorbar=('ci', 95))
ax.set(xlabel='Ratings', ylabel='Average Income')
ax.set_title('Ratings vs. Average Income in Census Block Group')
Out[11]:
Text(0.5, 1.0, 'Ratings vs. Average Income in Census Block Group')
No description has been provided for this image

2. Fast food trends in Philadelphia¶

At the end of part 1, you should have seen a strong trend where higher income tracts generally had restaurants with better reviews. In this section, we'll explore the impact of fast food restaurants and how they might be impacting this trend.

Hypothesis

  1. Fast food restaurants are predominantly located in areas with lower median income levels.
  2. Fast food restaurants have worse reviews compared to typical restaurants.

If true, these two hypotheses could help to explain the trend we found in part 1. Let's dive in and test our hypotheses!

2.1 Identify fast food restaurants¶

The "categories" column in our dataset contains multiple classifications for each restaurant. One such category is "Fast Food". In this step, add a new column called "is_fast_food" that is True if the "categories" column contains the term "Fast Food" and False otherwise

In [12]:
merged_gdf['is_fast_food'] = merged_gdf['categories'].str.contains('Fast Food', case=False)
merged_gdf.head()
Out[12]:
business_id latitude longitude name review_count stars categories geometry OBJECTID_left STATEFP10_left ... AWATER10_right INTPTLAT10_right INTPTLON10_right LOGRECNO_right NAME B19013_001E state county block group is_fast_food
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, B... POINT (-75.15556 39.95551) 191 42 ... 0.0 +39.9553999 -075.1569775 10336 Block Group 1, Census Tract 2, Philadelphia Co... 42308.0 42 101 1 False
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, B... POINT (-75.15556 39.95551) 191 42 ... 0.0 +39.9553999 -075.1569775 10336 Block Group 2, Census Tract 2, Philadelphia Co... 250001.0 42 101 2 False
1 MUTTqe8uqyMdBl186RmNeA 39.953949 -75.143226 Tuna Bar 245 4.0 Sushi Bars, Restaurants, Japanese POINT (-75.14323 39.95395) 190 42 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322) 33 42 ... 0.0 +39.9419037 -075.1591158 10356 Block Group 1, Census Tract 15, Philadelphia C... 66290.0 42 101 1 False
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322) 33 42 ... 0.0 +39.9419037 -075.1591158 10356 Block Group 2, Census Tract 15, Philadelphia C... 107292.0 42 101 2 False

5 rows × 43 columns

2.2 Calculate the median income for fast food and otherwise¶

Group by the "is_fast_food" column and calculate the median income for restaurants that are and are not fast food. You should find that income levels are lower in tracts with fast food.

Note: this is just an estimate, since we are calculating a median of median income values.

In [13]:
grouped_gdf2 = merged_gdf.groupby('is_fast_food')['B19013_001E'].median().reset_index()
grouped_gdf2.head()
Out[13]:
is_fast_food B19013_001E
0 False 69783.0
1 True 55773.5

2.3 Load fast food review data¶

In the rest of part 2, we're going to run a sentiment analysis on the reviews for fast food restaurants. The review data for all fast food restaurants identified in part 2.1 is already stored in the data/ folder. The data is stored as a JSON file and you can use pandas.read_json to load it.

Notes

The JSON data is in a "records" format. To load it, you'll need to pass the following keywords:

  • orient='records'
  • lines=True
In [14]:
reviews = pd.read_json('reviews_philly_fast_food.json', orient='records', lines=True)
merged_gdf
Out[14]:
business_id latitude longitude name review_count stars categories geometry OBJECTID_left STATEFP10_left ... AWATER10_right INTPTLAT10_right INTPTLON10_right LOGRECNO_right NAME B19013_001E state county block group is_fast_food
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, B... POINT (-75.15556 39.95551) 191 42 ... 0.0 +39.9553999 -075.1569775 10336 Block Group 1, Census Tract 2, Philadelphia Co... 42308.0 42 101 1 False
0 MTSW4McQd7CbVtyjqoe9mw 39.955505 -75.155564 St Honore Pastries 80 4.0 Restaurants, Food, Bubble Tea, Coffee & Tea, B... POINT (-75.15556 39.95551) 191 42 ... 0.0 +39.9553999 -075.1569775 10336 Block Group 2, Census Tract 2, Philadelphia Co... 250001.0 42 101 2 False
1 MUTTqe8uqyMdBl186RmNeA 39.953949 -75.143226 Tuna Bar 245 4.0 Sushi Bars, Restaurants, Japanese POINT (-75.14323 39.95395) 190 42 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322) 33 42 ... 0.0 +39.9419037 -075.1591158 10356 Block Group 1, Census Tract 15, Philadelphia C... 66290.0 42 101 1 False
2 ROeacJQwBeh05Rqg7F6TCg 39.943223 -75.162568 BAP 205 4.5 Korean, Restaurants POINT (-75.16257 39.94322) 33 42 ... 0.0 +39.9419037 -075.1591158 10356 Block Group 2, Census Tract 15, Philadelphia C... 107292.0 42 101 2 False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7348 8n93L-ilMAsvwUatarykSg 39.951018 -75.198240 Kitchen Gia 22 3.0 Coffee & Tea, Food, Sandwiches, American (Trad... POINT (-75.19824 39.95102) 378 42 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN False
7349 WnT9NIzQgLlILjPT0kEcsQ 39.935982 -75.158665 Adelita Taqueria & Restaurant 35 4.5 Restaurants, Mexican POINT (-75.15866 39.93598) 42 42 ... 0.0 +39.9367634 -075.1595100 10365 Block Group 1, Census Tract 24, Philadelphia C... 117500.0 42 101 1 False
7349 WnT9NIzQgLlILjPT0kEcsQ 39.935982 -75.158665 Adelita Taqueria & Restaurant 35 4.5 Restaurants, Mexican POINT (-75.15866 39.93598) 42 42 ... 0.0 +39.9367634 -075.1595100 10365 Block Group 2, Census Tract 24, Philadelphia C... 135139.0 42 101 2 False
7349 WnT9NIzQgLlILjPT0kEcsQ 39.935982 -75.158665 Adelita Taqueria & Restaurant 35 4.5 Restaurants, Mexican POINT (-75.15866 39.93598) 42 42 ... 0.0 +39.9367634 -075.1595100 10365 Block Group 3, Census Tract 24, Philadelphia C... 66313.0 42 101 3 False
7349 WnT9NIzQgLlILjPT0kEcsQ 39.935982 -75.158665 Adelita Taqueria & Restaurant 35 4.5 Restaurants, Mexican POINT (-75.15866 39.93598) 42 42 ... 0.0 +39.9367634 -075.1595100 10365 Block Group 4, Census Tract 24, Philadelphia C... 82606.0 42 101 4 False

17818 rows × 43 columns

2.4 Trim to the most popular fast food restaurants¶

There's too many reviews to run a sentiment analysis on all of them in a reasonable time. Let's trim our reviews dataset to the most popular fast food restaurants, using the list provided below.

You will need to get the "business_id" values for each of these restaurants from the restaurants data loaded in part 1.3. Then, trim the reviews data to include reviews only for those business IDs.

In [15]:
popular_fast_food = [
    "McDonald's",
    "Wendy's",
    "Subway",
    "Popeyes Louisiana Kitchen",
    "Taco Bell",
    "KFC",
    "Burger King",
]
In [16]:
business_id = merged_gdf[['business_id','name']]
reviews = pd.merge(business_id, reviews, on = 'business_id')
reviews.head()
Out[16]:
business_id name review_id stars text
0 O1oZpbZNDMH_gz8DhsZCdA Wendy's WtP3kl8xVOcoJOYyxkGfIg 1 This Wendy's is the worst Wendy's to go to . T...
1 O1oZpbZNDMH_gz8DhsZCdA Wendy's 7l__He_d8pHrEAAZZv2l0Q 1 This location has the slowest service ever eve...
2 O1oZpbZNDMH_gz8DhsZCdA Wendy's WSIe14YeSYUYqTdx6UDoFw 1 I got a burnt baked potato no friends ! Cook ...
3 O1oZpbZNDMH_gz8DhsZCdA Wendy's kc9ABqXrw1-91xP6G0KWmQ 1 The restaurant was extremely dirty. All the ta...
4 O1oZpbZNDMH_gz8DhsZCdA Wendy's ekdJfqlBxQf5oaZvaKQ8Zg 1 Eww I dont like Wendy's it looks nasty as hell...
In [17]:
reviews_filtered = reviews[reviews['name'].isin(popular_fast_food)]
reviews_filtered.head()
Out[17]:
business_id name review_id stars text
0 O1oZpbZNDMH_gz8DhsZCdA Wendy's WtP3kl8xVOcoJOYyxkGfIg 1 This Wendy's is the worst Wendy's to go to . T...
1 O1oZpbZNDMH_gz8DhsZCdA Wendy's 7l__He_d8pHrEAAZZv2l0Q 1 This location has the slowest service ever eve...
2 O1oZpbZNDMH_gz8DhsZCdA Wendy's WSIe14YeSYUYqTdx6UDoFw 1 I got a burnt baked potato no friends ! Cook ...
3 O1oZpbZNDMH_gz8DhsZCdA Wendy's kc9ABqXrw1-91xP6G0KWmQ 1 The restaurant was extremely dirty. All the ta...
4 O1oZpbZNDMH_gz8DhsZCdA Wendy's ekdJfqlBxQf5oaZvaKQ8Zg 1 Eww I dont like Wendy's it looks nasty as hell...

2.5 Run the emotions classifier on fast food reviews¶

Run a sentiment analysis on the reviews data from the previous step. Use the DistilBERT model that can predict emotion labels (anger, fear, sadness, joy, love, and surprise). Transform the result from the classifier into a DataFrame so that you have a column for each of the emotion labels.

In [18]:
from transformers import pipeline
import torch
In [19]:
text_data = reviews_filtered['text'].tolist()

model = "bhadresh-savani/distilbert-base-uncased-emotion"

emotion_classifier = pipeline(
    task="text-classification",
    model=model,
    tokenizer=model,
    truncation=True,
)

emotion_scores = emotion_classifier(text_data)
In [21]:
emotion_df = pd.DataFrame(emotion_scores)
emotion_df['text'] = text_data
In [22]:
print(emotion_df.columns)
Index(['label', 'score', 'text'], dtype='object')

2.6 Identify the predicted emotion for each text¶

Use the pandas idxmax() to identify the predicted emotion for each review, and add this value to a new column called "prediction"

The predicted emotion has the highest confidence score across all emotion labels for a particular label.

In [23]:
emotion_df_no_duplicates = emotion_df.drop_duplicates(subset='text')


emotion_pivot = emotion_df_no_duplicates.pivot(index='text', columns='label', values='score').reset_index()


emotion_pivot.columns.name = None  # Remove the 'label' name for columns


emotion_pivot.head()
Out[23]:
text anger fear joy love sadness surprise
0 "Meh. I've experienced better." That's what ... NaN NaN 0.997238 NaN NaN NaN
1 "No burgers in Burger King!" I went to this Bu... NaN NaN NaN 0.881435 NaN NaN
2 "Open 24 hours"... or when we feel like it. Th... NaN NaN NaN NaN 0.994735 NaN
3 "Why the fuck you add salad dressing to the ba... 0.997625 NaN NaN NaN NaN NaN
4 $.69 for any size coffee you say? Breakfast s... NaN NaN NaN NaN 0.898883 NaN
In [26]:
FF_emotions = pd.merge(reviews_filtered, emotion_pivot, on='text', how='left')

FF_emotions['prediction'] = FF_emotions[['anger', 'fear', 'sadness', 'joy', 'surprise', 'love']].idxmax(axis=1)

FF_emotions.head()
Out[26]:
business_id name review_id stars text anger fear joy love sadness surprise prediction
0 O1oZpbZNDMH_gz8DhsZCdA Wendy's WtP3kl8xVOcoJOYyxkGfIg 1 This Wendy's is the worst Wendy's to go to . T... NaN NaN NaN NaN 0.995998 NaN sadness
1 O1oZpbZNDMH_gz8DhsZCdA Wendy's 7l__He_d8pHrEAAZZv2l0Q 1 This location has the slowest service ever eve... 0.646855 NaN NaN NaN NaN NaN anger
2 O1oZpbZNDMH_gz8DhsZCdA Wendy's WSIe14YeSYUYqTdx6UDoFw 1 I got a burnt baked potato no friends ! Cook ... 0.890414 NaN NaN NaN NaN NaN anger
3 O1oZpbZNDMH_gz8DhsZCdA Wendy's kc9ABqXrw1-91xP6G0KWmQ 1 The restaurant was extremely dirty. All the ta... NaN NaN NaN NaN 0.642099 NaN sadness
4 O1oZpbZNDMH_gz8DhsZCdA Wendy's ekdJfqlBxQf5oaZvaKQ8Zg 1 Eww I dont like Wendy's it looks nasty as hell... 0.986700 NaN NaN NaN NaN NaN anger
In [27]:
FF_emotions.groupby("prediction").size().plot(kind='barh');
No description has been provided for this image

2.7 Combine the ratings and sentiment data¶

Combine the data from part 2.4 (reviews data) and part 2.6 (emotion data). Use the pd.concat() function and combine along the column axis.

Note: You'll need to reset the index of your reviews data frame so it matches the emotion data index (it should run from 0 to the length of the data - 1).

In [28]:
FF_reviews_reset = reviews_filtered.reset_index(drop=True)

emotion_data_subset = FF_emotions[['anger', 'fear', 'joy', 'love', 'sadness', 'surprise', 'prediction']]

rating_sentiment = pd.concat([FF_reviews_reset, emotion_data_subset], axis=1)

print(rating_sentiment.head())
              business_id     name               review_id  stars  \
0  O1oZpbZNDMH_gz8DhsZCdA  Wendy's  WtP3kl8xVOcoJOYyxkGfIg      1   
1  O1oZpbZNDMH_gz8DhsZCdA  Wendy's  7l__He_d8pHrEAAZZv2l0Q      1   
2  O1oZpbZNDMH_gz8DhsZCdA  Wendy's  WSIe14YeSYUYqTdx6UDoFw      1   
3  O1oZpbZNDMH_gz8DhsZCdA  Wendy's  kc9ABqXrw1-91xP6G0KWmQ      1   
4  O1oZpbZNDMH_gz8DhsZCdA  Wendy's  ekdJfqlBxQf5oaZvaKQ8Zg      1   

                                                text     anger  fear  joy  \
0  This Wendy's is the worst Wendy's to go to . T...       NaN   NaN  NaN   
1  This location has the slowest service ever eve...  0.646855   NaN  NaN   
2  I got a burnt baked potato  no friends ! Cook ...  0.890414   NaN  NaN   
3  The restaurant was extremely dirty. All the ta...       NaN   NaN  NaN   
4  Eww I dont like Wendy's it looks nasty as hell...  0.986700   NaN  NaN   

   love   sadness  surprise prediction  
0   NaN  0.995998       NaN    sadness  
1   NaN       NaN       NaN      anger  
2   NaN       NaN       NaN      anger  
3   NaN  0.642099       NaN    sadness  
4   NaN       NaN       NaN      anger  

2.8 Plot sentiment vs. stars¶

We now have a dataframe with the predicted primary emotion for each review and the associated number of stars for each review. Let's explore two questions:

  1. Does sentiment analysis work? Do reviews with fewer stars have negative emotions?
  2. For our fast food restaurants, are reviews generally positive or negative?

Use seaborn's histplot() to make a stacked bar chart showing the breakdown of each emotion for each stars category (1 star, 2 stars, etc.). A few notes:

  • To stack multiple emotion labels in one bar, use the multiple="stack" keyword
  • The discrete=True can be helpful to tell seaborn our stars values are discrete categories
In [31]:
sns.set(style="whitegrid")


plt.figure(figsize=(12, 8))
sns.histplot(data=rating_sentiment, x='stars', hue='prediction', multiple="stack", discrete=True, palette='plasma')


plt.xlabel('Stars')
plt.ylabel('Number of Reviews')
plt.title('Sentiment vs. Stars for Fast Food Reviews')


plt.show()
No description has been provided for this image