Nov 8, 2022
Using PostgreSQL® JSON functions to navigate reviews of restaurants in India
Parsing semi-structured dataset in a relational database seems scary. Read on for how PostgreSQL® JSON functions allow your SQL queries to work with json and jsonb data.
Analyse Indian restaurant reviews using PostgreSQL® JSON functions
The original idea behind relational databases was "structure, then data": you needed to define what the data looked like before being able to insert any content. This strict data structure definition helped keeping datasets in order by verifying data types, referential integrity, and additional business conditions using dedicated constraints.
But sometimes life can't be predicted, and data can take different shapes. To enable some sort of flexibility, modern databases like PostgreSQL® started adding semistructured column options JSON, where only a formal check on the shape of the data is done.
PostgreSQL actually offers two options in this space, json
and jsonb
. The first one validates that the content is in JSON format and stores it as string, the second is a binary representation optimised for faster processing and better indexing. You can read more on Stack Overflow.
This blog post goes into detail about a few jsonb functions (with the json version being really similar without the b
ending), by using a dataset containing restaurant information. Therefore, if you, like me, are always willing to discover new cuisines, take out your chef-investigator hat (a mix of Gordon Ramsay's and Sherlock Holmes's hat) and join me in the search for a good restaurant in our imaginary trip to India!
Deploy a PostgreSQL® instance
Let's start with the basics: we need a database with some sort of parsing capabilities for JSON data. PostgreSQL is perfect for that, and you can either use your own or create an instance in Aiven via the CLI:
avn service create demo-pg \ --service-type pg \ --plan hobbyist \ --cloud google-europe-west3
The above creates an Aiven for PostgreSQL (--service-type pg
) service called demo-pg
, with the bare minimum hobbyist
plan in the google-europe-west3
cloud region which is in Frankfurt, Germany. The service takes a couple of minutes to be ready; you can monitor that with the avn service wait
command.
Get the restaurant data in PostgreSQL
Any research starts with a dataset, and this is no different! There's a nice Zomato restaurants dataset available in Kaggle that can serve our purposes containing restaurant information for a lot of cities around the world.
All we need to download it is a valid Kaggle login. Once downloaded, we can unzip the archive file, and we'll get a folder containing 5 files (file1.json
to file5.json
). We can load them into our PostgreSQL service using the Aiven CLI, which will get the connection string and use the psql
client:
avn service cli demo-pg
Create a rest_reviews
table containing a unique column called reviews_data
of jsonb type
create table rest_reviews (reviews_data JSONB);
And then load the files with:
\copy rest_reviews from program 'sed -e ''s/\\/\\\\/g'' file1.json'; \copy rest_reviews from program 'sed -e ''s/\\/\\\\/g'' file2.json'; \copy rest_reviews from program 'sed -e ''s/\\/\\\\/g'' file3.json'; \copy rest_reviews from program 'sed -e ''s/\\/\\\\/g'' file4.json'; \copy rest_reviews from program 'sed -e ''s/\\/\\\\/g'' file5.json';
We are using sed -e ''s/\\/\\\\/g'' fileX.json
to properly escape any \
characters, which are a problem in the \copy
command (kudos to StackOverflow for the answer).
These files we just uploaded are nested JSON documents containing the Zomato API responses in an array like:
[ { "results_found": 17151, "restaurants": [ {"restaurant": {"name":"Hauz Khas Social",...}}, {"restaurant": {"name":"Qubitos - The Terrace Cafe",...}}, ... }, { "results_found": 100, "restaurants": [ {"restaurant": {"name":"Spezia Bistro",...}}, {"restaurant": {"name":"Manhattan Brewery & Bar Exchange",...}}, ... }, ... {"message": "API limit exceeded", "code": 440, "status": ""} {"message": "API limit exceeded", "code": 440, "status": ""} ]
jsonb_array_elements
Extract the list of restaurants with To access the list of restaurants (in the restaurants
field) we need to:
- Parse the outer array containing the list of API responses
- Parse the array of the
restaurants
JSON item (this also removes theAPI limit exceeded
errors)
We can do that with the following SQL query:
select restaurant -> 'restaurant' ->> 'name' restaurant_name from rest_reviews cross join lateral jsonb_array_elements(reviews_data) dt cross join lateral jsonb_array_elements(dt -> 'restaurants') restaurant limit 10;
In the above:
-
We use the
jsonb_array_elements
function to parse the jsonb array. -
jsonb_array_elements(reviews_data) dt
gives usdt
, which is the outer array of API responses -
->
retrieves a jsonb subitem. Sojsonb_array_elements(dt -> 'restaurants') restaurant
gives usrestaurant
, which is the array contained in therestaurants
field.- and
restaurant -> 'restaurants'
gives us therestaurant
jsonb values from that array of restaurants.
-
Like
->
,->>
retrieves the jsonb subitem but this time as text. So'restaurant' ->> 'name'
retrieves the fieldname
from the restaurant jsonb, as text.
When executing the above query we can see the data being parsed correctly
restaurant_name ---------------------------------- Hauz Khas Social Qubitos - The Terrace Cafe The Hudson Cafe Summer House Cafe 38 Barracks Spezia Bistro Manhattan Brewery & Bar Exchange The Wine Company Farzi Cafe Indian Grill Room (10 rows)
It would be nice to create a table having a row per restaurant, we can do that using a similar query.
Note: This step could do more and parse more columns, it's just an example of what's achievable.
create table restaurant_data as select (restaurant -> 'restaurant' ->> 'id')::int id, restaurant -> 'restaurant' jsonb_data from rest_reviews cross join lateral jsonb_array_elements(reviews_data) dt cross join lateral jsonb_array_elements(dt -> 'restaurants') restaurant;
We now have a table called restaurant_data
with an integer field id
and a jsonb field jsonb_data
that we can use for further analysis.
Dive deep into the restaurants data
Now we can start with our research. First let's explore some fields. Apart from the id
and name
, there's a nice location
JSON subitem where we can find the restaurant city
amongst other information.
select jsonb_data ->> 'id' id, jsonb_data ->> 'name' name, jsonb_data -> 'location' ->> 'city' city from restaurant_data limit 5;
In the above, check again the usage of ->
to extract the jsonb subitem versus ->>
to extract the same as text. The results are the following:
id | name | city ---------+----------------------------+----------- 308322 | Hauz Khas Social | New Delhi 18037817 | Qubitos - The Terrace Cafe | New Delhi 312345 | The Hudson Cafe | New Delhi 307490 | Summer House Cafe | New Delhi 18241537 | 38 Barracks | New Delhi (5 rows)
@>
What are the top prices in India? Filter data using Let's talk money! What are the most expensive restaurants based on average_cost_for_two
in Rs.
, (Indian Rupies) the local currency in India?
select jsonb_data ->> 'id' id, jsonb_data ->> 'name' name, (jsonb_data ->> 'average_cost_for_two')::int average_cost_for_two from restaurant_data where jsonb_data @> '{"currency": "Rs."}' order by 3 desc limit 5;
We use int
again to say the price is an integer, and the @>
operator to check that the jsonb_data
JSON document contains {"currency": "Rs."}
. An alternative would be to extract the currency
subitem and filter with jsonb_data ->> 'currency' = 'Rs.'
. The following are the results:
id | name | average_cost_for_two --------+------------------------------------------------------+---------------------- 3400072 | Dawat-e-Nawab - Radisson Blu | 3600 2300187 | Waterside - The Landmark Hotel | 3000 3400059 | Peshawri - ITC Mughal | 2500 102216 | Chao Chinese Bistro - Holiday Inn Jaipur City Centre | 2500 3400060 | Taj Bano - ITC Mughal | 2500 (5 rows)
with_bucket
Check the rating with Ok, the above query gave me an idea of the cost, what about the quality? Let's explore the user_rating
item and create a histogram with:
with agg_bucket as ( select width_bucket((jsonb_data -> 'user_rating' ->> 'aggregate_rating')::numeric, 0, 5, 10) bucket, count(*) nr_restaurants from restaurant_data where jsonb_data @> '{"currency": "Rs."}' group by width_bucket((jsonb_data -> 'user_rating' ->> 'aggregate_rating')::numeric, 0, 5, 10) ) select bucket, numrange(bucket*0.5 -0.5, bucket*0.5) range, nr_restaurants from agg_bucket order by 1;
The above query uses the with_bucket
function to assign the user_rating
value to one of 10 buckets, each covering 0.5
in ratings (e.g. 0-0.5, 0.5-1) etc. The result below shows that we can safely filter for rating >= 4
and still retain a good choice of restaurants.
bucket | range | nr_restaurants -------+-----------+---------------- 5 | [2.0,2.5) | 2 6 | [2.5,3.0) | 1 7 | [3.0,3.5) | 88 8 | [3.5,4.0) | 266 9 | [4.0,4.5) | 287 10 | [4.5,5.0) | 76 (6 rows)
What's the best affordable restaurant?
Should we try to minimise the spend? Let's search if there is any restaurant with a rating
greater or equal than 4 and an average_cost_for_two
less than 1000 Indian Rupees. Again, we are casting both aggregate_rating
and average_cost_for_two
to integers before applying the filter.
select jsonb_data ->> 'id' id, jsonb_data ->> 'name' name, (jsonb_data ->> 'average_cost_for_two')::int average_cost_for_two, (jsonb_data -> 'user_rating' ->> 'aggregate_rating')::numeric aggregate_rating from restaurant_data where jsonb_data ->> 'currency' = 'Rs.' and (jsonb_data ->> 'average_cost_for_two')::int < 1000 and (jsonb_data -> 'user_rating' ->> 'aggregate_rating')::numeric >= 4 order by 4 desc, 3 asc limit 5;
This shows quite a good selection of not too expensive but still good restaurants!
id | name | average_cost_for_two | aggregate_rating --------+---------------------------+----------------------+------------------ 3400346 | Sheroes Hangout | 0 | 4.9 2600109 | Sagar Gaire Fast Food | 250 | 4.9 800468 | Grandson of Tunday Kababi | 300 | 4.9 3001321 | CakeBee | 350 | 4.9 96776 | Conçu | 600 | 4.8 (5 rows)
jsonb_array_length
and jsonb_array_elements
A deep dive into the events array, with Let's refine our research, are any of the resulting restaurants doing events? Checking events done in the past might give us some more insights on what the restaurant can offer. Since the item zomato_events
contains an array of events, we can check for restaurants with at least 2 entries in that array, so we can get a sense of what's available.
select jsonb_data ->> 'id' id, jsonb_data ->> 'name' name, (jsonb_data ->> 'average_cost_for_two')::int average_cost_for_two, (jsonb_data -> 'user_rating' ->> 'aggregate_rating')::numeric aggregate_rating, jsonb_array_length(jsonb_data -> 'zomato_events') nr_events from restaurant_data where jsonb_data ->> 'currency' = 'Rs.' and (jsonb_data ->> 'average_cost_for_two')::int < 1000 and (jsonb_data -> 'user_rating' ->> 'aggregate_rating')::numeric >= 4 and jsonb_array_length(jsonb_data -> 'zomato_events') > 1 order by 5 desc, 4 desc, 3 asc limit 5;
To filter on zomato_events
, we're using the jsonb_array_length
function, which returns the number of items in a specific jsonb array. Interestingly we only get 3 rows.
id | name | average_cost_for_two | aggregate_rating | nr_events ---------+----------------+----------------------+------------------+----------- 103019 | Mutual's | 650 | 4.2 | 3 113537 | Puffizza | 700 | 4.3 | 2 18413814 | Aangan Horizon | 900 | 4.0 | 2 (3 rows)
Now that we have that list of three restaurants, let's have a look at what events they created to make our final decision
select jsonb_data ->> 'id' id, jsonb_data ->> 'name' name, (jsonb_data ->> 'average_cost_for_two')::int average_cost_for_two, (jsonb_data -> 'user_rating' ->> 'aggregate_rating')::numeric aggregate_rating, events -> 'event' ->> 'title' event_title from restaurant_data cross join lateral jsonb_array_elements(jsonb_data -> 'zomato_events') events where jsonb_data ->> 'currency' = 'Rs.' and (jsonb_data ->> 'average_cost_for_two')::int < 1000 and (jsonb_data -> 'user_rating' ->> 'aggregate_rating')::numeric >= 4 and jsonb_array_length(jsonb_data -> 'zomato_events') > 1 order by 1,2,5 limit 5;
Wow, #FlauntYourPizza Contest
seems interesting (and possibly very dangerous
id | name | average_cost_for_two | aggregate_rating | event_title -------+----------+----------------------+------------------+----------------------------------- 103019 | Mutual's | 650 | 4.2 | ABHI SOLANKI LIVE 103019 | Mutual's | 650 | 4.2 | FREEBIRDS 103019 | Mutual's | 650 | 4.2 | IPL SPECIAL OFF 113537 | Puffizza | 700 | 4.3 | Dim Light Dinner (Every Thursday) 113537 | Puffizza | 700 | 4.3 | #FlauntYourPizza Contest (5 rows)
Looks like we found our restaurant! Now it's time to go there, check the menu, and eat!
Conclusion
"Relational databases are too rigid", how many times did we hear that? It turns out they can also perform really well with semi structured data like JSON, and PostgreSQL in particular has a deep set of functions for manipulating JSON objects. So, next time you have a JSON dataset, maybe give PostgreSQL a try?
Some more resources that you might find useful:
- PostgreSQL JSON functions
- Json vs jsonb, what are the differences?
- Get started with Aiven for PostgreSQL
Further reading
Stay updated with Aiven
Subscribe for the latest news and insights on open source, Aiven offerings, and more.
Related resources
Sep 8, 2022
Working with command line tools and SQL can be intimidating. Read on to learn how to use the pgweb GUI to provide useful views of PostgreSQL® data.
Jan 5, 2023
It's normally best to use dynamic IP addresses for services, but occasionally a static IP address makes sense. Read on to find out why.
May 13, 2022
Wherever your database migration from on-prem to a managed cloud service starts, the big questions are pretty similar. Start figuring it out right here.