Jan 25, 2023
Why you should offload your PostgreSQL analytical workloads to ClickHouse
In this post, ClickHouse and PostgreSQL go head-to-head at analyzing YouTube videos. The results speak for themselves - find out what they say.
Many companies are still using a general-purpose database for every project. Usually it's the easiest way to get up and running quickly. If you already have a lot of expertise in that database, why would you use something else?
The answer is performance.
Generalist RDBMs are built to be pretty good at everything -- that's what a generalist is. But today, we have more and more database technologies created for specific workloads and use cases. They are designed from the ground up to be extremely good at one thing. And when it comes to query performance, a general RDBM doesn't stand a chance.
I'm going to demonstrate this with our new Aiven for ClickHouse database. And not just with some standard benchmarking tools - I'll test drive it with some real world data, something that can answer real questions that businesses might be interested in.
The dataset
I went hunting for interesting (and free) datasets that I could play around with.
Quick disclaimer: This is not going to be a deep dive into optimization of either Postgres or ClickHouse. Yes, you could use partitioning, secondary indexes, or all sorts of table or database settings to improve performance. This is intended to provide a baseline, because BOTH services can be optimized.
What I found is a treasure trove of data scraped from YouTube by the Archive Team (archive.org), featuring metadata for 4.5 billion videos as they were in December 2021. The reason this dataset exists is because YouTube announced they were going to no longer display the number of dislikes on videos. So this is the last point where we can see this information that they stopped sharing.
You can download the dataset yourself at the Internet Archive
For my own sanity, and to save time and space, I removed the video description and the metadata fields while loading the data. Both can have a lot of special characters, or new lines, and would frankly not be efficient to store in a data warehouse. Something like OpenSearch would be a better home, maybe for future exploration.
The questions
The next step was to come up with some questions I wanted to explore. This is what I chose:
- How do people react to live content, as opposed to uploaded videos? Since live content is by necessity less polished, do people like (or dislike) it at a higher ratio compared to pre-recorded videos?
- Which channels have a high view count to subscriber ratio? Are these generally driven by a few videos that have been extremely popular, but have not converted viewers into subscribers?
- What videos have gone most viral relative to the usual view count of their uploader?
- How has the use of subtitles changed over time? Has automation led to them being available on more videos?
- How does turning comments off affect the like / dislike ratio?
- What is the most watched / liked Unlisted video?
With these questions, I started to dig in.
Exploring the Data
I chose two plans with the same price:
- Aiven for ClickHouse - Business-16
- Aiven for Postgres - Business-32
I gave both some extra disk space. The first thing I noticed once I got the data loaded is that ClickHouse is much better on disk compression. The ClickHouse table was about a third the size of the Postgres table.
| - | ClickHouse | Postgres | | --------------------- | ------------- | ------------- | | Plan | Business-16 | Business-32 | | CPUs | 2 | 8 | | RAM | 16 | 32 | | Nodes | 2 | 2 | | Disk Available (GB) | 1150 | 1100 | | Disk Used (GB) | 362 | 956 |
Here are the table definitions I used.
ClickHouse
CREATE TABLE default.videos ( id String, fetch_date String, upload_date String, title String, uploader_id String, uploader String, uploader_sub_count Int32, is_age_limit Bool, view_count UInt64, like_count UInt64, dislike_count UInt64, is_crawlable Bool, is_live_content Bool, has_subtitles Bool, is_ads_enabled Bool, is_comments_enabled Bool ) ENGINE = MergeTree() ORDER BY (uploader_id)
Postgres
CREATE TABLE videos ( id TEXT, fetch_date TEXT, upload_date TEXT, title TEXT, uploader_id TEXT, uploader TEXT, uploader_sub_count BIGINT, is_age_limit Boolean, view_count BIGINT, like_count BIGINT, dislike_count BIGINT, is_crawlable Boolean, is_live_content Boolean, has_subtitles Boolean, is_ads_enabled Boolean, is_comments_enabled Boolean, PRIMARY KEY (id) )
Both of these tables are pretty basic, and we could definitly improve the performance of queries by adding secondary indexes, or by changing some settings. We could also make some changes to compress the data to use even less space. However, we're still in the exploratory phase. We can focus on optimization once we know what kind of queries we'll be running.
Live vs uploaded content
By definition, live content will be a little more raw and unpolished than uploaded videos, which are often extensively edited and composed by selecting from numerous takes. How do viewers feel about that? Lots of experts in influencer and content creator culture put heavy emphasis on the value of authenticity. Does live content feel more authentic? What about the interactive nature of live performances, where viewers can feel involved in the video (and the community) by chatting with other viewers and the creator - does that help drive likes as well?
ClickHouse query
SELECT is_live_content, AVG( COALESCE(like_count,0) / COALESCE(view_count,1) ) as LIKE_AVG, STDDEV_POP( COALESCE(like_count,0) / COALESCE(view_count,1) ) as LIKE_STD, AVG( COALESCE(dislike_count,0) / COALESCE(view_count,1) ) as DISLIKE_AVG, STDDEV_POP( COALESCE(dislike_count,0) / COALESCE(view_count,1) ) as DISLIKE_STD FROM videos WHERE view_count > 0 GROUP BY is_live_content;
Postgres query
SELECT is_live_content, AVG( COALESCE(like_count,0) / COALESCE(view_count,1) ) as LIKE_AVG, STDDEV_POP( COALESCE(like_count,0) / COALESCE(view_count,1) ) as LIKE_STD, AVG( COALESCE(dislike_count,0) / COALESCE(view_count,1) ) as DISLIKE_AVG, STDDEV_POP( COALESCE(dislike_count,0) / COALESCE(view_count,1) ) as DISLIKE_STD FROM videos WHERE view_count > 0 GROUP BY is_live_content;
Runtime
| ClickHouse | Postgres | % Diff | | ----------- | ----------- | ------ | | 113 seconds | 918 seconds | 712% |
Results
| Is Live? | Avg Likes/View | Std Dev | Avg Dislikes/View | Std Dev | | -------- | -------------- | ------- | ----------------- | ------- | | False | 0.0530 | 0.3160 | 0.0025 | 0.0479 | | True | 0.0985 | 0.4036 | 0.0060 | 0.0396 |
Our databases agree that live content in general gets more likes and dislikes per viewer, almost double the rate in both cases. Oddly the standard deviation for likes is higher for live content, but lower for dislikes.
Many Views, Few Subscribers
I was curious to find out what kind of channels have a very high number of views, as opposed to how many people are actually subscribed to the channel. What sort of videos do those channels create? Is it a single video that has gone viral? Or is it that to make it to the top of the list, you need a few videos with many views?
ClickHouse query
SELECT uploader_id, SUM(view_count)/MAX(uploader_sub_count) FROM videos WHERE uploader_sub_count > 10 GROUP BY uploader_id ORDER BY SUM(view_count)/MAX(uploader_sub_count) desc LIMIT 5;
Postgres query
SELECT uploader_id, SUM(view_count)/MAX(uploader_sub_count) FROM videos WHERE uploader_sub_count > 10 GROUP BY uploader_id ORDER BY SUM(view_count)/MAX(uploader_sub_count) desc LIMIT 5;
Runtime
| ClickHouse | Postgres | % Diff | | ----------- | ------------ | ------- | | 193 seconds | 3756 seconds | 1846% |
Results
| uploader_id | Current Channel Name | Views:Subscriber ratio | | ------------------------ | -------------------- | -----------------------| | UCZcSkQahCZvXg742oP7hhsw | Ziyan Lee | 10,423,195 | | UC3OqJzvmr4dJkoWxLuXMR1Q | Day New | 8,427,802 | | UCzXfnq1jnSKsbCYwCgci-vA | Vive Games | 7,704,057 | | UCsCP_JmGPLM319iSVURXhPQ | game funny | 7,491,109 | | UCGwjz9YOQXoAXJW-2kNyD8g | Naga | 6,576,452 |
This is a little disappointing: ranks 1, 2, and 5 are what look like ads for low effort mobile games, and 3 and 4 only have unlisted videos.
Let's see if we can filter out the ads and get more interesting results by bumping up to a minimum of 1000 subscribers, and filter to only count listed videos:
| uploader_id | Current Channel Name | Views:Subscriber ratio | | ------------------------ | -------------------- | ---------------------- | | UCRuLp80UV9FO3EFbQzF-D5A | SHADIAO VIDEOS | 673887 | | UC2WmR5jLqZWivWKrs5j2ZQg | 石秀云 | 629814 | | UC21yzf1STldxuTtFV_y_qKg | (Terminated Account) | 622086 | | UCSSmw60pHH3kKOj2xUwYdOQ | UAC Google | 461971 | | UC5LCpCZy0tnDhzGGI-fHlkQ | Eshare kwai | 415367 |
Hmm, again ranks 1, 2, and 5 are similar, mostly reposts from Likee, and all accounts have a few very popular videos. The rank 3 account has been terminated, so nothing to report there. Rank 4 is once again mobile app ads, this time showing for an app that can age or de-age your selfies.
Since digital advertising tends to have low conversion rates, perhaps it's not surprising that this section is dominated by ads instead of viral videos like I expected.
Subtitles over the years
With advances in speech recognition, it’s easier than ever to create subtitles for videos. Are people taking advantage of this? They added auto-captioning in late 2009, was there a jump around that time?
ClickHouse query
SELECT Substring(upload_date,1,4) as upload_year, AVG(has_subtitles) FROM videos GROUP BY Substring(upload_date,1,4) ORDER BY Substring(upload_date,1,4) ASC;
Postgres query
SELECT Substring(upload_date,1,4) as upload_year, AVG(CASE WHEN has_subtitles THEN 1 ELSE 0 END) FROM videos GROUP BY Substring(upload_date,1,4) ORDER BY Substring(upload_date,1,4) ASC;
Runtime
| ClickHouse | Postgres | % Diff | | ----------- | ------------ | ------- | | 121 seconds | 1021 seconds | 743% |
Results
| Year | % of Videos w/ Subtitles| | ---- | -------------------- | | 2005 | 9.4 | | 2006 | 10.9 | | 2007 | 11.4 | | 2008 | 12.8 | | 2009 | 14.8 | | 2010 | 16.3 | | 2011 | 16.4 | | 2012 | 22.5 | | 2013 | 30.4 | | 2014 | 28.1 | | 2015 | 31.1 | | 2016 | 32.0 | | 2017 | 32.1 | | 2018 | 35.6 | | 2019 | 39.8 | | 2020 | 56.0 | | 2021 | 58.5 |
People certainly have been making increasing use of subtitles for their videos. The largest jump was in 2020, which I had to look into for curiosity's sake. Apparently at that, time YouTube was removing their community captions feature, which allowed you to upload captions for other people's video. This prompted a very successful campaign to have creators add captions to their videos for hard of hearing and deaf viewers.
Comments and controversial videos
Comments are a mixed blessing. They’re an important sign of engagement for the algorithms that rule the internet, and can be a valuable source of feedback for creators. But they can also be vector for harassment, much more impactful than a simple “Dislike”. Do creators turn comments off sometimes for their more controversial videos?
ClickHouse query
SELECT CEILING(log10(view_count)) as Views_OOM, AVG( CASE WHEN is_comments_enabled = 1 THEN NULL ELSE CASE WHEN COALESCE(like_count,0) = COALESCE(dislike_count,0) THEN 0.5 WHEN like_count IS NULL or like_count = 0 THEN 0 WHEN dislike_count IS NULL or dislike_count = 0 THEN 1 ELSE like_count / dislike_count END END) as comments_disabled_like_ratio, AVG( CASE WHEN is_comments_enabled = 1 THEN CASE WHEN COALESCE(like_count,0) = COALESCE(dislike_count,0) THEN 0.5 WHEN like_count IS NULL or like_count = 0 THEN 0 WHEN dislike_count IS NULL or dislike_count = 0 THEN 1 ELSE like_count / dislike_count END ELSE NULL END) as comments_enabled_like_ratio FROM videos GROUP BY Views_OOM ORDER BY Views_OOM;
Postgres Query
SELECT CEILING(log10(view_count + 1)) as Views_OOM, AVG( CASE WHEN is_comments_enabled THEN NULL ELSE CASE WHEN COALESCE(like_count,0) = COALESCE(dislike_count,0) THEN 0.5 WHEN like_count IS NULL or like_count = 0 THEN 0 WHEN dislike_count IS NULL or dislike_count = 0 THEN 1 ELSE like_count / dislike_count END END) as comments_disabled_like_ratio, AVG( CASE WHEN is_comments_enabled THEN CASE WHEN COALESCE(like_count,0) = COALESCE(dislike_count,0) THEN 0.5 WHEN like_count IS NULL or like_count = 0 THEN 0 WHEN dislike_count IS NULL or dislike_count = 0 THEN 1 ELSE like_count / dislike_count END ELSE NULL END) as comments_enabled_like_ratio FROM videos GROUP BY Views_OOM ORDER BY Views_OOM; -- A few small changes because PG doesn't like Log(0), and stores boolean as T/F instead of 1/0
Runtime
| ClickHouse | Postgres | % Diff | | ----------- | ------------ | ------- | | 216 seconds | 1106 seconds | 412% |
Results
| Views | Likes / Dislikes | | | Comments Disabled | Comments Enabled | | --------------------- | ----------------- | ---------------- | | 0 | 0.53 | 0.54 | | Less than 10 | 0.63 | 0.66 | | Tens | 1.10 | 1.14 | | Hundreds | 3.47 | 4.54 | | Thousands | 14.01 | 22.37 | | Tens of Thousands | 22.88 | 40.92 | | Hundreds of Thousands | 19.01 | 37.63 | | Millions | 12.05 | 30.58 | | Tens of Millions | 6.66 | 23.24 | | Hundreds of Millions | 3.08 | 20.65 | | Billions | 1.77 | 19.55 |
Videos with comments disabled on average have a much worse Like
The other interesting thing about this data is the way the ratio of likes to dislike spikes in the 100k - million views range, both for videos with comments and without. This suggests to me that in that range, the YouTube algorithm is doing a good job predicting who will like a video. But as views increase, the odds that someone who dislikes the content of a video will be exposed to it increase.
So I asked a follow-up question: when commenting is disabled, are people more likely to like or dislike to express their feelings about a video?
ClickHouse query
SELECT CEILING(log10(view_count)) as Views_OOM, AVG( CASE WHEN is_comments_enabled = 1 THEN NULL ELSE like_count + dislike_count END) as comments_disabled_interactions, AVG( CASE WHEN is_comments_enabled =1 THEN like_count + dislike_count ELSE NULL END) as comments_enabled_interactions FROM videos GROUP BY Views_OOM ORDER BY Views_OOM;
Postgres Query
SELECT CEILING(log10(view_count + 1)) as Views_OOM, AVG( CASE WHEN is_comments_enabled THEN NULL ELSE like_count + dislike_count END) as comments_disabled_interactions, AVG( CASE WHEN is_comments_enabled THEN like_count + dislike_count ELSE NULL END) as comments_enabled_interactions FROM videos GROUP BY Views_OOM ORDER BY Views_OOM; -- A few small changes because PG doesn't like Log(0), and stores boolean as T/F instead of 1/0
Runtime
| ClickHouse | Postgres | % Diff | | ----------- | ------------ | ------- | | 165 seconds | 992 seconds | 501% |
Results
| Views | Average Likes + Dislikes | | | Comments Disabled | Comments Enabled | | --------------------- | ----------------- | ---------------- | | 0 | 0.10 | 0.10 | | Less than 10 | 0.45 | 0.56 | | Tens | 2.59 | 3.20 | | Hundreds | 10.56 | 13.74 | | Thousands | 55.86 | 80.88 | | Tens of Thousands | 414.67 | 699.82 | | Hundreds of Thousands | 3,193.55 | 5,903.21 | | Millions | 22,368.78 | 42,363.47 | | Tens of Millions | 138,028.31 | 256,495.88 | | Hundreds of Millions | 967,767.47 | 1,520,044.48 | | Billions | 8,072,880.48 | 9,185,313.54 |
No, indeed not. Videos with comments enabled have on average more interactions (likes and dislikes) than videos with similar view counts and comments disabled.
Unlisted videos
YouTube officially claims to have over 800 million videos, but our dataset shows over 4 billion records. Are those all unlisted? Our analysis has focused mostly on listed videos. So what's the most viewed video that is unlisted?
ClickHouse query
SELECT title, uploader, view_count FROM videos WHERE is_crawlable = 0 ORDER BY view_count desc LIMIT 2;
Postgres Query
SELECT title, uploader, view_count FROM videos WHERE is_crawlable ORDER BY view_count desc LIMIT 2;
Runtime
| ClickHouse | Postgres | % Diff | | ----------- | ------------ | ------- | | 301 seconds | 1006 seconds | 234% |
Results
| Title | Uploader | Views | | ------------------------------------------------------- | --------- | ------------- | | 009 211 Baton Recut 1920x1080 15s | Instagram | 2,048,634,523 | | 009 214 HeartShapedWorld Acq EN UAC 1920x1080 15s V1 FB | Instagram | 1,554,823,463 |
Boring! It's an Instagram ad with 2 billion views. After analyzing videos with a high view count to sub ratio, maybe this is what I should've expected. There's got to be a way to separate viral videos from ads! We'll just have to keep digging.
Top 1% of views
ClickHouse also has some intertesting statistical functions built in. As I'm interested in views again, let's get a sense of how views are distributed among videos. Before you look, try to guess how many views you need to be in the top 1% of videos.
ClickHouse query
SELECT quantiles(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99)(view_count) FROM videos;
Yes, postgres can do similar functions, but for this specific example needs multiple steps, and would require an order of magnitude more resources. For the ClickHouse quantile function, it uses sampling to calculate approximations
Runtime
42 seconds for ClickHouse
Results
| Percentile | View Count | | ---------- | ---------- | | 10th | 7 | | 20th | 18 | | 30th | 36 | | 40th | 64 | | 50th | 118 | | 60th | 220 | | 70th | 433 | | 80th | 1,093 | | 90th | 4,256 | | 99th | 179,308 |
It's pretty cool to see the numbers laid out like that. I did not expect the top 1% to be so high, given how many videos there are on the platform.
Conclusion
On average, the Aiven for ClickHouse database performed 5-10 times better than Aiven for PostgreSQL. And in one specific case, question #2, the performance was 20x better! Postgres is a great general purpose database, but it struggles to keep up with ClickHouse on these large datasets, especially column aggregations. It makes perfect sense, ClickHouse has been designed for exactly that purpose. If we were doing a performance test with a lot of updates and deletes, it would be ClickHouse fighting for its life.
Every database has an intended use. From the very beginning of its development, the developers make certain trade-offs, and you want to go with the grain, not against it.
If you only take one thing away from our little experiment, it should be this: choosing the right tool for your use case saves you time, both in development and execution, so you can focus on getting answers to the big questions.
To get the latest news about Aiven and our services, plus a bit of extra around all things open source, subscribe to our monthly newsletter! Daily news about Aiven is available on our LinkedIn and Twitter feeds.
If you just want to find out about our service updates, follow our changelog.
Further reading
Stay updated with Aiven
Subscribe for the latest news and insights on open source, Aiven offerings, and more.