Develop and leverage AI models in ClickHouse®
Learn how to train AI models and perform live scoring with a set of SQL statements and Aiven for ClickHouse®
ClickHouse® is a well known analytical database, but one of its less known feature is the ability to train AI models and perform live scoring with a set of SQL statements. In this tutorial we'll showcase how, starting from a taxi dataset, we can train a Logistic Regression model and use it for predicting the future cost of a ride.
Prerequisites
- An Aiven account. Sign up for a free trial today
- Docker to interact with the ClickHouse® service to upload the data and query it
Start with Aiven for ClickHouse
We can create an ClickHouse service with the following process:
-
Access the Aiven Console
-
Create an Aiven for ClickHouse® service and specify:
- The cloud provider and region
- The service plan, defining the size of the service. We can use the free tier for the purpose of this tutorial
- The additional storage disk size
- The service name
-
Click on Create
Wait a couple of minutes until the service is in RUNNING
state.
Load the data
For this tutorial, we'll use a dataset containing New York taxi trip data from Kaggle. Navigate to the Kaggle website, download the file and unzip it. The folder contains a set of files with the naming pattern yellow_tripdata_YYYY-MM.csv
. Each one includes a month of taxi rides in the years 2019
and 2020
.
To load it in Aiven for ClickHouse, connect to the database with the clickhouse-client
Docker image:
docker run -it \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure
USERNAME
, PASSWORD
, HOST
and PORT
information in the Aiven Console, under the Aiven for ClickHouse service overview page in the ClickHouse Native tab.Create a taxi_trips_train
table for our training dataset:
CREATE TABLE taxi_trips_train ( VendorID UInt32, tpep_pickup_datetime DateTime, tpep_dropoff_datetime DateTime, passenger_count UInt32, trip_distance Float32, RatecodeID UInt32, store_and_fwd_flag String, PULocationID UInt32, DOLocationID UInt32, payment_type Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5), fare_amount Float32, extra Float32, mta_tax Float32, tip_amount Float32, tolls_amount Float32, improvement_surcharge Float32, total_amount Float32, congestion_surcharge Float32, ) ENGINE = MergeTree PRIMARY KEY (tpep_pickup_datetime, tpep_dropoff_datetime);
Create another table, named taxi_trips_test
for our testing dataset:
CREATE TABLE taxi_trips_test ( VendorID UInt32, tpep_pickup_datetime DateTime, tpep_dropoff_datetime DateTime, passenger_count UInt32, trip_distance Float32, RatecodeID UInt32, store_and_fwd_flag String, PULocationID UInt32, DOLocationID UInt32, payment_type Enum('CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4, 'UNK' = 5), fare_amount Float32, extra Float32, mta_tax Float32, tip_amount Float32, tolls_amount Float32, improvement_surcharge Float32, total_amount Float32, congestion_surcharge Float32, ) ENGINE = MergeTree PRIMARY KEY (tpep_pickup_datetime, tpep_dropoff_datetime);
Then exit from the docker container:
exit
Next, we need to clean our data source a little. Remove the Windows end of line symbols from yellow_tripdata_2019-01.csv
, representing our training dataset, with:
tr -d '\015' <yellow_tripdata_2019-01.csv >yellow_tripdata_2019-01-polished.csv
Then load the data from the yellow_tripdata_2019-01-polished.csv
file to the taxi_trips_train
table with:
cat yellow_tripdata_2019-01-polished.csv | docker run -i \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure \ --max_insert_block_size=100000 \ --query="""INSERT INTO taxi_trips_train FORMAT CSVWithNames"""
Load the yellow_tripdata_2019-02.csv
data in the taxi_trips_test
, this represents our testing dataset
tr -d '\015' <yellow_tripdata_2019-02.csv >yellow_tripdata_2019-02-polished.csv cat yellow_tripdata_2019-02-polished.csv | docker run -i \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure \ --max_insert_block_size=100000 \ --query="""INSERT INTO taxi_trips_test FORMAT CSVWithNames"""
Train a stochastic linear regression model
Once the data is uploaded, we can focus on the main task of training an AI model. Checking the dataset, we can notice the presence of a column called fare_amount
representing the fare as calculated by the meter based on time and distance. In the next section we'll build a model to predict the fare_amount
based on other columns in the dataset and we'll use a stochastic linear regression model built in ClickHouse to achieve it.
To train the model, connect to ClickHouse again:
docker run -it \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure
Then execute the following SQL to train the stochastic linear regression model:
CREATE TABLE taxi_trips_model ENGINE = Memory AS SELECT stochasticLinearRegressionState(0.001, 0.1, 15, 'Adam')( fare_amount, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID ) AS state FROM taxi_trips_train;
The above statement trains a stochastic linear regression model with the following parameters:
taxi_trips_model
as name0.001
as learning rate0.1
as l2 regularization coefficient15
as mini-batch sizeAdam
as method for updating weightsfare_amount
as target columnpassenger_count
,trip_distance
,RatecodeID
,PULocationID
, andDOLocationID
as training parameters
Test the stochastic linear regression model
After building the model, let's test it against our taxi_trips_test
dataset. We can compare the actual fare_amount
in our test dataset with the prediction from the taxi_trips_model
with:
WITH (SELECT state FROM taxi_trips_model) AS model SELECT rowNumberInAllBlocks(), fare_amount, evalMLMethod( model, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID ) prediction FROM taxi_trips_test limit 10;
The above SQL:
- Retrieves the
state
from thetaxi_trips_model
model - Includes the row number with the
rowNumberInAllBlocks()
function - Selects the
fare_amount
from the test dataset - includes the prediction from the model using the
evalMLMethod
function, passing the same columns used for training, but coming from the test dataset
The results showcases predictions not too far from the actual data
┌─rowNumberInAllBlocks()─┬─fare_amount─┬────prediction─┐ │ 0 │ 52 │ 57.29841872226803 │ │ 1 │ 42.5 │ 43.8984970367608 │ │ 2 │ 7 │ 7.425320721767511 │ │ 3 │ 10 │ 11.119674442058924 │ │ 4 │ 11.5 │ 13.151729718574753 │ │ 5 │ 7 │ 7.861355191000587 │ │ 6 │ 9.5 │ 10.534773043722428 │ │ 7 │ 9 │ 10.270891195885282 │ │ 8 │ 5.5 │ 6.191941850198845 │ │ 9 │ 16 │ 17.262316387594158 │ └───────────────┴─────────┴───────────┘
Measure the stochastic linear regression model performance
How does the model perform? In the AI/ML field there are a set of different KPIs to understand the quality of a model. We are going to use the mean absolute error (MAE) in this example. The MAE calculates the average difference between the prediction and the actual value in the testing dataset. We can calculate the overall MAE with:
WITH (SELECT state FROM taxi_trips_model) AS model, pred as (SELECT rowNumberInAllBlocks() rownum, fare_amount, evalMLMethod( model, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID ) prediction FROM taxi_trips_test) SELECT avg(abs(fare_amount - prediction)) AS MAE, avg(fare_amount) AS avg_fare_amount, avg(abs(fare_amount - prediction))*100.0/avg(fare_amount) pct FROM pred WHERE fare_amount > 0 and fare_amount <= 500;
The above query calculates:
- The
MAE
as the average of the absolute difference between the actualtotal_amount
and theprediction
- The average
total_amount
- The weight of the
MAE
on thetotal_amount
The results say that our model is off on 2.03
dollars per trip, the error is around 15.97%
. Not bad for a single SQL query.
┌────────MAE─┬────avg_fare_amount─┬─────────pct─┐ │ 2.035759828017413 │ 12.746705737557932 │ 15.970870199184755 │ └───────────┴──────────────┴────────────┘
Improve the model with feature engineering in ClickHouse
Can we improve the performance? Yes with some feature engineering! Feature engineering is the practice of creating a new set of parameters for the model to be more accurate. You can derive these parameters from existing parameters, enriched with external sources, or aggregated from various fields. All the above actions are native to ClickHouse that allows us to explore the data and reshape it as needed.
First step is to check the data quality; let's investigate the presence of outliers in the fare_amount
column by bucketing the data in 10$ buckets:
select cast(fare_amount/10 as int)*10, count(*) from taxi_trips_train group by cast(fare_amount/10 as int)*10 order by 1
Notice that there are a lot of trips with negative fare_amount
as well as amounts greater than 500$. This is probably mistakes in the data collection and outliers we might want to take into account when creating the model.
Let's create a new model that:
- Excludes the outliers identified above with cost greater than 500$
- Predicts the absolute value of
fare_amount
avoiding the negative numbers - Includes, as additional parameters:
- The pickup hour
- The pickup day of the week
- The time from pickup to dropoff in buckets of
10
minutes
First, drop the existing taxi_trips_model_enhanced
table:
drop table taxi_trips_model_enhanced;
Then re-create it with our new model:
CREATE TABLE taxi_trips_model_enhanced ENGINE = Memory AS SELECT stochasticLinearRegressionState(0.001, 0.1, 15, 'Adam')( abs(fare_amount), toDayOfWeek(tpep_pickup_datetime), toHour(tpep_pickup_datetime), age('minute', tpep_pickup_datetime, tpep_dropoff_datetime)/10, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID ) AS state FROM taxi_trips_train WHERE fare_amount <= 500;
The pickup hour and day of the week are added to the new taxi_trips_model_enhanced
model by the toDayOfWeek(tpep_pickup_datetime)
and toHour(tpep_pickup_datetime)
columns. The calculation age('minute', tpep_pickup_datetime, tpep_dropoff_datetime)/10
provides the time in minutes between pickup and dropoff in buckets of 10
minutes.
We can test the new model with:
WITH (SELECT state FROM taxi_trips_model_enhanced) AS model, pred as (SELECT rowNumberInAllBlocks() rownum, fare_amount, evalMLMethod( model, toDayOfWeek(tpep_pickup_datetime), toHour(tpep_pickup_datetime), age('minute', tpep_pickup_datetime, tpep_dropoff_datetime)/10, passenger_count, trip_distance, RatecodeID, PULocationID, DOLocationID ) prediction FROM taxi_trips_test) SELECT avg(abs(abs(fare_amount) - prediction)) AS MAE, avg(abs(fare_amount)) AS avg_fare_amount, avg(abs(abs(fare_amount) - prediction))*100.0/avg(fare_amount) pct FROM pred WHERE abs(fare_amount) <= 500;
The result shows an improvement, compared to the previous model, of more than 1%
.
┌────────────────MAE─┬────avg_fare_amount─┬────────────────pct─┐ │ 1.8768253472744798 │ 12.658898150895748 │ 14.854519179740457 │ └────────────────────┴────────────────────┴────────────────────┘
You can build in further improvements to the model parameters to raise its precision. This tutorial demonstrates how, with just SQL statements, you can train and test a stochastic linear regression model all within ClickHouse.
Conclusion
Adding artificial intelligence in your application doesn't always require external tools. This removes the security barriers you set up in your database and exposing you to the risk of accidental data access or usage of stale data. Modern databases like ClickHouse exposes interesting functionality that allows you to implement model training and prediction directly where your data resides, enabling you to keep your data secure and performant.