How to query a remote ClickHouse server
In this guide, we're going to learn how to query a remote ClickHouse server from chDB.
Setup
Let's first create a virtual environment:
And now we'll install chDB. Make sure you have version 2.0.2 or higher:
And now we're going to install pandas, and ipython:
We're going to use ipython
to run the commands in the rest of the guide, which you can launch by running:
You can also use the code in a Python script or in your favorite notebook.
An intro to ClickPy
The remote ClickHouse server that we're going to query is ClickPy.
ClickPy keeps track of all the downloads of PyPI packages and lets you explore the stats of packages via a UI.
The underlying database is available to query using the play
user.
You can learn more about ClickPy in its GitHub repository.
Querying the ClickPy ClickHouse service
Let's import chDB:
We're going to query ClickPy using the remoteSecure
function.
This function takes in a host name, table name, and username at a minimum.
We can write the following query to return the number of downloads per day of the openai
package as a Pandas DataFrame:
Now let's do the same to return the downloads for scikit-learn
:
Merging Pandas DataFrames
We now have two DataFrames, which we can merge together based on date (which is the x
column) like this:
We can then compute the ratio of Open AI downloads to scikit-learn
downloads like this:
Querying Pandas DataFrames
Next, let's say we want to find the dates with the best and worst ratios. We can go back to chDB and compute those values:
If you want to learn more about querying Pandas DataFrames, see the Pandas DataFrames developer guide.