Installing chDB for Python
Requirements
Python 3.8+ on macOS and Linux (x86_64 and ARM64)
Install
Usage
CLI example:
Python file example:
Queries can return data using any supported format as well as Dataframe
and Debug
.
GitHub repository
You can find the GitHub repository for the project at chdb-io/chdb.
Data Input
The following methods are available to access on-disk and in-memory data formats:
Query On File (Parquet, CSV, JSON, Arrow, ORC and 60+)
You can execute SQL and return desired format data.
Work with Parquet or CSV
Pandas DataFrame output
Query On Table (Pandas DataFrame, Parquet file/bytes, Arrow bytes)
Query On Pandas DataFrame
Query with Stateful Session
Sessions will keep the state of query. All DDL and DML state will be kept in a directory. Directory path can be passed in as an argument. If it is not passed, a temporary directory will be created.
If the path is not specified, the temporary directory will be deleted when the Session object is deleted. Otherwise, the path will be kept.
Note that the default database is _local
and the default engine is Memory
which means all data will be stored in memory. If you want to store data in disk, you should create another database.
See also: test_stateful.py.
Query with Python DB-API 2.0
Query with UDF (User Defined Functions)
Some notes on the chDB Python UDF (User Defined Function) decorator.
- The function should be stateless. Only UDFs are supported, not UDAFs (User Defined Aggregation Function).
- Default return type is String. If you want to change the return type, you can pass in the return type as an argument. The return type should be one of the following.
- The function should take in arguments of type String. As the input is TabSeparated, all arguments are strings.
- The function will be called for each line of input. Example:
- The function should be a pure Python function. You should import all Python modules used inside the function.
- The Python interpreter used is the same as the one used to run the script. You can get it from
sys.executable
.
see also: test_udf.py.
Python Table Engine
Query on Pandas DataFrame
Query on Arrow Table
Query on chdb.PyReader class instance
- You must inherit from chdb.PyReader class and implement the
read
method. - The
read
method should:- return a list of lists, the first dimension is the column, the second dimension is the row, the columns order should be the same as the first arg
col_names
ofread
. - return an empty list when there is no more data to read.
- be stateful, the cursor should be updated in the
read
method.
- return a list of lists, the first dimension is the column, the second dimension is the row, the columns order should be the same as the first arg
- An optional
get_schema
method can be implemented to return the schema of the table. The prototype isdef get_schema(self) -> List[Tuple[str, str]]:
, the return value is a list of tuples, each tuple contains the column name and the column type. The column type should be one of the following.
See also: test_query_py.py.
Limitations
- Column types supported:
pandas.Series
,pyarrow.array
,chdb.PyReader
- Data types supported: Int, UInt, Float, String, Date, DateTime, Decimal
- Python Object type will be converted to String
- Pandas DataFrame performance is all of the best, Arrow Table is better than PyReader