Bulk Ingest¶
SQLSpec exposes native bulk-ingest fast paths through a small, adapter-agnostic
storage-bridge API. High-volume writes use each driver's database primitive --
COPY, LOAD DATA LOCAL INFILE, direct path load, BulkCopy, Arrow
ingest, load jobs, or mutations -- instead of generic row-by-row execution.
The API¶
Three methods cover the common shapes. They share return type
StorageBridgeJob (its telemetry dict reports rows_processed):
load_from_arrow(table, source, *, overwrite=False)-- load an Arrow table (or anything coercible to one) using the adapter's native ingest path.load_from_storage(table, source, *, file_format, overwrite=False)-- load a staged artifact (a local path or cloud URI) into a table.load_from_records(table, records, *, columns=None, overwrite=False)-- load in-memory rows.recordsmay be mappings (columns derived from the keys) or positional sequences (columnsrequired). Records are normalized and routed through the adapter's nativeload_from_arrowpath, so every adapter that supports Arrow ingest supports records too.
# dict records -- columns inferred from keys
driver.load_from_records("orders", [{"id": 1, "total": 9.99}, {"id": 2, "total": 4.50}])
# positional records -- columns required
driver.load_from_records("orders", [(3, 1.0), (4, 2.0)], columns=["id", "total"])
Empty input, mismatched mapping keys, or a positional/column width mismatch
raise ImproperConfigurationError.
Capability matrix¶
Adapter |
Native ingest path |
Transactionality |
Gate / opt-in |
|---|---|---|---|
asyncpg |
|
Atomic; exact row counts |
Always on |
psycopg (sync/async) |
|
Atomic; exact row counts |
Always on |
psqlpy |
Binary |
Atomic |
Always on |
adbc |
|
Driver-dependent; FlightSQL falls back to per-row |
Always on |
duckdb |
|
Single connection transaction |
Always on |
sqlite / aiosqlite |
|
Atomic when the driver owns the transaction; rolls back on error |
Always on |
oracledb |
direct path load (Thin mode, default); |
Per |
|
MySQL family (pymysql, asyncmy, aiomysql, mysql-connector) |
|
Server-managed |
|
bigquery |
Parquet load job (default); Arrow Storage Write API (opt-in) |
All-or-nothing load job / PENDING write stream |
|
spanner |
|
In-transaction (default); independently committed groups (Batch Write) |
Always on; |
mssql-python |
|
Driver-managed |
Always on |
arrow_odbc |
|
Driver-managed |
Always on |
Security and opt-in paths¶
Some fast paths are opt-in because they read local files or change semantics:
MySQL ``LOAD DATA LOCAL INFILE`` requires both the adapter feature
enable_local_infile_bulk_loadand the connection's local-infile setting (local_infile=Truefor pymysql/aiomysql/asyncmy,allow_local_infile=Truefor mysql-connector). Enabling the feature without the connection gate raisesImproperConfigurationErrorat config construction. The MySQL server must also havelocal_infileenabled. mysql-connector additionally honorsallow_local_infile_in_path-- the staged temp file must live under that directory when it is set.Oracle direct path load is the default bulk-ingest transport in Thin mode. Set
enable_direct_path_load=Falseto forceexecutemany. Connections that do not expose the Direct Path Load API, including Thick-mode connections, silently fall back toexecutemany.BigQuery Storage Write API (
enable_storage_write_api) streams Arrow rows forload_from_arrowappends and falls back to the Parquet load job when the Storage client is unavailable;overwrite=Truealways uses a ParquetWRITE_TRUNCATEload job.Spanner Batch Write API (
enable_batch_write_api) routesload_from_arrowthroughDatabase.mutation_groups().batch_write()for high-throughput, independently committedinsert_or_updategroups instead of a single in-transaction flush. The upsert semantics keep each group idempotent on replay.
Examples¶
MySQL LOAD DATA LOCAL INFILE:
from sqlspec.adapters.pymysql import PyMysqlConfig
config = PyMysqlConfig(
connection_config={"host": "localhost", "local_infile": True},
driver_features={"enable_local_infile_bulk_load": True},
)
with config.provide_session() as driver:
driver.load_from_arrow("orders", arrow_table)
Oracle per-call batch error and array-DML row-count reporting:
statement_config = driver.statement_config.replace(
execution_args={"oracle_batch_errors": True, "oracle_array_dml_row_counts": True}
)
result = driver.execute_many(
"INSERT INTO orders (id, total) VALUES (:1, :2)", rows, statement_config=statement_config
)
failures = result.metadata["oracle_batch_errors"] # list of {offset, code, message}
row_counts = result.metadata["oracle_dml_row_counts"] # per-statement affected rows
Note
Spanner load_from_arrow uses insert_or_update mutations, so re-running
the same rows is an idempotent upsert rather than a primary-key collision.