Core Concepts
Introduction
Hotdata gives you on-demand OLAP databases you can create, populate with parquet, query, and destroy — all through a single API call. There is no infrastructure to provision and no schema to migrate.
The primary use case: an agent or application creates a database for a specific request, loads data into it, runs analytical queries (including vector search, full-text search, and geospatial), then discards the database when done. The whole lifecycle can happen in seconds.
You can also connect Hotdata to your existing databases and warehouses — Postgres, Snowflake, BigQuery, and others — and query them alongside your managed databases in a single SQL statement. No ETL, no replication.
Two ways to get data in
Managed databases (on demand)
The fastest path. Create a database via the API, declare tables, upload parquet files, and start querying. Everything is provisioned on demand — there is no server to manage.
# Create a database and load data in under 30 seconds
hotdata databases create \
--name mydb \
--table orders
hotdata databases load mydb.orders \
--url https://example.com/orders.parquet
hotdata query \
"SELECT COUNT(*) FROM default.public.orders" \
--database mydb
Managed databases expire automatically (default 24 hours) — or you can delete them immediately when done. This makes them ideal for agent workflows, per-request analytics, and exploratory work where you need real compute on temporary data.
See CLI Reference — Databases and API Reference — Databases.
Connections (existing sources)
Connect Hotdata to your existing databases and warehouses. Hotdata discovers the schema, caches it locally, and routes queries through the connection at execution time. Data is never moved or replicated unless you explicitly create a dataset.
Supported sources: Postgres, MySQL, Snowflake, BigQuery, DuckDB, and more. See Data Sources.
Once connected, tables are queryable as <connection>.<schema>.<table> in standard SQL. You can join across connections and managed databases in a single query.
How it fits together
╔═ hotdata ══════════════════════════════════╗
╔══════════╗ ║ ║░
║ ║░ API ║ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ║░
║ client ║░──────▶║ ┃ workspace ┃ ║░
║ ║░ ║ ┃ ┏━━━━━━━━━━━━━━━┓ ┏━━━━━━━━━━━━━━┓ ┃ ║░
╚══════════╝░ ║ ┃ ┃ managed db ┃ ┃ connection ┃ ┃ ║░
░░░░░░░░░░░░ ║ ┃ ┃ (on demand) ┃ ┃ (external) ┃ ┃ ║░
║ ┃ ┃ - parquet ┃ ┃ - postgres ┃ ┃ ║░
║ ┃ ┃ - ephemeral ┃ ┃ - snowflake ┃ ┃ ║░
║ ┃ ┃ - any SQL ┃ ┃ - bigquery ┃ ┃ ║░
║ ┃ ┗━━━━━━━━━━━━━━━┛ ┗━━━━━━━━━━━━━━┛ ┃ ║░
║ ┃ ↘ ↙ ┃ ║░
║ ┃ hybrid query engine ┃ ║░
║ ┗━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┛ ║░
╚════════════════════════════════════════════╝░
░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
Organization
A boundary for users and workspaces. All activity is scoped within an organization, including access control, resource limits, and usage tracking. Organizations isolate teams and environments while sharing a common governance layer. See API Reference — Workspaces.
Workspace
An isolated execution environment provisioned on demand. Each workspace runs independently with its own compute, storage, and security boundary. Workspaces persist until explicitly deleted, allowing agents and applications to create and use them without affecting other workloads. See API Reference — Workspaces and CLI Reference — Workspaces.
Managed Databases
Hotdata-owned OLAP databases you create via the API, populate with parquet, and query immediately. Unlike connections, managed databases have no external dependency — you define the schema, load the data, and Hotdata handles the rest.
Key properties:
- Created on demand — a single API call or CLI command is all you need
- Loaded from parquet — upload a file or point to a URL; no schema migration required
- Any SQL — analytical queries, window functions, vector search, full-text, geospatial, all in one engine
- Ephemeral or persistent — set
expires_atfor automatic cleanup, or delete explicitly
Tables inside a managed database are addressed as default.<schema>.<table> in SQL. Pass the database ID via --database (CLI) or X-Database-Id (API) to scope a query.
See CLI Reference — Databases and API Reference — Databases.
Connection
A configuration for an external data source (Postgres, Snowflake, BigQuery, SaaS APIs, and more). Connections are created within a workspace and used to discover and query tables from existing systems. Each connection has a unique ID and operates through controlled, read-only access. Schema discovery and metadata are cached for faster access.
Tables from a connection are addressed as <connection>.<schema>.<table> in SQL — no data movement required.
See Data Sources, API Reference — Connections, and CLI Reference — Connections.
Datasets
Materialized results of queries or uploaded files. Datasets represent a snapshot of data at a point in time and are stored locally for reuse. They can be queried, joined, and transformed without re-accessing the original source. This reduces latency and avoids repeated scans of upstream systems. See API Reference — Datasets and CLI Reference — Datasets.
Sandboxes
A workspace-scoped context for exploratory CLI work. While a sandbox is active, your activity is tied to it; datasets created in a sandbox are removed when the sandbox ends, so keep anything you need long-term outside a sandbox. Sandboxes can include markdown notes for context. Use the CLI hotdata sandbox commands (for example sandbox new, sandbox set, sandbox run); see CLI Reference — Sandboxes.
Secrets
Credentials used by connections (passwords, tokens, API keys). Secrets are securely stored and scoped to a workspace. Values are never returned by read APIs and are injected only at execution time. This prevents leakage while allowing dynamic access to external systems. See API Reference — Secrets.
Saved Queries
Reusable query definitions that can be executed multiple times. They capture logic without storing results, making them useful for standard transformations, recurring analysis, and agent workflows. Saved queries can be versioned and combined with datasets to build repeatable patterns. See API Reference — Saved Queries.
Schedules
Controls how and when data is refreshed. Data sources are refreshed daily by default, with support for custom schedules (cron-based or usage-driven). Frequently accessed datasets can be updated more aggressively, while less active data can be refreshed lazily to optimize cost and performance. See API Reference — Refresh.
Persisted Results
Every query result is automatically stored in local storage. These results can be re-queried instantly, filtered, or joined without accessing the original source. This enables iterative workflows where each step builds on previous results. Persisted results also support time-based comparisons and replay. See API Reference — Results and CLI Reference — Results.
Vector Search
Uses usearch for approximate nearest neighbor search. Optimized for AVX-512 SIMD execution, enabling high-throughput similarity search on CPUs without GPU dependency. Designed for real-time retrieval of embeddings (text, images, etc.), supporting large-scale inference workloads with low latency. See SQL Reference — Vector search, CLI Reference — Search, and API Reference — Indexes.
Full text search
Built-in ranked full-text retrieval. Indexed and SIMD-optimized for fast evaluation of term relevance. Supports phrase matching, token weighting, and ranking across large text corpora. Eliminates the need for external search systems while maintaining strong relevance and performance. See SQL Reference — Full-text search, CLI Reference — Search, and API Reference — Indexes.
Geospatial Queries
Native support for spatial data types and operations such as distance calculations, containment checks, intersections, and bounding boxes. Enables location-aware filtering and joins within the same execution engine. Works alongside other query types, allowing spatial constraints to be combined with analytical, vector, and text queries. See SQL Reference — Geospatial functions.
OLAP (Analytical Queries)
Supports fast aggregations, filtering, and group-by operations over large datasets. Execution is vectorized and columnar, enabling efficient use of CPU and memory. Designed for analytical workloads where latency and throughput both matter. See SQL Reference — Aggregate functions, SQL Reference — Window functions, and API Reference — Query.
Hybrid Queries
Combines multiple query types in a single execution plan. For example: full-text search → vector similarity → relational filtering → geospatial constraints → final row retrieval. This avoids coordinating multiple systems and keeps execution within a single low-latency path. See API Reference — Query and SQL Reference — Overview.
Joining Across Results
Query results can be treated as datasets and queried again. This allows joining across:
- previous query outputs
- different data sources
- time-based snapshots
This model supports iterative computation, where each step refines the result without recomputing from the original data. It enables complex workflows to be expressed as a sequence of lightweight, composable queries. See API Reference — Query, API Reference — Datasets, and SQL Reference — SELECT syntax.