SQL Reference

Hotdata executes SQL with PostgreSQL-compatible syntax—if you're familiar with Postgres, most queries and functions will work as expected. This reference covers the syntax, functions, and capabilities available when you run queries via the API or MCP.

Read-only: Hotdata operates in read-only mode. SELECT and read-only information schema commands are supported; INSERT, UPDATE, DELETE, CREATE, and DROP are not available.

Overview

Hotdata supports standard SQL and extensions for analytics, time-series, and complex types. Queries run against tables from your connected data sources (databases, warehouses, SaaS) and managed databases you've loaded.

SELECT syntax

The core query form is:

SELECT ... FROM ... [ WHERE ... ] [ GROUP BY ... ] [ HAVING ... ] [ ORDER BY ... ] [ LIMIT ... ]

WITH (CTEs)

Use WITH for common table expressions:

WITH top_customers AS (
  SELECT customer_id, sum(amount) AS total
  FROM orders
  GROUP BY customer_id
  ORDER BY total DESC
  LIMIT 10
)
SELECT * FROM top_customers;

FROM clause

Reference tables by schema and name. Tables come from your workspace connections and managed databases. Use fully qualified names when needed: catalog.schema.table or schema.table.

WHERE clause

Filter rows with conditions. Supports AND, OR, NOT, comparison operators (=, !=, <, >, <=, >=), IN, BETWEEN, LIKE, IS NULL, IS NOT NULL, and subqueries with EXISTS, IN, ANY, ALL.

JOIN clause

Join tables with INNER, LEFT, RIGHT, FULL, CROSS joins. Use ON for join conditions:

SELECT o.id, c.name
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.created_at >= '2025-01-01';

GROUP BY and HAVING

Group rows and filter groups:

SELECT department, count(*) AS headcount, avg(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING count(*) > 5
ORDER BY avg_salary DESC;

ORDER BY and LIMIT

Sort and limit results:

SELECT * FROM events
ORDER BY created_at DESC
LIMIT 100;

ORDER BY supports ASC, DESC, and multiple columns. LIMIT can optionally include OFFSET for pagination.

UNION

Combine result sets with UNION (distinct) or UNION ALL (keeps duplicates):

SELECT id, name FROM users
UNION ALL
SELECT id, name FROM archived_users;

Subqueries

Subqueries are supported in SELECT, FROM, WHERE, and HAVING:

SELECT *
FROM orders
WHERE customer_id IN (SELECT id FROM customers WHERE tier = 'premium');

Data types

Standard SQL types are supported:

Category	Types
Numeric	`TINYINT`, `SMALLINT`, `INT`, `INTEGER`, `BIGINT`, `DECIMAL`, `FLOAT`, `DOUBLE`, `REAL`
Character	`CHAR`, `VARCHAR`, `STRING`
Date/Time	`DATE`, `TIME`, `TIMESTAMP`, `INTERVAL`
Boolean	`BOOLEAN`
Binary	`BINARY`, `VARBINARY`
Complex	`ARRAY`, `STRUCT`, `MAP`

See the SQL data types reference for full details and type coercion rules.

Operators

Comparison

=, !=, <>, <, >, <=, >=, IS NULL, IS NOT NULL, BETWEEN ... AND ..., LIKE, IN (...)

Logical

AND, OR, NOT

Numerical

+, -, *, /, % (modulo)

Bitwise

&, |, ^ (bitwise XOR), ~ (bitwise NOT), <<, >>

String

|| for concatenation

Scalar functions

Scalar functions operate on single values. Categories and examples:

Math

abs, ceil, floor, round, trunc, sqrt, pow, power, exp, ln, log, log10, log2, sin, cos, tan, asin, acos, atan, atan2, degrees, radians, signum, random

SELECT round(price * 1.1, 2) AS price_with_tax FROM products;

String

length, lower, upper, trim, ltrim, rtrim, substr, substring, concat, concat_ws, split_part, replace, regexp_replace, position, strpos, left, right, repeat, reverse, lpad, rpad

SELECT lower(email), trim(name) FROM users;

Conditional

coalesce, nullif, case when ... then ... else ... end

SELECT coalesce(nickname, display_name, 'Anonymous') AS name FROM users;

Date and time

date_trunc, date_part, extract, current_date, current_time, current_timestamp, to_timestamp, to_date, make_date, make_timestamp, age, interval

SELECT date_trunc('month', created_at) AS month, count(*) FROM orders GROUP BY 1;

Array

array_length, array_contains, element_at, array_prepend, array_append, array_concat, array_slice, unnest

Struct and map

struct, struct_field, map, map_keys, map_values, element_at (for maps)

Hashing

md5, sha224, sha256, sha384, sha512, crc32

See the scalar functions reference for the complete list with signatures.

Aggregate functions

Aggregate functions compute a single value from a set of rows.

General

count, count(*), sum, avg, min, max, first_value, last_value, string_agg, array_agg, bool_and, bool_or, bit_and, bit_or, bit_xor

SELECT
  count(*) AS total,
  sum(revenue) AS total_revenue,
  avg(revenue) AS avg_order,
  max(created_at) AS last_order
FROM orders;

Statistical

stddev, stddev_pop, stddev_samp, variance, var_pop, var_samp, covar_pop, covar_samp, corr, median, percentile_cont, approx_percentile_cont

Approximate

approx_distinct (HyperLogLog), approx_median, approx_percentile_cont (t-digest)

FILTER clause

Restrict which rows contribute to an aggregate:

SELECT
  sum(amount) FILTER (WHERE status = 'completed') AS completed_revenue,
  count(*) FILTER (WHERE status = 'cancelled') AS cancelled_count
FROM orders;

Ordered-set aggregates

percentile_cont and approx_percentile_cont support WITHIN GROUP (ORDER BY ...):

SELECT percentile_cont(0.5) WITHIN GROUP (ORDER BY latency) AS median_latency FROM metrics;

See the aggregate functions reference.

Window functions

Window functions compute results over a set of rows related to the current row, defined by OVER (PARTITION BY ... ORDER BY ...).

Ranking

row_number(), rank(), dense_rank(), percent_rank(), ntile(n)

SELECT id, name, revenue,
  rank() OVER (ORDER BY revenue DESC) AS rev_rank
FROM sales_rep;

Analytical

first_value(...), last_value(...), lag(...), lead(...), nth_value(...)

SELECT date, symbol, price,
  lag(price) OVER (PARTITION BY symbol ORDER BY date) AS prev_price
FROM stock_prices;

Aggregate as window

Any aggregate (sum, avg, count, etc.) can be used as a window function:

SELECT id, amount,
  sum(amount) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM transactions;

See the window functions reference.

Special functions

Expansion

unnest(array) — expand an array into rows. Useful for array columns:

SELECT id, unnest(tags) AS tag FROM articles;

Geospatial functions

Hotdata includes spatial extensions modeled after the PostGIS API. Use them to query geometry and geography data.

Constructing geometries

Create points and geometries from coordinates or text:

-- Create a 2D point
SELECT ST_MakePoint(-122.4194, 37.7749) AS sf_point;

-- Create from Well-Known Text (WKT)
SELECT ST_GeomFromText('POINT(-122.4194 37.7749)') AS point;
SELECT ST_GeomFromText('POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))') AS polygon;

-- Create from GeoHash
SELECT ST_PointFromGeoHash('9q8yy') AS point;

Geometry accessors

Extract properties and coordinates:

SELECT
  ST_X(location) AS lon,
  ST_Y(location) AS lat,
  ST_Area(boundary) AS area_sq_units,
  ST_Length(path) AS path_length,
  ST_GeometryType(geom) AS geom_type
FROM locations;

Spatial relationships

Test how geometries relate to each other:

-- Find stores within 10 km of a point
SELECT id, name
FROM stores
WHERE ST_Distance(
  ST_GeomFromText('POINT(-122.4194 37.7749)'),
  location
) < 10000;

-- Points contained in a polygon (e.g. delivery zones)
SELECT order_id, ST_AsText(destination)
FROM orders
WHERE ST_Within(destination, delivery_zone);

-- Intersecting geometries (e.g. overlapping regions)
SELECT a.id, b.id
FROM regions a
JOIN regions b ON ST_Intersects(a.geom, b.geom) AND a.id < b.id;

Geometry processing

Compute derived geometries:

-- Convex hull of a set of points
SELECT ST_ConvexHull(ST_Collect(points)) AS hull
FROM clusters;

-- Simplify a polygon (Douglas-Peucker)
SELECT ST_Simplify(complex_geom, 0.001) AS simplified
FROM boundaries;

-- Centroid of a geometry
SELECT ST_Centroid(polygon) AS center FROM parcels;

See the PostGIS documentation for the full function reference; Hotdata's spatial support follows the PostGIS API.

Vector search

Hotdata provides vector similarity search with two approaches: semantic search on text columns using an embedding provider, and direct vector search on pre-computed embedding columns. When a vector index exists, queries are transparently rewritten into HNSW index calls—no special syntax needed.

Semantic search with vector_distance

vector_distance(column, 'search text') is the simplest way to perform semantic search. It takes a text column and a search string, automatically generates an embedding using the column's configured embedding provider, and computes the distance using the metric defined in the vector index.

SELECT id, title, vector_distance(description, 'machine learning frameworks') AS dist
FROM conn.public.articles
ORDER BY dist ASC
LIMIT 10;

Requirements:

A vector index on the column with an embedding provider configured
The column must be a text column (VARCHAR / TEXT)

Behind the scenes, vector_distance resolves the embedding provider from the vector index metadata, calls it to embed your search text, and rewrites the query to the appropriate distance function (e.g. cosine_distance) on the generated embedding column. You don't need to know which distance metric or embedding column is used—the index configuration handles it.

Distance functions

For direct control over vector search—when you already have embedding vectors or want to specify the exact distance metric—use the distance functions directly on vector columns. All are lower-is-closer; use ORDER BY ... ASC:

SQL function	Metric	Use case
`l2_distance(column, query)`	L2 (Euclidean)	General-purpose distance
`cosine_distance(column, query)`	Cosine	Angle between vectors (normalized embeddings)
`negative_dot_product(column, query)`	Inner product	Maximum inner product search

These functions accept a vector column and a literal query vector (ARRAY[...]):

SELECT id, title, cosine_distance(embedding, ARRAY[0.1, -0.2, 0.5, ...]) AS dist
FROM documents
ORDER BY dist ASC
LIMIT 10;

With a vector index: queries matching ORDER BY distance_fn(col, ARRAY[...]) ASC LIMIT k are transparently rewritten into an HNSW index call for fast approximate nearest-neighbor search.

Without a vector index: the distance functions still work via brute-force scan over all rows. This is useful for small tables or ad-hoc exploration, but much slower on large datasets.

A metric mismatch (e.g. cosine_distance on an L2-built index) falls back silently to brute-force. ORDER BY ... DESC is also never rewritten.

vector_search table function

vector_search is a table function for semantic search that returns full rows. Like vector_distance, it auto-embeds your search text using the configured embedding provider — but returns results as a table you can query directly.

vector_search('connection.schema.table', 'column', 'search text', k)

Parameter	Type	Description
table	string literal	Fully-qualified table name: `'connection.schema.table'`
column	string literal	Text column with a vector index and embedding provider
query	string literal	Search text to embed and search for
k	integer	Number of nearest neighbors to return

Returns all table columns plus a _distance column (Float32). Lower distance means more similar.

Requirements: A vector index on the column with an embedding provider configured.

-- Semantic search returning full rows
SELECT id, title, _distance
FROM vector_search('conn.public.articles', 'description', 'machine learning', 10)
ORDER BY _distance ASC;

-- With filtering and aggregation
SELECT category, COUNT(*) AS cnt, AVG(_distance) AS avg_dist
FROM vector_search('conn.public.articles', 'description', 'data science', 50)
WHERE category IN ('ml', 'statistics')
GROUP BY category
ORDER BY avg_dist;

vector_search_vector table function

vector_search_vector is the direct ANN search table function — you provide the query vector yourself. It requires a vector index but no embedding provider.

vector_search_vector('connection.schema.table', 'column', ARRAY[...], k)

Parameter	Type	Description
table	string literal	Fully-qualified table name: `'connection.schema.table'`
column	string literal	Vector column with a HNSW index
query	`ARRAY[...]` literal	Query vector
k	integer	Number of nearest neighbors to return

Returns all table columns plus a _distance column (Float32).

Requirements: A vector index on the column.

SELECT id, title, _distance
FROM vector_search_vector('conn.public.documents', 'embedding', ARRAY[0.1, -0.2, 0.5, ...], 10)
ORDER BY _distance ASC;

Choosing between vector search methods

Method	Input	Requires embedding provider	Requires index	Returns
`vector_distance(col, 'text')`	text	yes	yes	scalar (`Float32`) — use in SELECT/ORDER BY
`vector_search('table', 'col', 'text', k)`	text	yes	yes	table (all columns + `_distance`)
`vector_search_vector('table', 'col', ARRAY, k)`	vector	no	yes	table (all columns + `_distance`)
`l2_distance` / `cosine_distance` / `negative_dot_product`	vector	no	optional (HNSW if present, brute-force otherwise)	scalar (`Float32`) — use in SELECT/ORDER BY

Use vector_search or vector_distance when you have text and want automatic embedding. Use vector_search_vector or the distance functions when you already have vectors.

WHERE clause filtering

Scalar WHERE conditions are absorbed and applied adaptively at execution time when a vector index is present:

Selectivity	Strategy	Behavior
> 5%	In-graph filtering	HNSW filtered search skips non-passing nodes; returns exactly k results
≤ 5%	Brute-force subset	HNSW bypassed; exact distances over the valid subset; heap-select top-k

The 5% threshold is configurable per table.

SELECT id, title, cosine_distance(embedding, ARRAY[...]) AS dist
FROM documents
WHERE category = 'technical'
ORDER BY dist ASC
LIMIT 10;

Tuning and limitations

expansion_search (ef_search): Beam width during HNSW traversal. Default 64; range 32–200. Higher values improve recall at the cost of speed.
brute_force_selectivity_threshold: Fraction below which brute-force is used instead of HNSW. Default 0.05 (5%).
Literal query vectors: The optimizer only rewrites distance functions when the query vector is a compile-time literal (ARRAY[...]). Use vector_distance for text-based search.
SELECT * with distance functions: When the table has generated embedding columns, SELECT * expands to include them. Since the HNSW index doesn't serve embedding columns, the query falls back to brute-force. Select specific columns to use the index, or use vector_search / vector_search_vector which always use HNSW.
Stacked filters: Only one Filter → TableScan layer is absorbed.

Full-text search

Hotdata provides BM25 full-text search via the bm25_search table function. It performs ranked text retrieval against a column with a BM25 index.

bm25_search

bm25_search('connection.schema.table', 'column', 'query text' [, limit])

Parameter	Type	Description
table	string literal	Fully-qualified table name: `'connection.schema.table'`
column	string literal	Text column to search
query	string literal	Search query text
limit	integer (optional)	Maximum results to return. Default: `10`

Returns all columns from the base table plus a score column (Float32) containing the BM25 relevance score. Higher scores indicate stronger relevance.

Requirements: A BM25 index on the target column.

-- Search articles for "machine learning", return top 20 by relevance
SELECT id, title, body, score
FROM bm25_search('conn.public.articles', 'body', 'machine learning', 20)
ORDER BY score DESC;

-- Combine full-text search with additional filters
SELECT id, title, score
FROM bm25_search('conn.public.docs', 'content', 'kubernetes deployment', 100)
WHERE category = 'infrastructure'
ORDER BY score DESC
LIMIT 10;

Note: bm25_search returns its top-N results first, then WHERE filters are applied. When combining with filters, pass a higher limit to bm25_search than the final number of rows you need.

Indexes

Hotdata supports three index types to accelerate different query patterns. Indexes are created via the API or CLI—not SQL. For connection-backed tables the cache must be populated first (run a sync or query).

API (connection or managed-database tables): POST /v1/connections/{connection_id}/tables/{schema}/{table}/indexes

CLI (managed-database tables): hotdata indexes create --catalog <alias> --table <table> --column <cols> --type <type>

Indexes are stored in the catalog and used automatically when the query planner finds a beneficial match. Use GET .../indexes to list indexes and DELETE .../indexes/{index_name} to drop one.

Sorted indexes

Pre-sorted Parquet copies of cached data, ordered by the columns you specify. Accelerate range queries, ordering, and point lookups.

{
  "index_name": "idx_created_at",
  "columns": ["created_at"],
  "index_type": "sorted"
}

A sorted index is selected when:

Filter match — a WHERE clause filters on the index's leading sort column. Sorted Parquet enables efficient row-group pruning for range and equality predicates.
Sort elimination — an ORDER BY on the index columns is satisfied without an explicit sort step.

-- Index on created_at: row-group pruning for range filter
SELECT * FROM events
WHERE created_at >= '2025-01-01' AND created_at < '2025-02-01';

-- Index on sku: sort elimination
SELECT * FROM products ORDER BY sku LIMIT 100;

Vector indexes

HNSW indexes for approximate nearest-neighbor search on vector columns. Require exactly one column.

On a pre-computed embedding column (float array):

{
  "index_name": "docs_embedding_idx",
  "columns": ["embedding"],
  "index_type": "vector",
  "metric": "cosine"
}

On a text column with auto-embedding: when the source column is text, Hotdata automatically generates embeddings using an embedding provider. This enables vector_distance() for semantic search.

If you omit embedding_provider_id, the system embedding provider is used automatically. If no system provider is configured, you must pass one explicitly. The metric also defaults to the provider's configured metric when not specified.

{
  "index_name": "docs_description_idx",
  "columns": ["description"],
  "index_type": "vector"
}

Or with explicit provider and metric:

{
  "index_name": "docs_description_idx",
  "columns": ["description"],
  "index_type": "vector",
  "metric": "cosine",
  "embedding_provider_id": "embp_abc123",
  "dimensions": 1536
}

The embedding provider generates a new column (default: {column}_embedding) containing the vector embeddings. Configure embedding providers via the Embedding Providers API.

Field	Required	Description
`metric`	no	Distance metric: `l2`, `cosine`, or `dot`. Defaults to `l2` for float array columns, or the provider's metric for text columns with auto-embedding
`embedding_provider_id`	no	Embedding provider ID. Defaults to the system provider for text columns
`output_column`	no	Name for the generated embedding column. Default: `{column}_embedding`
`dimensions`	no	Embedding dimensions (model-dependent)

CLI example:

hotdata indexes create --catalog mydb --table documents \
  --column embedding --type vector --metric cosine

BM25 indexes

Full-text search indexes using BM25 scoring. Required for bm25_search().

{
  "index_name": "articles_body_idx",
  "columns": ["body"],
  "index_type": "bm25"
}

CLI example:

hotdata indexes create --catalog mydb --table articles \
  --column body --type bm25

Point lookups

Queries that fetch a single row by equality (WHERE col = value with LIMIT 1) are classified as point lookups. Parquet bloom filter reads skip row groups that cannot contain the value, reducing I/O.

SELECT * FROM users WHERE id = 12345;

Index lifecycle

Indexes are invalidated when the underlying table is purged or resynced. After a data refresh, indexes are automatically rebuilt from the new cached data. If a rebuild fails, drop and recreate the index via the API.

Information schema

Read-only catalog introspection:

SHOW TABLES — list tables
SHOW COLUMNS FROM table — list columns and types
SHOW FUNCTIONS — list available functions
DESCRIBE table — table schema

Hotdata surfaces tables from your connections and managed databases; use the API or MCP to list tables and schemas for your workspace.

EXPLAIN

Use EXPLAIN to inspect the query plan:

EXPLAIN SELECT * FROM orders WHERE created_at > '2025-01-01';

EXPLAIN ANALYZE runs the query and returns execution metrics (timing, rows, etc.).

Prepared statements

Prepared statements with positional parameters ($1, $2, ...) are supported. Use these for parameterized queries to improve reuse and security. See the prepared statements documentation for syntax.