- Published on
Complex SQL to Filter Through 20 Million Rows of Option Chain Snapshot Data
- Authors
- Name
- Teddy Xinyuan Chen
In my DE for Market Research post I talked about how I fetch and store option chain snapshot data (last quote and trades, and calcs) on an intraday basis. That started on 3/14, and now I've collected 20 million rows of data.
I've been playing with realtime data by plotting surfaces of or slices of surface of different metrics on different tickers:
Full screen (5 plots in total): https://gg.teddysc.me/?g=12d79db1018fa41c8f57b290a3b0ccb3&a&c=2
Time of data and spot price at that time is in the subtitle.
The vol surfaces are bumpier than it actually is, because the calcs data provider does not normalize time based on if it's market hour, overnight on weekdays, or weekends.
Filter based on Delta Targets
As you can see in the gallery above, I prefer to use delta as one of the axises instead of strike price, and this way I can get a view of the relationship of fixed delta vol of different tenors. Selecting data by delta is also a great tool to normalize data, it works better than using strike price or ratio of strike price to underlying price.
I love using delta as an axis, but it also makes code complicated as we need to find rows that has closest delta to the target within each group:
DEFAULT_DELTA_TARGETS_225 = {
"5DP": -0.05,
"22DP": -0.22, # <- 22d near vanna peaks
"50D": 0.50,
"22DC": 0.22,
"5DC": 0.05,
}
# pl = polars, refactored from pandas and sped things up many times
# i only want OTM options
elif cols == "delta": # my app supports pivoting by delta, strike, or ratio of strike to underlying price
logger.info("Pivoting by Delta using OTM options (Polars group/agg loop)...")
assert delta_targets_ is not None
# Add abs_delta column
otm_df = otm_df.with_columns(pl.col("delta").abs().alias("abs_delta"))
results = []
# Group by time axis in Polars
grouped = otm_df.group_by(
time_axis_col_name, maintain_order=True
) # Maintain order for consistency
for time_point_tuple, group_df in grouped:
time_point = time_point_tuple[0] # group_by returns tuple key
row_data = {time_axis_col_name: time_point}
otm_puts = group_df.filter(pl.col("putCall") == "PUT")
otm_calls = group_df.filter(pl.col("putCall") == "CALL")
for label, target_delta in delta_targets_.items():
found_value = np.nan # Default
try:
is_put_target = "P" in label and target_delta < 0
is_call_target = "C" in label and target_delta > 0
is_50d_target = (
abs(target_delta - 0.50) < 1e-6
or abs(target_delta + 0.50) < 1e-6
)
target_options_lf = (
pl.DataFrame().lazy()
) # Initialize empty LazyFrame
diff_expr = None
if is_put_target and not otm_puts.is_empty():
target_options_lf = otm_puts.lazy()
diff_expr = (pl.col("delta") - target_delta).abs()
elif is_call_target and not otm_calls.is_empty():
target_options_lf = otm_calls.lazy()
diff_expr = (pl.col("delta") - target_delta).abs()
elif is_50d_target and not group_df.is_empty():
target_options_lf = group_df.lazy()
diff_expr = (pl.col("abs_delta") - 0.50).abs()
if diff_expr is not None:
closest_row_lf = target_options_lf.sort(diff_expr).limit(1)
closest_row_df = closest_row_lf.collect()
if not closest_row_df.is_empty():
found_value = closest_row_df[value_col_name][0]
except Exception as e:
# Log specific errors if needed, e.g., column not found
logger.debug(
f"Error finding delta {label} at {time_point} in group: {e}"
)
row_data[label] = found_value
results.append(row_data)
To do the same thing in QuestDB, this is what I use:
-- chaining 3 CTEs to find rows with closest delta to targets
-- timestamp is snapshot time
-- dte, settlementType (AM / PM) identifies a tenor within a snapshot
WITH TargetDeltas AS (
SELECT -0.05 AS target_delta, '-5DP' AS delta_label UNION ALL
SELECT -0.22 AS target_delta, '-22DP' AS delta_label UNION ALL
SELECT 0.50 AS target_delta, '50D' AS delta_label UNION ALL
SELECT 0.22 AS target_delta, '22DC' AS delta_label UNION ALL
SELECT 0.05 AS target_delta, '5DC' AS delta_label
),
OptionDeltaDiffs AS (
SELECT
o.timestamp,
o.putCall,
o.symbol,
o.description,
o.strikePrice,
o.delta,
o.volatility,
o.dte,
o.settlementType,
m.underlyingPrice AS spot_price,
td.target_delta,
td.delta_label,
abs(o.delta - td.target_delta) AS delta_diff
FROM
teddysc_me_chain_SPX AS o
JOIN
teddysc_me_chain_metadata AS m ON o.timestamp = m.timestamp AND m.symbol = '$SPX'
CROSS JOIN
TargetDeltas td
WHERE
o.delta IS NOT NULL
),
RankedDeltas AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY timestamp, dte, settlementType, target_delta
ORDER BY delta_diff ASC, symbol ASC
) AS rn
FROM OptionDeltaDiffs
)
SELECT
timestamp,
putCall,
symbol,
description,
strikePrice,
delta,
volatility,
dte,
settlementType,
spot_price,
target_delta,
delta_label,
delta_diff
FROM RankedDeltas
WHERE rn = 1
ORDER BY timestamp, dte, settlementType, target_delta;

The teddysc_me_chain_SPX
table has ~2 million rows, and the first run took 11s, which was not bad, considering that I'm using the performance killer ROW_NUMBER() OVER
syntax.
But for real time analysis, I think it's a good idea to cache query results for historic rows in a separate table.
I don't think these queries are getting JIT-compiled, re-running takes about the same amount of time.
PyPika to programmatically generate QuestDB SQL for simple filtering
PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.
At first I use pandas to filter data, but it was just too slow. Simple filtering should be pushed up in the pipeline if possible, using the most efficient way.
polars actually evaluates a complex expression before it even decides which column and rows it wants to read from, and I like that very much.
What's even faster is to filter using the DB's native language, like this:
def generate_option_filter_sql_pypika_questdb(
ticker: str,
# spot: float, # Spot price is used for calculation *after* fetch in the original, not filtering
# ts: datetime, # Timestamp is handled by the MAX subquery
min_date: Optional[date] = None,
max_date: Optional[date] = None,
min_abs_delta: Optional[float] = DEFAULT_MIN_ABS_DELTA,
max_abs_delta: Optional[float] = DEFAULT_MAX_ABS_DELTA,
strike_lower_bound: Optional[float] = None,
strike_upper_bound: Optional[float] = None,
specific_strikes_list: Optional[List[float]] = None,
taq_exclude_never_traded: bool = True,
# fridays_only: bool = False, # Note: Difficult to implement reliably in SQL without DB specifics
# remove_zero_dte_after_cutoff: bool = True,
# add_is_non_major_last_day_of_week: bool = True, # This is metadata added later
param_style: str = "?", # Still relevant for non-date parameters
) -> tuple[str, list]: # Return SQL and parameter list
"""
Generates a QuestDB-compatible SQL SELECT query string using PyPika.
Timestamps for expirationDate filtering are formatted and embedded directly.
Args:
ticker (str): The stock ticker symbol. Used to construct the table name.
min_date (Optional[date]): Minimum expiration date (inclusive). Defaults to today.
max_date (Optional[date]): Maximum expiration date (inclusive).
min_abs_delta (Optional[float]): Minimum absolute delta value (inclusive).
max_abs_delta (Optional[float]): Maximum absolute delta value (inclusive).
strike_lower_bound (Optional[float]): Minimum strike price (inclusive).
strike_upper_bound (Optional[float]): Maximum strike price (inclusive).
specific_strikes_list (Optional[List[float]]): List of specific strikes to include.
taq_exclude_never_traded (bool): If True, filter out options where last trade price is 0.
param_style (str): The parameter placeholder style used by DB driver for non-date params.
Returns:
tuple[str, list]: A tuple containing:
- str: A SQL query string with QuestDB timestamp literals and placeholders for other params.
- list: A list of parameter values (excluding dates) in the order they appear in the SQL.
"""
Polars is really neat
In the demo at the top, you can see that I marked monthly opex data points if you hover over 4/17. This is how I add the is_major_opex
flag to the tabular data:
# lf is a lazyframe
lf = lf.with_columns(
pl.when(
pl.col("potential_major_mask")
& (pl.col("is_am_option") | ~pl.col("has_am_on_potential_date"))
)
.then(True)
.otherwise(False)
.alias("is_major_opex")
)
Determining if a date is in third week of the trading calendar isn't as easy as it sounds
Here's how to do it without using df.apply
, which is a glorified for loop:
# vectorized!
def is_in_third_week_of_month(date_input: IntoExpr) -> pl.Expr:
"""
Polars expression: Returns True if the given date falls within the third relevant week
of the month that has at least 1 business day.
Weeks are considered to start on Monday (ISO standard).
A business day is Monday to Friday.
Args:
date_input: An expression or column name that results in a Date or Datetime column.
Returns:
A Polars expression evaluating to Boolean.
Example:
>>> from datetime import date
>>> df = pl.DataFrame({
... "dates": pl.date_range(date(2024, 4, 1), date(2024, 4, 30), eager=True)
... })
>>> df.with_columns(
... is_third_week = is_in_third_week_of_month("dates")
... ).filter(pl.col("is_third_week"))
shape: (7, 2)
┌────────────┬───────────────┐
│ dates ┆ is_third_week │
│ --- ┆ --- │
│ date ┆ bool │
╞════════════╪═══════════════╡
│ 2024-04-15 ┆ true │
│ 2024-04-16 ┆ true │
│ 2024-04-17 ┆ true │
│ 2024-04-18 ┆ true │
│ 2024-04-19 ┆ true │
│ 2024-04-20 ┆ true │
│ 2024-04-21 ┆ false │
└────────────┴───────────────┘
"""
# Handle different input types correctly:
# 1. If it's a string, assume it's a column name
# 2. If it's already an expression, use it directly
if isinstance(date_input, str):
date_expr = pl.col(date_input)
else:
# Already an expression
date_expr = date_input
# Ensure we're working with date type
# date_expr = date_expr.cast(pl.Date)
# 1. Find the first day of the month
first_day_of_month = date_expr.dt.month_start()
# 2. Find the first business day (Mon-Fri) of the month
# dt.weekday() returns 1 for Monday up to 7 for Sunday
weekday_1st = first_day_of_month.dt.weekday()
days_to_add_to_find_fbd = (
F.when(weekday_1st == 6)
.then(2) # If Saturday, add 2 days to get Monday
.when(weekday_1st == 7)
.then(1) # If Sunday, add 1 day to get Monday
.otherwise(0) # Otherwise, the first day is a business day
.cast(pl.Int32) # Cast required for duration arithmetic
)
first_business_day = first_day_of_month + pl.duration(days=days_to_add_to_find_fbd)
# 3. Determine the start date (Monday) of the ISO week containing the first business day.
# dt.weekday() returns 1 (Mon) .. 7 (Sun)
weekday_fbd = first_business_day.dt.weekday()
# To get Monday, subtract (weekday - 1) days
days_to_subtract_to_find_week_start = (weekday_fbd - 1).cast(pl.Int32)
start_of_first_relevant_week = first_business_day - pl.duration(
days=days_to_subtract_to_find_week_start
)
# 4. Calculate the start date (Monday) of the third relevant week.
# This is exactly 2 weeks (14 days) after the start of the first relevant week.
start_of_third_relevant_week = start_of_first_relevant_week + pl.duration(weeks=2)
# 5. Calculate the end date (Sunday) of the third relevant week.
# This is 6 days after the start of the third relevant week.
end_of_third_relevant_week = start_of_third_relevant_week + pl.duration(days=6)
# 6. Check if the input date falls within the calculated third relevant week (inclusive).
result = (date_expr >= start_of_third_relevant_week) & (
date_expr <= end_of_third_relevant_week
)
return result
Filtering for ATM options is surprisingly harder than it seemed
WITH OptionDiffs AS (
SELECT
o.timestamp,
o.putCall,
o.symbol,
o.description,
o.strikePrice,
o.dte,
o.settlementType,
m.underlyingPrice AS spot_price,
abs(o.strikePrice - m.underlyingPrice) AS price_diff
FROM
teddysc_me_chain_SPX AS o
JOIN
teddysc_me_chain_metadata AS m
ON o.timestamp = m.timestamp AND m.symbol = '$SPX'
),
RankedOptions AS (
SELECT
*,
row_number() OVER (
PARTITION BY timestamp, dte, settlementType
ORDER BY price_diff ASC, symbol ASC
) AS rn
FROM OptionDiffs
)
SELECT
timestamp,
putCall,
symbol,
description,
strikePrice,
dte,
settlementType,
spot_price,
price_diff
FROM RankedOptions
WHERE rn = 1
ORDER BY timestamp, dte, settlementType;
questdb-cli
for this query (dte is the dte at snapshot time) - you can see that different tenors have different density of strikes, which is why I have the Connect Gaps option in my plotter.