Published on

Complex SQL to Filter Through 20 Million Rows of Option Chain Snapshot Data

Authors
  • avatar
    Name
    Teddy Xinyuan Chen
    Twitter

In my DE for Market Research post I talked about how I fetch and store option chain snapshot data (last quote and trades, and calcs) on an intraday basis. That started on 3/14, and now I've collected 20 million rows of data.

I've been playing with realtime data by plotting surfaces of or slices of surface of different metrics on different tickers:

Full screen (5 plots in total): https://gg.teddysc.me/?g=12d79db1018fa41c8f57b290a3b0ccb3&a&c=2

Time of data and spot price at that time is in the subtitle.

The vol surfaces are bumpier than it actually is, because the calcs data provider does not normalize time based on if it's market hour, overnight on weekdays, or weekends.

Filter based on Delta Targets

As you can see in the gallery above, I prefer to use delta as one of the axises instead of strike price, and this way I can get a view of the relationship of fixed delta vol of different tenors. Selecting data by delta is also a great tool to normalize data, it works better than using strike price or ratio of strike price to underlying price.

I love using delta as an axis, but it also makes code complicated as we need to find rows that has closest delta to the target within each group:

DEFAULT_DELTA_TARGETS_225 = {
    "5DP": -0.05,
    "22DP": -0.22, # <- 22d near vanna peaks
    "50D": 0.50,
    "22DC": 0.22,
    "5DC": 0.05,
}
# pl = polars, refactored from pandas and sped things up many times
# i only want OTM options
    elif cols == "delta": # my app supports pivoting by delta, strike, or ratio of strike to underlying price
        logger.info("Pivoting by Delta using OTM options (Polars group/agg loop)...")
        assert delta_targets_ is not None

        # Add abs_delta column
        otm_df = otm_df.with_columns(pl.col("delta").abs().alias("abs_delta"))

        results = []
        # Group by time axis in Polars
        grouped = otm_df.group_by(
            time_axis_col_name, maintain_order=True
        )  # Maintain order for consistency

        for time_point_tuple, group_df in grouped:
            time_point = time_point_tuple[0]  # group_by returns tuple key
            row_data = {time_axis_col_name: time_point}
            otm_puts = group_df.filter(pl.col("putCall") == "PUT")
            otm_calls = group_df.filter(pl.col("putCall") == "CALL")

            for label, target_delta in delta_targets_.items():
                found_value = np.nan  # Default
                try:
                    is_put_target = "P" in label and target_delta < 0
                    is_call_target = "C" in label and target_delta > 0
                    is_50d_target = (
                        abs(target_delta - 0.50) < 1e-6
                        or abs(target_delta + 0.50) < 1e-6
                    )

                    target_options_lf = (
                        pl.DataFrame().lazy()
                    )  # Initialize empty LazyFrame
                    diff_expr = None

                    if is_put_target and not otm_puts.is_empty():
                        target_options_lf = otm_puts.lazy()
                        diff_expr = (pl.col("delta") - target_delta).abs()
                    elif is_call_target and not otm_calls.is_empty():
                        target_options_lf = otm_calls.lazy()
                        diff_expr = (pl.col("delta") - target_delta).abs()
                    elif is_50d_target and not group_df.is_empty():
                        target_options_lf = group_df.lazy()
                        diff_expr = (pl.col("abs_delta") - 0.50).abs()

                    if diff_expr is not None:
                        closest_row_lf = target_options_lf.sort(diff_expr).limit(1)
                        closest_row_df = closest_row_lf.collect()
                        if not closest_row_df.is_empty():
                            found_value = closest_row_df[value_col_name][0]

                except Exception as e:
                    # Log specific errors if needed, e.g., column not found
                    logger.debug(
                        f"Error finding delta {label} at {time_point} in group: {e}"
                    )

                row_data[label] = found_value
            results.append(row_data)

To do the same thing in QuestDB, this is what I use:

-- chaining 3 CTEs to find rows with closest delta to targets
-- timestamp is snapshot time
-- dte, settlementType (AM / PM) identifies a tenor within a snapshot

WITH TargetDeltas AS (
  SELECT -0.05 AS target_delta, '-5DP' AS delta_label UNION ALL
  SELECT -0.22 AS target_delta, '-22DP' AS delta_label UNION ALL
  SELECT  0.50 AS target_delta, '50D' AS delta_label UNION ALL
  SELECT  0.22 AS target_delta, '22DC' AS delta_label UNION ALL
  SELECT  0.05 AS target_delta, '5DC' AS delta_label
),
OptionDeltaDiffs AS (
  SELECT
    o.timestamp,
    o.putCall,
    o.symbol,
    o.description,
    o.strikePrice,
    o.delta,
    o.volatility,
    o.dte,
    o.settlementType,
    m.underlyingPrice AS spot_price,
    td.target_delta,
    td.delta_label,
    abs(o.delta - td.target_delta) AS delta_diff
  FROM
    teddysc_me_chain_SPX AS o
  JOIN
    teddysc_me_chain_metadata AS m ON o.timestamp = m.timestamp AND m.symbol = '$SPX'
  CROSS JOIN
    TargetDeltas td
  WHERE
    o.delta IS NOT NULL
),
RankedDeltas AS (
  SELECT *,
         ROW_NUMBER() OVER (
           PARTITION BY timestamp, dte, settlementType, target_delta
           ORDER BY delta_diff ASC, symbol ASC
         ) AS rn
  FROM OptionDeltaDiffs
)
SELECT
  timestamp,
  putCall,
  symbol,
  description,
  strikePrice,
  delta,
  volatility,
  dte,
  settlementType,
  spot_price,
  target_delta,
  delta_label,
  delta_diff
FROM RankedDeltas
WHERE rn = 1
ORDER BY timestamp, dte, settlementType, target_delta;
alt text
Query result screenshot - this data can be used to generate the vol cone in my demo above without much further processing.

The teddysc_me_chain_SPX table has ~2 million rows, and the first run took 11s, which was not bad, considering that I'm using the performance killer ROW_NUMBER() OVER syntax.

But for real time analysis, I think it's a good idea to cache query results for historic rows in a separate table.

I don't think these queries are getting JIT-compiled, re-running takes about the same amount of time.

PyPika to programmatically generate QuestDB SQL for simple filtering

PyPika is a python SQL query builder that exposes the full richness of the SQL language using a syntax that reflects the resulting query. PyPika excels at all sorts of SQL queries but is especially useful for data analysis.

At first I use pandas to filter data, but it was just too slow. Simple filtering should be pushed up in the pipeline if possible, using the most efficient way.

polars actually evaluates a complex expression before it even decides which column and rows it wants to read from, and I like that very much.

What's even faster is to filter using the DB's native language, like this:

def generate_option_filter_sql_pypika_questdb(
    ticker: str,
    # spot: float, # Spot price is used for calculation *after* fetch in the original, not filtering
    # ts: datetime, # Timestamp is handled by the MAX subquery
    min_date: Optional[date] = None,
    max_date: Optional[date] = None,
    min_abs_delta: Optional[float] = DEFAULT_MIN_ABS_DELTA,
    max_abs_delta: Optional[float] = DEFAULT_MAX_ABS_DELTA,
    strike_lower_bound: Optional[float] = None,
    strike_upper_bound: Optional[float] = None,
    specific_strikes_list: Optional[List[float]] = None,
    taq_exclude_never_traded: bool = True,
    # fridays_only: bool = False, # Note: Difficult to implement reliably in SQL without DB specifics
    # remove_zero_dte_after_cutoff: bool = True,
    # add_is_non_major_last_day_of_week: bool = True, # This is metadata added later
    param_style: str = "?",  # Still relevant for non-date parameters
) -> tuple[str, list]:  # Return SQL and parameter list
    """
    Generates a QuestDB-compatible SQL SELECT query string using PyPika.
    Timestamps for expirationDate filtering are formatted and embedded directly.

    Args:
        ticker (str): The stock ticker symbol. Used to construct the table name.
        min_date (Optional[date]): Minimum expiration date (inclusive). Defaults to today.
        max_date (Optional[date]): Maximum expiration date (inclusive).
        min_abs_delta (Optional[float]): Minimum absolute delta value (inclusive).
        max_abs_delta (Optional[float]): Maximum absolute delta value (inclusive).
        strike_lower_bound (Optional[float]): Minimum strike price (inclusive).
        strike_upper_bound (Optional[float]): Maximum strike price (inclusive).
        specific_strikes_list (Optional[List[float]]): List of specific strikes to include.
        taq_exclude_never_traded (bool): If True, filter out options where last trade price is 0.
        param_style (str): The parameter placeholder style used by DB driver for non-date params.

    Returns:
        tuple[str, list]: A tuple containing:
            - str: A SQL query string with QuestDB timestamp literals and placeholders for other params.
            - list: A list of parameter values (excluding dates) in the order they appear in the SQL.
    """

Polars is really neat

In the demo at the top, you can see that I marked monthly opex data points if you hover over 4/17. This is how I add the is_major_opex flag to the tabular data:

# lf is a lazyframe
    lf = lf.with_columns(
        pl.when(
            pl.col("potential_major_mask")
            & (pl.col("is_am_option") | ~pl.col("has_am_on_potential_date"))
        )
        .then(True)
        .otherwise(False)
        .alias("is_major_opex")
    )

Determining if a date is in third week of the trading calendar isn't as easy as it sounds

Here's how to do it without using df.apply, which is a glorified for loop:

# vectorized!
def is_in_third_week_of_month(date_input: IntoExpr) -> pl.Expr:
    """
    Polars expression: Returns True if the given date falls within the third relevant week
    of the month that has at least 1 business day.
    Weeks are considered to start on Monday (ISO standard).
    A business day is Monday to Friday.

    Args:
        date_input: An expression or column name that results in a Date or Datetime column.

    Returns:
        A Polars expression evaluating to Boolean.

    Example:
        >>> from datetime import date
        >>> df = pl.DataFrame({
        ...    "dates": pl.date_range(date(2024, 4, 1), date(2024, 4, 30), eager=True)
        ... })
        >>> df.with_columns(
        ...     is_third_week = is_in_third_week_of_month("dates")
        ... ).filter(pl.col("is_third_week"))
        shape: (7, 2)
        ┌────────────┬───────────────┐
        │ dates      ┆ is_third_week │
        │ ---        ┆ ---           │
        │ date       ┆ bool          │
        ╞════════════╪═══════════════╡
        │ 2024-04-15 ┆ true          │
        │ 2024-04-16 ┆ true          │
        │ 2024-04-17 ┆ true          │
        │ 2024-04-18 ┆ true          │
        │ 2024-04-19 ┆ true          │
        │ 2024-04-20 ┆ true          │
        │ 2024-04-21 ┆ false         │
        └────────────┴───────────────┘

    """
    # Handle different input types correctly:
    # 1. If it's a string, assume it's a column name
    # 2. If it's already an expression, use it directly
    if isinstance(date_input, str):
        date_expr = pl.col(date_input)
    else:
        # Already an expression
        date_expr = date_input

    # Ensure we're working with date type
    # date_expr = date_expr.cast(pl.Date)

    # 1. Find the first day of the month
    first_day_of_month = date_expr.dt.month_start()

    # 2. Find the first business day (Mon-Fri) of the month
    # dt.weekday() returns 1 for Monday up to 7 for Sunday
    weekday_1st = first_day_of_month.dt.weekday()
    days_to_add_to_find_fbd = (
        F.when(weekday_1st == 6)
        .then(2)  # If Saturday, add 2 days to get Monday
        .when(weekday_1st == 7)
        .then(1)  # If Sunday, add 1 day to get Monday
        .otherwise(0)  # Otherwise, the first day is a business day
        .cast(pl.Int32)  # Cast required for duration arithmetic
    )
    first_business_day = first_day_of_month + pl.duration(days=days_to_add_to_find_fbd)

    # 3. Determine the start date (Monday) of the ISO week containing the first business day.
    # dt.weekday() returns 1 (Mon) .. 7 (Sun)
    weekday_fbd = first_business_day.dt.weekday()
    # To get Monday, subtract (weekday - 1) days
    days_to_subtract_to_find_week_start = (weekday_fbd - 1).cast(pl.Int32)
    start_of_first_relevant_week = first_business_day - pl.duration(
        days=days_to_subtract_to_find_week_start
    )

    # 4. Calculate the start date (Monday) of the third relevant week.
    # This is exactly 2 weeks (14 days) after the start of the first relevant week.
    start_of_third_relevant_week = start_of_first_relevant_week + pl.duration(weeks=2)

    # 5. Calculate the end date (Sunday) of the third relevant week.
    # This is 6 days after the start of the third relevant week.
    end_of_third_relevant_week = start_of_third_relevant_week + pl.duration(days=6)

    # 6. Check if the input date falls within the calculated third relevant week (inclusive).
    result = (date_expr >= start_of_third_relevant_week) & (
        date_expr <= end_of_third_relevant_week
    )

    return result

Filtering for ATM options is surprisingly harder than it seemed

WITH OptionDiffs AS (
  SELECT
    o.timestamp,
    o.putCall,
    o.symbol,
    o.description,
    o.strikePrice,
    o.dte,
    o.settlementType,
    m.underlyingPrice AS spot_price,
    abs(o.strikePrice - m.underlyingPrice) AS price_diff
  FROM
    teddysc_me_chain_SPX AS o
  JOIN
    teddysc_me_chain_metadata AS m 
    ON o.timestamp = m.timestamp AND m.symbol = '$SPX'
),
RankedOptions AS (
  SELECT
    *,
    row_number() OVER (
      PARTITION BY timestamp, dte, settlementType
      ORDER BY price_diff ASC, symbol ASC
    ) AS rn
  FROM OptionDiffs
)
SELECT
  timestamp,
  putCall,
  symbol,
  description,
  strikePrice,
  dte,
  settlementType,
  spot_price,
  price_diff
FROM RankedOptions
WHERE rn = 1
ORDER BY timestamp, dte, settlementType;
alt text
Using my questdb-cli for this query (dte is the dte at snapshot time) - you can see that different tenors have different density of strikes, which is why I have the Connect Gaps option in my plotter.