Data Engineering for Market Research

I wanted to set up my computer for more advanced market research that requires tons of data, in this post I'll share how I solved some of my problems. DE is a large part of the setup, but not the all of it.

Table of Contents

Problem: Storing historical data across intervals and tickers, and snapshots of full options chains for some of the most optioned tickers, in a sane, efficient, and easy-to-access-later way.
Problem: Real Time Alert?
- Solution: Prefect for task scheduling and tracking
Problem: Handling real time streaming data?
- Solution: a Kafka-esque event streaming service
Problem: managing multiple local long-running processes?
Solution: Convert them to launchd(8) jobs and let the OS handle it

Problem: Storing historical data across intervals and tickers, and snapshots of full options chains for some of the most optioned tickers, in a sane, efficient, and easy-to-access-later way.

Huge volume of data

Even if I only store sparse chain snapshots (<5 snapshots per day) and low-resolution (1 min to daily) spot ohlcv data.

The chains I currently fetch frequently and store are: spx vix nvda googl aapl meta smh qqq. Initially I wanted to cover all mags and some tickers I'm interested in, but it's just too much. I'd love to get RTY chains, but the API I'm currently using won't let me.

I gave up fetching NDX and RUT chains, they're about 1/2 size of SPX chain (very heavy). The reason I wanted to use QQQ and RTY as a proxy for both indexes is because I've seen Cboe's research doing that.

The most optioned ticker is obviously, SPX. The tenors span from today all the way to 2030, and there are roughly 26,500 total strikes across all tenors at any date as of the time I'm writing this post.

Now let's talk about 1 min data. For 1 ticker, we have 390 rows of data per day, for 1 year (252 days), it's 98280 rows.

I only store SPX and DJI components' 1 min data, about 519 tickers in total, plus some indexes. Let's say 530 in total. For 1 year, total rows would be 52088400. Yes, that's 52 million rows, it would take around 4 GB of disk space using ArcticDB's columnar store (LMDB backend).

How do you even organize this?

If you just have daily data for a few tickers, I think it's acceptable to just put them in CSVs, like this:

.
└── data
    └── 20250321 # manual time partition by DAY
        └── 1min # interval
            └── spx.csv # just 390 rows :)

But the chain's snapshots are a LOT messier. The equites' and indexes's historical OHLC(V) data is just trades, not quotes, the chain (just level one) contains trades and quotes (TAQ) and time for both. And we have the "calcs" (greeks), OI, contract specs, regulation related flags (like penny pilot, not useful but it's in the returned data). There are more than 60 columns for each strike, which is massive. 1 SPX snapshots takes >20MB on ArcticDB iirc.

What about just use a large parquet file / duckdb for everything, or 1 per ticker?

I still think it's a mess on disk.

duckdb is recommended by so many people, I'm currently happy with my stack so I'm not investing time learning it.

don't even mention sqlite! I tried it and the thing was really really huge, sqlite have limited data types, is slow and space-inefficient.

Current solution: ArcticDB and QuestDB

Because I use pandas for most of my analysis, ArcticDB is a pretty good choice. Their tag line is to 'let quant spin up analysis in seconds', and I gotta say it's very ergonomic, if you look past the non-existent support for MultiIndex pandas DataFrames.

ArcticDB provides a very nice interface to organize your data in a hierarchical way (db -> library -> symbol (aka table)). No explicit schema definition. Design is time series focused, you can append only if the first datetime in the data you want to append is after the current last row's datetime. And you can update a table, by replacing a range (by datetime) of data implied by the new data.

Data is versioned nicely and you can create snapshots of a db, or part of it as you like, very flexible. You never have to mess with nested dir path again! Which is a big plus.

Storing a snapshot of a chain for a single ticker:

def adb_store_option_chain_snapshot(
    lib: adb.library.Library, contracts_df: pd.DataFrame, metadata_df: pd.DataFrame
) -> dict[str, adb.VersionedItem | pd.Timestamp | None]:
    """Stores an options chain snapshot (contracts and metadata).

    Args:
        lib: The ArcticDB library object.
        contracts_df: DataFrame with contract data.
        metadata_df: DataFrame with metadata.
    """

    # --- 1. Process and Store contracts_df ---

    # Get snapshot timestamp (use max for safety) and convert to UTC
    snapshot_timestamp = pd.to_datetime(
        contracts_df["quoteTimeInLong"].max(), unit="ms", utc=True
    )

    # Add snapshot_timestamp as a *column* (for filtering)
    contracts_df["snapshot_timestamp"] = snapshot_timestamp

    # Set the index (for time-based queries)
    contracts_df = contracts_df.set_index("snapshot_timestamp")

    # Derive ticker symbol from metadata (handle potential errors)
    try:
        ticker = metadata_df["symbol"].iloc[0].replace("$", "")  # Remove '$' prefix
        if not isinstance(ticker, str) or not ticker:
            raise ValueError("Invalid ticker symbol in metadata.")
    except (KeyError, IndexError, ValueError) as e:
        # print(f"Error extracting ticker from metadata: {e}. Using 'UNKNOWN'.")
        # ticker = "UNKNOWN"
        raise ValueError(
            f"Error extracting ticker from metadata: {e}. Cannot store empty metadata."
        )

    # contracts_symbol = f"{ticker}_C"  # Short symbol name (e.g., "SPX_C")
    contracts_symbol = f"{ticker}"  # Short symbol name (e.g., "SPX_C")

    # fix col order mismatching
    if contracts_symbol in lib.list_symbols():
        existing_cols = lib[contracts_symbol].data.columns
        contracts_df = contracts_df[existing_cols]

        # update so that we can insert the records even if the quote time >= the latest quote time in the table
        version_item_contracts = lib.update(
            contracts_symbol, contracts_df, prune_previous_versions=False
        )
    else:
        # append to create new symbol
        version_item_contracts = lib.append(
            contracts_symbol, contracts_df, prune_previous_versions=False
        )

    # --- 2. Process and Store metadata_df ---

    # Handle edge cases for metadata_df
    if metadata_df.empty:
        # print("Warning: metadata_df is empty.  Skipping write.")
        # return
        raise ValueError("metadata_df is empty. Cannot store empty metadata.")
    if len(metadata_df) > 1:
        print("Warning: metadata_df has more than one row. Using the first row.")
        metadata_df = metadata_df.iloc[:1]  # Use only the first row

    # Use the *same* snapshot timestamp as contracts_df
    metadata_df["snapshot_timestamp"] = snapshot_timestamp
    metadata_df = metadata_df.set_index("snapshot_timestamp")

    # metadata_symbol = f"{ticker}_M"  # Short symbol name (e.g., "SPX_M")
    metadata_symbol = "metadata"  # Short symbol name (e.g., "SPX_M")
    if metadata_symbol in lib.list_symbols():
        versioned_item_meta = lib.update(
            metadata_symbol, metadata_df, prune_previous_versions=False
        )
    else:
        versioned_item_meta = lib.append(
            metadata_symbol, metadata_df, prune_previous_versions=False
        )
    return {
        "meta": versioned_item_meta,
        "contracts": version_item_contracts,
        "timestamp": snapshot_timestamp,
    }

Getting SPX options where expires in 1 week (I ran this on 2025-03-16, where I have only 1 snapshot at that time), whose delta is not too extreme, and only get a few columns:

First (messy) attempts in using ArcticDB's query builder (query first then only load the filtered rows into concrete DataFrame). I needed to differentiate between SPX and SPXW, which was why I applied extract_root from symbols like `SPXW 250321C05950000` above

2 things I don't like about ArcticDB:

Naive deduplication, no customization allowed. You can just have 1 datetime index that works magically when append or update. This is a huge problem when designing the optimal way to store/organize chain snapshots in the db. Should the index be the timestamp when I fetched it, or the last quote time, or the last trade time? I initially used last quote time, which caused some issue when appending because some strikes are stale, and the last quote time didn't change between my fetches, so ArcticDB won't let me append to it.
No file-level partition on disk, after a few days I'm left with huge binary files on disk. I get it that it's columnar and indeed loads very fast, but no partition means that backing up is a headache. I upload the huge file to my backup S3 bucket at EOD, and it's taking longer and longer amount of time as the files grows.

QuestDB is an timeseries DB/OLAP. Very time series and financial domain focused (built-in functions to do this in Q with l2price. the vwap I imagine would be a lot faster than doing it in pandas / polar.rs), unlike Yandex's ClickHouse. Setting up is a breeze.

Before QuestDB I also read the first 4 chapters of Q for Mortals and got a kdb+ trial license. It's really fun and fast, but I'd say the learning curve isn't the friendliest.

Yes it's beautiful - Shortest fib() ever! `10 {x,sum -2#x}/ 1 1` The unnamed function is literally the definition of fib - https://code.kx.com/q4m3/1_Q_Shock_and_Awe/#112-example-fibonacci-numbers

Q is an array language in the APL family that initially looks like a human-friendly brainfuck. I like how terse and how loopless it is.

I imagine it's not operated by quants at firms, but by specialist engineers (the "Q Gods" as the book say) who spent years to learning Kx's proprietary technologies, who really enjoys writing (and reading, which is allegedly harder) Q as a job. They're paid top dollars (for engineers) but they do not involve in the research.

I tried Kx's kdb's NYSE TAQ demo at https://github.com/kxsystems/kdb-taq, and my computer can handle >10G of data with relative ease. Amazing.

People have polarized opinions about Kx's technologies:

Some says it's legacy tech, some praise it's compact enough to run 100% on CPU cache, some worries about the 100k/year license fee is too much (I'm surprised that the firms are only charged that much).

QuestDB is also blazing fast, and it solved the 2 complaints I have about ArcticDB. In current stage, I'm storing data in both. For 1 min data, I partition by DAY, and for daily, by MONTH. For equities I DEDUP like this:

CREATE TABLE equities_1d (
    timestamp TIMESTAMP,
    open DOUBLE,
    high DOUBLE,
    low DOUBLE,
    close DOUBLE,
    volume LONG,
    ticker SYMBOL CAPACITY 1024  -- Overestimate slightly, powers of 2 are efficient (I need just around 600 for now), SYMBOL mean categorical
) timestamp(timestamp) PARTITION BY YEAR DEDUP UPSERT KEYS(timestamp, ticker);

Columns for my 1 min interval equities table

Data ingestion via the InfluxDB Line Protocol is insanely fast, and I like the pandas / numpy integration, also via the protocol.

The REST API makes it easily curlable as there are just 3 endpoints. For postgres enjoyers (not me), QuestDB also exposes a postgres interface.

Problem: Real Time Alert?

I have a dirty script that I want to run exactly at first second of every minute during trading hours. First I tried to hack it with sleep, it works but it's not very elegant, as you would imagine.

Then I tried the apschedule library:

#!/usr/bin/env python3

from datetime import datetime
import argparse
import asyncio

from apscheduler.schedulers.asyncio import AsyncIOScheduler
from apscheduler.triggers.cron import CronTrigger


async def run_task(script_cmd):
    """Your task logic goes here"""
    try:
        print(f"Task executing at {datetime.now().strftime('%H:%M:%S')}")
        await asyncio.sleep(0.01)  # Small delay to allow cancellation
        # execute user specified script
        proc = await asyncio.create_subprocess_shell(script_cmd)
        await proc.communicate()
        print(f"Task executed successfully at {datetime.now().strftime('%H:%M:%S')}")
    except asyncio.CancelledError:
        print(f"Task cancelled at {datetime.now().strftime('%H:%M:%S')}")
        raise  # Re-raise to properly handle the cancellation


async def main():
    parser = argparse.ArgumentParser(description="Run a task every minute at hh:mm:01")
    parser.add_argument(
        "script",
        help="User specified script command to execute",
    )
    args = parser.parse_args()

    print("Scheduler started, running task every minute at second 1...")

    scheduler = AsyncIOScheduler()
    scheduler.add_job(run_task, CronTrigger(second=1), args=[args.script])
    scheduler.start()

    # Keep the scheduler running indefinitely
    try:
        while True:
            await asyncio.sleep(3600)  # Keep the main task alive
    except asyncio.CancelledError:
        print("Scheduler stopped.")
    finally:
        scheduler.shutdown()


if __name__ == "__main__":
    asyncio.run(main())

Still kinda verbose, but I like it (it feels more robust than the sleep hack).

Now I could just run ./run_every_min.py ./your-alert-script, with just a few issues:

No easy way to track how each run goes, or the alert status, everything is dumped to stdout/stderr
It's obviously not the ideal way to do this.

Solution: Prefect for task scheduling and tracking

Thanks r/algotrading for the suggestion :)

I feel like around 10% of people on that sub are extremely good at data engineering and training all kinds of models (that works).

Prefect has 3 types of scheduling format, crontab is the easiest one.

# crontab for every minute during market hour
# you can combine them with prefect's `Schedules` class, not possible to merge them in 1 crontab line
30-59 9 * * 1-5 command-to-run, it'll check the market hour API so it won't run on holidays
* 10-15 * * 1-5 command-to-run
30 16 * * 1-5 command-to-run

Problem: Handling real time streaming data?

Should I dump them in my QuestDB as they comes in? Then query the DB and get insight? That sounds like extra steps.

I think QuestDB can handle small chunks of update if I subscribe to a second level or tick level data feed, but it doesn't feel optimal.

Solution: a Kafka-esque event streaming service

I didn't have very positive onboarding experience with Kafka and I'm not a fan of Javaware, so I chose Redpanda (Self-Managed) this time, which is Kafka compatible. So far I'm very happy with it.

Problem: managing multiple local long-running processes?

I have a Makefile for starting my local API server, Prefect server and workers:

DEFAULT_PYTHON := ~/mamba/envs/fin/bin/python3
PREFECT_BIN := ~/mamba/envs/fin/bin/prefect

serve:
	# ./app.py
	$(DEFAULT_PYTHON) ./app.py

# other services

But it's not fun manually start them, one in 1 terminal tab (not a huge of running tmux locally)

Solution: Convert them to `launchd(8)` jobs and let the OS handle it

So that they become daemons (or in launchd terms, agents) and I don't have to see the mess in my terminal. stdout and stderr are sent to a log file that can be tailed or searched.

Fun fact - you can replace crontab with launchd, if you don't mind writing a XML key for every possible minute a task should run.