Storing and analyzing equity tick data (trades and quotes) using HDF5 / h5py

This is a follow up to my previous two posts on building a django-based stock trade analysis tool (1, 2).

For this tool, I want to be able to analyze trades at extremely micro-level (on a tick-by-tick basis), so summary market data on a secondly or minutely basis wouldn’t be sufficient.  Now, equity trade data is pretty manageable (the most actively traded names usually have somewhere in the tens of thousands of trades per day), but for quote data, even just level-1 depth can amount to hundreds of thousands or even a million lines of data for a single stock in a single day.  So a traditional SQL database isn’t a practical solution for storing and querying the data.  There were three main solutions I considered:

  • Kx systems’ proprietary high performance tick database, kdb+.  kdb is the financial industry standard db for managing tick data.  Commercial licenses are extremely expensive, but it’s free for non-commercial use, so it was definitely an option.  Interacting with kdb requires heavy use of their proprietary querying language Q.  This is what our quants used at my prior firm.
  • HDF5 / h5py.  HDF5 is an open source binary data storage file format.  h5py is a python library/interface for HDF5.  HDF5 came recommended by @igorsyl.
  • Flat files.  The historical tick data is provided in zipped flat files, so it was also an option to just leave it in this format and unzip/process it as necessary.

Even though we will likely go with kdb long term, I decided to try out HDF5, mainly because I was already working in python (django) and figured it would be a more straightforward integration.  Here are the steps I took to get my raw tick data into HDF5 and perform some basic analyses/visualizations.

  1. Obtain tick data.  We went with TickData, and I’ve been pretty happy with that decision so far.  Here’s their file format for  historical U.S. equity market data.
  2. Download/install necessary Python modules (and Python itself).   The main libraries I’ve used so far are: h5py (storing/recalling HDF5 data), NumPy (h5py converts HDF5 data into NumPy arrays), Matplotlib (charting), PyTables/ViTables (ViTables is a UI for viewing/interacting with HDF5 files, based on PyTables/NumPy/h5py), PyQt4 (UI library).  Python(x,y) is a popular package that contains Python and all of these libraries, as well as a bunch more, so it’s probably a good starting point.
  3. Unzip tick data.  TickData provides data in a weird series of nested zip files/directories, so it’s kind of a pain to manipulate.  I wrote a little python script to traverse the tick data directory and recursively unzip all the files/directories and their contents (and another script to delete all the zip files afterward).  I tried combining these into a single method, but it had concurrency issues, so just ran one then the other.
import os, zipfile

def walkThroughAndUnzip(path):
    for subdir, dirs, files in os.walk(path):
        for file in files:
            fullname = os.sep.join([subdir, file])
            if zipfile.is_zipfile(fullname):
                zFile = zipfile.ZipFile(fullname)
                newpath = fullname[:-4]
                zFile.extractall(newpath)                
                walkThroughAndUnzip(newpath)

def deleteZipFiles(path):
    for subdir, dirs, files in os.walk(path):
        for file in files:
            fullname = os.sep.join([subdir, file])
            if zipfile.is_zipfile(fullname):
                os.remove(fullname)
  1. Define trade and quote schemas.  h5py stores data in HDF5 via NumPy array, so before we can populate the database, we have to decide on a format for storage for the raw data.  Since I don’t really care about all of the 15-20 fields that TickData provides, I narrowed the data down as following.  Note: I chose to store timestamps as an unsigned 32-bit int containing milliseconds since epoch.
    • Trades: Timestamp, Price, Quantity, Exchange, Condition Code(s)
    • Quotes: Timestamp, Exchange, Bid Size, Bid Price, Ask Price, Ask Size, Condition Code
import numpy as np
dataTypeT = np.dtype([('timestamp', 'u8'), ('price', 'f8'), 
    ('quantity', 'u4'), ('exchange', 'a1'), ('condition', 'a4')])
dataTypeQ = np.dtype([('timestamp', 'u8'), ('exchange', 'a1'), 
    ('bid_size', 'u4'), ('bid', 'f8'), ('ask', 'f8'), 
    ('ask_size', 'u4'), ('condition', 'a1')])
  1. Define HDF5 file hierarchy.  Next step is to figure out how you want to organize the data within HDF5.  I chose the following structure: T/Q (group) >> Symbol (group) >> Year (group) >> Month (group) >> Day (dataset).
  2. Parse raw tick data and insert into HDF5 file.  Finally, we have all the pieces to actually process the tick data file-by-file.  This was pretty straightforward– just traverse the directory of flat tick data files, identify whether it’s trade or quote data and where in the HDF5 hierarchy it belongs, create an empty NumPy table with the appropriate dtype and the same number of rows as the data file, and read/convert the lines one-by-one.
  3. Define book schema for aggregated quote data.  The raw quote data is provided as a series of individual exchange quotes.  In order to extract useful information from it, you need to consolidate the data into the NBBO.  To do this, I first defined a new NumPy datatype that could encompass all of the data on all exchanges at a given time in a single row.  Note that in addition to the NBBO, I decided to store the inside quote on all of the exchanges (TickData’s equity exchange codes are: A|B|C|D|E|I|J|K|M|N|O|P|S|Q|T|W|X|Y|Z, although some of these are now inactive).
book_dtype = np.dtype([('timestamp', 'u8'), ('best_bid_size', 'u4'), 
    ('best_bid', 'f8'), ('best_ask', 'f8'), ('best_ask_size', 'u4'),
    ('A_bid_size', 'u4'), ('A_bid', 'f8'), ('A_ask', 'f8'), ('A_ask_size', 'u4'),
    ('B_bid_size', 'u4'), ('B_bid', 'f8'), ('B_ask', 'f8'), ('B_ask_size', 'u4'),
    ('C_bid_size', 'u4'), ('C_bid', 'f8'), ('C_ask', 'f8'), ('C_ask_size', 'u4'),
    etc...
  1. Process raw quote data into consolidated book format.  I then added a third group (B for book) on top of T and Q in my HDF5 hierarchy and converted the raw quote data using the following helper class as I went line by line.  I aggregated bid/ask interest across all exchanges at the inside as opposed to only reflecting the volume at the largest individual exchange (as most market data providers do)– aggregated bid/ask size is far more useful as a data point as it provides a good baseline for the total amount of stock one could reasonably expect to trade at a price level, before factoring in hidden liquidity.
class Book:
    def __init__(self):
        self.exchangeQuotes = {}

    def getTOB(self):
        bbs = -1
        bb = -1
        ba = -1
        bas = -1
        for eq in self.exchangeQuotes.itervalues():
            if eq.bid > bb or bb == -1:
                bb = eq.bid
                bbs = eq.bid_size
            elif eq.bid == bb:
                bbs += eq.bid_size

            if eq.ask < ba or ba == -1:
                ba = eq.ask
                bas = eq.ask_size
            elif eq.ask == ba:
                bas += eq.ask_size
        return bbs, bb, ba, bas

    def updateQuote(self, exchange, quote):
        self.exchangeQuotes[exchange] = quote

    def getRowForDB(self, timestamp):
        new_row = np.zeros(shape=1, dtype=book_dtype)
        new_row['timestamp'] = timestamp
        new_row['best_bid_size'], new_row['best_bid'], new_row['best_ask'], 
            new_row['best_ask_size'] = self.getTOB()
        for ex, eq in self.exchangeQuotes.items():
            new_row[ex + '_bid_size'] = eq.bid_size
            new_row[ex + '_bid'] = eq.bid
            new_row[ex + '_ask'] = eq.ask
            new_row[ex + '_ask_size'] = eq.ask_size
        return new_row
  1. Analysis/charting.  Now that the data is in an easily consumable form, it’s pretty easy to run some basic metrics or create graphs.  NumPy has great functionality that makes it extremely easy to filter data.  I also used the matplotlib and PyQt4 graphics/UI libraries to make a little interactive app to query the HDF5 database and chart price and volume data.  I’ve found PyQt4 reasonably straightforward to use, and it feels pretty snappy compared to say Highcharts (in the web app), for example, but I feel like there are probably better UI options out there.  It worked fine for test purposes though.

Next step will be to integrate the historical tick data with the actual client trades and the rest of the django app and start exploring some performance metrics.  Sweet.

3 thoughts on “Storing and analyzing equity tick data (trades and quotes) using HDF5 / h5py

  1. I am eagerly awaiting the next entry in this series! I have been playing around with financial data in python, and want to turn an idea into a site. Since I’m not a software engineer I highly value your opinions/experiments with HDF5!!

    • Appreciate the kind words– unfortunately a couple days after this post, I got shifted off this project to work on trading strategy stuff (which I’m not really at liberty to discuss in depth). I’d be happy to answer any questions you’ve got as best I can. Feel free to reach out. Good luck!

  2. Pingback: Python을 이용한 마켓데이타 플랫폼 구축방법

Leave a comment