LobsteR - NASDAQ under a "tidy" Microscope

Stefan Voigt

Last updated on Mar 27, 2020

During my PhD studies, I have been working with high-frequency trading data provided by Lobster a lot for some of my research projects.
In this short series of posts, I want share some of my code and routines to efficiently handle the extremely large amounts of data that go through NASDAQs servers on a daily basis. In fact, if you look at the figure below, there is plenty to explore: during less than 2 minutes on March 17th, 2020, thousands of trades have been executed for SPY, a large ETF. The red line shows the traded prices during that period and the blue shaded areas show the dynamics of the orderbook. The darker the areas, the more liquidity (measured as size of the order book levels).

First, I provide some snippets to read-in Lobster files and to compute some potentially interesting statistics. In a second post, I illustrate long-run characteristics of the orderbook dynamics and I’ll finally focus some really recent events: the days since the outbreak of COVID19 have been extremely bumpy for SPY, the largest ETF in the world and it is amazing to see, how liquidity supply changed during these rough days.

Handling Lobster Data

Lobster is an online limit order book data tool to provide easy-to-use, high-quality limit order book data for the entire universe of NASDAQ traded stocks. I requested some of the data based on their online interface and stored it before running the code below. The actual data which I will use for the next post is much larger. I downloaded all trading messages for ticker SPY (order submissions, cancellations, trades, …) that went through NASDAQ since July, 27th 2007 until March, 25th, 2020. The files contain the entire orderbooks until level 10.

First steps

I work in R with message level data from Lobster in a tidy and (hopefully) efficient way.

library(tidyverse)
library(lubridate)

As an example, I illustrate the computations for a tiny glimpse of March 17th, 2020. Lobster files always come with the same naming convention ticker_date_34200000_57600000_filetype_level.csv, whereas filetype either denotes message or the corresponding orderbook snapshots.

asset <- "SPY"
date <- "2020-03-17"
level <- 10
messages_filename <- paste0(asset,"_",date,"_34200000_57600000_message_", level,".csv")
orderbook_filename <- paste0(asset, "_",date,"_34200000_57600000_orderbook_", level,".csv")

Let’s have a look at the raw message feed first.

messages_raw <- read_csv(messages_filename, 
                col_names = c("ts", "type", "order_id", "m_size", "m_price", 
                              "direction", "null"),
                col_types = cols(ts = col_double(), 
                                 type = col_integer(),
                                 order_id = col_integer(),
                                 m_size = col_double(),
                                 m_price = col_double(),
                                 direction = col_integer(),
                                 null = col_skip())) %>% 
  mutate(ts = as.POSIXct(ts, origin=date, tz="GMT"), 
         m_price = m_price / 10000)

messages_raw

## # A tibble: 20,000 x 6
##    ts                   type order_id m_size m_price direction
##    <dttm>              <int>    <int>  <dbl>   <dbl>     <int>
##  1 2020-03-17 09:30:00     4 24654260    230    245          1
##  2 2020-03-17 09:30:00     3 24683304    500    245.        -1
##  3 2020-03-17 09:30:00     3 24690848    500    245.         1
##  4 2020-03-17 09:30:00     1 24699256    500    245.        -1
##  5 2020-03-17 09:30:00     3 24690812    500    245.        -1
##  6 2020-03-17 09:30:00     3 24699256    500    245.        -1
##  7 2020-03-17 09:30:00     1 24699992    500    245.         1
##  8 2020-03-17 09:30:00     1 24700384    500    245.         1
##  9 2020-03-17 09:30:00     3 24700384    500    245.         1
## 10 2020-03-17 09:30:00     1 24700516    500    245.         1
## # ... with 19,990 more rows

By default, ts denotes the time in seconds since midnight (decimals are precise until nanosecond level) and price always comes in 10.000 USD. type denotes the message type: 4, for instance, corresponds to the execution of a visible order. The remaining variables are explained in more detail here.

Next, the corresponding orderbook snapshots contain price and quoted size for each of the 10 levels.

orderbook_raw <- read_csv(orderbook_filename,
    col_names = paste(rep(c("ask_price", "ask_size", "bid_price", "bid_size"), level),
                      rep(1:level, each=4), sep="_"),
    cols(.default = col_double())) %>% 
  mutate_at(vars(contains("price")), ~./10000)

Putting the files together

Each message is associated with the corresponding orderbook snapshot at that point in time. After merging message and orderbook files, the entire data thus looks as follows

orderbook <- bind_cols(messages_raw, orderbook_raw)

ts	type	order_id	m_size	m_price	ask_price_1	ask_size_1	bid_price_1	bid_size_1
2020-03-17 09:30:00	4	24654260	230	245.00	245.10	500	244.88	1000
2020-03-17 09:30:00	3	24683304	500	245.14	245.10	500	244.88	1000
2020-03-17 09:30:00	3	24690848	500	244.88	245.10	500	244.88	500
2020-03-17 09:30:00	1	24699256	500	245.03	245.03	500	244.88	500
2020-03-17 09:30:00	3	24690812	500	245.10	245.03	500	244.88	500
2020-03-17 09:30:00	3	24699256	500	245.03	245.11	500	244.88	500

Compute summary statistics

Next, I compute summary statistics on 20 second levels. In particular I am interested in quoted prices, spreads, and depth (the amount of tradeable units in the orderbook):

Midquote \(q_t = (a_t + b_t)/2\) (where \(a_t\) and \(b_t\) denote the best bid and best ask)
Spread \(S_t= (a_t - b_t)\) (values below are computed in basis points relative to the concurrent midquote)
Volume is the aggretate sum of traded units of the stock. I do differentiate between hidden (type==5) and visible volume.

orderbook <- orderbook %>% mutate(midquote = ask_price_1/2 + bid_price_1/2, 
                     spread = (ask_price_1 - bid_price_1)/midquote * 10000,
                     volume = if_else(type ==4|type ==5, m_size, 0),
                     hidden_volume = if_else(type ==5, m_size, 0))

As a last step, depth of the orderbook denotes the number of assets that can be traded without moving the quoted price more than a given range (measured in basis points) from the concurrent midquote. The function below takes care of the slightly involved computations.

compute_depth <- function(df, side = "bid", bp = 0){
  if(side =="bid"){
    value_bid <- (1-bp/10000)*df %>% select("bid_price_1") 
    index_bid <- df %>% select(contains("bid_price")) %>% 
      mutate_all(function(x) {x >= value_bid})
    sum_vector <- (df %>% select(contains("bid_size"))*index_bid) %>% rowSums()
  }else{
    value_ask <- (1+bp/10000)*df %>% select("ask_price_1")
    index_ask <- df %>% select(contains("ask_price")) %>% 
      mutate_all(function(x) {x <= value_ask})
    sum_vector <- (df %>% select(contains("ask_size"))*index_ask) %>% rowSums()
    
  }
  return(sum_vector)
}

orderbook <- orderbook %>% mutate(depth_bid = compute_depth(orderbook),
                                  depth_ask = compute_depth(orderbook, side="ask"),
                                  depth_bid_5 = compute_depth(orderbook, bp = 5),
                                  depth_ask_5 = compute_depth(orderbook, bp = 5, side="ask"))

Almost there! The snippet below splits the data into 20 second intervals and computes the averages of the computed summary statistics.

orderbook_dense <- orderbook %>%
  mutate(ts_minute = floor_date(ts, "20 seconds")) %>% 
  select(midquote:ts_minute) %>% 
  group_by(ts_minute) %>% 
  mutate(messages = n(),
         volume = sum(volume),
         hidden_volume = sum(hidden_volume)) %>%
  summarise_all(mean)

Here we go: during the first 100 seconds on March 17th, 20.000 messages related to the orderbook of SPY have been processed by NASDAQ. The quoted spread on average was around 3bp. On average, roughly 90.000 contracts have been traded during each 20 second slot - in other words, assets worth roughly 90 million USD have been exchanged. Quoted liquidity at the best bid and best ask seems rather small relative to the tremendous amounts of trading activity during this (very short) period of time.

ts_minute	midquote	spread	volume	hidden_volume	depth_bid	depth_ask	depth_bid_5	depth_ask_5	messages
2020-03-17 09:30:00	245.0332	4.010257	89606	19923	353.7358	354.1362	1854.152	2516.916	5890
2020-03-17 09:30:20	245.2229	3.142070	54733	23716	190.3232	238.8164	2099.857	2041.646	3165
2020-03-17 09:30:40	245.5052	2.177630	53273	18188	121.9574	182.5553	2113.945	2282.149	4246
2020-03-17 09:31:00	245.2010	1.488751	146974	86780	297.4000	254.3316	1985.406	2416.603	4210
2020-03-17 09:31:20	244.6590	1.514445	26286	6655	122.6870	115.6107	2174.080	2325.517	2489

Finally, some visualisation of the data at hand: The code below creates the figure at the beginning of the post and shows the dynamics of the traded prices (red line) and the quoted prices at the higher levels of the orderbook.

orderbook_trades <- orderbook %>% 
  filter(type==4|type==5) %>% 
  select(ts, m_price)

orderbook_quotes <- orderbook %>% 
  mutate(id = row_number()) %>%
  select(ts, id, matches("bid|ask")) %>% 
  gather(level, price, -ts, -id) %>%
  separate(level, into=c("side","variable","level"), sep="_") %>%
  mutate(level = as.numeric(level))  %>% 
  spread(variable, price)

p1 <- ggplot() + 
  theme_bw() +
  geom_point(data = orderbook_quotes, aes(x=ts, y=price, color=level, size = size/max(size)), alpha = 0.1)+
  geom_line(data = orderbook_trades, aes(x=ts, y=m_price), color='red') + 
  labs(title="SPY: Orderbook Dynamics",
       y="Price",
       x="") +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        legend.position ="none") +
  scale_y_continuous()

LobsteR - NASDAQ under a "tidy" Microscope

Handling Lobster Data

First steps

Putting the files together

Compute summary statistics

Related