Fundamentals of Streaming Data

Michael Kane, PhD

A little background information about me

I have an MA and PhD in Statistics but I started in computing
I'm interested in
1. Scalable computing
2. Applied probability (probabilistic networks and percolation theory)
I'm in a biostats department but I don't know much biology
I do know a bit about finance though
I also know a bit about signal and image processing

A little more background information

I'm a member of the R community
I'm the author of the Bigmemory Project
John Tukey is my mathematical great-grandfather (my advisor's advisor's advisor)

About this talk

This talk is an attempt to formalize an ongoing conversation with Simon Urbanek
This talk is about concepts and designs for streaming data infrastructure and analysis
- No code will be shown
- I do have R packages that implement some of the proposed design
I want it to be as interactive as possible. Please ask questions as they come up!

Overview of the talk

Motivating example: The gulf oil spill
The streaming data pipeline
Application of streaming data: Backtesting FINRA's circuit breakers
The pipeline as a general approach to scalable, big-data computing
Feedback
Application of feedback: A liquidity market for stock price stability
Conclusions/Questions

Goals for the talk

Formalize streaming data concepts in the general setting
- There are many special-purpose streaming frameworks
- I'm going to cherry-pick and generalize
Show how streaming data is intimately connected with distributed/parallel computing
Provide new directions for streaming data research

What is streaming data?

An ordered sequence of continually arriving points (Forest, 2011)

Some traditional characteristics of streaming challenges:

A finite window of data is available
Data often lose value over time
New data must be processed in a timely-manner

A motivating example: The gulf oil disaster

"Transocean Rig Drilling for BP in US Gulf Hit by Explosion" -- Dow Jones Newswire April 21, 2010

What did we do?

Get a cultural historian to create a lexicon of words that indicate disasters in the oil industry

Look at the intersection of the lexicon words and words in the news article

Use Bayesian change-point analysis to detect when there is a shift from "normal" news to "disaster" news

How is this a streaming challenge?

Real-time news articles constitute a sequence of data related the state of businesses

News loses its value over time

New information needs to be acted upon quickly

The lexicon-article intersections

The Bayesian change-point analysis

Conclusions from the study

For certain types of event risk a bag-of-words and Bayesian change-point can be use to quickly detect event increasing risk
Event detection can be used to hedge or take a position and make money in the face of calamities
Events work well for oil, not necessarily for other sectors

Going from online to offline

Two distinct challenges

Turning this from an offline to an online analysis poses challenges
Operationalizing the online analysis

Why don't we just have a person do this?

Information for ~ 10,000 equities
Routinely get 10,000 DJ news articles per day (about 7 articles per minute every day)
Routinely get ~ 25 million trades

That's a lot of data. Can we exploit existing distributed computing platforms like Hadoop to handle these types of challenges?

Hadoop is good for embarrassingly parallel problems with data independence
It is not good at moving computations to data
It is bad at sequential algorithms
Only supports the MapReduce idiom
We'd really like to be able to define our own idioms supporting
- Pipeline parallelism
- Data localization

Element: Source

A source is pipeline element that only produces data

Provides the input to the pipeline
Examples:
- Stock ticker
- RSS news feed
- Dow Jones Newswire data or other real-time news feed

Element: Filter

A filter is a pipeline element that receives data either from sources or other filters and produces new data

Provides intermediate data processing and analysis in the pipeline
Examples:
- Text analyzer for Dow Jones Newswire
- Any classical filter (Kalman, median, ...)

Element: Sink

A sink is a pipeline element that receives data either from sources or filters and does not produce output

Provides and endpoint for a pipeline
Examples:
- Output file
- Dynamic graph
- A buy/sell order for a broker

Channel

A channel is a communication mechanism allowing messages to be sent within the pipeline

Essentially queues with some enhanced functionality
Require that multiple elements can read from the same queue
There are a lot of technologies that facilitate this

A note on channels and filters

For some problems it is useful allow a set of replicated filters to share a channel. This is sometimes called a farm.

Dynamic pipelines

For some problems it would be nice for a shared pipeline to create a stream that is available to client pipelines on a multicast connection mechanism.

Application: Backtesting FINRA's circuit breaker rules

May 6, 2010 at about 2:45 PM the stock market loses about one trillion dollars

Stocks rebounded to within a few percent of their pre-plunge prices

This event became known as the flash crash

FINRA put in place a set of rules that dictate how quickly stock prices can move

We back-tested how these rules would have performed: are they substanitive or symbolic?

The challenge

Apply FINRA's rules to TAQ data (consolidated trades) for 3 years 2007-2010 (24 billion trades)

Decide if the circuit breaker rules are needlessly coercive, triggering too often

Determine if the rules are effective for controlling volatility during systemic events like the Flash Crash

The results

The circuit breaker rules do mitigate volatility for individual stocks

The would not have stopped a systemic event like the Flash Crash

The circuit breaker rules trigger daily for tens of stocks in normal market conditions

Our work was featured in Barron's (Alpert and Stryjewski, 2011)

Can we describe classic parallel computing idioms with pipelines?

The advantages of this approach

We can describe a much richer set of parallel/distributed/cloud computing idioms

We have a graphical approach to describing sophisticated these idioms

Online vs. offline is a source distinction not a pipeline distinction

Application: Addressing liquidity in markets

The rules continue not to work very well
Markets don't react to them the way people thought they would
There have been revisions to the rules and proposals for other approaches
One suggestion was a transaction tax
- A "one size fits none" solution
- Doesn't take into account changing market conditions

Wanted to propose a market solution that addreses the liquidity problem in systemic events

A stock/liquidity product

Rather than buying stocks you buy a bundled instrument
- The stock
- A financial product whose value is inversely related to the liquidity of the stock
During normal market conditions there is a lot of liquidity and the liquidity product is worth almost nothing
During systemic liquidity events (like Flash Crash)
- Liquidity is low driving the price of the liquidity product up
- Trading is incentivized
- Investors can actually make money selling the stock at a loss
- The products price stays stable

Wrapping up

Provided a concepts and example of streaming data
Showed that a properly defined processing pipeline provides a framework not only for streaming data but for scalable, big-data computing in general
There is some existing software, mostly for multimedia application, but no one has provided a general software framework for scalable, big-data stream
Open areas
- Fault tolerance
- Data localization
- Data Buffering
- Sinks for visualizing processed data
- GUI pipeline editor
- Extensions to control theory

References

Aldinucci M., Danelutto M., Kilpatrick P., and Torquati M. FastFlow: high-level and efficient streaming on multi-core, in: Programming Multi-core and Many-core Computing Systems, Parallel and Distributed Computing, chapter 13. Wiley, 2013. Alpert, W. and Stryjewski L. Hitting the Switch on New Circuit Breaker. Barron's. August 31, 2011.

Forest, J. Stream: A Framework for Data Stream Modeling in R, Undergraduate Thesis, 2011.

The Gstreamer Team. Gstreamer: The Open Source Multimedia Framework. http://gstreamer.freedesktop.org. August 2012.