Fundamentals of Streaming Data

Michael Kane, PhD

A little background information about me

A little more background information

About this talk

Overview of the talk

Goals for the talk

What is streaming data?

An ordered sequence of continually arriving points (Forest, 2011)

Some traditional characteristics of streaming challenges:

  1. A finite window of data is available
  2. Data often lose value over time
  3. New data must be processed in a timely-manner

A motivating example: The gulf oil disaster

"Transocean Rig Drilling for BP in US Gulf Hit by Explosion" -- Dow Jones Newswire April 21, 2010

What happened to the stock?

What did we do?

Get a cultural historian to create a lexicon of words that indicate disasters in the oil industry

Look at the intersection of the lexicon words and words in the news article

Use Bayesian change-point analysis to detect when there is a shift from "normal" news to "disaster" news

How is this a streaming challenge?

Real-time news articles constitute a sequence of data related the state of businesses

News loses its value over time

New information needs to be acted upon quickly

The lexicon-article intersections

The Bayesian change-point analysis

Conclusions from the study

  1. For certain types of event risk a bag-of-words and Bayesian change-point can be use to quickly detect event increasing risk

  2. Event detection can be used to hedge or take a position and make money in the face of calamities

  3. Events work well for oil, not necessarily for other sectors

Going from online to offline

Two distinct challenges

  1. Turning this from an offline to an online analysis poses challenges

  2. Operationalizing the online analysis

Why don't we just have a person do this?

That's a lot of data. Can we exploit existing distributed computing platforms like Hadoop to handle these types of challenges?

The streaming data pipeline

Element: Source

A source is pipeline element that only produces data

Element: Filter

A filter is a pipeline element that receives data either from sources or other filters and produces new data

Element: Sink

A sink is a pipeline element that receives data either from sources or filters and does not produce output

Channel

A channel is a communication mechanism allowing messages to be sent within the pipeline

A note on channels and filters

For some problems it is useful allow a set of replicated filters to share a channel. This is sometimes called a farm.

Dynamic pipelines

For some problems it would be nice for a shared pipeline to create a stream that is available to client pipelines on a multicast connection mechanism.

Application: Backtesting FINRA's circuit breaker rules

May 6, 2010 at about 2:45 PM the stock market loses about one trillion dollars

Stocks rebounded to within a few percent of their pre-plunge prices

This event became known as the flash crash

FINRA put in place a set of rules that dictate how quickly stock prices can move

We back-tested how these rules would have performed: are they substanitive or symbolic?

The challenge

Apply FINRA's rules to TAQ data (consolidated trades) for 3 years 2007-2010 (24 billion trades)

Decide if the circuit breaker rules are needlessly coercive, triggering too often

Determine if the rules are effective for controlling volatility during systemic events like the Flash Crash

The pipeline

The results

The circuit breaker rules do mitigate volatility for individual stocks

The would not have stopped a systemic event like the Flash Crash

The circuit breaker rules trigger daily for tens of stocks in normal market conditions

Our work was featured in Barron's (Alpert and Stryjewski, 2011)

Can we describe classic parallel computing idioms with pipelines?

MapReduce

Multimedia

The advantages of this approach

We can describe a much richer set of parallel/distributed/cloud computing idioms

We have a graphical approach to describing sophisticated these idioms

Online vs. offline is a source distinction not a pipeline distinction

Adding feedback

Application: Addressing liquidity in markets

Wanted to propose a market solution that addreses the liquidity problem in systemic events

A stock/liquidity product

Liquidity market feedback

Wrapping up

References

Aldinucci M., Danelutto M., Kilpatrick P., and Torquati M. FastFlow: high-level and efficient streaming on multi-core, in: Programming Multi-core and Many-core Computing Systems, Parallel and Distributed Computing, chapter 13. Wiley, 2013. Alpert, W. and Stryjewski L. Hitting the Switch on New Circuit Breaker. Barron's. August 31, 2011.

Forest, J. Stream: A Framework for Data Stream Modeling in R, Undergraduate Thesis, 2011.

The Gstreamer Team. Gstreamer: The Open Source Multimedia Framework. http://gstreamer.freedesktop.org. August 2012.