Fundamentals of Streaming Data
Michael Kane, PhD
Michael Kane, PhD
I have an MA and PhD in Statistics but I started in computing
I'm interested in
I'm in a biostats department but I don't know much biology
I do know a bit about finance though
I also know a bit about signal and image processing
I'm a member of the R community
I'm the author of the Bigmemory Project
John Tukey is my mathematical great-grandfather (my advisor's advisor's advisor)
This talk is an attempt to formalize an ongoing conversation with Simon Urbanek
This talk is about concepts and designs for streaming data infrastructure and analysis
I want it to be as interactive as possible. Please ask questions as they come up!
Motivating example: The gulf oil spill
The streaming data pipeline
Application of streaming data: Backtesting FINRA's circuit breakers
The pipeline as a general approach to scalable, big-data computing
Feedback
Application of feedback: A liquidity market for stock price stability
Conclusions/Questions
Formalize streaming data concepts in the general setting
Show how streaming data is intimately connected with distributed/parallel computing
Provide new directions for streaming data research
An ordered sequence of continually arriving points (Forest, 2011)
Some traditional characteristics of streaming challenges:
"Transocean Rig Drilling for BP in US Gulf Hit by Explosion" -- Dow Jones Newswire April 21, 2010
Get a cultural historian to create a lexicon of words that indicate disasters in the oil industry
Look at the intersection of the lexicon words and words in the news article
Use Bayesian change-point analysis to detect when there is a shift from "normal" news to "disaster" news
Real-time news articles constitute a sequence of data related the state of businesses
News loses its value over time
New information needs to be acted upon quickly
For certain types of event risk a bag-of-words and Bayesian change-point can be use to quickly detect event increasing risk
Event detection can be used to hedge or take a position and make money in the face of calamities
Events work well for oil, not necessarily for other sectors
Two distinct challenges
Turning this from an offline to an online analysis poses challenges
Operationalizing the online analysis
Hadoop is good for embarrassingly parallel problems with data independence
It is not good at moving computations to data
It is bad at sequential algorithms
Only supports the MapReduce idiom
We'd really like to be able to define our own idioms supporting
A source is pipeline element that only produces data
Provides the input to the pipeline
Examples:
A filter is a pipeline element that receives data either from sources or other filters and produces new data
Provides intermediate data processing and analysis in the pipeline
Examples:
A sink is a pipeline element that receives data either from sources or filters and does not produce output
Provides and endpoint for a pipeline
Examples:
A channel is a communication mechanism allowing messages to be sent within the pipeline
For some problems it is useful allow a set of replicated filters to share a channel. This is sometimes called a farm.
For some problems it would be nice for a shared pipeline to create a stream that is available to client pipelines on a multicast connection mechanism.
May 6, 2010 at about 2:45 PM the stock market loses about one trillion dollars
Stocks rebounded to within a few percent of their pre-plunge prices
This event became known as the flash crash
FINRA put in place a set of rules that dictate how quickly stock prices can move
We back-tested how these rules would have performed: are they substanitive or symbolic?
Apply FINRA's rules to TAQ data (consolidated trades) for 3 years 2007-2010 (24 billion trades)
Decide if the circuit breaker rules are needlessly coercive, triggering too often
Determine if the rules are effective for controlling volatility during systemic events like the Flash Crash
The circuit breaker rules do mitigate volatility for individual stocks
The would not have stopped a systemic event like the Flash Crash
The circuit breaker rules trigger daily for tens of stocks in normal market conditions
Our work was featured in Barron's (Alpert and Stryjewski, 2011)
We can describe a much richer set of parallel/distributed/cloud computing idioms
We have a graphical approach to describing sophisticated these idioms
Online vs. offline is a source distinction not a pipeline distinction
Wanted to propose a market solution that addreses the liquidity problem in systemic events
Aldinucci M., Danelutto M., Kilpatrick P., and Torquati M. FastFlow: high-level and efficient streaming on multi-core, in: Programming Multi-core and Many-core Computing Systems, Parallel and Distributed Computing, chapter 13. Wiley, 2013. Alpert, W. and Stryjewski L. Hitting the Switch on New Circuit Breaker. Barron's. August 31, 2011.
Forest, J. Stream: A Framework for Data Stream Modeling in R, Undergraduate Thesis, 2011.
The Gstreamer Team. Gstreamer: The Open Source Multimedia Framework. http://gstreamer.freedesktop.org. August 2012.