Advances in out-of-core computing

Introduction to flexmem


Use of the term "Big Data" according to Google

plot of chunk trendChart1

Are people still going ``to the cloud?''

plot of chunk trendChart2

Data's value comes from its information

plot of chunk trendChart3

Avoiding ``mine are bigger than yours''

Since size is relative to technology define ``big'' in terms of technology

A data set is large if it exceeds 20% of the available RAM for a single machine

A data set is massive if it exceeds 50% of available RAM on a single machine

  • Can't afford even a single copy without swapping
  • Need an alternative approach


"The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." -John W. Tukey

Why is analyzing/processing large data sets a challenge?

Computations on data are limited by how quickly we can get data to a processor (or computing resource)

The term ``Big Data'' is technological concept

Capacity Transfer Rate
L1 Cache 64 KB 950 GB/s
L2 Cache 256 KB 240 GB/s
RAM 16 GB 6.4 GB/s
SSD 500 GB 0.55 GB/s
Disk 3 TB 0.15 GB/s
Gigabit Ethernet 0.01 GB/s
Internet 0.006 GB/s

``Computers are like cars. Everyone wants a faster one but no one knows why.'' -unknown

Challenges of out-of-core (OOC) computing

  1. Not easily retrofitted to existing applications
    • In-core data structures have to be swapped for out-of-core ones
    • OOC data structures sometimes behave differently
    • OOC data structures have different performance characteristics
  2. Development requires system-level expertise
    • Different practice than application development or algorithm delopment
    • Another component to manage
  3. Generally platform specific
    • Windows is not Linux is not OS X
    • Maintain parallel versions of functionally equivalent software

What if we could redefine how memory is allocated when an application is launched?

If the allocation were small standard memory functions (malloc/new) could be called

If the alloation were large:

  1. Create and memory map (mmap) a file with specified size
  2. Return the mmap'ed pointer address

Applications could run OOC without modification

Existing data-processing applications could accomodate much larger data sets without modification

Introducing flexmem

A general, transparent tool for OOC computing

Launched as a command line utility taking an application as a argument

All memory allocations larger than a specified threshold are file-backed

Process- and thread-safe

When data is not needed it is stored on disk, when it is it is cached

Tested with R, python, and the GIMP.

Using flexmem

Start an application in flexmem by specifying the application as an argument

mike@mike-VirtualBox:~/$ flexmem R

memory-mapped files are stored in /tmp by default

Internals can be interrogated using the Rflexmem package

Works right now for Linux with OS X support on the way

Feedback and contributions are welcome

Project can be found at: https://github.com/kaneplusplus/flexmem

``Big data'' development with flexmem

flexmem is a tool for managing data

  • Transparently create OOC data structures
  • Will not make your analyses faster

Complements ``Big data'' development

  • Allows you to focus on algorithm development
  • Allows you to easily apply new algorithms larger-than-RAM data

Example the singular value decomposition

For an \(n \times p\) matrix the computational complexity of the SVD is \(O(n^2p + n p^2 + p^3)\) for \(n \geq p\).

Difficult to calculate for even modest size matrices

Implicitly Restarted Lanczos Bidiagonalization Algorithm (IRLBA) (Baglama and Reichel, 2005)

  • Iterative approximation method
  • Truncated SVD
  • Generally converges in a constant number of steps

Evaluating the IRLB with flexmem

plot of chunk SvdComparison

Where is flexmem going?

  1. OS X support

  2. Directly map to data structures on disk
    • No need to serialize
    • Planned for R

  3. Feedback/testing

If you like to try it out or contribute...

Thanks again... any questions?