Since size is relative to technology define ``big'' in terms of technology
A data set is large if it exceeds 20% of the available RAM for a single machine
A data set is massive if it exceeds 50% of available RAM on a single machine
"The data may not contain the answer. The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." -John W. Tukey
Computations on data are limited by how quickly we can get data to a processor (or computing resource)
The term ``Big Data'' is technological concept
Capacity | Transfer Rate | |
---|---|---|
L1 Cache | 64 KB | 950 GB/s |
L2 Cache | 256 KB | 240 GB/s |
RAM | 16 GB | 6.4 GB/s |
SSD | 500 GB | 0.55 GB/s |
Disk | 3 TB | 0.15 GB/s |
Gigabit Ethernet | 0.01 GB/s | |
Internet | 0.006 GB/s |
``Computers are like cars. Everyone wants a faster one but no one knows why.'' -unknown
If the allocation were small standard memory functions (malloc/new) could be called
If the alloation were large:
Applications could run OOC without modification
Existing data-processing applications could accomodate much larger data sets without modification
A general, transparent tool for OOC computing
Launched as a command line utility taking an application as a argument
All memory allocations larger than a specified threshold are file-backed
Process- and thread-safe
When data is not needed it is stored on disk, when it is it is cached
Tested with R, python, and the GIMP.
Start an application in flexmem by specifying the application as an argument
mike@mike-VirtualBox:~/$ flexmem R
memory-mapped files are stored in /tmp
by default
Internals can be interrogated using the Rflexmem package
Works right now for Linux with OS X support on the way
Feedback and contributions are welcome
Project can be found at: https://github.com/kaneplusplus/flexmem
flexmem is a tool for managing data
Complements ``Big data'' development
For an \(n \times p\) matrix the computational complexity of the SVD is \(O(n^2p + n p^2 + p^3)\) for \(n \geq p\).
Difficult to calculate for even modest size matrices
Implicitly Restarted Lanczos Bidiagonalization Algorithm (IRLBA) (Baglama and Reichel, 2005)