The computing infrastructure refers to the software that is used to manage computing resources. It is designed to be scalable meaning that the run-time for large-scale computations can be kept to a minimum simply by applying more computing resources to the calculation. This will allow the infrastructure to continue processing larger and larger amounts of data simply by adding hardware resources. Next, the infrastructure is also designed to be elastic meaning that computing resources can be applied to a calculation as it is being performed providing further speed-up or removed from a calculation, allowing the computing resource to be applied to other calculations. Finally, the infrastructure is designed to be general provides a solution to a large number of computational challenges.
The infrastructure is designed to be decentralized and computing resources are deployed to computations as needed. Rather than being tied to a single “master” process a worker processes blocks on a resource queue, waiting for a new job. A new computation is initiated by a coordinator by sending a request for workers through the resource queue, shown in Figure 1. The request includes the number of workers to deploy to the coordinator along with a handle to a task queue, where workers will receive new tasks, and a task return queue, where workers will return results. Next, workers receiving the request wait on the coordinators task queue, shown in Figure 2, where they receive calcualtions to perform. When the tasks are complete the workers return to the resource queue where they wait for new jobs.
The proposed computing infrastructure is inspired by the [tuple spaces][TupleSpaces], associative memory approach to distributed computing. Redis, a key-value database is responsible for interprocess communication, synchronization, and provides primitives for fault-tolerance. The design is easily scaled-out and new workers are easily added to the system simply by waiting on the resource queue for the next job. Furthermore, since workers are not associated with a single coordinator, they can be deployed to any job. And multiple jobs can be serviced at the same time.
The infrastructure also provides two clear advantage over Hadoop. First, it is well suited to a general class of parallel problems. Users can define general pipeline-parallel problems suitable for tackling streaming data along with other non-embarrassingly parallel problems such as numerical linear algebra. Users a not limited to computing challenges that fall within the MapReduce framework and they are free to define their own parallel computing configuration to address individual challenges. Second, it is easy to administer and maintain. A basic Hadoop installation requires the setup and maintenance of the Hadoop Filesystem, Task Trackers, a Job Tracker, a Name Node, Data Nodes, etc. The computing infrastructure requires that R, a few R packages, and Redis are installed. There are no services to start and there are no configuration files to modify. The infrastructure can be installed and used in a matter of seconds
The system also provides advantages when compred with MPI. First, only one computation can be performed on the cluster at a time. As a result users are generally forced to submit their jobs to a queue, which only starts after the jobs ahead of it are complete. When previously submitted jobs are long-running the user is forced to wait, regardless of the running time of his job. Second, elasticity is not explicitly built into the MPI framework. This means that either a parallel calculation has to be descretized to allow entry points for new workers to join in the calculation or a registration process has to be built on top of MPI. While there has been some work in this area by A. Raveendran (2011), the concept and its implementation are still considered preliminary.
Generally speaking, parallel problems fall into one of two categories. Computations in the first category are accomplished by grouping data, performing a single calculation on each group, and returning the results in a specified format. Since the group calculations are independent of each other, and they are easily parralleized, they are often referred to as embarrassingly parallel computing challenges. While current distributed computing frameworks such as Hadoop are very effective for embarrassingly parallel operations they become inefficient in the second challenge domain, non-embarrassingly parallel. This limitation originates from the inappropriate optimizations of communication overhead, which can only be achieved by taking into account redundancies in data transfer inherent to these types of problems.
An illustrative example of a non-embarrassingly parallel calculation is a matrix multiply. This operation can be thought of as the outer product of the inner products of the rows of one matrix onto columns of another. Let \( A \) be a \( p \times n \) matrix and \( B \) be a \( n \times q \) matrix with block matrices defined by \( A_{11} \), \( A_{12} \), \( A_{21} \), \( A_{22} \) and \( B_{11} \), \( B_{12} \), \( B_{21} \), \( B_{22} \) respectively. The matrix multiply operation can be written as: \[ AB = \left[ \begin{array}{ccc} A_{11} \times B_{11} + A_{12} \times B_{21} & A_{11} \times B_{12} + A_{12} \times B_{22} \\ A_{11} \times B_{12} + A_{12} \times B_{22} & A_{21} \times B_{12} + A_{22} \times B_{22} \end{array} \right] \] From the equations it is clear that there is not a data independent strategy for grouping the block matrices since each block appears in at least two of the resulting block matrices and no combination of blocks appears more than once in the result. It should also be noted that the challenge of minimizing communication overhead is not limited to matrix multiplications; it appears throughout the field of linear algebra and streaming data challenges.