On the other hand, the paper is a must read if you are interested in distributed deep network platforms. This is the paper that applied the distributed parameter-server idea to Deep Learning. The parameter-server idea is still going strong as it is suitable to serve the convergent iteration nature of machine learning and deep learning tasks. The DistBelief architecture has been used by the Microsoft Adam project, Baidu Deep Image, Apache Hama, and Petuum's Bosen. Google, though, has since switched from the DistBelief parameter-server to TensorFlow's hybrid dataflow architecture, citing the difficulty of customizing/optimizing DistBelief for different machine learning tasks. And of course TensorFlow also brought support for distributed GPU execution for deep learning, which improves performance significantly.
I think another significance of this paper is that it established connections between deep-learning and distributed graph processing systems. After understanding the model-parallelism architecture in DistBelief, it is possible to transfer some distributed graph processing expertise (e.g., locality-optimized graph partitioning) to address performance optimization of deep NN platforms.
The DistBelief architectureDistBelief supports both data and model parallelism. I will use the Stochastic Gradient Descent (SGD) application as the example to explain both cases. Let's talk about the simple case, data parallelism first.
Data parallelism in DistBelief
It is OK for the model replicas work concurrently in an asynchronous fashion because the $\delta$ gradients are commutative and additive with respect to each other. It is even acceptable for the model replicas to slack a bit in fetching an updated copy of the model parameters $w$. It is possible to reduce the communication overhead of SGD by limiting each model replica to request updated parameters only every nfetch steps and send updated gradient values only every npush steps (where nfetch might not be equal to npush). This slacking may even be advantageous in the beginning of the training when the gradients are steep, however, towards converging to an optima when the gradients become subtle, going like this may cause dithering. Fortunately, this is where Adagrad adaptive learning rate procedure helps. Rather than using a single fixed learning rate on the parameter server, Adagrad uses a separate adaptive learning rate $\eta$ for each parameter. In Figure 2 the parameter-server update rule is $w' := w - \eta \delta w$. An adaptive learning with large learning rate $\eta$ during convergence, and small learning rate $\eta$ closer to the convergence is most suitable.
Although the parameter-server is drawn as a single logical entity, it is itself implemented in a distributed fashion, akin to how distributed key value stores are implemented. In fact the parameter server may even be partitioned over the model replicas so each model replica becomes the primary server of one partition of the parameter-server.
Model parallelism in DistBelief
When the model replica is sharded over multiple machines as in the figure, this is called *model-parallelism*. Typically the model replica, i.e. the NN, is sharded upto 8 model-worker machines. Scalability suffers when we try to partition the model replica among more than 8 model-workers. While we were able to tolerate slack between the model-replicas and the parameter-server, inside the model-replica the model-workers need to act consistently with respect to each other as they perform forward activation propagation and backward gradient propagation.
For this reason, proper partitioning of the model-replica to the model-worker workers is critical for performance. How is the model, i.e., the NN, partitioned over the model-workers? This is where the connection to distributed graph processing occurs. The performance benefits of distributing the model, i.e., the deep NN, across multiple model-worker machines depends on the connectivity structure and computational needs of the model. Obviously, models with local connectivity structures tend to be more amenable to extensive distribution than fully-connected structures, given their lower communication requirements.
The final question that remains is the interaction of the model-workers with the parameter-server. How do the model workers, which constitute a model-replica, update the parameter-server? Since the parameter-server itself is also distributedly implemented (often over the model replicas), each model-worker needs to communicate with just the subset of parameter server shards that hold the model parameters relevant to its partition. For fetching the model from the parameter-server, I presume the model-workers need to coordinate with each other and do this in a somewhat synchronized manner before starting a new mini-batch.
[Remark: Unfortunately the presentation of the paper was unclear. For example there wasn't a clean distinction made between the term "model-replica" and "model-worker". Because of these ambiguities and the complicated design ideas involved, I spent a good portion of a day being confused and irritated with the paper. I initially thought that each model-replica has all the model (correct!), but each model-replica responsible for updating only part of the model in parameter-server (incorrect!).]
ExperimentsThe paper evaluated DistBelief for a speech recognition application and for ImageNet classification application.
For visual object recognition, DistBelief was used for training a larger neural network with locally-connected receptive fields on the ImageNet data set of 16 million images, each of which we scaled to 100x100 pixels. The network had three stages, each composed of filtering, pooling and local contrast normalization, where each node in the filtering layer was connected to a 10x10 patch in the layer below. (I guess this is a similar set up to convolutional NN which become an established method of image recognition more recently. Convolutional NN has good locality especially in the earlier convolutional layers.) Due to locality in the model, i.e., deep NN, this task scales better to partitioning up to 128 model-workers inside a model replica, however, the speedup efficiency is pretty poor: 12x speedup using 81 model-workers.
Using data-parallelism by running multiple model-replicas concurrently, DistBelief was shown to be deployed over 1000s of machines in total.