Saturday, November 18, 2017

Paper summary. Dynet: The Dynamic Neural Network Toolkit

The programming model that underlies several popular toolkits such as TensorFlow uses a static declaration approach: they separate declaration and execution of the network architecture.

Static declaration has a number of advantages. After the computation graph is defined, it can be optimized in a number of ways so that the subsequent repeated executions of computation can be performed as quickly as possible. This also simplifies distribution of computation across multiple devices, as in TensorFlow. But static declaration is inconvenient for the following:

• variably sized inputs
• variably structured inputs
• nontrivial inference algorithms
• variably structured outputs

Of course, it is possible to process variable sized inputs if the computation graphs can represent objects whose size is unspecified at declaration time. Flow control operations such as conditional execution and iteration can be added to the inventory of operations supported by the computation graph. For example, to run an RNN over variable length sequences, Theano offers the scan operation, and TensorFlow offers the dynamic RNN operation.

While it is therefore possible to deal with variable architectures with static declaration in theory, that still poses some difficulties in practice:

• Difficulty in expressing complex flow-control logic
• Complexity of the computation graph implementation
• Difficulty in debugging

These are associated with some serious software engineering risks. As an alternative, DyNet proposes reviving an alternative programming model: dynamic declaration of computation graphs.

Dynamic declaration

The dynamic declaration model in Dynet takes a single-step approach: the user defines the computation graph programmatically as if they were calculating the outputs of their network on a particular training instance. There are no separate steps for definition and execution: the necessary computation graph is created, on the fly, as the loss calculation is executed, and a new graph is created for each training instance. (To avoid the overhead, DyNet strives to provide very lightweight graph construction.)

Dynamic declaration reduces the complexity of the computation graph implementation since it does not need to contain flow control operations or support dynamically sized data. DyNet is designed to allow users to implement their models in their preferred programming language (C++ or Python). A symbolic computation graph is still constructed, but by using the host language (C++ or Python) rather than providing them separately at the computation graph level. Thus, dynamic declaration facilitates the implementation of more complicated network architectures.

What is the innovation in DyNet?

DyNet aims to minimize the computational cost of graph construction in order to allow efficient dynamic computation. This way DyNet aspires to remove barriers to rapid prototyping and implementation of more sophisticated applications of neural nets that are not easy to implement in the static computation paradigm.

DyNet's backend, which is written in C++, is optimized to remove overhead in computation graph construction, and support efficient execution on both CPU and GPU. This is feasible to do. Since flow control and facilities for dealing with variably sized inputs remain in the host language (rather than in the computation graph, as is required by static declaration), the computation graph needs to support fewer operation types, and these tend to be more completely specified (e.g., tensor sizes are always known rather than inferred at execution time).

DyNet programs

DyNet programs follow the following template
1. Create a Model.
2. Add the necessary Parameters and LookupParameters to the model. Create a Trainer object and associate it with the Model.
3. For each input example:
(a) Create a new ComputationGraph, and populate it by building an Expression representing the desired computation for this example.
(b) Calculate the result of that computation forward through the graph by calling the value() or npvalue() functions of the final Expression
(c) If training, calculate an Expression representing the loss function, and use its backward() function to perform back-propagation
(d) Use the Trainer to update the parameters in the Model

In contrast to static declaration libraries such as TensorFlow, in DyNet the "create a graph" step falls within the loop. This has the advantage of allowing the user to flexibly create a new graph structure for each instance and to use flow control syntax (e.g., iteration) from their native programming language.

Here is an example program.

This program shows the process of performing maximum likelihood training for a simple classifier that calculates a vector of scores for each class it will be expected to predict, then returns the ID of the class with the highest score.  Notice that, at line 14: symbolic graph is defined dynamically, at line 15: forward pass is executed, and at line 16: backward pass automatic diff is executed. At line 19, after the training, inference is done. To account for dynamic input/graphs at inference, the graph is reconstructed for each serving input.

Dynet allows dynamic flow control at the inference time easily. This can allow the classifier to avoid wasting processing time when the answer is clear. It is also possible to perform dynamic flow control at training time, and this supports more sophisticated training algorithms using reinforcement learning. These algorithms require interleaving model evaluation and decision making on the basis of that evaluation.

How do we make DyNet distributed

Dynet is currently centralized. There is also support for automatic mini-batching to improve computational efficiency, taking the burden off of users who want to implement mini-batching in their models. For more complicated models that do not support mini-batching, there is support for data-parallel multi-processing, in which asynchronous parameter updates are performed across multiple threads, making it simple to parallelize (on a single machine) any variety of model at training time.

Petuum Inc. is working on extending this parallelism from single machine to multiple machines data-parallel processing, by using Poseidon machine-learning communication framework.

Thursday, November 16, 2017

Spring 18 seminar, "blast from the past" edition

It is a bad sign if the references section of a paper fails to cite any recent papers. That means, either the authors are not aware of any recent work on the area, or the area is not of interest to other researchers.

But it is also a bad sign if the references section of a paper fails to cite any old papers. That means, the authors likely do not know enough about the fundamental/foundational work in the area.

There is a case to be made for working from the fundamentals, the first principles. Elon Musk made that case in his work, and showed that you can make transformative work, even in the commercial technology world, by working from the first principals.

Working from the first principles is also essential for research. It is not uncommon to get your best ideas when preparing for a class. Sometimes reviewing fundamental work in a topic, you notice a gap, some weird under-explained assumption, and go "huh, why is it that way". Or sometimes the students (or outsiders of the field) ask a basic question from the left field, and that may start an investigation. And sometimes, you see the old idea/algorithm as a promising/useful fit in new emerging applications. Recently the flexible quorums extension to Paxos is a good example of working from first principles. Nobody expected that result after 30 years of Paxos.

Back to the seminar

Every Spring, I teach a distributed systems seminar where we cover recent interesting papers in the field. But this Spring, I think I will experiment with a special "blast from the past" edition.

Reading these golden oldies could be a chance to revisit the fundamentals of the field. When reading these papers, we can note about how they aged (which parts aged well, which not) and how the context changed from then to now. We can use Google Scholar to investigate how these papers got cited in the intervening years. We can also consider if some of these algorithms can find new applications in modern systems.

We run our distributed systems seminars to include discussion and group work sessions, so I am sure we will be able to come up with new insights/perspectives about these papers with the advantage of hindsight.

But I am not sure what the selected papers will be yet. Here is my first brain dump on this. It may take several weeks to finalize the list. If you have suggestions, please let me know in the comments. What should be the cutoff date for these papers? 2000 seems to be a reasonable cutoff date. We may even stretch this up to 2007 to include a paper or two.
1. Lamport. Time, clocks, and the ordering of events in a distributed system, 1978.
2. Lampson & Sturgis, Crash Recovery in a Distributed Data Storage System, 1979
3. Dijkstra. Self-stabilization in spite of distributed control, 1982.
4. Ben-Or, "Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols" 1983.
5. Rabin, "Randomized Byzantine Generals" 1983.
6. Oki & Liskov, Viewstamped replication: A new primary copy method to support highly-available distributed systems, 1988.
7. Consensus in the presence of partial synchrony, Dwork, Lynch, Stockmeyer, 1988.
8. Some UNITY papers from Chandy & Misra.
9. Birman, Virtual Synchrony paper and debate.
10. Andrew File System: Scale and performance in a distributed file system, 1988
11. Schneider, "Implementing fault-tolerant services using the state machine approach: a tutorial", 1990.
12. Awerbuch and Peleg, "Sparse Partitions", 1990.
13. Arora, Gouda. Distributed reset, 1990.  Closure and convergence: A foundation of fault-tolerant computing, 1993.
14. Herlihy & Moss, "Transactional Memory: Architectural Support for Lock-Free Data Structures", 1993.
15. Plan 9 from Bell Labs, 1995
16. Lampson, "How to Build a Highly Available System Using Consensus", 1996.
17. Chandra, Toueg "Unreliable Failure Detectors for Reliable Distributed Systems", 1996.
18. Flexible update propagation for weakly consistent replication, 1997.
19. Afek & Dolev, "Local stabilizer" 1997.
20. Cluster-Based Scalable Network Services, 1997.
21. Scalable, distributed data structures for internet service construction, 2000.
22. Rosenblum and Garfinkel. Virtual Machine Monitors: Current Technology and Future Trends, 2005.

Werner Vogels's blog has a "Back to the Basics Reading" label, which includes interesting papers.

Tuesday, November 14, 2017

Book review: The Growth mindset

I had read this book a couple years ago . This was a fulfilling book. It is written by an eminent Stanford psychiatrist Prof. Carol Dweck. You can listen to her speak about the ideas in the book here.

The growth mindset is best understood with its antithesis: the fixed mindset. The fixed mindset says that your abilities and capacity are predetermined at birth. The growth mindset says they can be altered and improved.

So, why is this important? This sounds like just an issue of perspective. But a perspective change (a paradigm shift) is capable of changing a lot of things.

The book argues that fixed mindset thinking leads people to play defensive. The fixed mindset people are scared of failures and embarrassment as they would show to the world their capacity (limited and unchanged). So in order  not to fail and save face they don't attempt things.

In contrast, the growth mindset people (i.e., the people who embraces the growth mindset/perspective) love challenges. They are not scared of failure and embarrassment. They may chalk up failure and embarrassment as learning and improvement opportunities. Of course this does not mean that they can absolve themselves of any responsibilities or guilt about their failings/shortcomings. They need to accept the pain in order to grow: no pain, no gain. This just means that the growth mindset people are not affixed to the fixed capabilities and skills defining their self worth. They know they can grow. The growth mindset is akin to taking an antifragile approach to personality and life.

The fixed versus growth mindset has several implications in education, parenthood, relationship, sports-coaching. One practical advise (you have certainly heard this from media/blogs/newspapers) is about how to praise your kids. Don't praise your kids about how smart they are as this would reinforce the fixed mindset. Praise them about how hard they try, how persistent they are,  how courageous they are to attempt/try new challenges. This way you reinforce the growth mindset in them.

One of the biggest harms we can do to our gifts is to raise them with the fixed mindset. And one of the biggest gifts we can give to our kids is to raise them with the growth mindset. Don't just teach them how to fish. Teach them the concept/notion that they can always become much better fisherman than you were and they currently are.

Sunday, November 12, 2017

Book review. The Undoing Project: A friendship that changed our minds

I have recently read this 2017 book about the collaboration between two psychologists, Amos Tversky and Daniel Kahneman. The book is by Michael Lewis, a truly great storyteller. From his pen, the academic collaboration story of two social scientists becomes a love story and a thriller. I wouldn't ever imagine social science and behavioral economics could be this exciting. So I plan to read more of Michael Lewis's books: Flash Boys, The Big Short, Moneyball, The Blind Side, and Liar's Poker.

Here are some short notes from the book.

Danny Kahneman had a very tough childhood. His family survived (except the father) through Nazi prosecution and World War 2, and were able immigrate to Israel in 1948. He was a gifted child and starred in academia, although through out his life he was always had doubts about his talents and was always unsure of himself.

Amos Tversky was born in Israel and served in the Israel army for many years. He got educated at US for graduate school in psychology. He was very bright, which led to his colleagues coining an IQ test: "The faster you realized Tversky was smarter than you, the smarter you are."

The book contains many amazing quotes from Amos. When he was asked why/how he had become a psychologist, Amos replied: "The big choices we make are practically random. The small choices probably tell us more about who we are. Which field we go into may depend on which high school teacher we happen to meet. Who we marry may  depend on who happens to be around at the right time of life. On the other hand, the small decisions are very systematic. That I became a psychologist is probably not very revealing. What kind of psychologist I am may reflect deep traits."

Amos's early research on similarity of things was also very interesting, and the book explains this very nicely. Similarity is not like distance, because it does not necessarily have symmetry. People assess similarity essentially by making a list of features. The more noticeable features two objects share, the more similar they are. Note that, objects have varying number of noticeable features: New York has more of them than Tel Aviv. As a result, New York is not as much like Tel Aviv as Tel Aviv is like New York. Hence, similarity can be asymmetric. (In other words, similarity is like Bayesian subset comparison of features.) The book has a nice anecdote about how Amos instated that sometimes a lack of a feature is a feature: e.g., a three legged dog.

When Amos and Danny got together as colleagues, they started an epic collaboration streak investigating the irrational ways humans make decisions about risk. Their collaborative research started the behavioral economics field.

Their research shows a prime example of academic courage and creativity. They were able ask very original and daring questions and they were brave enough to pursue those questions and break ground in uncharted territory. An example of the kind of thought experiments they performed is the Asian disease problem which elegantly demonstrates the power of framing.

I will not divulge more about the collaboration of this duo and about the book. I enjoyed the book immensely and I strongly encourage you to read the book to learn about the collaboration story of these two great men. The book gives great insights about how to approach research as well.

Amos passed away in 1996 due to metastatic melanoma. Danny was awarded the 2002 Nobel Memorial Prize in Economic Sciences. You can listen to Danny talk here.

Friday, November 10, 2017

Paper summary. Towards distributed machine learning in shared clusters: a dynamically partitioned approach

This paper (by Peng Sun, Yonggang Wen, Ta Nguyen Binh Duong, and Shengen Yan) has been put on Arxiv on April 2017.

This paper was a little confusing to read. I think it could have been presented better to make its contributions more clear. The paper aims to enable multiple distributed ML frameworks, say TensorFlow, Petuum, MxNet, share the same cluster.

Enterprises have clusters, managed by  a cluster management systems (CMSs). The paper starts with a review of existing CMSs, and mentions shortcomings with each. It is unhappy with application-level scheduling, because there each application reserves and keeps all allocated resources until completion, and this leads to low utilization of the resources as the scheduling is done for the peak/maximum resource needs of the application.

In the task-level scheduling mode, applications use acquired resources to run a single task, release them as soon as the task completes, and petition for new resources to launch uncompleted tasks. The paper cites high scheduling overhead with this approach: each task must wait until receiving suitable resources. (But the SLAQ paper we reviewed has taken this task level scheduling route and didn't find significant scheduling overheads.)

In order to enable distributed ML workloads share a single cluster, Dorm uses two techniques: a dynamically-partitioned cluster management mechanism and an utilization-fairness optimizer.

The solution Dorm proposes is simple. It uses Docker containers to partition a cluster and runs one application per partition. Each application places its tasks on the assigned partition without petitioning for resources. Dorm can then adjust the existing resource allocations (i.e., number of containers in a partition) to keep a high resource utilization.

When adjusting an application's resources, Dorm first saves its state to a reliable storage system (e.g., Lustre filesystem). Then Dorm kills this application and creates/destroys containers on corresponding servers. Finally, Dorm resumes the killed application from the saved state with new resource allocations. This way, distributed ML applications can dynamically scale up/down without recomputing from the first iteration.

But this leaves me with several questions. Is this checkpointing only for the parameter-server state and not the worker states? Would the checkpointing work with TensorFlow which has dataflow graphs and associated state at each worker? Would those worker states matter? Would the states of the channels (i.e., messages in transit) matter? Finally, how is the checkpoint done in a distributed manner? The checkpoints will naturally appear at different points in different workers/containers; does that cause a problem?

The paper reports that Dorm was implemented to work with Petuum, MxNet, TensorFlow and MPI-Caffe. But details lack about how the implementations are done and how the application frameworks are modified to work with Dorm.

Wednesday, November 8, 2017

TLA+/PlusCal modeling of Synchronized Round Consensus Algorithm: Solution

The other day I posed the synchronized round consensus question.

Here is the solution on GitHub, and some brief explanation of the relevant part below.

Single round consensus algorithm

The code above is pretty straightforward. The while loop between lines 36-42 models how a node sends its vote to other nodes one by one. The sender node can fail in this while loop after some of the nodes received the vote. So if we model check with FAILNUM=1, the agreement invariant is violated in this single round algorithm as seen in the error trace below.

The blue highlighted line, state 15, is the last line in the error trace, and the value of the state variables are listed in the window below. If you inspect "up" you can see node 1 is down. Checking "mb" you can see node 2 received node 1's vote, but node 3 did not receive node 1's node. As a result, the decision "d" for node 2 is "1", whereas node 3 decides "2", and both decisions are finalized. So the invariant "Agreement" is violated.

(Note that even if we had 10 nodes, and FAILNUM=8, we could have extended this scenario by failing always the smallest id up node in each round after it delivers the message to the next node in the sequence keeping the "1" vote alive but hidden.)

Another interesting part in the code occurs at lines 43 and 44.
After sending its vote to the other nodes, the node increments its round number, pt, by 1. But then it "awaits" other nodes to catchup, and goes to the next round only after this synchronization await at line 44. Note that the node awaits only for the nodes that are "up". If it waits for a down node to increment its pt+1, it would have to wait forever.

This await at line 44 cuts corners: it assumes shared memory instead of message passing. One way to implement this unrealistic "await" is to use physical time, but even that is a brittle method. In reality it is hard to implement perfectly synchronized rounds. Physical clock synchronization is hard, and since the OS is not a real-time OS, timing assumptions can be violated, say due to garbage collection kicking in, or due to VM/container getting slow, or network contention.

When the round synchronization assumption is broken, this algorithm fails. Trust me, you don't want your consensus algorithm, that the rest of your coordination infrastructure depends on, to fail. That is why consensus algorithms adopted in practice, such as Paxos, Raft, Zab, Viewstamped Replication, do not rely on synchronized rounds, and can tolerate (in the sense that agreement is not violated) extreme asynchrony in the system. When the timing assumptions normalize a bit, those algorithms then achieve progress and solve consensus.

Crash tolerant synchronized round consensus algorithm

To tolerate crash faults, it is clear that the algorithm needs to be extended to delay decision to future rounds where each node can ensure that all the nodes have the same set of values from which to decide.

To this end, Line 32 introduces a top-level while loop to the original model iterate through multiple rounds.

The important question here is to determine which round is safe to decide.

The trivial way to do this is to always iterate FAILNUM+1 rounds before deciding, but that is a wasteful solution. FAILNUM is the maximum number of faults that can occur, and the algorithm should be able to decide in less number of rounds in the common case when faults do not occur. But how do you tell that asymmetrically, only using the node's own perception of the system, which is by definition partial and always slightly stale.

One way to do this is to look at the stability of the set of proposed votes and compare mb, the mailbox contents for this round, with pmb, the mailbox contents in the previous round. If there is a potential for fault, it follows that the algorithm should always go to round 2, to confirm with others. The delaying of the decision should continue until the proposed votes converge to a single value for two consecutive rounds and the cardinality of the mailbox is also important because it witnesses to the fact that there are no faults so the above pathological crash sequence of vote hiding is avoided.

Observations

I had written a faulty version of the multi-round algorithm in my first try. I had not taken the cardinality of the set into account and went with straightforward set union. It didn't give any violations of the invariant for N=5 and FAILNUM=3, but the progress part was taking more than an hour on my laptop and I stopped running it. Turns out that version was susceptible to the pathological crash sequence and vote hiding as above. All the nodes decide with "2" but there is this node who just received a "1" vote which was still alive but hidden. So this node goes to next round, but since others have decided, this node will await forever. This is a wacky flavor of consensus, which can still be acceptable maybe if this minority report node kills itself with up:=FALSE. This led me to improve the line 44 condition. Another bug was about a node in a higher round sending messages which gets consumed by a node in a lower round, which leads to nodes getting stuck. To solve this, I had to improve the condition at line 37.

I found out these mistakes when I started writing this blog post. So writing and explaining is an indispensable part of the design process. If TLA+ model checking was more performant, I wouldn't give up prematurely, and I would still know about the solution. If model checking is taking long, it may be best to do it on the cloud, say on an AWS, Azure, or GCE. But the state space explosion is an inherent and dangerous nemesis for model-checking and it will bite. The best precaution is to keep things simple/minimal in the modeling. But that is not always easy.

Monday, November 6, 2017

Paper summary. SLAQ Quality-driven scheduling for distributed machine learning

This paper (by Haoyu Zhang, Logan Stafman, Andrew Or,  and Michael J. Freedman) appeared  at SOCC'17.

When you assign a distributed machine learning (ML) application resources at the application level, those resources are allotted for many hours. However, loss improvements usually occur during the first part of the application execution, so it is very likely that the application is underutilizing the resources for the rest of the time. (Some ML jobs are retraining of an already trained DNN, or compacting of a DNN by removing unused parameters, etc., so blindly giving more resources at the beginning and pulling some back later may not work well.)

To avoid this, SLAQ allocates resources to ML applications at the task level, leveraging the iterative nature of ML training algorithms. Each iteration of the ML training algorithm submits tasks to the scheduler with running times around 10ms-100ms. This is how Spark based systems operate readily anyways. (The Dorm paper criticized this iteration-based task scheduling approach saying that it causes high overhead for scheduling and introduces delays for waiting to get scheduled, but there was no analysis on those claims.)

SLAQ collects "quality" (measured by "loss" really) and resource usage information from jobs, and using these it generates quality-improvement predictions for future iterations and decides on future iteration task scheduling based on these predictions.

The paper equates "quality" with "loss", and justifies this by saying:
1) "quality" cannot be defined unless at the application level; so to keep it general let's use "loss"
2) for exploratory training jobs, reaching 90% accuracy is sufficient for quality, and SLAQ enables to get there in a shorter time frame.

On the other hand, there are drawbacks to that. While delta improvements on loss may correspond to improvements on the quality, the long-tail of the computation may still be critical for "quality", even when loss is decreasing very slowly. This is especially true for non-convex applications.

The paper normalizes quality/loss metrics as follows: For a certain job, SLAQ normalizes the change of loss values in the current iteration with respect to the largest change it has seen for that job so far.

SLAQ predicts an iteration's runtime simply by how long it would take the N tasks/CPUs can process through S the size of data processed in an iteration. (minibatch size.)

For scheduling based quality improvements, the paper considers couple metrics, like maximizing the total quality and maximizing the minimum quantity. The paper includes a good evaluation section.

In conclusion, SLAQ improves the overall quality of executing ML jobs faster, particularly under resource contention, by scheduling at a finer granularity task-level based on the observed loss improvements.

Saturday, November 4, 2017

TLA+/PlusCal modeling of Synchronized Round Consensus Algorithm

In my distributed systems class for Fall 17, I assigned modeling of the synchronized round consensus algorithm as the first project. I have been assigning TLA+/PlusCal modeling projects in my class for the last 4 years and releasing the projects and their solutions. I believe this is useful for the distributed systems community, because at this point the barrier before wider adoption of TLA+ tools seems to be the lack of more TLA+ modeling examples of algorithms/systems. My goal is to provide a TLA+/PlusCal example for everything I teach in the class. This way the students will get a hands-on experience in algorithms design and dealing with the intrinsic complexities of distributed systems: concurrent execution, asymmetry of information, concurrency bugs, and a series of untimely failures.

Here is some previous discussion/context about why I started assigning TLA+/PlusCal modeling projects in distributed systems classes.

Timing of the project

I think I timed this project well. In the first month of the semester, I have covered reasoning about distributed programs in terms of safety and progress properties and gave them sufficient introduction to TLA+/PlusCal as well. While the students worked on the project, I covered time/state in distributed systems (logical/vector clocks, distributed snapshots, asynchrony concerns), and distributed mutual exclusion and dining philosophers.

The submission deadline for the project was just before I cover consensus. Working on the project prepared the students to appreciate the challenges of consensus, and primed them about the issues with some hands on practice. So, this is in some sense this is my poor man's implementation of a flipped classroom.

The Project 1 makes unrealistic assumptions of synchronized rounds and reliable channels. Last week I told the students about why and how these needs to be relaxed further, and that made a very nice introduction to Paxos and the failure detectors discussion that comes after.

Synchronized consensus

Every process broadcasts (to all other processes, including itself) its initial value v. In a synchronous network, this can be done in a single "round" of messages. After this round, each process decides on the minimum value it received.

If no faults occur, this algorithm is correct. In the presence of a crash fault, however, a problem can arise. In particular, if a process crashes during a round, some processes may have received its (low) initial value, but others may not have. (Note that the channels are always assumed to be fault-free; they deliver messages reliably once a message is put to the channel.)

Use the template below as your starting point, and fill in the redacted parts. Write and test an invariant property to capture the Agreement property of the consensus protocol. The agreement property should be satisfied when FAILNUM=0, i.e., when no node is allowed to fail. The property will fail to be satisfied when FAILNUM>0. In that case, write in the comments section, after the "================" line, your findings/observations about how the Agreement property is violated.

Extending the algorithm to address crash faults

To address crash faults, consider this simplifying assumption: say that at most 1 process can crash. How can we modify the algorithm to handle such a failure? (Note again that the channels are always fault-free; they deliver messages reliably once a message is put to the channel.)

Answer: by using 2 rounds. In the 1st round, processes broadcast their own initial value. In the 2nd round, processes broadcast the minimum value they heard. Each process then decides on the min value among all the sets of values it received in the 2nd round.

If the one crash occurs during the first round, the second round ensures that all processes have the same set of values from which to decide. Else, if the one crash occurs during the second round, the first round must have completed without a crash and hence all processes have the same set of values from which to decide.

Without knowing/referring-to FAILNUM, modify your first PlusCal algorithm to achieve consensus in the presence of crash faults. The key observation is that if no crash occurs during a round, all processes have the same set of values from which to decide and they correctly decide on the same minimum value.

Future projects

For the second project, I am assigning two phase transaction commit modeling. There are already models of this available from Lamport's webpage, and I ask students to model what happens when the initiator/transaction manager (TM) fails, how would a backup (TM) take over, and what type of problems would arise in an asynchronous system where failure-detection timeouts may fail.

I don't know what projects I will assign next year. Viewstamped replication modeling maybe? Another interesting one would be modeling blockchains and consensus? Feel free to suggest me ideas in the comments or via email.

While the projects may change every year, one thing is invariant: My class enables you to start a lifelong career in modeling, regardless of your looks.  And as I continue my career in modeling, I find that, yes, the night life is pretty exciting: TLA+ has a way of enticing people to burn midnight oil.

Friday, November 3, 2017

HPTS'17 day 2

On HPTS day 2, there were 4 presentation sessions (I mention 2 below) and an evening session on remembering Jim Gray and Ed Lassettre.

Verification of Systems sesion

The first talk was Jepsen VIII by Kyle Kingsbury, who breaks databases for a living. He gave a very fast paced talk. A good rule of thumb for presentations is to go with 2 minutes per slide. Kyle flips this rule upside down and then goes even further to present 5 slides per minute. He presented 150+ slides in less than 30 minutes, and somehow he made this work. I can't believe how quickly/smoothly he was able to transition from a slide to the next, and how he managed to memorize all those transitions.

Kyle's Jepsen toolkit tests databases as blackboxes using a client to submit overlapping operations, where the start/end of operations define the operation intervals. To prepare these tests, Kyle first carefully reads through the documentation to see which guarantees are claimed and then he writes a bunch of targeted tests in Jepsen to check if the database indeed behaves as advertised under faults such as partitions, crashes, and clock skew.

Kyle first reported on Jepsen test results of VoltDB, an in-memory SQL database. The documents claimed that all transactions are strictly serializable, but this was violated in v6.3, because stale reads, dirty reads, lost updates were possible in the presence of network partitions. VoltDB 6.4 passed all Jepsen tests for strict serializability, even when this meant incurring a 20% performance hit for read to achieve this.

He then talked about MongoDB. Using wall-clocks as optime timestamps led to lost updates due to clock-skew. MongoDB fixed them by making the optime a tuple of logical-term and wall-clocks. Despite these improvements, Jepsen tests identified data loss issues even with the replication v1 protocol of MongoDB.

Kyle also talked about Tendermint blockchain which was found to lose documents and fail to tolerate Byzantine faults.

He concluded with these advice for distributed system design: be formal and specific, figure out the invariants your system needs, consider your failure modes (e.g., crash, clock skew, process pause, partition), and test the system end to end.

The next 2 talks in the session was from Peter Alvaro and his student, on the topic of Lineage Driven Fault Injection for testing. Peter had 3 other students presenting at HPTS, which was an impressive accomplishment.

Antifragile Exercises

Casey Rosenthal at Netflix gave a lively presentation about Chaos Engineering. Kolton Andrus (Gremlin Inc) talked about breaking things on purpose, and also made a great case for chaos testing at production.

Evening session

The evening session had people stepping up telling their memories about Jim Gray and Ed Lassettre.

Pat Helland's tribute for Jim Gray is worth another read to remind us what a great human, collaborator, and researcher Jim Gray was.

Sunday, October 29, 2017

HPTS'17 day 1

This post is a continuation from yesterday's post. I cover only some of the talks, you may check the other talks from the HPTS agenda.

HPTS day 1 started with a keynote from Pat Helland. Pat Helland is a database veteran. I had covered some of his papers on this blog before. He writes insightful position papers with a unique style. The keynote provided an overview of trends in storage and computing and hit the high notes from his earlier position papers as well, and mentioned this table:
linearizable       | no             | no              | yes
nonlinearizable | yes            | yes            | no
cached-data       | yes            | no             | no

The point is you can't have everything. Pat argues "immutability is a solid rock to stand on" and is the closest you can get to yes, yes, yes, in that table. However, Pat cautions that what matters is not hitting yes, yes, yes on that table. Don't fall in love with any of those consistency semantics, what matters is the application requirements. Different applications demand different behaviors from durable state: Do you want it right or right now?

Here is Pat's presentation slides.

Stop building infrastructure (Evan Jones, Bluecore)

Evan started the presentation with a disclosure: warning: zero facts ahead. This was sign that this will be a deliberately provocative/controversial position. He argued that developers should stop building infrastructure (i.e., databases, load balancers, service orchestration software).

His main point was that companies already have their workflows/toolchains streamlined on a cloud vendor and compared to the simplicity of integrating an existing service from the cloud vendor to their workflows/toolchains, it is an uphill-battle to overcome when integrating a 3rd party to their workflow. Ain't nobody got time for that.

His other points were:

• the cloud vendors are better at building infrastructure than you
• even if you succeed, your infrastructure is going to run at cloud, and will make money to the cloud vendors
• cloud vendors can destroy you via copy and crush
• at a company it is easy to justify paying an existing cloud vendor, but it is cumbersome to convince management to buy 3rd party software

These are fair points, but I think his talk should instead be making the opposite point: things are dire for the infrastructure startups when facing the cloud vendors, so how can we help/encourage them? At one point in his talk, he mentioned that their company used only one outside vendor, the datadog, because google cloud hiked up the prices on its monitoring service. This shows that we should be rooting for the success of infrastructure services/companies so that the cloud vendors don't grab all power against the developers.

Moreover Evan's points don't take into account enterprise markets and also foreign markets like Europe and China. Infrastructure companies do exist because there is market for them. Several infrastructure/databases companies have done really great financially, and I hope we will see more success stories from them in the future as well.

Blockchain and Databases - C Mohan (IBM)

Mohan gave an overview of blockchain and talked about the HyperLedger project. His talk(s) and more information can be found at http://bit.ly/CMbcDB.

CockroachDB: From OLTP to HTAP - Arjun Narayan (Cockroach Labs)

Here are the slides for the talk. OLTP stands for online transaction processing, and HTAP stands for hybrid transactional/analytical processing.

The talk gave a quick overview of CockroachDB's scale-out architecture to implement a WAN distributed SQL database service. When building CockroachDB the team prioritized correctness and stability over performance, and attacked those first. They are now working on performance improvements and achieving scale-out OLTP. Arjun talked about how HTAP can be done over CockroachDB in the future. He likes to leverage on the CockroachDB OLTP layer and adopt incrementally updated materialized views as the building primitive. Concretely, he argues that since CockroachDB timestamps give serializable transactions that span OLTP+OLAP, it is possible to build dataflow systems like Naiad or timely dataflow for achieving HTAP.

Gongshow

After the dinner, we gathered at the chapel for a potpourri of 5 minute talks. Andy Pavlo's student Joy Arulraj organized the session. (Joy will be on the academic job market and he will be a great catch for whoever manages to hire him.)

There were many interesting talks in the gong session. And with 5 minutes per talk, they were fast-paced as well.

Joy also convinced me to present at the gong show at around noon, so I had to write some slides in the afternoon. Thanks to the emacs-org-mode export to beamer I was able to come up with slides in time for the talk. I presented about our ongoing work on wPaxos, our WAN Paxos protocol. I will be writing a blog post about that sometime soon.

Saturday, October 28, 2017

HPTS'17 day 0

A couple weeks ago, I attended HPTS'17. As I wrote in my HTPS'15 posts (day1 and day2), HPTS is an unconventional workshop. "Every two years, HPTS brings together a lively and opinionated group of participants to discuss and debate the pressing topics that affect today's systems and their design and implementation, especially where scalability is concerned. The workshop includes position paper presentations, panels, moderated discussions, and significant time for casual interaction. The only publications are slide decks by presenters who choose to post them." HPTS is by invitation only and keeps it under 100 participants. The workshop brings together experts from both industry and academia so that they can mix and interact.

Although some people prefer to keep what happens at HPTS to stay at HPTS, I find it hard to not talk about HPTS. I learn a lot at HPTS and I want to share at least some of those. And this year I don't think there was any confidential technology discussed at all. So I don't think Pat Helland will find this post and shout at me.

Travel to HPTS

I flew from Buffalo (BUF) to Chicago (ORD) to San Jose (SJC). The United flights were on time, and I was happy except for United asking for payment for any sort of in-flight entertainment. Almost nobody pays for the in-flight movies, but unfortunately almost nobody turns off their monitors in the seats either. So every monitor in the plane keeps playing catchy advertising clips over and over again, which drives me mad. I am the type of person who cannot concentrate enough to participate in a conversation if there is a screen in the same room. So 3 hours into the flight, I get into a half-crazed mode, as these monitors catch my eye again and again in a loop and torture me.

I landed at the SJC airport at noon and headed to the car rental. I was excited because I was going to see if I could actually get my \$9 daily rate from Hertz. Yes, you heard it right. When I bought my tickets from United a month earlier, I had reserved a car rental. The rates was around \$40 for SJC car rentals. I checked these a couple times and was trying to decide which company to rent to. Then when I refreshed the screen, I saw that Hertz was now only at \$9 daily to rent. This surely must have been a glitch or an update-error at their database, but I immediately jumped on this and reserved it. See the screenshot below. So, when I arrived at the Hertz counter, I was determined to stick to my guns, and see how this will play out. HPTS dinner was not till 6pm, and I wasn't pressed for time. When it was my turn, I waited for the Hertz staff to provide me with my reservation. She wasn't able to find it with my name, but when I gave her the reservation number she located it. She said that they don't have any cars to offer at \$9 rate, let alone a midsize car for \$9. I insisted that this is the reservation I had, and didn't comment further. Now I got really curious and wanted to see how they will resolve this issue, but I wasn't going to make it easy for them by budging early. She called her manager, and the manager moved me to another counter. At this point I was already into my 30 minutes at Hertz counter. The manager again couldn't locate the reservation with my name, but located it with the reservation number. He kept looking at it, and asked me about how I was able to get a reservation for \$9, because it was the lowest rate he has seen so far. I refrained from commenting. He stared at the screen for many minutes, and called his manager. While we waited for the manager's manager to arrive, the manager helping me was also on the phone with their support center. At one point he lit up, and said "Oh! That's why!!!"

At this point, we are about 50 minutes into my stay at the Hertz counters. But since I have been taking this at stride, and was curious about how it will play out, I am still having fun. I am even thinking "Oh I bet this will make a fun story for why distributed databases should implement strong consistency". (Like that time when Brad Fitzpatrick gave away an Apple Gift card for free on his Twitter account, and caused a race condition. Scroll down and expand to read the continuation of the thread, it is hilarious.)

OK, back to the manager and his enlightenment point. The manager told me that my reservation is for SJO and not for SJC. And I told him, I don't understand what he means. He then explained to me that my reservation was for SJO, CR, that is for the San Jose airport at Costa Rica, and not for SJO, CA, the San Jose airport at California. The manager's manager arrived and I explained to them that this is what the United web page gave me and I wasn't the one at fault. They told me not to worry as they will give me their best rate.

Their best rate turned out to be \$79. I thought that Hertz was a victim of a computer error and would have to go with the \$9 rate, but it turned out I was the victim of a computer error. The joke was on me now. So I moved to the Dollar counters, and rented a compact car for \35 rate and left the car rental around 1:30pm. Computers hate me. I still had time to kill before the HPTS dinner, so I drove through Santa Cruz. I took a walk at the Santa Cruz Wharf. It was an absolutely gorgeous day. I then drove to Asilomar conference grounds, checked in at my room, and took a nap before dinner. Dinner and beyond For dinner, I was at the same table with Shel Finkelstein and Ethan Miller both from UC Santa Cruz. When they asked me what I am working on these days, I mentioned about the WAN Paxos protocol I am working on. They were more interested about the semantics of its transaction model, and asked me whether it was MVCC and used snapshot isolation (the answer for our WAN Paxos is no). A long discussion on snapshot isolation ensued ---a proper way to start HPTS indeed. Shel gave a snapshot isolation with two variables on the snapshot a=0, b=100, and the invariant a<b. Thread 1 starts with that snapshot and sets a:=60, still satisfying the invariant. Thread 2 starts with that snapshot and sets b:=20, still satisfying the invariant. While both threads are individually correct, their combined result violates the invariant. At the dinner, I also met with Peter Bailis briefly. Always nice to meet him again. I wish I had more time to chat and learn from him. Later on day 3, I met his graduate student Firas Abuzaid and chat with him as well. After dinner, we went to the chapel for snacks and beverages. I met Jonathan Ellis, cofounder and CTO of DataStax and had a long talk. I then talked with a MongoDB engineer. And followed that by talking to Ben Darnell, CTO and cofounder of CockroachDB. Vaibhav Arora from UC Santa Barbara was also in that conversation. (He will be in job market soon, he is brilliant, hire him.) We talked about CockroachDB transaction commit protocol. The protocol starts like snapshot isolation, with the reads from a given timestamp, but at commit time it serializes commits and prevents conflicts in a pessimistic manner. That is, unlike the snapshot isolation example above, the protocol checks for the timestamps of the reads at Transaction 1 and Transaction 2 and would not allow those transactions to both commit since both transactions' timestamps are 0 and they overlap accessing same variables. Then we moved to the directors cottage. Pat Helland had special beverage tasting scheduled there till midnight. I met Evan Jones from Bluecore. I met several others in the cottage an talked about their current work and my current research. Then I met Tony Voellm from Microsoft CosmosDB team, and talked with him at length. Then we both joined a group talking about SSD disks with Ethan Miller. Ethan argued that a big benefit to switching to SSDs is it reduces the space footprint of servers/storage in the datacenter, and that is a big gain. I also talked to Ippokratis Pandis at Amazon, whom I knew from before. I was later involved in another conversation, where I learned that datacenters accumulate junk, much like how our laptops accumulate junk files over years. And closing restructuring a datacenter is a good chance to get rid of all that junk, much like how we get rid of junk files in our old laptops --by buying a new laptop. End of day 0 I am not telling these to name drop (well maybe there is a bit of that). I am telling these to show that I was able to meet about 15 people just on day 0. And I am a shy guy. HPTS is great for meeting passionate people and geeking out about distributed systems. I looked at the attendance list and from a rough estimation I must have chat with at least 50 participants by the end of day 2 of HPTS. There is a lot more to tell, but I will save Day 1 to the next post. Here is the HPTS'17 agenda to whet your appetite. Friday, October 6, 2017 What does authentic mean? Seth Godin defines authentic in relation to consistency. Recently, he defined it as "consistent emotional labor." We call a brand or a person authentic when they're consistent, when they act the same way whether or not someone is looking. Someone is authentic when their actions are in alignment with what they promise. Showing up as a pro. Keeping promises. Even when you don't feel like it. Especially when you don't. I agree with this definition. If I may refine it, I would define the authentic act/behavior as that which causes guilt/unrest, if you don't do it. If you don't act as your authentic self, you feel as if you shortchanged yourself, you feel guilt and pain. This doesn't mean that doing the authentic act is not painful. (After all if it is not painful, it is not worth doing. It is trivial.) Authentic means that if you don't do it, it is also painful because it causes guilt and unrest. At least if you act as your authentic self, you find peace eventually through your pain of labor. This is exactly how I feel about writing and research. This is probably how entrepreneurs feel about starting businesses. It is painful, but they cannot not do it. This is also how athletes feel about training; it hurts but they cannot not do it. So, in War of Art author Steven Pressfield's terms, authentic means being motivated territorially rather than hierarchically, in spite of how hard the resistance pushes you back. Geesh, I am now writing Seth Godin style posts? What is wrong with me? Tuesday, October 3, 2017 UB CSE 50 celebrations, Alumni Symposium The last couple of days we celebrated 50th anniversary of our department, CSE at University at Buffalo. We celebrated the anniversary with a technical conference and panels. Yesterday, I wrote about the Graduate Research Conference on Friday. Today, I am posting my notes on the Alumni Symposium that took place on Saturday. Here is a link to the full program. Keynote Speaker: Prof. Sung-Mo "Steve" Kang. “The 4th Industrial Revolution and Future of Nanoscience and Engineering”. Prof. Steve Kang got an MS from our department in 1972. He was Prof. Peter Scott's student. Steve talked about the era of the 4th industrial revolution: 1. steam engine (labor), 2. electricity (energy), 3. computing (knowledge), and 4. cyberphysical system (smart production / soft power). As part of this 4th era, Steve credits machine learning as important. He gave examples of alpha go vs Lee Sedol, a novel written by AI in Japan, the KAIST hubo roboto winning 2015 darpa robotic challenge, and an AI lawyer Ross at Baker & Hostetler LLP. Then he went back and started taking us through the history with the invention of transistor in 1947, and integrated circuit in 1958. Steve then talked about how he has led the development of the world's first 32-bit microprocessor chips as a technical supervisor at AT&T Bell Laboratories in 1981. When I say Steve talked about this, I am not being truthful. Steve is so humble that he never mentioned that he had led the project. He just talked about how it was challenging to build the chip, and how the team has done a wonderful job. Then he talked about how to achieve progress unlimited by Moore's law. The key is diversification to the biochips. This harks back to the neuromorphic engineering mentioned in Prof. Bruce Shriver's talk the previous day. He then mentioned about memristors. Steve's PhD advisor Prof. Chua at UC Berkeley had conjectured that to complement, capacitor, resistor, and inductor, another component memristor should exist. Steve was again very humble, and mentioned this as a footnote, but his PhD dissertation laid out the theory behind memristors. A memristor is a resistor with memory: "The memristor's electrical resistance is not constant but depends on the history of current that had previously flowed through the device, i.e., its present resistance depends on how much electric charge has flowed in what direction through it in the past; the device remembers its history." In 2008, a team at HP Labs claimed to have found Chua's missing memristor based on an analysis of a thin film of titanium dioxide thus connecting the operation of RRAM devices to the memristor concept. The result was published in Nature. Although HP has made announcement for production of memristor, it is not done yet, which is normal. After the invention of transistors it took a long time before they got practical. Memristors has many potential applications in neuromorphic computing/engineering. Steve talked about the brain's neocortex having a structure consisting in 6 layers, and how many people see some analogs to FPGA designs in that. It may be possible to implement a synapse using a unit similar to memristors. He mentioned that the Hodgkin–Huxley model of neurons turn out to equivalent to memristors. Steve finished his keynote talking about the higher education goals: creativity, soft skills, challenge, knowledge. He also talked about the seven social sins by Gandhi: 1. Wealth without work. 2. Pleasure without conscience. 3. Knowledge without character. 4. Commerce without morality. 5. Science without humanity. 6. Religion without sacrifice. 7. Politics without principle. Panel 1: Hot Topics in Industry and Academia Chair: D. Sivakumar (Google). Panelists: Victor Bahl (Microsoft), Anmol Bhasin (Salesforce), Jin-Yi Cai (Wisconsin), Justin Delvecchio (CURC), Faisal Farooq (IBM), Robert Myers (IBM Innovation Center Buffalo), Ashish Naik (Google), Jian Pei (Simon Fraser), Sridhar Seshadri (Synopsys) The panel included people in core (algorithms and infrastructure), data/information management, and application layers of computing. Some interesting tidbits from this panel were as follows. "Some interesting recent developments in theory/algorithms include Laszlo Babai's work on graph isomorphism." "The design automation tools and functional verification tools are vital for integrated circuits. Nowadays the ratio of design/verification employees are 50-50, previously the ratio of design/verification employees were 80-20. Verification is of utmost importance. Model checking tools do not suffice due to sheer scale of the variables/state involved in todays' integrated circuits." "With a background in computer science theory you can be a good impostor in any other field. I've been winging it for the last 20 years." "Workloads are changing: ML workloads are becoming most of our workload these days. Fortunately, energy is not growing exponentially, not following the Moore's law. Accelerators like TPUs help for energy efficiency. Renewable energy sources are also used as well." "Homomorphic encryption in cloud computing will get more important, so that government cannot subpoena the cloud providers." "I am a skeptic on quantum computing, but a proponent on quantum communication." "Shtetl-Optimized, Scott Aaronson's blog, is a nice source to follow on developments on quantum computing." "A/B testing dramatically adopted in commercial products for personalization/customization." "Deep learning is eating the world. DL/ML will have a very wide impact in unsuspected fields yet: long tail in industry, farming, etc." "What will DL do to the data science fields?" "Health is ripe for disruptions!" "Developing/demonstrating the disruptive killer application is as hard as developing the disruption technology. Don't stop at technology, go end-to-end with application as well, that is as critical and important." "The future trend estimation reports in 2006, predicted 3D printing and cloud correctly, but flapped on Virtual Reality and Second Life predictions." "Introduction of sensors/UAVs changed DOD drastically." Panel 2: Entrepreneurship - Opportunities and Lessons Learned Chair: Kannan Govindarajan. Panelists: Russ Agrusa (Iconics), Bob Fritzinger (UB TechTransfer), Dan Magnuszewski (ACV), Ron Schreiber (to be confirmed), Rohini K. Srihari (UB) Some select quotes from this panel include... "I was fortunate to be in the right place in the right time." (Repeated many times.) "I have a compulsion to not say no to opportunities." "Hodgetech in Buffalo was the first high school in the nation (world?) to have a computer and subsequently computer programming classes. (1962)" "Among the long tail of industries not yet benefited from machine learning and big data technologies include: peace giving and conflict resolution domains. These are multi billion dollar markets as well" "Always go to your mom and wife first for funding." "Job < Career < Calling" "Finish things, solve real problems, have real impact, and have control." "After my entrepreunership experiences, the projects I assign in my classes are more holistic problem solving projects." "When I design a product, I first talk to sales person, would you be able to sell this?" Panel 4: Vision for CSE Department Chair: Bharat Jayaraman (CSE UB). Panelists: Raj Acharya (Indiana U), Mark Crovella (Boston U), Deepak Kumar (Bryn Mawr College), Aidong Zhang (UB) Well you know the drill. Here are some quotes from the panel without context. "There is a boom in CSE enrollment. This replicates what we experienced in 80s. At that time we made the mistake of weeding students out of our courses, and became selective to fight being overcrowded. We should not repeat that mistake. Instead we should grow to handle capacity." "We should be mindful about increasing diversity. Digital humanities! How do you train the K12 teachers for teaching CSE?" "How do you prepare current students for industry? Recommendation: make the distinction between science and technology." "Renaissance engineer: entrenched in domain, but also has background in humanities." "Why are students choosing CSE? It is not just for jobs, it is more than that. Students don't think CSE people anymore as Dilberts in cubicles. The new perception is that CSE has impact!" "Convergence in CSE: interdisciplinary < multidisciplinary < transdisciplinary!" "How do we give the medicine that taste bad? How do we teach fundamentals, when students would be more interested in just the technology fads? Cultivate Intellectual Deep Curiosity!" Banquet dinner talks After the alumni symposium, we headed to dinner, and after dinner, we listened to alumni reminisce about the department. This was the most entertaining session. The alumni told funny and emotional stories about their time in the department. The thing that came up again and again was the warmth in the department. The alumni that spoke in the event kept mentioning how it was a tight-knit community in the department and how they use to go to dinners at the faculty's houses. Those were the memories that most impressed on them. As our department gets bigger that maintaining that warmth also gets challenging. I had joined the department when it was about 20 faculty in 2005. Now it is close to 50 faculty. That is fast growth! We currently have around 150 PhD students, 450 Masters student, and 1400 undergraduate students. In spite of the growth, it is important to keep that warmth alive among the faculty and between the faculty and students. The entire event of 3 days made me realize once again that we are not only in the business of science & engineering, but also as much in the business of raising scientists & engineers. It was great to see how our department was having impact on the world via our alumni as well. Monday, October 2, 2017 UB CSE 50 celebrations, The Graduate Research Conference Over the last couple of days, we celebrated the 50th anniversary of our department, CSE at University at Buffalo. We celebrated the anniversary with a technical conference and panels, so it was an opportunity to learn new things for everyone. With the attendance of many prominent UB-CSE alumni, it has been a really amazing 2.5 days. Here is a link to the CSE 50 conference program. On Thursday evening, the event was kicked off with a reception and an undergraduate poster session. The thing that surprised me in this poster session was how quickly the PM2.5 sensors miniaturized. PM2.5 refers to atmospheric particulate matter (PM) that have a diameter less than 2.5 micrometers, which is about 3% the diameter of a human hair. PM2.5 is a traffic-related pollutant implicated in a wide variety of adverse health outcomes. I was involved in a NIH/NIEHS project for using smartphone-based time-activity data for air pollutant exposure estimation from 2010-12. At that time PM2.5 sensors were mini-fridge sized and expensive to buy and deploy. On the poster demo, my jaw hit the floor when I saw the new generation of PM2.5 sensors that are palm-sized and are connected to Arduino boards. The Friday conference consisted of 3 keynote presentations and 4 sessions. The sessions were a mix of invited alumni talks and our own graduate students unpublished original paper presentations. I was the program chair for the Friday conference, and was responsible for selecting the graduate papers. I used EasyChair to collect the paper submissions and review them. We formed a technical program committee of 22 alumni/faculty. Out of 21 paper submissions, we selected 8 for the Friday program. While all the submissions were high quality, we had to be selective to keep to the time constraints. We also processed 50 poster submissions, and chose 29 papers among them for the graduate poster presentation on Saturday. Here are my notes from some of the talks on Friday. Keynote 1 - Dr. Victor Bahl (Microsoft Research) "Democratization of Streaming Video Analytics & the Emergence of Edge Computing" Victor got a BS & MS degree from ECE at UB at the end of 80s. He is a Distinguished Scientist and Director of Mobile & Networking Research at Microsoft. His talk was about edge computing, looking beyond cloud computing. In a 2013 meeting in Microsoft, he had claimed that by 2020, cloud computing would be disaggregated and augmented by edge/fog computing. He defended edge computing putting forth latency/bandwidth, expense/service, and battery-saving reasons. Since then he was involved in proving the utility of edge computing with killer applications. He talked about "Glimpse: continuous realtime object recognition of mobile devices" from Sensys 2014 as one application. Another application is the connected car. In 2015, they came up with ParkMaster, edge-powered in vehicle analytics for detecting open parking spaces in urban environments. As you drive your smartphone detects (and then uploads to cloud) empty parking spaces for others to later park in that street. Your car provides service to others, and in return others provide the same service to you. Yet, as another application of edge computing, he pursued surveillance of public buildings. The idea is to do the filtering/analysis of video feeds right in the building machines, instead of uploading the videos to cloud for remote offline analysis. And finally, most recently, he has been applying the edge computing concept to live video analytics of traffic cameras at intersections. This project serves by collecting traffic video analytics and data for the Vision Zero project. Vision Zero is a multi-national road traffic safety project started in 1997 that aims to achieve a highway system with no fatalities or serious injuries involving road traffic. The project is deployed and in use in Bellevue and Seattle streets, and is in progress to be deployed in Cambridge UK. Invited talk: Prof. Bruce Shriver (Liddy Shriver Sarcoma Initiative), "Unconventional Computer Architectures" Bruce started his PhD at University at Buffalo CS department in 1968 and got his PhD in 1971. His talk was about rethinking/rebooting computation and computers and touched on many topics including neuromorphic engineering. (This topic was also revisited by Dr. Steve Kang, another of our alum, in his Saturday's keynote titled "The 4th Industrial Revolution and Future of Nanoscience and Engineering".) Bruce has been interested in how the human brain organizes, stores, accesses and understands sensory input and its accumulated knowledge, and yet run with such a small power requirement. The recent success and wide adoption of CRISPR has invigorated the area. Bruce's presentation referred to several recent articles in the area, including: • DNA Fountain enables a robust and efficient storage architecture, Elrich et al, Science, March 2017 • Model-based design of RNA hybridization networks implemented in living cells, Rodrigo et al, Nucleic Acids Research, September 2017 • Complex cellular logic computation using ribocomputing devices, Green at al, Nature, July 2017 • Silicon quantum processor with robust long-distance qubit couplings, Tosi et al, Nature Communications, September 2017 • A Brain Built From Atomic Switches Can Learn, Quanta Magazine, Andreas von Bubnoff, September 2017 • On-chip generation of high-dimensional entangled quantum states and their coherent control, Kues et al, Nature, June 2017 • A million spiking-neuron integrated circuit with a scalable communication network and interface, Merolla et al, Science, August 2014 Bruce pointed out that these unconventional new architectures would need new type of algorithms, a type we do not have yet. He urged the audience to think about what type of algorithms, software, programming language, OS, hardware, and programmers would be needed to address these challenges. Bruce conjectured that we should see breakthroughs via molecular computing I/O. Student paper presentations Among the student papers, some of the most interesting ones for me were the following. • "Metadata-based Feature Aggregation Network for Face Recognition" by Nishant Sankaran, Sergey Tulyakov, Srirangaraj Setlur and Venugopal Govindaraju. The idea here is to use metadata (yaw, pitch, roll, gender) unperturbed by the feature extraction process to gauge quality of facial image. • "Differentially Private Empirical Risk Minimization with Non-convex Loss Function" by Di Wang and Jinhui Xu. • "Emulating Quantum Circuits Via Boolean Formulas and #SAT Solving" by Chaowen Guan and Kenneth Regan. Monday, September 25, 2017 Paper Summary. Proteus: agile ML elasticity through tiered reliability in dynamic resource markets This paper proposes an elastic ML system, Proteus, that can add/remove transient workers on the fly for exploiting the transient availability of cheap but revocable resources in order to reduce costs and latency of computation. The paper appeared in Eurosys'17 and is authored by Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, and Phillip B. Gibbons. Proteus has two components: AgileML and BidBrain. AgileML extends the parameter-server ML architecture to run on a dynamic mix of stable and transient machines, taking advantage of opportunistic availability of cheap but preemptible AWS Spot instances. BidBrain is the resource allocation component that decides when to acquire and drop transient resources by monitoring current market prices and bidding on new resources when their addition would increase work-per-dollar. Before delving into AgileML and BidBrain, let's first review the AWS Spot model. See Spot run AWS provides always available compute instances, called "on-demand" instances: you get them when you like and keep them as much as you like provided that you pay their fixed hourly rate. AWS also offers transient compute-instances via the AWS Spot market. You specify a bid price, and if the current market price is under your bid, you get the instance. You only pay the market price, and not your bid-price. In other words, your bid-price is an upperbound on how much you are comfortable for paying hourly. And if the AWS spot market price for the instance goes above your upperbound rate, AWS pulls the instance from you with only a 2-minute advance warning. Even in this case, the silverlining is that the last incomplete hour of computing is not charged for you, so you get some free computing. As seen in Figure 3, you can save a lot of money if your computing job can exploit AWS Spot instances. (It is peculiar how the peak prices are sometimes up to 10 times higher than the fixed on-demand instances. This is speculated to prevent a high bid that secures long running instances at AWS Spot at a rate lower than EC2.) Jobs that are especially suitable for AWS Spot's transient/preemptible computing style are embarrassingly parallel data processing tasks, where pieces are not related and where there is no need to maintain long-lived state. For example, for "shallow computing", such as thumbnail generation, there is no harm done with an instance eviction, as there is no need for continuity across the computation. The question the paper investigates is how to make the AWS Spot model work for "deeper computing", such as ML jobs. While the paper considers this question for the AWS Spot market, the motivation also applies to enterprise computing in the datacenter. As disclosed in the Google Borg paper, Google distinguishes and prioritizes its production services over analytic/batch services. If the production services need more resources, they will be given resources to the extent of preempting them from analytic/batch jobs if need be. On the other hand, when there is an excess of resources, analytic/batch jobs can enjoy them opportunistically. Stages of AgileML AgileML has 3 modes/stages as shown in Figure 4. To provide a shorter and more cost-effective computation, AgileML dynamically changes modes based on the availability of cheap transient instances. As the transient to on-demand ratio increases from 1:1 to beyond 15:1, AgileML shifts up from mode 1 up to mode 3. As the ratio decreases, AgileML shifts down from mode 3 down to mode 1. • Stage 1: Parameter Servers Only on Reliable Machines. Stage 1 spreads the parameter-server across reliable machines only, using transient nodes only for stateless workers. This works for most ML applications including K-means, DNN, Logistic Regression, Sparse Coding, as the workers are stateless while the parameter-servers contain the current solution state. • Stage 2: ActivePSs on Transient Machines and BackupPSs on Reliable Machines. For transient to reliable node ratio greater than 1:1, AgileML switches to stage 2. Stage 2 uses a primary-backup model for parameter servers, using transient nodes for an active server (ActivePS) and reliable nodes for the hot standby (BackupPS). This relieves the heavy network load at the few reliable resources by spreading it across the many transient resources. The model parameters are sharded across the set of ActivePS instances. Workers send all updates and reads to the ActivePSs, which push updates in bulk to the BackupPSs. The solution state affected by transient node failures or evictions is recovered from BackupPSs. (For backing up ActivePS to BackupPS, it may be possible to explore a threshold-based update mechanism as outlined in the Gaia paper.) • Stage 3: No Workers on Reliable Machines. Workers colocated with BackupPSs on reliable machines were found to cause straggler effects at transient-to-reliable ratios beyond 15:1. Stage 3 removes these workers, and acts like a sub-case of Stage 2. Handling elasticity The elasticity controller component is responsible for changing modes based on the transient-to-reliable ratio and the network bandwidth. It tracks which workers are participating in the computation, assigns a subset of input data to each worker, and starts new ActivePSs. For stage 2 and stage 3, half of the transient instances are recruited as ActivePSs, as that performed best in the evaluations. This one-half ratio is likely to be specific to using transient instances, as with reliable instances the more PSs the merrier it is. During start-up, AgileML divides the parameter state into N partitions, where N is the maximum number of ActivePSs that can exist at any one point. By using partitions in this way, AgileML avoids the need to re-shard the parameter state when adding or removing servers, instead re-assigning partitions as needed. As the ActivePS instances increase and decrease, the elasticity controller re-assigns the parameter-server shards across the ActivePS instances appropriately. If all the ActivePSs are evicted, AgileML transfers to Stage 1. It seems like using a level of indirection was sufficient to get this working. BidBrain BidBrain keeps track of historical market prices for transient instances and makes allocation decisions to minimize cost-per-work. An allocation is defined as a set of instances of the same type acquired at the same time and price. Before the end of an allocation's billing hour, BidBrain compares the cost-per-work ratios to decide whether the allocation is renewed or terminated. Evaluation The experiments were performed with 3 ML applications. • Matrix Factorization (MF) is a technique (a.k.a. collaborative filtering) commonly used in recommendation systems, such as recommending movies to users on Netflix. The goal is to discover latent interactions between the two entities (e.g., users and movies). Given a partially filled matrix X (e.g., a matrix where entry (i, j) is user i’s rating of movie j), MF factorizes X into factor matrices L and R such that their product approximates X. • Multinomial Logistic Regression (MLR) is a popular model for multi-way classification, often used in the last layer of deep learning models for image classification or text classification. The MLR experiments use the ImageNet dataset with LLC features, containing 64k observations with a feature dimension of 21,504 and 1000 classes. • Latent Dirichlet Allocation (LDA) is an unsupervised method for discovering hidden semantic structures (topics) in an unstructured collection of documents, each consisting of a bag (multi-set) of words. The evaluated LDA solver implements collapsed Gibbs sampling. The baseline runs all instances on Spot market machines and uses checkpointing to recover progress if evicted. The experiments show about 17% overhead for MF due to checkpointing. Figure 1 illustrates the cost and time benefits of Proteus over the MLR application. Compared to all on-demand, the baseline improves on cost significantly as expected but increases the runtime by 25%. Proteus improves on cost and also manages to achieve reduced runtime. On average 32% of Proteus's computing is free computing. But aggressively chasing free computing by bidding very close to market price results in high overhead: 4x increase in runtime and higher costs due to frequent evictions. Friday, September 22, 2017 Paper summary. Distributed Deep Neural Networks over the Cloud, the Edge, and End Devices This paper is by Surat Teerapittayanon, Bradley McDanel, and H.T. Kung at Harvard University and appeared in ICDCS'17. The paper is about partitioning the DNN for inference between the edge and the cloud. There has been other work on edge-partitioning of DNNs, most recently the Neurosurgeon paper. The goal there was to figure out the most energy-efficient/fastest-response-time partitioning of a DNN model for inference between the edge and cloud device. This paper adds a very nice twist to the problem. It adds an exit/output layer at the edge, so that if there is high-confidence in classification output the DNN replies early with the result, without going all the way to the cloud and processing the entire DNN for getting a result. In other words, samples can be classified and exited locally at the edge when the system is confident and offloaded to the cloud when additional processing is required. This early exit at the edge is achieved by jointly training a single DNN with an edge exit point inserted. During the training, the loss is calculated by taking into account both the edge exit point and the ultimate exit/output point at the cloud end. The joint training does not need to be done via device/cloud collaboration. It can be done centrally or at a datacenter. DDNN architecture The paper classifies the serving/inference hierarchy as local, edge, cloud, but I think local versus edge distinction is somewhat superfluous for now. So in my summary, I am only using two layers: edge versus cloud. Another thing this paper does differently than existing edge/cloud partitioning work is its support for horizontal model partitioning across devices. Horizontal model partitioning for inference is useful when each device might be a sensor capturing only one sensing modality of the same phenomena. DDNN uses binary neural networks to reduce the memory cost of NN layers to run on resource constrained end devices. A shortcoming in the paper is that the NN layers that end up staying in the cloud is also binary neural network. That is unnecessary, and it may harm precision. DDNN training At training time, the loss from each exit is combined during backpropagation so that the entire network can be jointly trained, and each exit point achieves good accuracy relative to its depth. The idea is to optimize the lower parts of the DNN to create a sufficiently good feature representations to support both samples exited locally and those processed further in the cloud. The training is done by forming a joint optimization problem to minimize a weighted sum of the loss functions of each exit. Equal weights for exits are used for the experimental results of this paper. They use a normalized entropy threshold as the confidence criteria that determines whether to classify (exit) a sample at a particular exit point. The normalized entropy close to 0 means that the DDNN is confident about the prediction of the sample. At each exit point, the normalized entropy is computed and compared against a threshold T in order to determine if the sample should exit at that point. The exits are determined/inserted manually before training. Future research should look into automating and optimizing the insertion of exit points. DDNN evaluation The evaluation is done using a dataset of 6 cameras capturing the same object (person, car, bus) from different perspectives. The task is to classify to these three categories: person, car, and bus. This task may be overly simple. Moreover, the evaluation of the paper is weak. The multicamera dataset has only 680 training samples and 171 testing samples. Tuesday, September 19, 2017 Paper summary. Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds This paper appeared in NSDI'17 and is authored by Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R. Ganger, Phillip B. Gibbons, and Onur Mutlu. Motivation This paper proposes a framework to distribute an ML system across multiple datacenters, and train models at the same datacenter where the data is generated. This is useful because it avoids the need to move big data over wide-area networks (WANs), which can be slow (WAN bandwidth is about 15x less than LAN bandwidth), costly (AWS does not charge for inside datacenter communication but charges for WAN communication), and also prone to privacy or ownership concerns. Google's Federated Learning also considered similar motivations, and set out to reduce WAN communication. It worked as follows: 1) smartphones are sent the model by the master/datacenter parameter-server, 2) smartphones compute an updated model based on their local data over some number of iterations, 3) the updated models are sent from the smartphones to the datacenter parameter-server, 4) the datacenter parameter-server aggregates these models (by averaging) to construct the new global model, and 5) Repeat. The Gaia paper does not cite Federated Learning paper, because they are likely submitted around the same time. There are many parallels between Gaia's approach and that of Federated Learning. Both are based on the parameter-server model, and both prescribe updating the model parameters in a relaxed/stale/approximate synchronous parallel fashion: several iterations are run in-situ before updating the "master" parameter-server. The difference is for Federated Learning there is a "master" parameter-server in the datacenter, whereas Gaia takes a peer-to-peer approach where each datacenter has a parameter-server, and updating the "master" datacenter means synchronizing the parameter-servers across the datacenters. Approximate Synchronous Parallelism (ASP) idea Gaia's Approximate Synchronous Parallelism (ASP) idea tries to eliminate insignificant communication between data centers while still guaranteeing the correctness of ML algorithms. ASP is motivated by the observation that the vast majority of updates to the global ML model parameters from each worker are insignificant, e.g., more than 95% of the updates may produce less than a 1% change to the parameter value. With ASP, these insignificant updates to the same parameter within a data center are aggregated (and thus not communicated to other data centers) until the aggregated updates are significant enough. ASP builds heavily on the Stale Synchronous Parallelism (SSP) idea for parameter-server ML systems. While SSP bounds how stale (i.e., old) a parameter can be, ASP bounds how inaccurate a parameter can be, in comparison to the most up-to-date value. ASP allows the ML programmer to specify the function and the threshold to determine the significance of updates for each ML algorithm. The significance threshold has 2 parts: a hard and a soft threshold. The purpose of the hard threshold is to guarantee ML algorithm convergence, while that of soft threshold is to use underutilized WAN bandwidth to speed up convergence. In other words, the soft threshold provides an opportunistic synchronization threshold. Architecture The Gaia architecture is simple: It prescribes adding a layer of indirection to parameter-server model to account for multiple datacenter deployments. Figure 4 shows the overview of Gaia. In Gaia, each data center has some worker machines and parameter servers. Each worker machine works on a shard of the input data stored in its data center to achieve data parallelism. The parameter servers in each data center collectively maintain a version of the global model copy, and each parameter server handles a shard of this global model copy. A worker machine only READs and UPDATEs the global model copy in its data center. To reduce the communication overhead over WANs, ASP is used between parameter-servers across different data centers. Below are the 3 components of ASP. The significance filter. ASP takes 2 inputs from user: (1) a significance function and (2) an initial significance threshold. A parameter server aggregates updates from the local worker machines and shares the aggregated updates with other datacenters when the aggregate becomes significant. To facilitate convergence to the optimal point, ASP automatically reduces the significance threshold over time: if the original threshold is v, then the threshold at iteration t of the ML algorithm is\frac{v}{\sqrt{t}}\$.

ASP selective barrier. When a parameter-server receives the significant updates at a rate that is higher than the WAN bandwidth can support, instead of sending updates (which will take a long time), it first sends a short control message to other datacenters. The receiver of this ASP selective barrier message blocks its local workers from reading the specified parameters until it receives the significant updates from the sender of the barrier.

Mirror clock. This provides a final safety net implementing SSP across datacenters. When each parameter server receives all the updates from its local worker machines at the end of a clock (e.g., an iteration), it reports its clock to the servers that are in charge of the same parameters in the other data centers. When a server detects its clock is ahead of the slowest server, it blocks until the slowest mirror server catches up.

Evaluation

The paper evaluates Gaia with 3 popular ML applications. Matrix Factorization (MF) is a technique commonly used in recommender systems. Topic Modeling (TM) is an unsupervised method for discovering hidden semantic structures (topics) in an unstructured collection of documents, each consisting of a bag (multi-set) of words. Image Classification (IC) is a task to classify images into categories, and uses deep learning and convolutional neural networks (CNNs). All applications use SGD-based optimization.

The experiments, running across 11 Amazon EC2 global regions and on a cluster that emulates EC2 WAN bandwidth, compare Gaia against the Baseline that uses BSP (Bulk Synchronous Parallelism) across all datacenters and inside a LAN.

Questions

Is this general enough? The introduction says this should apply for SGD based ML algorithms. But are there hidden/implicit assumptions?

What are some examples of advanced significance functions? ML users can define advanced significance functions to be used with Gaia, but this is not explored/explained much in the paper. This may be a hard thing to do even for advanced users.

Even though it is easier to improve bandwidth than latency, the paper focuses on the challenge imposed by the limited WAN bandwidth rather than the WAN latency.  While the end metric for evaluation is completion time of training, the paper does not investigate the effect of network latency. How would the evaluations look if the improvements are investigated in correlation to latency rather than throughput limitations? (I guess we can have a rough idea on this, if we knew how much the barrier control message was used.)