Posts

Showing posts from 2010

Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM

Image
I wrote about Ousterhout's "The Role of Distributed State" work before. This review is for his recent work on RAMClouds, which appeared in SIGOPS Operating Systems Review . This paper makes a case for keeping all the data in the RAM over distributed nodes in a datacenter. "A RAMCloud is not a cache like memcached and data is not stored on an I/O device, as with flash memory: DRAM is the permanent home for data." Obviously, storing everything in RAM would yield a very high-throughput (the paper mentions 100-1000x) and very low-latency (again the paper mentions 100-1000x) system compared to disk-based systems. However, the primary reason the authors are excited about RAMCloud is the following: "RAMCloud will simplify the development of large-scale Web applications by eliminating many of the scalability issues that sap developer productivity today." The motivation for RAMCloud is to provide a general-purpose storage system that scales far beyond existing

Finding a Needle in Haystack: Facebook's Photo Storage

Image
This paper appeared in OSDI'10. The title "Finding a needle in Haystack" is a bit over-dramatization :-) Finding a needle in Haystack becomes straightforward if you can memorize the location of each needle in the haystack. And that is exactly what Facebook Haystack system does. Haystack is an object store for sharing photos on Facebook where data is written once, read often, never modified, and rarely deleted. Haystack storage system was designed because traditional filesystems perform poorly under the Facebook workload. While using network attached storage (NAS) appliance mounted over NFS, several disk operations were necessary to read a single photo: one (or typically more) to translate the filename to an inode number, another to read the inode from disk, and a final one to read the file itself. While insignificant on a small scale, multiplied over billions of photos and petabytes of data, accessing metadata becomes the throughput bottleneck. Haystack aims to

Boxwood: Abstractions as the foundation for storage infrastructure

Image
This paper is by Microsoft Research, and appeared in OSDI'04. This review will mostly be a stream of conciousness, because I have not yet understood all of the paper and cannot put it in context as much as I would like to. While reading in to the Boxwood paper, I started to notice how similar this is getting to the GFS problem and GFS approach . Boxwood appeared at OSDI'04, and GFS appeared at SOSP'03. Boxwood refers to GFS but does not compare or contrast itself with GFS. Maybe the reason is in 2004 the Boxwood authors could not see the similarities. This could be because, as I mentioned in my GFS review , the GFS paper did not talk about the Paxos replication of the master chunk-manager in the 2003 paper; that came a couple years later in the Chubby and Paxos-made-live papers. When citing GFS, the Boxwood authors only state that GFS "will be layered over the facilities of Boxwood". But, that is impractical as it would be duplicating a lot of the services; GFS

Globecom, WSN forum, Urban-scale sensing talk by Ed Knightly (Rice U)

Last week, I attended Globecom'10 . Ed Knightly from Rice talked about urban-scale sensing under 3 parts: vehicular sensing, health sensing, and smart grid. Ed spent most of his talk on the vehicular sensing part. A recent US deparment of transportation vehicle safety commission project asked this question: vehicles have dozens of sensors already, what if this information was shared, what can we achieve? Some low hanging fruits are: traffic signal warning, curve speed warning, left turn assistant, stop sign movement assistant, lane change warning, collision warning, and finally, internet access applications. The candidate technology that is proposed for making this networking feasible is a wireless technology, of course. But not the wifi technology which is probably many people's first guess; It is visible light communication (VLC) technology. A VLC transmitter is a LED, which can as well be the LED headlight and taillight in most of the recent models. The only thing needed is

Globecom, wireless networking forum, talk on smartphones by Roy Want

Last week, I attended Globecom'10 . Roy Want (Intel) gave a talk on smartphones in Globecom. He started by showing the market trends for cellphones smartphones and laptops. Cellphones and smartphones grow so quickly that they dwarf the laptop market (which is growing with a healthy 20%). Roy, then, asked the following question: "Will one day will we be computing on the smartphones?" He said that in order for that to happen we need to overcome the poor UI experience of smartphones. As part of these introductory slides, Roy showed a picture of the Intel atom processor for smartphones. It is smaller than rice grain yet is the brain of smartphone and x86 compatible, so these chips can bring After the presentation, over dinner, I asked Roy about why not put a dozen of these atom processors in one smartphone, given that they take virtually no space. Turns out this is currently not very feasible, because these processors are pretty battery-hungry, even though they are very ti

Onix: A Distributed Control Platform for Large-scale Production Networks

Image
The Onix work (OSDI'10) builds on Nox. Essentially, Onix takes Nox and distributes over multiple servers. Let me start with a brief refresher on Nox. (Or you can r ead my previous post on Nox ) The main idea in Nox (and openflow) was to facilitate innovation by separating the control plane from the forwarding (data) plane. (In the current networking architecture, control and data planes are both implemented in the same place, the routers.) Nox introduced "software defined networking" (SDN): Nox uses a centralized controller to make the decisions (i.e., control plane); The routers implement only the data plane, and just follow directions from the controller while forwarding data. A drawback with Nox was that since it uses a single controller, it is prone to a single point of failure. Although the Nox work pointed out how this single controller can be distributed, it didn't pursue it further. Enter Onix. Onix investigates how to distribute that single controller. In On

Dynamo: Amazon's highly available key-value store

This paper, which appeared in SOSP'07, describes Dynamo, the underlying storage technology for several core services in Amazon's e-commerce platform. Dynamo is a NoSQL system and provides a single key-value store. The commonly accepted (yet still disputed) wisdom is that RDBMS are overkill for simple key-value stores and are unsuitable for large-scale multi datacenter systems. The goal in Dynamo is providing reliability at large scale. As the paper says in the introduction "The reliability and scalability of a system is dependent on how its application state is managed". This was a key lesson from the Ousterhout'90 paper: "The role of distributed state" . Dynamo is an optimistic replication system. According to the optimistic survey taxonomy , Dynamo is a multi-master (multiple coordinators can update the data) system that employs state transfer and asynchronous propagation of updates. Hence, Dynamo allows conflicting updates in to the system. Dynamo u

Crowdsourced sensing and collaboration using Twitter

First, a brief prelude about my research interests. I started my PhD on distributed algorithms and self-stabilizing systems in 1998. But then in 2000, after my advisor received a DARPA grant on networked embedded sensor technologies, I have started working on wireless sensor networks. After 10 years of working on wireless sensor networks, I am now in the process of switching topics. I am doing a lot of reading on cloud computing. This is a topic I enjoy, and I think I can contribute here due to my background in distributed systems and theory. I am also doing a lot of reading on smartphones, as they provide a good alternative/complement to wireless sensor network systems. The main appeal of smartphones is that they have solved the market penetration problem, which the wireless sensor networks have been perpetually struggling with. The killer applications for smartphones are communication and offering a lightweight ubiquitous PC replacement. Smartphones are incidentally better sensors th

VL2: A Scalable and Flexible Data Center Network

Image
This paper is by MS Research and appeared in Sigcomm 2009. The paper investigates data center networking, the same problem as the Portland paper (which also appeared in Sigcomm 2009!). Naturally there are similarities in the approaches recommended by the two papers. The motivation for this paper is from a slightly different angle than the Portland paper. This paper puts more emphasis on the network capacity problem in the data centers. The paper argues that the network is the bottleneck of the computation, since the switches at the higher levels (i.e., aggregation and core switches) are oversubscribed heavily. Servers typically have 1:1 over-subscription to other servers in the same rack --that is, they can communicate at the full rate of their interfaces (e.g., 1 Gbps). However, up-links from top of the rack (ToR) switches are typically 1:20 oversubscribed (i.e., 1 Gbps of up-link for 20 servers), and paths through the highest layer of the tree can be 1:240 oversubscribed. The paper

PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric

Image
Last week we covered the Portland paper by UC San Diego , which appeared at Sigcomm'09. This paper is well written and I really appreciate its clarity and simplicity. The motivation for the work is the need to scale at the data center networks (DCNs). Data centers today may have around 100K machines, with 32 virtual machines at each, which yield a total of 3 million hosts. Assuming one switch is required for every 25 physical hosts and accounting for interior nodes, the topology would consist of 8,000 switches. Current network protocols impose significant management overhead at this scale let alone supporting seamless VM migration across machines. For example, broadcasting ARPs and updates to every swithch is out of the question for such a network. The desiderata for the DCN is given as: R1. Any VM may migrate to any physical machine. Migrating VMs should not have to change their IP addresses as doing so will break pre-existing TCP connections and application-level state. R2. An ad

Availability in Globally Distributed Storage Systems

Image
This paper by Google Research provides a report on the availability behavior of large cloud storage systems by studying up to 7K nodes at Google over a year. The paper does not propose any new protocols, but provides sufficiently accurate models of system behavior/performance based on the study. These models are important since they enable us to correctly design and optimize these multi layered systems for data availability. The work is divided into two parts. The first is the analysis of the component availability (disks, machines, racks), and the second is the analysis of the data availability, as inferred from the component availability results and design decisions of the distributed storage system. Preliminaries A storage node is defined as unavailable when it fails to respond positively to periodic health checking pings sent by the monitoring system. The paper does not investigate about network errors specifically; those are also swept under unavailability with software & h

My iPhone 4 has a transparent screen

Image
I just took a picture of my palm, and set it as the wallpaper to give the transparency effect. The trick worked well on some unsuspecting friends.

Optimistic Replication

Image
This 2005 paper by Saito and Shapiro provides a comprehensive survey of the optimistic replication area and is a must-read for distributed services designers. Below, I am providing some snippets from the paper to give a brief overview. Data replication improves both availability and performance for distributed services. Availability is improved by allowing access to the data even when some of the replicas are unavailable. Performance improvements concern reduced latency (by enabling local access from replica instead of remote access) and increased throughput (by letting multiple computers serve the data). Pessimistic techniques block access to a replica unless it is provably up to date. Such pessimistic techniques perform well in local-area networks, in which latencies are small and failures uncommon, but is unsuitable for wide-area networks, because the Internet remains slow and unreliable. Moreover, pessimistic algorithms scale poorly, because its throughput and availability suf

A presentation tip (brought to you by a weary audience)

A common and serious flaw I see in student presentations is that the presenter rushes through the first 5-10 slides. This is the worst thing to do in a presentation, these introduction slides to the problem and solution are the most important ones. If the presenter loses the audience in these beginning slides, this wastes the entire hour for the audience. In comparison, losing the audience in the middle of the presentation is less critical, less time is wasted, and the audience can even use this time to think about alternative solutions to the problem (which was presented clearly in the introduction). You would think this is common sense, but most of the students still commit this mistake. I guess the reason is since the presenter had read and understood the relatively easy introduction part of the paper, he thinks the audience can also understand and follow it with ease. And armed with the powerpoint slides, the presenter can finish the first 10 slides in less than 10 minutes in bulle

The role of distributed state

This is a 1990 technical report by John Ousterhout, who has been a very influential and pioneer figure in systems research. I had written a summary of his log-structured file system (LFS) paper earlier in this blog. It seems like his ideas on LFS and RAMCloud are becoming important in cloud computing these days. Ousterhout is also the creator of TCL/TK, so he has a lot to say in programming languages for web applications domain as well. The miscellaneous articles on his non-academic homepage include a lot of wisdom from his experience in the software development industry, and are also worth reading carefully. This paper titled "Role of distributed state" describes the advantages and disadvantages of distributed state and suggests that the act of building a distributed system consists of making tradeoffs among various alternatives for managing the distributed state. Illustrating this point, the two case studies included in the paper, NFS and Sprite file systems, end up with

Popular posts from this blog

The end of a myth: Distributed transactions can scale

Hints for Distributed Systems Design

Foundational distributed systems papers

Learning about distributed systems: where to start?

Metastable failures in the wild

Scalable OLTP in the Cloud: What’s the BIG DEAL?

SIGMOD panel: Future of Database System Architectures

The demise of coding is greatly exaggerated

Dude, where's my Emacs?

There is plenty of room at the bottom