Cloud outage can result in bad user experiences for cloud tenants and revenue loss to the provider. This makes cloud network diagnostic solutions invaluable. Despite the various existing network diagnostic solutions, few of them are designed specifically for cloud networks. Current state-of-the-art cloud network diagnosis falls short in three aspects: (1) there is no clear way to distinguish whether an observed problem is in the tenant's virtual network or in the provider's infrastructure. As a result, the interaction between tenants and the provider leads to a longer problem-solving time and higher maintenance costs; (2) for cloud tenants, there are only rudimentary troubleshooting tools (e.g., ping, VM monitoring) that can be deployed. However, diagnosing a distributed system with these tools depends heavily on skill and experience, which is not always feasible for tenants; (3) for the cloud provider, new trends such as network function virtualization make the infrastructure more complex than the traditional network, which could lead to new problems arising. Thus, the provider requires new diagnostic tools to help cover this range of problems.
In this talk, we design two systems for cloud network diagnosis: (A) VND: a Virtual Network Diagnostic Service. VND is a service offered by the provider to its tenants. Using VND, a tenant can determine whether a problem is in its virtual network or not; VND's interfaces also simplify tenants' troubleshooting operations. A tenant could use VND to collect, parse and query its packet traces. Here, the trace collection cooperates within the tenant's view of its own virtual network without exposing the cloud infrastructure. Trace parse and query interfaces are design to ease the tenant's troubleshooting operations. VND provides a SQL interface for tenants to perform diagnosis. We show several typical network diagnostic use cases where troubleshooting solutions can be easily implemented using VND. We also measure VND's overhead and show its feasibility. (B) PerfSight: Performance Diagnosis for Software Data Planes. Increasingly, modern network data planes have complex software involved in packet processing (e.g., virtual switches, VM hypervisors and software middleboxes). We refer to these software parts as the software data plane. We argue that there are at least three new classes of performance problems that arise in software data planes: bottlenecks, contentions and bugs. We propose a system named PerfSight to target these three problems. PerfSight instruments the software data plane, gathers basic statistics (e.g., packet count, byte count, I/O time) and analyzes the statistics comprehensively. We obtained two key insights by running PerfSight: (1) packet drop is the best indicator of bottlenecks, and location of packet drop can give information on the resources in contention (e.g. CPU, network bandwidth); (2) software middlebox's states can be defined by basic statistics, and these states propagate in the network in certain patterns. These patterns can be used to infer which middlebox has performance bugs. Our evaluation shows PerfSight introduces little overhead to the existing system, and thus it is feasible to deploy.
Together, we believe VND and PerfSight provide diagnostic solutions to both tenants and the provider. They form an integral basis for cloud network diagnosis.
Wenfei Wu is a Ph.D. student in the University of Wisconsin-Madison, Computer Sciences Department since 2010. His adviser is Prof. Aditya Akella. His research interests are in system and networking. During his Ph.D. program, he has research publications in datacenter network management, traffic engineering, the Internet architecture, wireless networks and software-defined networks. Wenfei is expected to graduate at the end of 2015.