SDCBench: A Benchmark Suite for Workload Colocation and Evaluation in Datacenters

Colocating workloads are commonly used in datacenters to improve server utilization. However, the unpredictable application performance degradation caused by the contention for shared resources makes the problem di ﬃ cult and limits the e ﬃ ciency of this approach. This problem has sparked research in hardware and software techniques that focus on enhancing the datacenters ’ isolation abilities. There is still lack of a comprehensive benchmark suite to evaluate such techniques. To address this problem, we present SDCBench, a new benchmark suite that is speci ﬁ cally designed for workload colocation and characterization in datacenters. SDCBench includes 16 applications that span a wide range of cloud scenarios, which are carefully selected from the existing benchmarks using the clustering analysis method. SDCBench implements a robust statistical methodology to support workload colocation and proposes a concept of latency entropy for measuring the isolation ability of cloud systems. It enables cloud tenants to understand the performance isolation ability in datacenters and choose their best- ﬁ tted cloud services. For cloud providers, it also helps them to improve the quality of service to increase their revenues. Experimental results show that SDCBench can simulate di ﬀ erent workload colocation scenarios by generating pressures on multidimensional resources with simple con ﬁ gurations. We also use SDCBench to compare the latency entropies in public cloud platforms such as Huawei Cloud and AWS Cloud and a local prototype system FlameCluster-II; the evaluation results show FlameCluster-II has the best performance isolation ability over these three cloud systems, with 0.99 of experience availability and 0.29 of latency entropy.


Introduction
Cloud computing has grown rapidly over the past few decades and is widely used in many application areas, such as web service, databases, big data processing, and machine learning [1,2]. Virtualization technologies allow end users to share cloud resources in the form of virtual machines (VMs) with an on-demand provision model [3]. By abstracting the underlying hardware resources, the server utilization can be improved through workload consolidation [4]. However, the contention for shared resources such as CPU, Last-Level Cache (LLC), and memory bandwidth between VMs causes performance interference, especially in multitenant cloud scenarios [5][6][7].
Interference may result in unpredictable performance degradation for cloud services, which not only reduces the user experience but also hurts resource efficiency in datacenters [8,9]. For example, the pressures on multidimensional resource consumption at the instruction cycle level become more intractable, causing long tail latencies for interactive applications. Additionally, servers running latency-critical services can only operate at low utilization due to their unpredictable tail latency. The performance interference makes it difficult to utilize the spare server capacity by colocating batch applications since uncontrolled sharing of CPU cores, caches, and power causes high latency degradation. As a result, the average server utilization in most datacenters is only 10%-50% [10,11]. This leads to billions of dollars of wastage in infrastructure and energy every year [12].
To mitigate the performance interference caused by shared resource contention, many researchers seek to enhance the isolation ability in cloud systems from hardware and software approaches. The hardware methods such as Intel RDT [13] and PARD [14] provide control interfaces for partitioning hardware such as LLC ways and memory bandwidth between different colocated applications, thus reducing the contention on these microarchitecture level resources. The software methods commonly adopt mechanisms like resource overprovisioning, CPU core binding, and dynamical power management to protect latencycritical services from the interference of colocated workloads [4,15]. However, utilizing these techniques in datacenter requires new hardware support or system upgrades. Not all of the cloud providers are willing to implement such optimizations in their platforms, which also leads to service performance differences between cloud providers. The need for predictable service performance in datacenters brings new challenges and opportunities for cloud system design that seek to improve server-level resource utilization but do not hurt application-level performance.
Unfortunately, the lack of a comprehensive suite of workload colocation benchmarks makes studying this emerging problem challenging. First, it hampers research that seeks to analyze the causes of application interference. Latency-critical services (LCs) have a variety of latency requirements and microarchitectural characteristics. The performance degradation may come from the pressures on different shared resource contentions or their combinations. However, most benchmarks in prior works are not designed for evaluating the performance interference of workloads in multitenant shared cloud scenarios [1,[16][17][18], and they are unable to exert a wide range of pressures on the underlying hardware resources. Additionally, they only observe the service-level performance and do not support the measurement in system isolation ability. Second, most algorithm or architecture innovations in cloud systems are focused on throughput-oriented designs to provide better resource pooling or provisioning abilities [10,19]. This constitutes a blind spot in the interference measurement and explorations of new isolation techniques. For example, many scheduling frameworks adopt optimized algorithms to improve the efficiency of allocating virtual machines (VM) or jobs [20][21][22], while insufficient hardware isolation mechanisms may dramatically worsen the application performance of VMs (e.g., increasing job completion time). Nevertheless, none of the existing benchmark suites support the measurement of performance isolation ability in diverse cloud scenarios.
A workload colocation benchmark can help cloud providers understand and improve the infrastructures' isolation capabilities, thereby increasing their adoption by cloud users [8]. Designing such a benchmark is challenging for several reasons. First, the applications must be carefully selected and cover a comprehensive range of domains and multidimensional resource usage behaviors. Similar applications will make the benchmark redundant and not easy to use. Second, the service performance degradation caused by interference may present in many system-level and microarchitecture metrics (e.g., tail latency and IPC). How to characterize the system uncertainty by consolidating these observation changes is a challenge. Third, it is not enough for these workloads to run individually on systems. Instead, the benchmark should support flexible mixing of workload types and intensities to adapt different application colocation requirements in datacenters.
To solve these problems, we present SDCBench, a benchmark suite for workload colocation that addresses these challenges. SDCBench includes a diverse set of LC services and BE applications, as well as a robust validated experimental methodology that makes it easy to colocate these benchmarks on cloud systems and measure the datacenters' resource isolation abilities. Our key contributions can be summarized as follows: (i) We present SDCBench, a new benchmark suite for the isolation ability measurement in multitenant shared cloud systems that covers a wide spectrum of workload diversity and characterization (ii) We propose a concept of latency entropy to describe the application performance degradation arising due to resource contention in cloud systems. This enables one to quantify the system isolation ability and efficiency of hardware and software partition technologies in workload colocation scenarios Large-scale online services such as e-commerce, search engines, online maps, social media, and advertising are widespread in today's datacenters. These interactive, latency-critical services are usually scaled across thousands of servers with fanout or multitiered architecture [33]. The intermediate state and accessible data are stored in memory or flash distributedly to ensure fast response time. A large number of microservices across multiple leaf nodes may collaborate to serve a user request. As the overall latency presented to the user is determined by the slowest nodes, even small interference, queue delays, or other sources of performance variations in these nodes may cause dramatic service time increases (a.k.a. tail latency). For example, the tail latency of the web search service in Google ranges from 0 to 500 ms, and the highest variation difference can exceed 600 × [34]. The requirements for low and predictable tail latency of latency-critical applications limit server resource utilization in datacenters [4,35]. On the one hand, the workload of 2 Intelligent Computing interactive services varies significantly due to the diurnal patterns and unpredictable bursts in user accesses. Cloud providers have to allocate resources to these services for their peak loads. This leads to much wastage due to resource overprovisioning. On the other hand, it is difficult to utilize the spare capacity by colocating batch applications with them, as the interference from sharing CPU cores, cache, and power causes high tail latency degradation and even violates the latency SLO. Achieving high performance isolation in cloud systems has been a key challenge in improving the resource efficiency of datacenters. However, researchers have proposed many approaches to reduce the performance interference between colocated workloads in cloud systems in order to improve server utilization. The lack of a comprehensive benchmark suite hampers the researchers in this area to evaluate their new proposed methods, and for cloud users, it is also hard to help them understand their application performance in different cloud platforms. In the following, we will explain why the existing benchmarks do not address this problem.

Limitation of Existing Benchmark Suites.
Prior work has proposed a variety of benchmarks, including both latency-critical services and background applications that can be deployed in cloud systems to help researchers in this area. These benchmarks fall short of our needs from the standpoints of workload diversity and performance metrics in interference measurement and studies. In the following, we compare SDCBench with some of the representative benchmarks from these aspects. Table 1 lists the existing benchmark suites for evaluating computing system performance. LINPACK [23], SPEC CPU [24], HPCC [25], and PARSEC [26] are benchmarks designed in the evaluations of high-performance computing (HPC) areas, which include several well-written programs running on computing hardware or simulators. The main focus of these applications is on the peak speed of CPU processors such as GFLOPs, which impacts the job completion time (JCT) of tested applications. Different from the SPEC CPU, the SPEC Cloud_IaaS [16] is a specially designed version for evaluating applications in cloud systems. It includes both latency-critical services and background applications in the benchmark and supports the measurement of service latency. However, SPEC Cloud_IaaS only provides two applications and is not representative of most of today's cloud scenarios.
Several benchmarks, such as YCSB [27], CloudSuite [1], TailBench [17], and uSuite [29], focus on performance measurements in cloud services. Among them, YCSB is a cloud benchmark specifically for data storage systems, which provides key-value pairs of queries from the NoSQL database. Similar to YCSB, μSuite includes four data-intensive interactive applications that are designed for the measurement in microarchitecture metrics such as system calls, context switches, and other OS overheads. CloudSuite and BigData-Bench [18] provide both latency-critical and throughputoriented applications to evaluate the microarchitectural traits that impact the performance of these services. However, their load testers adopt a "closed-loop" system and lack a rigorous latency measurement methodology. TailBench aggregates a set of interactive benchmarks and proposes a more accurate methodology to measure the tail latency. However, all of these benchmarks are designed to measure the monolithic application performance, which cannot reflect the difference of performance isolation ability in cloud systems.
Other benchmarks, such as DeathStarBench [30], MLPerf [31], and ServerlessBench [32], target specific application domains in cloud datacenters. For example, Death-StarBench is a recent open-source benchmark suite for cloud and IoT microservices, which includes representative services such as social networks, video streaming, E-commerce, and swarm control services. MLPerf is an industryacademic benchmark suite for machine learning that facilitates system-level performance measurements and comparisons on diverse software platforms, such as TensorFlow [36] and PyTorch [37] as well as hardware architectures. Server-lessBench is a benchmark designed for serverless platforms. It contains a number of multifunction applications and focuses on function composition patterns.
2.1.3. Implications. The limitations of current benchmarks motivate us to design a new benchmark suite for widespread colocation scenarios in datacenters to help researchers understand the interference in cloud systems and evaluate their new proposed software and hardware techniques to improve the system efficiency. The benchmark suite should be designed with the following principles. (i) Workload diversity: the benchmark should include both latencycritical services and background applications from a wide range of domains in cloud systems. The applications in the benchmark should be sensitive to the pressures on multidimensional resources at the system and microarchitecture levels. (ii) Usability and robust evaluation methodology: the workloads in the benchmark suite should be easy to use for simulating different workload colocation scenarios in a local cluster or public cloud systems. The benchmark should also be able to provide automatic evaluation and metric collection mechanisms that cover a wide range of service latency or completion time from microseconds to tens of minutes. (iii) Characterization and measurement for interference: the benchmark should be able to characterize the workload performance degradation caused by shared resource interference in cloud scenarios, and it should also provide metrics for measuring the performance isolation ability in cloud systems.

Overview of SDCBench.
In this section, we present the design of SDCBench. We first give an overview of the evaluation framework of this benchmark to explain how it works in cloud systems. Then, we describe the main components in the benchmark suite, including the application selection methodology, definition of latency entropy, and implementation of the key modules of SDCBench. Figure 1 shows the overview of SDCBench. SDCBench includes 16 latency-critical services and background applications with representative workload characterizations in multitenant shared cloud scenarios. It supports workload 3 Intelligent Computing colocation in cloud systems based on these applications and metric collection from service performance to microarchitectural behaviors. Different from the existing benchmarks, SDCBench implements a robust evaluation methodology and latency entropy metric that enables it to measure the interference isolation ability between different tenants in a cloud system. SDCBench can be deployed in a local server cluster running on a container engine or public cloud platforms built on top of VMs. We also develop an automatic software toolkit to help users easily build, deploy, and evaluate SDCBench in cloud systems. The toolkit consists of three key components: colocation controller, load generator, and metric collector. For each evaluation case in the cloud scenario, the user can choose the needed benchmarks from SDCBench and configure their resources and runtime parameters, such as query per second, request arrival pattern, and peak load. The colocation controller reads the user's configurations, prepares the sandbox environments, and runs applications by calling load generator to send request. During this process, metric collector monitors the servicelevel and system-level states and collects the valuable metrics, which will be analyzed by the colocation controller and then presents the evaluation results to user.   To design SDCBench, we first select candidate applications used in cloud scenarios. The existing benchmarks have proposed a large number of applications from simple website to resource-intensive background tasks. However, some of these benchmarks are similar in their workload behaviors and microarchitecture characterization. To select representative applications, we conduct a characterization of these benchmarks using metrics describing CPU, memory behaviors, and external resource requirements. The evaluation allows us to classify applications and choose a benchmark set that is representative of the required resource consumption. We collect more than 50 applications from existing benchmarks [1,17,18,31,32,38]. As shown in Figure 2, these applications come from diverse cloud scenarios such as web search, video processing, machine learning, and serverless computing. The performance measurement of these applications also covers a wide range of response times, from microsecond latency in interactive services to tens of minutes of completion time in background tasks. However, many of these applications have similar workload characterization of the pressures on the underlying hardware resources. For example, Image-classify, Scimark, and Alu are compute-intensive applications that require much computing capacity for task or data processing. Similar phenomenon can be found in I/O-intensive applications such as Redis, Memcached, and Media-streaming.
To reduce the number of redundant applications and select a representative set of them in SDCBench, we characterize the applications in these benchmarks by collecting their microarchitecture resource consumption. The measured metrics include CPU, memory, LLC, memory bandwidth, network, disk I/O, and IPC. These applications are deployed individually in isolated sandboxes without running other workloads aside from the server. Since latency-critical services always have varied workloads, we measure these services under different loads (10%, 50%, and 100%) and aggregate the collected data as their overall metrics. For background applications, we characterize them by collecting the performance metrics under different input data sizes. The input datasets used for evaluating these applications are listed in Table 2. To estimate the measurement error from system noise, we prewarm every application for a period of time (e.g., 5 minutes) in each load and collect the average metrics from a statistical method.
We use a vector P = fp 1 , p 2 , ⋯, p n g to present the profile of the application, where n is the dimension of metrics in the resource consumption characterization. For each resource, we normalize the metric value to the maximum resource capacity in the server, which is mapped in the range of 0 to 1. In this paper, we construct a 7-dimension profiling vector for each application; the profile metadata includes the consumption of CPU cores, main memory, memory bandwidth, LLC, disk I/O, network resources, and a microarchitecture metric IPC. After that, we adopt the K-means algorithm [39] to cluster these applications based on their similarity to select the minimum set of representative bench-marks. We use cosine distance to define the similarity of these applications, which is commonly used for classification. Given the profile vectors of two applications, their cosine distance can be calculated as follows: where p ik (p jk ) represents the k th metric of application i (j). If the characterization of application i is close to application j, a small value of the cosine distance is derived. Based on the application characterization, these candidate benchmarks are clustered into 6 classes of latencycritical services and 10 classes of throughput-oriented applications (see Tables 3 and 4). As the benchmarks in a same class have similar workload behaviors, we finally select one of them in each class for building the application set of SDCBench. Then, the selected 16 applications in SDCBench are listed in Table 5. For clarity, each number in the table is color-coded as follows: red is ≥0.8, yellow is between 0.2 and 0.8, and green is ≤0.2. We can see that these applications exhibit a variety of resource sensitivity characteristics, and many of them can generate considerable pressure on several dimensions of resources.

Application Descriptions. We now briefly describe the applications included in SDCBench.
Image-classify [40] is a deep learning serving application implemented with python. This service takes images through HTTP requests and activates a ResNet model for Imageclassify. Since ResNet is computationally intensive, the serving latency typically ranges from 10s to 1000s of milliseconds.
Redis [41] is an open-source, key-value database, and is widely used as distributed in-memory cache and message brokers. Redis is written in C and is highly efficient, which provides sub-millisecond response latency.
Solr [42] is an open-source enterprise-search engine written in Java, which supports full-text search, hit highlighting, and real-time indexing. Solr is highly scalable and fault-tolerant and is widely used for enterprise search and analytics use case. The search latency of Solr is typically at 10s of milliseconds.
Speech-recog [43] is a speech classification inference service, which consists a Speech-recog model that takes in the frequency spectrum of the input speech sequence and produces the classified labels. This light-weighted model is implemented with python, which takes only 10s of milliseconds for inference.
TPC-W [44] is a web server and database performance benchmark, which is proposed by the Transaction Processing Performance Council. It defines the complete webbased shop for searching, browsing, and ordering books. The response time of such web interactions typically ranges from 10s to 100s milliseconds.
Social-network [30] is a microservice application that resides in the DeathStarBench benchmark, which is an end-to-end service that implements a broadcast-style social network with unidirectional follow relationships. Since 5 Intelligent Computing requests may be forwarded and processed by different components, the service latency typically ranges from 10s to 100s milliseconds.
DecisionTree [45] is an application in the Spark benchmark suite, which is written in Scala. The spark decision tree application is implemented with Spark mllib APIs, which supports decision trees for binary and multiclass classification and for regression. This application is highly IPC efficient.
Alu [32] is an arithmetic computation application in the serverless benchmark suite, ServerlessBench, which computes the arithmetic operation repeatedly with multiple threads. The Alu application is CPU intensive and requires much less memory and network resources.
PageRank [46] is a graph processing application implemented with Spark [47]. Since web pages could be numerous, the page rank computation also consumes relatively intensive resources and jobs would take at least minutes to complete.
DiskIO [48] is an application from serverless benchmark suite, FunctionBench, which performs the dd system command that creates a file in the /tmp/ directory of the function runtime. The DiskIO application consumes less CPU, memory, and network resources, while imposing high pressure on the disk I/O bandwidth.
Dwarf-sort [18] is a big data sort application in the Big-DataBench benchmark suite, which is implemented with Scala by sorting the Wikipedia entries by keys. This application is typically memory and cache intensive.
AlexNet, LeNet, and ResNet20 [49] are deep learning training applications implemented with TensorFlow. These deep learning training processes impose huge and lasting pressure on CPUs, memory, cache, and network bandwidth. Typically, AlexNet has the highest resource demands, while LeNet consumes relatively fewer resources to train the network.
Matmul [50] is a matrix multiplication application in the HPCC benchmark. Typically, the matrix multiplication operation consumes a large amount of CPU, memory, and cache resources, while generating less network pressure.

Metric Collection.
SDCBench supports the measurement of both service-level and system-level metrics in workload colocation scenarios. These service-level metrics mainly focus on application performance and are presented to the user as intuitive results. System-level metrics are collected to analyze user application runtime behavior and the impact of system isolation ability on application performance. We now discuss the detailed metrics that are measured at the service level and system level.
Service-level metrics: these metrics provide an accurate profile of application performance for users.
(i) Response time: for interactive services, SDCBench records the response time for each request to analyze their performance changes in cloud systems. The collected metrics include the latency for a single request, the average latency and the tail latency (e.g., 90th, 95th, and 99th latency) (ii) CPU utilization: SDCBench supports the measurement of the CPU time in the user and the kernel space spent by the application. This metric helps to determine which applications are sensitive to computational resources    System-level metrics: this metric is used to describe the performance uncertainty of an application running on a cloud system and the impact of performance degradation on users caused by system uncertainty.   Figure 3). Prior work always focuses on system failures that cause user experience degradation; that is, when the system fails, the service becomes unavailable. However, this ignores the low latency requirement of cloud users. High tail latency also degrades user experience and even violates the users' latency service level objectives (SLOs). SDCBench introduces a new EA metric [51,52] by combining the system availability with service tail latency, which is defined as follows: where the collected latency statistics are divided into m uniform time intervals (i.e., Δ t ). For the i th time interval, if the γ th of tail latency t i ðτ, γÞ meets t i ðτ, γÞ ≤ τ, where τ is the latency SLO, then it is set to 1. Otherwise, t i ðτ, γÞ is set to 0.
(ii) Latency entropy: the concept of entropy was first introduced by the German physicist Clausius in 1865 and is used to describe the degree of disorder within a system [53]. Inspired by this, SDCBench first proposes the latency entropy (LE) metric for measuring the uncertainty of cloud systems. Inside a computing system, the sequence of system calls and hardware accessing events (e.g., memory access, instruction fetching, and thread execution) occurring in per unit time can be considered as microstate of the system. The colocation from multiple applications makes the microstates of the system more complex, especially when shared resource contention occurs between different applications. In these scenarios, system behaviors, such as instruction fetching and execution, become disorderly and unpredictable Unfortunately, it is difficult to measure the internal microstates of computer systems under the architecture of modern high-speed processors. To help users understand their application performance changes with observable metrics, SDCBench defines the latency entropy that describes the variations of tail latency for the measurement of system isolation ability. The latency entropy is calculated as follows: where n is the number of latency distribution states and p i represents the i th state's probability. In practice, we divide the collected latencies into multiple fixed-length time intervals, and each of them can be seen as individual states, and then, the probability of one state can be approximately derived by calculating the percentage of latency samples falling into the corresponding interval. For each cloud service, latency entropy can be used to describe its performance uncertainty in cloud system, which implies the following.
(i) The smaller the number of latency distribution states, the smaller the latency entropy of the cloud system. (ii) The more uneven the probability of the latency distribution states, the smaller the latency entropy of the cloud system. For example, if service A has a latency state distribution with " [14,17,20], [24,25,29], [32,37]" and service B has a latency state distribution with " [19], [23,24,26,28,29], [31]," we could see that service B is more stable than service A, and actually, service B has smaller LE score than service A (0.81 vs. 1.08), which is consistent with our observation.

Results and Discussion
3.1. System Implementation. We implement an evaluation framework based on SDCBench, which is designed to help users easily understand the isolation ability of different cloud provides, thus to evaluate their application performance in these cloud platforms. As mentioned in Section 2.2, the colocation controller, load generator, and metric collector are the

Time interval Timeline
System availability: v/s Failure Restore Experience availability: Experience availability: • the th tail latency meets latency SLO Experience unavailability: • the th tail latency is large than latency SLO Intelligent Computing core modules in the framework design, which support automatic application configuration, deployment, and measurement of performance and cloud system isolation ability. Colocation controller: the controller automatically manages all necessary steps of the workload evaluation in a cloud system. It provides application selection and parameter configuration interfaces with a visual interface for users to execute these operations. The latency-critical services and background applications in SDCBench are registered in the database, and the user can select the evaluated applications by marking their flags as executable. For latency-critical services, SDCBench supports the evaluation parameter settings, such as request arrival pattern, peak load, warmup invocations, and the total request invocations. For background applications, SDCBench supports the settings of job execution times, task types, and input data sizes.
SDCBench supports component-level (i.e., containerlevel) application colocation based on docker APIs and Linux system tools. It uses docker update commands [54] and the numactl [55] tool to bind the CPU core and memory block to containers. Applications running on the same CPU socket may have contentions on the shared cache and memory bandwidth, which causes performance inference for the colocated workloads. SDCBench also provides a finegrained resource partition mechanism for containers running within the same CPU socket; it adopts the Intel RDT tool [13] to set the cache ways and memory bandwidth for each container. Additionally, SDCBench uses the qdisc network tool [56] for allocating network bandwidth for the evaluated applications to measure their performance variations in both colocated and isolated workload scenarios.
Load generator: SDCBench implements a load generator to generate requests to the latency-critical services, which can be deployed on one or more client machines. The load generator integrates a traffic shaper, a client pool, and a recorder. It creates multiple clients from the thread pool to continuously generate requests, and the traffic shaper handles these requests and sends them to the backend service following the desired workload patterns (e.g., from production traces) by inserting delays between requests before sending them out over the network. The simulated clients operate in an "open-loop" mode, where the request can be sent directly according to their desired timing characteristics without waiting for responses from previous requests. The open-loop [17,57] setups generate sufficient workload pressures on the evaluated services and can accurately capture the queuing delays that are an important factor impacting the tail latency.
The recorder maintains a queue to store the processed requests, which is shared among simulated clients. It records the response time of all the requests sent by the clients, aggregates them and calculates statistical metrics, such as the single response time, average latency, and tail latency. The measured data can be stored in a database or exported as files for users.
Metric collector: the performance degradation caused by interference can manifest in multiple hardware resource activities. To accurately measure these system layer and microarchitecture level behaviors without introducing exter-nal overhead, SDCBench adopts a nonintrusive method to implement the metric collector. In the system layer, we collect the actual resource usage ratio of the measured application, which includes the number of CPU cores, memory, network, and disk I/O. In the microarchitecture layer, we collect branch prediction errors, cache switching, context switching, memory-level parallelism, and misses per thousand instructions (MPKI). These metrics focus on the operating efficiency of the application code on the current physical hardware, such as locality and parallelism, which help us understand where the inference comes from and its impact on the application performance.
The collector runs aside the measured applications in an isolated CPU socket and adopts a multithreading technique to invoke a series of monitoring tools such as Intel RDT, Perf [58], and Docker Stats [54], for the metrics measurement. When all of the monitors are complete, the collector formats the data and returns the results to the user. SDCBench is an open source and is available at https://github.com/ TankLabTJU/sdcbench/tree/sdcbench-v2.0/.

Evaluation and Methodology.
SDCBench is designed to help cloud users understand the performance isolation ability of cloud systems by deploying colocated applications in cloud systems that different workloads may share the hardware resources and measuring their performance changes. In the evaluation of SDCBench, we need to answer several key questions. Are the benchmarks in SDCBench representative of the multitenant cloud scenarios by covering a wide range of latencies that can be measured? Is SDCBench able to observe the service performance degradation due to interference from colocated workloads? Can hardware isolation mechanisms eliminate the performance variations caused by interference? How do the major cloud service providers perform in latency entropy measurement? To answer these questions, we use SDCBench to thoroughly evaluate cloud system under different workloads. We begin with a local benchmark evaluation to verify that we cover various workload behaviors (Section 3.2.1). We analyze performance degradations of latency-critical services under different workload colocation scenarios (Section 3.2.2) and present the comparison of latency entropy in some of the existing cloud platforms (Section 3.2.3).
We evaluate SDCBench on both local cluster and public cloud platforms. The benchmarks are implemented in C, Python 3.7, and Java. All real-system measurements reported in the evaluation were performed on servers with two Intel Xeon Silver-4215 CPUs. Table 6 shows the detailed server configurations. We run load generator on individual server to avoid interference from the deployed applications. In the testbed, we bind the applications in the second CPU socket and forbid the operation system to schedule other tasks on the CPU cores in this socket, thus to prevent interference from the system. We also disable the TurboBoost technique and use cpufreq tool to fix CPU frequency at 2.0 GHz, which can help to avoid unpredictable performance fluctuations [59]. The server and client nodes are connected via 10 Gbps, full-bisection bandwidth Ethernet. 10 Intelligent Computing 3.2.1. Benchmark Characteristics. We now study the latency characteristics of each application, which include the average request service time and tail latency. The service time of a request measures the time the application takes to process that request, which can reflect the execution speed of the application code running on dedicated hardware. The tail latency represents the few slowest requests (e.g., the slowest 1% requests when measuring the 99th percentile latency); it is much more sensitive to small perturbations and can be used to observe the service performance fluctuations. We also study how the request arrival rate affects tail latency in these applications. All measurements in this experiment were obtained by the record collector module in the load generator. To mitigate the measurement error caused by system noise, we collect the evaluation metadata after the application running stably and each of these experiments is measured three times. Q1: Are the benchmarks in SDCBench representative for the multitenant cloud scenarios by covering a wide range of latencies that can be measured? Figure 4 shows the cumulative distribution function (CDF) of request service times for each SDCBench application. Obviously, the service times vary widely across applications. Almost all Redis requests finish in less than 1.1 ms, and the difference between the lowest and highest service times is only 0.3 ms. But the service time of Social-network requests can take more than 110 ms each. Applications also vary widely in how tightly their request service times are distributed. For some applications, request service times are distributed in a fairly narrow range or have a long tail. 90% of Social-network request times are distributed between 110 ms and 125 ms, and another 10% are distributed between 125 ms and 175 ms, accounting for 77% of the total time distribution. Solr requests have 1% of requests spread over 100 ms to 150 ms, accounting for one-third of the total time distribution. Image-classify requests have a similar trend. Other applications, such as Speech-recog and TPC-W, have their request service times fairly evenly distributed in two specific ranges. Figure 5 shows the mean and 99th percentile latencies for each application at various request load. In these experiments, 100% of the request load represents the queries per second (QPS) limit of the application under the current configuration. At very low request load, the difference between mean and tail latencies mostly depends on the distribution of request service times. As the request load increases, both mean and tail latencies increase because of competition for resources and queue delays. However, the tail latencies of all applications except Image-classify increase much faster than the mean; for example, Solr requests have a tail latency of about 500 ms higher than the mean latency at 100% of the request load and 50 ms higher at 80% of the request load. Redis also has a similar trend, but the difference between the tail latency and the mean latency of Redis increases slowly compared to other applications. And the tail latency of Redis is only about 1 ms higher than the mean at 100% of the request load. The difference between the tail latency and the mean latency of Image-classify is also small, but its tail latency growth rate is gradually exceeded by the mean latency growth rate.

3.2.2.
Interference Measurement. We next analyze the performance changes of these benchmarks in multitenant shared cloud scenarios. Users may deploy different workloads in their cloud virtual machines that run on a large scale of physical servers. Recent studies have shown that interactive services such as websites make up a large part of these cloud applications. At the same time, many users also use cloud VMs to process data-intensive applications such as batch tasks. When deploying SDCBench to simulate these shared cloud scenarios, we prefer to observe the performance degradations on the colocated workloads, which means that applications running inside the cloud system cause contentions on the underlying hardware resources. This could help us to understand the sensitivity of the SDCBench applications for interference and the range of performance changes they can measure.
Based on the benchmark characterization of SDCBench in Section 2.3.1, we build four workload colocation suites with different levels of competition in share resources ( Figure 6). These application combinations are (i) CPUintensive suite: this suite includes Speech-recog, Alu, and PageRank, which are computation-intensive applications during their executions. (ii) Memory-intensive suite: this suite consists of latency-critical service Solr, background applications PageRank, and DNN model training for ResNet20, which rely on much memory resources for execution. (iii) Hybrid contentions: this suite contains Redis, Dis-kIO, and DNN model training for AlexNet, which can generate pressures on multidimensional resources. (vi) Symbiotic workloads: this suite includes Social-network, Deci-sionTree, and DNN training on ResNet20. Unlike the other combinations, these three applications can be colocated together without significant performance interference on shared resources. We use these benchmark suites on the local cluster and evaluate their performance changes with the measurement of system isolation ability.
Q2: Is SDCBench able to observe the service performance degradation due to interference from colocated workloads?  12 Intelligent Computing Figure 7 shows the latency distributions of latency-critical services in these colocation workload suites. For these workload suites of CPU, memory, and hybrid contentions, the latency of colocation is significantly higher than that of solo-run. And it is clear that the latency distribution of above three types of workloads at various request load has become larger, which means that the performance of these services is more unstable. Solr is most sensitive to the interference caused by colocation with other background workloads. At 50% of the request load, the mean latency of colocated Solr service is about 70 × that of the solo-run, and the latency distribution is also significantly larger. At 100% of the request load, the latency of Solr increases from about 600 ms to 9000 ms. Compared with the other two types of workload, Redis is less sensitive to the colocation, but the latency is still increased by 8 times at 10% of the request load. The performance degradation caused by interference is especially obvious when the online service is under high load. This is because as the load of online services increases, the demand for memory and CPU resources increases, and the competition with other background applications becomes more intense, making it more sensitive to interference. For example, while at 10% and 50% of the request load, the latency of colocated Speech-recog increases by 2 × and 10 × compared with solo-run, respectively. At 100% of the request load, the latency increases by about 100 × . Compared with the previous three workloads, the distribution of latency increases slightly, but is not obvious. The mean latency of colocated social network increases about 5 ms compared with solo-run at 10% of the request load. However, the mean latency hardly changes at 100% of the request load and even slightly decreases at 50% of the request load. Therefore, when the symbiotic application is combined with other background applications, the performance does not change much compared to the solo-run, which is also in line with our expected results. This result shows that the performance degradation in the first three workloads is indeed caused by interference between applications, and it also shows that the application selection in SDCBench is scientific. Figures 8(a) and 8(b) show an example of the comparison in system-level and microarchitecture-level metrics with hybrid-contention workloads suite. As Redis generates a high utilization of network and LLC resources, there is little difference in these two metrics between solo-run and colocation workloads suites. By colocating Redis with DiskIO and AlexNet, the CPU and memory utilizations in the system are improved by 2.7 × and 13%, respectively. Meanwhile, the usage of DiskIO in colocation group increases by 4.28 × . However, we could also see that the IPC and Context-switches metrics decrease by 42% and 87% in the colocation group since the workload interference significantly reduces the QPS of Redis. Additionally, the workload interference in colocation group results in higher tail latency for Redis. When compared with solo-run group, the cache miss rate of L1 and L2 increases by 1.7 × and 1.3 × in the colocation group, respectively. More cache miss also results in a higher memory access rate, and we could see the  Figure 7: The latency distributions of latency-critical services in the four workload suites under 10%, 50%, and 100% request loads. Each workload suite is evaluated in solo-run, colocation, and isolation groups, respectively. 13 Intelligent Computing memory bandwidth usage increases about 30% in the colocation group. This explains why the shared resources contention lead to performance degradation for latency-critical services.
Q3: Can hardware isolation mechanisms eliminate the performance variations caused by interference? In the evaluation of the first three workloads, application performance has been greatly improved after resource isolation. It can be seen from the figure that the performance of these three services in isolation is very close to that of solo-run. In CPU-intensive workloads, after isolation, the mean latency of Speech-recog at three types of request loads decreases by about 20 ms, 90 ms, and 4000 ms, respectively, which is almost the same as that of solo-run. And the performance stability is very close to that of solo-run. It indicates that the workload colocation with resource isolation brings limited interference. For symbiotic application combinations, there is no significant change between the effect of resource isolation and colocation. We can see that the latency distribution after resource isolation is more centralized, which means that performance is more stable. However, under 100% request load, the mean latency of social network even increases by about 10 ms. This also shows that the resource isolation method can improve application performance to a certain extent in the multitenant cloud scenario, but it should be analyzed according to the specific application characteristics.
We also measure the service experience availability and latency entropy metrics of these applications in each group of experiment. Table 7 shows the comparison of service experience availability of the colocated latency-critical services, which is calculated with Equations (2) and (3). We define that the latency-critical services have the best performance in solo-run group, and set the latency SLO (τ) as the value that meets EA = 1:0 at service 100% load. We could see that the service experience availability decreases dramatically in the first three experiment groups because of the performance interference between colocated workloads. For example, the experience availability of Speech-recog reduces about 60% at 10% load level in the colocation group. Moreover, its performance becomes even worse when we increase the request arrival rate, which achieves only 29% and 16% of time intervals that meet latency SLO at 50% and 100% load levels, respectively. Additionally, we could see that isolation on hardware resources significantly improves the performance availability of these latency-critical services. For Speech-recog, its performance availability recovers to 0.97, 0.98, and 0.99 at 10%, 50%, and 100% load levels, respectively. The performance availability changes of Solr and Redis are similar to those of Speech-recog. Social-network has less performance fluctuations over these evaluated applications.
We further present the comparison of latency entropy of these experiments, which are listed in Table 8. For each workload colocation suite, we record the best and worst latencies in the solo-run group, divide them into multiple latency intervals, and calculate the probabilities of latency time that falls in these intervals in colocation and isolation groups, thus deriving their latency entropy measurements. We could see that Social-network has the highest latency entropy in solo-run group over the four workload colocation suites as its latency that crosses multiple microservices invocations is more sensitive to the performance fluctuation. The interference caused by shared resources contention leads to an average of 13 × , 4 × , and 2.8 × of LE increasement except for the Social-network. By isolating the shared resources between these colocated applications, the average LE decreases about 9.7 × , 4.7 × , and 2.8 × in Speech-recog, Solr, and Redis groups, respectively. This indicates that the interference between colocated workloads can greatly magnify the system uncertainty, leading to unpredictable performance degradation on applications running in the system. Introducing isolation mechanisms such as hardware partitions can effectively reduce the application performance fluctuation caused by system uncertainty, thus improving the user experience.

Case Study in Public
Cloud. Q4: How do the major cloud service providers perform in the latency entropy measurement? One of the key benefits of SDCBench is to help users to understand their application performance in different public cloud systems. We seek to find the answers in some of the existing public cloud platforms through a simple case study, where we deploy SDCBench in these platforms to 14 Intelligent Computing measure their latency entropy metrics. Specifically, our testbeds include the public cloud platforms such as Huawei Cloud and AWS Cloud and a local prototype system FlameCluster-II [60], which is built based on the new CPU architecture for Labeled von Neumann Architecture (LvNA) that supports better isolation of shared resources than the traditional x86-based CPU architectures.
Since the current version of FlameCluster-II only supports C and JAVA languages, we choose TPC-W and Redis as the evaluated latency-critical services to measure the EA and LE metrics of the three platforms. For the public cloud platforms, we deploy TCP-W and Redis in individual cloud VMs and collect their latencies at different times of the day. As the VMs of different cloud tenants may be scheduled into the same server, the user's application performance can be impacted by the interference from other tenants' workloads. For FlameCluster-II, we build an 8-node FPGA cluster and deploy benchmarks across these nodes. The load generator in these experiments is deployed in an isolated server environment and communicates with the evaluated applications via network. To reduce the measurement error caused by system noise, we collect the measured data in each experiment three times and take their statistical metrics as the evaluated results.
We collect more than 1,000,000 request latencies in each testbed and measure their EA and LE metrics. Figure 9 shows the comparison of these three platforms in average EA and LE measurement. We could see that Huawei cloud, AWS cloud, and FlameCluster-II achieved 0.94, 0.86, and 0.99 of EA, respectively. This indicates that user may obtain better performance experience by deploying applications in Huawei cloud when compared with AWS cloud. For the comparison of LE, the latency entropy of the three platforms is 1.34, 3.4, and 0.29, respectively. Specifically, applications in FlameCluster-II have minimal performance fluctuations since its strong isolation in hardware from the LvNA design, which validates that hardware isolation is a good way to eliminate performance uncertainty in cloud datacenters.
Additionally, the evaluation results also show that application in Huawei cloud achieves better performance isolation ability than that in AWS cloud. This may be because AWS has adopted more aggressive resource oversold policies in different cloud tenants.

Conclusion
We have presented SDCBench, a benchmark suite and evaluation methodology for latency entropy measurement in datacenters. SDCBench seeks to help cloud tenants and providers understand the application isolation ability in datacenter by colocating workloads and observing their performance variation in cloud systems. SDCBench includes 16 representative applications selected from today's wellknown benchmarks across a wide range of cloud scenarios. It first proposes the concept of latency entropy and implements a robust methodology to measure the performance isolation ability in datacenters. Our validation results show that SDCBench can simulate different multitenant shared cloud systems with simple configurations, and we also    15 Intelligent Computing present the comparison of latency entropy in today's major cloud providers by deploying SDCBench in Huawei cloud, AWS cloud, and a local prototype system FlameCluster-II. The evaluation results show FlameCluster-II achieves the lowest latency entropy with 0.29 while the scores in Huawei cloud and AWS cloud are 1.34 and 3.4, respectively.