HIGH THROUGHPUT DISK SCHEDULING WITH FAIR BANDWIDTH DISTRIBUTION
#1

HIGH THROUGHPUT DISK SCHEDULING WITH FAIR BANDWIDTH DISTRIBUTION
SEMINAR REPORT
Submitted by
ANOOP THOMAS
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
COLLEGE OF ENGINEERING TRIVANDRUM


[attachment=7074]

1. ABSTRACT
Mainstream applications–such as file copy/transfer, Web, DBMS, or video streaming–typically issue synchronous disk requests. As shown in this paper, this fact may cause work conserving schedulers to fail both to enforce guarantees and to provide a high disk throughput. A high throughput can be however recovered by just idling the disk for a short time interval after the completion of each request. In contrast, guarantees may still be violated by existing timestamp-based schedulers, because of the rules they use to tag requests.

Budget Fair Queueing (BFQ), the new disk scheduler presented in this paper, is an example of how disk idling, combined with proper back-shifting of request timestamps, may allow a timestamp-based disk scheduler to preserve both guarantees and a high throughput. Under BFQ each application is always guaranteed–over any time interval and independently of whether it issues synchronous requests–a bounded lag with respect to its reserved fraction of the total number of bytes transferred by the disk device.

2. INTRODUCTION

Most mainstream applications, such as file transfer, Web, DBMS, Video on Demand or Internet TV, require moving data to/from disk devices. Meeting the disk bandwidth and pre request delay requirements of these applications and at the same time achieving a high throughput is not an easy task. The first problem is that the time needed to serve a request once dispatched to the disk device, hereafter called service time, is highly variable. Causes are seek and rotational latencies, variation of the transfer rate with the sector position, caching1. In addition, few if any of the existing controllers export these physical parameters, which make it difficult to predict service times based on request positions on the disk. Finally, most mainstream applications usually issue one request or batch of requests at a time. Especially, they block until the only outstanding request/batch has been completed. We denote this type of request or batch of requests as synchronous, for it can be issued only after the outstanding batch/request has been completed. The peculiar arrival pattern of synchronous requests may cause guarantee and disk throughput problems.

Guarantee violations may occur with timestamp-based schedulers, i.e., schedulers that timestamp requests as a function of their arrival time and in essence dispatch them to the disk in ascending timestamp order. The root of the problem is that the arrival of a synchronous request may be arbitrarily delayed by a scheduler by just delaying the dispatching of the preceding request. Then a delayed synchronous request may get a higher timestamp with respect to the one it would have got if not delayed. This higher timestamp may finally let the request wait for the service of more requests before being dispatched to the disk. Unfortunately, delaying the service and hence the completion of the request will
delay the arrival of the successive synchronous request of the same application, and so on. In the end, by just delaying the service of its requests, the scheduler may force the application to issue requests at a deceptively lower rate. If this anomaly occurs, the scheduler just fails to guarantee the reserved bandwidth or the assigned request completion times to the application (even to a greedy one).
In addition, a minimum amount of time is needed for an application to handle a completed request and to submit the next synchronous one. On one hand, this fact may further contribute to violating guarantees with timestamp-based schedulers. On the other hand, it may also prevent a work-conserving scheduler from achieving a high disk throughput. From the disk device standpoint, an application is deceptively idle until it issues the next synchronous request [1]. During one such idle time, the disk head may be moved away from the current position by a work conserving scheduler, thus losing the chance of a close access for applications performing mostly sequential IO. The problem is mitigated by the fact that operating systems typically perform read-ahead for request patterns deemed sequential.
Over-provisioning would be a way to easily guarantee a predictable and short request service time without dealing with the above problems. Unfortunately it entails high purchase, powering and cooling costs [2], and purposely wastes disk bandwidth. In fact, the only option to improve the disk utilization and meet the requirements of many competing applications is properly scheduling disk requests. Several schedulers–such as SCAN (Elevator), C-SCAN, LOOK or C-LOOK [3]–have been defined to achieve the first goal. These algorithms are often implemented also inside modern disk devices, which can internally queue requests and service them in the best order to boost the throughput. Finally, to achieve a high throughput also in presence of deceptive idleness, most Linux standard disk schedulers extend these policies with disk idling: they do not dispatch any other request to the disk for a short time interval (in the order of the seek and rotational latencies) after a synchronous request has been completed. By doing so they give a chance to the (possible) next request of the same application to arrive before the disk arm is moved away. The Anticipatory (AS) disk scheduler [1], extends C-LOOK in this sense.
However, to meet the requirements of most types of applications, guarantees must also be provided on bandwidth distribution or request completion times. In this respect, the problem of algorithms aimed only at maximising the disk throughput is that in the worst case they may delay the service of a request until the whole disk has been read or written (age-based policies, as in AS, can be added to mitigate the problem). In contrast, many schedulers, as e.g., SCAN-EDF [4], SATF-DAS [5], JIT [6], Hybrid [7], YFQ
[8], pClock [9], adaptive DRR [10], CFQ [11] and adaptations of SFQ [12] have been proposed to provide control on request completion time and/or on bandwidth distribution, while at the same time trying to keep the throughput high.
Apart from CFQ, all of these schedulers are work-conserving and none of them takes the delayed arrival problem into account. Hence, as previously discussed, they may provide a low disk throughput and, if timestamp-based, may violate guarantees. For similar reasons, guarantees for synchronous requests are violated with any scheduler if the disk device performs internal queueing, as discussed in more detail in Section IV. Our experimental results with a single disk thoroughly confirm the expected loss of guarantees for all the analyzed work-conserving timestampbased schedulers, as well as a loss of disk throughput in case of applications performing mostly sequential accesses. Of course, on the opposite end, adopting a na¨ıve disk idling approach in a multidisk system, i.e., idling the entire array of disks on the completion of each sequential synchronous request, may cause throughput loss [13]2. Some more effective strategies are mentioned together with our proposal in the next subsection. In contrast, a non timestamp-based scheduler should not suffer from either problem, provided that some mechanism is adopted to handle deceptive idleness. This is the case for round robin schedulers, as, e.g., CFQ, which does perform disk idling. Unfortunately, as discussed in Section IV and experimentally shown in Section V, they exhibit a higher delay/jitter in request completion times than the timestamp-based scheduler we propose.

2.1 PROPOSED SOLUTION

In this paper we propose a proportional share timestamp based disk scheduler, called Budget Fair Queueing (BFQ). As in any proportional share scheduler, in BFQ each application is guaranteed its reserved fraction (share) of the disk throughput, irrespective of its behaviour. Hence neither the request arrival pattern of the application needs to be known nor the application itself needs to be modified to provide such a guarantee.
BFQ serves applications as follows. First, when enquired, an application is assigned a budget, measured in number of sectors to transfer. Once selected, the application gets exclusive access to the disk. During the service of the application disk idling is performed to wait for the arrival of synchronous requests (only if these requests are deemed sequential, see Subsection II-B). Finally, if the application runs out of either its backlog or its budget, it is deselected and assigned a new budget. This service scheme allows BFQ to achieve a high throughput if applications issue mostly sequential requests. In contrast, for random workloads it is not as optimal as a global policy, like, e.g., C-LOOK. It is however worth noting that with random workloads even an optimal policy can achieve only a small fraction of the disk rate. In fact, where possible, caches are usually tuned so as to reduce disk access and achieve feasible response times.
Applications are scheduled by the internal Budget-WF2Q+ (BWF 2Q+) scheduler as a function of their budgets. The latter isa slightly extended version of WF2Q+ [14], a proportional share (fair-queuing) packet scheduler. B-WF2Q+ basically differs from the original packet scheduler in that, first, it handles the case where an application becomes idle before consuming all of its budget, and, second, it properly shifts backwards timestamps to conceal delayed arrivals. Thanks to this characteristics and to the fact that it works in the service and not in the time domain, BFQ provides the following guarantee to each application, over any time interval, with any workload and regardless of the disk physical parameters: each application is guaranteed the minimum possible lag, achievable by serving applications budget-by-budget, with respect to the minimum amount of service, measured in number of sectors transferred, that the application should receive according to its reserved share of the disk throughput and to the total amount of service provided by the system. A loose upper bound to this lag is 3Bmax, where Bmax is the maximum budget that can be assigned to any application. In general, BFQ can be fine-tuned to achieve the desired trade-off between throughput boosting and maximum per-application lag, by tuning Bmax and a few other configuration parameters. In addition, a simplified interface is provided for users not concerned with low-level details.
The above sector guarantees can be turned into time guarantees as follows. First, the aggregate throughput achieved with the desired values of the configuration parameters must be measured for the expected worst-case request pattern (see Section III-D). Then, worst-case time guarantees can be computed as a function of the aggregate throughput with a simple closed-form expression. Note that, on one side no other disk physical parameter needs to be known or measured, whereas, on the other side, there is no possibility with any scheduler to provide practical time guarantees without knowing at some degree at least the locality of the expected request pattern. And video streaming applications (Section V). In general, the main contribution of this paper is showing through BFQ a way to combine disk idling and timestamp back-shifting to preserve both guarantees and a high throughput in presence of synchronous requests. This technique may be applied also to other timestamp based schedulers. Finally, to preserve a high throughput in a multiple-disk system, the disk idling scheme described so far should be extended to allow, e.g., one outstanding request per disk. Or alternatively a hierarchical approach might be adopted, leaving the disk idling task only to the local per-disk schedulers. Carefully investigating these solutions is out of the scope of this paper.



2.2 ORGANIZATION OF THE PAPER

In Section II we describe BFQ and report its computational cost, whereas in Section III we show its service properties in the sector and time domains. In Section IV we provide a brief survey of related work. Finally, in Section V we compare the performance of BFQ against several research and production schedulers.



3. BFQ

After defining the system model in the next subsection, we introduce the logical scheme and the main algorithm in the successive two subsections, where B-WF2Q+ is used as a black box providing application enqueue/dequeue operations. We then describe B-WF2Q+ in detail in Subsections II-D and II-E.

3.1 SYSTEM MODEL

We consider a storage system composed of a disk device and the BFQ scheduler, as in Fig. 1. The former contains a disk, which we model as a sequence of contiguous fixed size sectors, each identified by its position in the sequence. The disk device services two types of disk requests, respectively the reading and the writing of a set of contiguous sectors. After receiving the start command or completing a request, the disk device asks for the next request to serve by invoking the function dispatch exported by the BFQ scheduler.
At the opposite end, requests are issued by the N applications served by the storage system (applications here stand for the possible entities that can compete for disk access in a real system, as, e.g., threads or processes). The meaning of the notations hereafter introduced is also summarized in Table I. The i-th application issues its j-th request Rj i by invoking the function add request exported by the scheduler, and passing the index i and the request Rj i . We define as size Lj i of Rj i the number of sectors to read or write, and as position/end of Rj i the position of the first/last of these sectors. We say that two requests are sequential if the position of the second request is just after the end of the first one. We define as arrival, start and completion time of a request the time instants aj i , sj i and cj i at which the request Rj i is issued by the i-th application, starts to be served and is completely served by the disk device, respectively. We say that a request is synchronous if it can be issued by an application only after the completion of its previous request. Otherwise the request is denoted as asynchronous.
We say that an application is receiving service from the storage system if one of its requests is currently being served. Both the amount of service Wi(t) received by an application and the total amount of service W(t) delivered by the storage system are measured in number of sectors transferred during [0, t]. For each application i, we let Bi,max denote the ynamically configurable maximum budget, in number of sectors, that BFQ can assign to it. We define Bmax _ maxi Bi,max. We say that an application is backlogged if it has pending requests. In addition, to deal with the delayed arrival and the deceptive idleness problems an application is denoted as quasibacklogged at time t if either it is backlogged, or it is not backlogged but its backlog emptied not before time t − Twait, where Twait is a system-wide dynamically configurable time interval. Otherwise the application is deemed as idle. Moreover we say that an application i enjoys the short-or-independent arrival property (SI property for short) if, for each request Rj i , either Rj i is asynchronous, or aj i − cj−1 i _ Twait. The actual adherence of real-world applications to the SI property is discussed in Section III, whereas hereafter we assume that applications do enjoy this property.
Each application has a fixed weight _i assigned to it. Without losing generality, we assume that PN q=1 _q _ 1. Given a generic function of time f(t), we define f(t− 1 ) _ limt→t− 1 f(t) and f(t1, t2) _ f(t2) − f(t1). Finally, given any time interval [t1, t2] during which the i-th application is continuously quasibacklogged, we define its reserved service during [t1, t2] as _i •W(t1, t2). Suppose that Rj+1 i is a synchronous request, and let cj i be the time instant at which the request Rj i would be completed if the i-th application received exactly its reserved service during [aji , cj i ]. We say that the arrival of Rj+1 i is (deceptively) delayed if cj i > cj i .








3.2 LOGICAL SCHEME

The logical scheme of BFQ is depicted in Fig. 2. Differently from Fig. 1, here solid arrows represent the paths followed by requests until they reach the disk device. On the contrary, dashed arrows represent flows of information or internal commands. Finally, circles represent algorithms or operations. There is a request queue for each application. Requests are inserted into the right queue by the add request function.
At any time each application has a budget assigned to it, measured in number of sectors. Let Bl i be the l-th budget assigned to the i-th application. At system start-up, all applications are assigned the same default budget B0 i . Disk access is granted to one application at a time, denoted as the active application.
When the new active application is selected, its current budget is assigned to a special remaining budget counter. Each time a request of the active application is dispatched, the remaining budget counter is decreased by the size of the request. Moreover, if the request queue gets empty at time t, a timer is set to t+Twait to wait for the possible arrival of the next request. This may leave the disk idle, but prevents BFQ from switching to a different application if the active application is deceptively idle and enjoys the SI property. In addition to not breaking a possible sequence of close or sequential accesses, this waiting is also instrumental in concealing delayed arrivals, as shown in Subsection II-E. However, waiting for the arrival of non-sequential requests provides no benefit. Hence, as CFQ, BFQ automatically reduces Twait to a very low (configurable) value for applications performing random IO (currently 2 ms, see the code for details [16]). If the active
application issues no new request before the timer expiration, it is deemed idle. The active application is exclusively served until either there is not enough remaining budget to serve the next request, or the application becomes idle. At this point the next budget Bl+1 i of the application is computed. Basically, Bl+1
i is obtained by increasing or decreasing Bl i, depending on whether the application consumed all of its budget or ran out of its backlog before consuming it. The next active application is then chosen by B-WF2Q+, which schedules enqueued applications as a function of their budgets (see Subsection II-E).
To guarantee a controllable per-application maximum queueing time, the order in which the requests are extracted from the queue of the active application depends on a user-configurable TFIFO parameter. When the next request of the i-th application is to be dispatched at time t, if aj i +TFIFO > t holds for all the queued requests Rj I, then the next request is chosen in C-LOOK order [3], otherwise the oldest request is picked.





3.3DISK WEIGHTED FAIR QUEUEING
In this subsection we provide a brief survey of the main concepts behind the original WF2Q+ algorithm (see [14] and [17] for details), (re)formulated in the disk scheduling domain. These concepts are the basis for B-WF2Q+, which is then described in detail in the next subsection.
Hereafter we use the term batch to denote the set of the requests served using a given budget. We define a batch as pending at time t if it has not yet been completely served at time t (each application has at most one pending batch at a time). We define two systems as corresponding if they serve the same applications and, at any time instant, they provide the same total amount of service per time unit.
WF2Q+ approximates on a packet-by-packet basis the ideal service provided by a work-conserving packet-based fluid system. In B-WF2Q+ we map the concept of packet into the concept of batch: B-WF2Q+ approximates on a batch-by-batch basis the service provided by a corresponding fluid system that may serve more than one application at a time. Hereafter this system is called just the ideal system, as opposed to the real system, i.e., to the storage system in Fig. 1. The ideal system is used as a reference because it distributes the total service among applications so as to guarantee to each application a bounded lag, over any time interval, with respect to its reserved service (see the appendix for details). Moreover, in the ideal system, each request is completed no later than the time instant at which it has to be completed according to the reserved service of the issuing application. On the contrary, as shown in the Section III, this does not hold in the real system. Hence a synchronous request in the real system may arrive later than in the ideal system. This fact may cause request arrivals to be deceptively delayed in the real system.

3.4WF2Q+ AND TIMESTAMP BACK-SHIFTING
The B-WF2Q+ algorithm is shown in Fig. 4 using pseudocode. An application is inserted into the scheduler by calling the function b-wf2q+ insert. If the application is not the active one, then, since Twait is waited for before eactivating an application and applications enjoy the SI property, b-wf2q+ insertis called as a consequence of the arrival of an asynchronous request. In this case the virtual start and finish times of the application are computed as a function of the actual time at which b-wf2q+ insert is called, i.e., the actual request arrival time, using the same formulas as in WF2Q+ [14].
On the other hand, if the application to insert is the active one, then according to Fig. 3 line 36, the application is necessarily being enqueued into B-WF2Q+ after a deactivation and not because of the arrival of a request. Let Rj i be the next request to serve of the application (there is certainly one). As explained below in detail, after a deactivation Fi is assigned the value assumed by the application virtual time upon the completion of the last served request, in this case Rj−1i, in the ideal system (Fig.3 lines 27-28). Let Fi be this value. According to [14], the
exact value of the application virtual time upon the arrival of Rj I should have been equal to max(Fi, V (aj i )). On the contrary, in BWF 2Q+ Fi is unconditionally assigned to Si, as if Rj i had arrived at a time instant between the arrival time aj−1 i and the completion time in the ideal system of the previous request Rj−1 i . This fictitious backward shift conceals the possibly delayed arrival of Rj i and of the successive requests served using the same budget as
Rj i . Moreover, the worst-case guarantees of the other applications are not endangered, as the guarantees provided to each application are independent of the arrival time of the requests of the other applications.
The function b-wf2q+ get next application returns the eligible application with the minimum virtual finish time and removes it from the internal data structure (lines 23-25). In case there are quasi-backlogged applications, but no one is eligible (lines 19-22), V (t) is pushed-up to the minimum virtual start time among all the quasi-backlogged applications (the latter coincide with the applications enqueued in B-WF2Q+ when b-wf2q+ get next application is invoked). Since an application is eligible if and only if its virtual start time is not higher than the system virtual time, this jump guarantees B-WF2Q+ to be work-conserving [14]. However, as shown in Subsection II-B the overall BFQ algorithm is not work-conserving, as it waits for the arrival of a new request before serving the next one.

4. SERVICE PROPERTIES

In this section we report the service properties of BFQ, in both sector (bandwidth distribution) and time domains. We also show how to perform admission control and to provide time guarantees. Finally we show how to achieve the desired trade-off between fairness granularity and throughput boosting. Before proceeding, it is important to identify the set of applications for which the service properties of BFQ actually hold also in presence of delayed arrivals. From Section II we know that BFQ conceals the delayed arrivals of the requests issued by the applications that meet the SI property. Hence BFQ guarantees to these applications the same amount of service and the same per-request completion time as if the arrival of each of their synchronous requests would not have been delayed. As a consequence, one would set Twait as high as possible to include as many applications as possible. However, the value of Twait
has an important impact on the disk throughput. It may provide significant boosting in presence of deceptive idleness, but only if its value is in the order of the seek and rotational latencies [1], namely a few milliseconds. In contrast, higher values may cause progressive performance degradation, as the disk may be left idle for too long.
Applications commonly alternate phases during which they make intense use of the disk device and phases during which they rarely access it. Fortunately, even for the above mentioned beneficial low value of Twait, during the former phases the SI property holds for the majority of mainstream applications. On
the other hand, should an application not meet the SI property in the other phases, the possible degradation of the guarantees on bandwidth distribution and request completion times would have a negligible impact on the overall application performance.

4.1 SECTOR-DOMAIN PROPERTIES

To show BFQ sector guarantees, we refer to a sector-variant of the Bit-Worst-case Fair Index (Bit-WFI), originally defined in packet systems [14]. This index, which we denote as Sector-WFI, allows us to predict the minimum amount of service guaranteed by a system to an application over any time interval during which the application is continuously quasi-backlogged (Section II). The following theorem holds.
Theorem 1: For any time interval [t1, t2] during which the i- th application is continuously quasi-backlogged, BFQ guarantees that:
_i •W(t1, t2) −Wi(t1, t2) _ Bmax + Bi,max + Lmax. (1)
The right hand side in (1) is the Sector-WFI of BFQ. Note that, if an application enjoys the SI property, then the time intervals during which it needs to access the disk safely coincide with the time intervals during which it is quasi-backlogged. Given the strong similarities between WF2Q+ and BFQ, the proof of
Theorem 1 is basically an extension of the proof of the Bit-WFI of WF2Q+ [14]. Whereas the full proof is reported in the appendix, an intuitive justification of each component of the bound follows. The component Bmax measures the deviation from the ideal service due to not respecting the batch completion order of the ideal system. More precisely, if the i-th applications has a lower virtual finish time than the active one, but becomes backlogged too late, it may unjustly wait for the service of at most Bmax sectors before accessing the disk.
The second component, Bi,max, stems from the fact that, if there is no constraint on the request arrival pattern, BFQ guarantees the real system to be in advance in serving the i- th application for at most Bi,max sectors with respect to the minimum guaranteed service at time t1. The application may pay back for this extra service during [t1, t2]. However, it is worth noting that this may happen only if a request Rj i may arrive before the maximum completion time guaranteed in the ideal system to the previous request Rj−1 i . Hence, on the opposite end, if this never occurs, i.e., the application never asks for more than its reserved service, the component Bi,max is not present at all. The last term follows from the stepwise approximation of V (t). Basically, due to wrong time-stamping, requests may be erroneously treated as if arrived, with respect to the actual arrival
time, after a time interval during which the ideal system may haveserved at most Lmax sectors.
It is important to note that the rightmost term in (1) does not grow with the time or the (total) amount of service, hence the long term bandwidth distribution is unconditionally guaranteed. Furthermore, (1) provides a simple relationship between the shortterm bandwidth distribution and the value of the parameters that
influence the aggregate disk throughput, as detailed in Subsection III-D. Finally, it is easy to prove that no scheduler that exclusively serves applications batch-by-batch may guarantee a lower Sector-WFI than BFQ. Hence BFQ provides optimal worst-case bandwidth distribution guarantees among this class of schedulers.

4.2 TIME-DOMAIN PROPERTIES

The following theorem is the starting point for computing the time guarantees of BFQ. Let t1 _ aj i be a generic time instant such that the i-th application is continuously quasi-backlogged during [t1, cj i ]. To compute a worst-case upper bound to cj i , we assume that all the applications, except for the i-th one, ideally start to issue asynchronous requests back-to-back from time t1(i.e., without waiting for the completion of their outstanding requests). Moreover, to prevent BFQ from increasing budgets and hence boosting the throughput more than it would happen in the actual scenario, we assume that the maximum value of the budget of each application, except for the i-th application, is set to its average value in the real scenario. Finally, thanks to the fact that BFQ conceals delayed arrivals for applications enjoying the SI property, if Rj i is a delayed synchronous request but the application does enjoy the SI property, in the following theorem aj i can be safely assumed to be equal to the time instant at which Rj i would have arrived if it had not been delayed.

Theorem 2: Given a request Rj i , let t1 _ aj i be a generic time instant such that the i-th application is continuously quasibacklogged during [t1, cj i ]. Let Tagg be the minimum aggregate disk throughput during an interval [t1, cj i ] under the above worstcase assumptions. Finally, let Lj i be the size of Rj i and Ai(t1, aj
i + TFIFO) be the sum of the sizes of the requests issued by the i- th application during [t1, aj i + TFIFO) plus Lj i . The following inequality holds:
cj i − t1 _ Qi(t− 1 )+Ai(t1,aj i+TFIFO)+(Bl i−Lj i )+Bi,max _iTagg + +Bmax+Bi,max+Lmax Tagg , (2) where Qi(t− 1 ) is the sum of the sizes of the requests of the i-th application not yet completed immediately before time t1, nd
Bl i is (the size of) the budget assigned to the i-th application to serve the batch that Rj i belongs to. As before, the proof of this theorem, which can be found in the appendix, is just an extension of the proof of the Time-WFI of WF2Q+. We discuss here the terms in (2), and in Subsec. III-C how to use (2) to perform admission control and to provide actual time guarantees.
The right hand side of (2) can be rewritten as 1 _iTagg • “Qi(t− 1 ) + i(t1, aj i + TFIFO)” + dj i . It is easy to see that the first component represents the orst-case completion time of Rj I in an ideal system guaranteeing no lagging behind the reserved service over any time interval. In contrast dj i represent the worstcase delay with respect to the ideal worst-case completion time. With regard to the first component, note that, if the i-th application issues only synchronous requests, the latter are always served in FIFO order. This is equivalent to assuming TFIFO = 0.The first component of the worst-case delay, Bl i − Lj i , stems from that the batch that Rj i belongs to is not timestamped (and
scheduled) as a function of Lj i , but as a function of Bl i. Hence, in the worst-case the service of Rj i may be delayed proportionally to the difference Bl
i−Lj i . Finally, the Bmax, Bi,max (which appears twice in dj i ) and Lmax terms can be explained using the same arguments as for the same terms in (1).


4.3 THROUGHPUT BOOSTING

Larger budgets increase the probability of serving larger bursts of close or even sequential requests, and hence of achieving a higher throughput. Besides, a large value of TFIFO may boost the throughput in presence of asynchronous requests. In this respect, also recall that most mainstream applications issue only
Synchronous requests. Hence, in a system serving this kind of applications FIFO has no impact either on the throughput or on the time guarantees (Subsection III-B). In contrast Twait may be just set to the most effective value for the target disk device, equal to the device-dependent average cost of seek and rotational
latencies, usually between 4 and 8 ms (and set by default to 4 ms in the current release of BFQ). BFQ exports a last low-level configuration parameter related to
disk throughput, namely the system-wide maximum time budget Tmax (possibly automatically computed, see Subsection V-A). Once got access to the disk, each active application must consume all of its time budget or backlog within no more than Tmax time units, otherwise it is unconditionally (over)charged for a Bi,max service and the next active application is selected. This additional mechanism prevents applications performing random IO from substantially decreasing the disk throughput. Hence it guarantees practical bandwidths and delays, which are inversely proportional to the aggregate throughput (Subsec. III-B), to applications
performing mostly sequential IO. In contrast, applications performing random IO virtually receive no service guarantees. According to (1) and (2), in addition to influencing the disk throughput, all these parameters directly or indirectly influence also guarantees. Hence, to set the desired trade-off between guarantee granularity and throughput boosting, the values of these parameters must be (iteratively) tuned by (iteratively) measuring the resulting throughput. BFQ also provides a simplified interface, which allows a user not interested in full control over all the parameters to avoid the resulting tuning complexity. This interface
has been used in the experiments reported in Section V, and is described in detail in Subsection V-A.


5. EXPERIMENTAL RESULTS

In this section we report the results of our single-disk experiments with BFQ, SCAN-EDF, YFQ, CFQ, C-LOOK, and AS, on a system running the Linux 2.6.21 kernel. We first provide implementation and configuration details in the next subsection. Then we describe the experimental setup and report the results of each set of experiments in the successive ones.

5.1 AGGREGATE THROUGHPUT

The first set of experiments was aimed at estimating the worst-case aggregate throughput guaranteed by each scheduler in case of simultaneous sequential reads. As can be seen in the next subsections, according to our experiments these results hold also for a Web server workload. Under BFQ and YFQ, all applications were assigned the same weight, whereas they were assigned the same priority under CFQ (which allows applications to be assigned different priorities). Under SCAN-EDF all the requests ere assigned the same deadline, equal to 20 ms. The whole set of different experiments was given by the combinations of the following 5 values: scheduler in {BFQ, SCAN-EDF, YFQ, CFQ,C-LOOK, AS}; cardinality of the set of distinct files to read in {2, 3, 4, 5} (for each set, the files were placed in slices at the maximum possible distance from each other, with each file in a distinct slice); value of the scheduler configuration parameter: maximum budget Bmax in {512, 1024, 2048, 4096, 8192, 16384} sectors for BFQ, batch size BTsize in {4, 8, 16} requests for YFQ, deadline granularity _ in {20, 40, 80, 160, 320} ms for SCANEDF, and time slice Tslice equal to 100 ms (the default value) for CFQ; Twait in {0, 4} ms for SCAN-EDF and YFQ, and Twait = 4 ms for BFQ (CFQ automatically sets/changes Twait); size of every file in {128, 256, 512, 1024} MB. Of course, Twait is implicitly 0 ms in C-LOOK, whereas it was set to 4 ms in AS with any additional fairness policy deactivated. The minimum file size was experimentally) chosen so as to let the results be due only to the disk schedulers, without significant distortions due to unrelated short-term factors such as CPU scheduling. With any scheduler, the lowest throughputs were achieved in case of two, 128 MB long, files, most certainly because in this case the disk head covers the longest distance and (spends more time moving) between the files (the influence of the length of the files was in the order of a few tenths of a MB/s). Or this scenario, Table II reports both the maximum value of the mean aggregate
throughput achieved by the six schedulers, and the value of the configuration parameter for which this value was achieved: Whereas the highest throughput is achieved by AS, the bad performance of C-LOOK is due to the fact that it frequently switches from one file to the other, for it does not perform disk idling. It is easy to see that, for Bmax = 16384 sectors, a maximum length budget is served in about 200 ms under BFQ, which confirms that a high throughput is achieved if a throughput boosting level of 1 is set with the simplified interface. Moreover, the higher throughput achieved by BFQ and SCAN-EDF with respect to CFQ results from the higher number of (sequential) sectors of a file that can be read before switching to the other file. More precisely, considering the disk peak rate, it is easy to see that less than 16384 sectors can be read in 100 ms (which is
the value of Tslice for CFQ), whereas more than 16384 sectors can be read in 640 ms (which is the value of _ for SCAN-EDF). Notably, Twait does not influence much the aggregate throughput with SCAN-EDF. In fact, as the system performs read-ahead for sequential accesses, it tends to asynchronously issue the next request before the completion of the current one (this also helps C-LOOK). Finally, the poor performance of YFQ and the fact that it is independent of both Twait and the batch size (not shown), confirm the arguments in Subsection IV-B.








5.2 SIMULTANEOUS SEQUENTIAL READS

The second feature we evaluated is the accuracy of the schedulers in providing the desired bandwidth fraction to applications performing sequential reads. For brevity, for BFQ we only report the results for Bmax = 4096 sectors. Similarly, for SCAN-EDF, we consider only _ = 80 ms because for this value SCAN-EDF achieves an aggregate throughput close to BFQ. Finally, in our experiments the batch size had no influence on the guarantees provided by YFQ. We report the results for BTsize = 4 requests. We first considered the case where all the applications are allocated the same fraction of the disk bandwidth. In particular, during the same experiments used to evaluate the aggregate throughput, we measured the throughput of each (file read) application. For each scheduler, the highest deviation from the ideal distribution occurred for 128 MB long files. Moreover, we observed that the 95% confidence interval for the mean throughput of the i-th application in case of 3 and 4 files was always smaller (or greater) than that obtained in case in case of 5 (or 2) files. This inclusion property held also in case of asymmetric allocation (see below). Hence, for brevity, we report only the results for two and five, 128 MB long, files. Finally, we consider only Twait = 4 ms for SCAN-EDF and YFQ in Table III.
BFQ, YFQ and C-LOOK exhibit the most accurate bandwidth distribution (C-LOOK basically performs a round robin among the batch of requests issued by the read-ahead mechanism). Unfortunately, as previously seen, YFQ has a low throughput. SCANEDF is less accurate for 2 files, but it provides a higher throughput than YFQ. Consistently with the arguments in Subsection IVB, CFQ fails to fairly distribute the throughput because of the varying sector transfer speed. Finally, as expected AS has the most asymmetric bandwidth distribution. Moreover the high width of the confidence interval for AS is a consequence of the fact that sometimes the waiting timer may expire before the next synchronous request arrives. In that case also AS switches to the service of another application.
It is important to observe that the accurate throughput distribution of YFQ and SCAN-EDF is mostly related to the symmetry of the bandwidth allocation. To measure the accuracy of the schedulers in distributing the disk bandwidth in case of asymmetric allocations, under BFQ and YFQ we assigned different weights to the applications. We run two sets of experiments, using, respectively, the weights 1 and 2, and the weights 1, 2 and 10 (there is no constraint on the values of the weights in our implementations of BFQ and YFQ). Moreover, assuming that the j-th application is the one with maximum weight, and denoted
as Sj the size of the file read by the j-th application, the file read by the i-th application had a length _i _j Sj . To try to allocate the same bandwidth as with BFQ and YFQ, under SCAN-EDF we assigned to each request of an application a relative deadline inversely proportional to the weight assigned to the application with the other two schedulers. Finally, this type of experiments was not run for CFQ, C-LOOK and AS, which do not provide differentiated bandwidth allocations. To show the worst-case, yet not distorted by external factors, performance of the schedulers, for the scenarios where the maximum weight was, respectively, 2 and 10 we report the results of only the experiments where the maximum file sizes were, respectively, 256 MB and 1 GB. Finally, since the above defined inclusion property holds also in case of asymmetric allocation, for brevity we report the results with BFQ only for the 2 and 5 files scenarios in Table IV.



5.3 WEB SERVER

In this set of experiments we estimated the per-request completion time guaranteed by the schedulers against the following Web server-like workload: 100 processes (all with the same weight/priority) continuously read files one after the other. Each of the files to read may have been, with probability 0.9, a small 16kB (html) file at a random position in the first half of the disk, or, with probability 0.1, a large file with random size in [1, 30] MB at a random position in the second half of the disk. Every 10 files, each process appended a random amount of bytes, from 1 to 16 kB, to a common log file. Such a scenario allows the performance of the schedulers to be measured for a mainstream application
and, in general, in presence of a mix of both sequential (large files) and a random (small files) requests. Especially, for each run, lasting for about one hour, we measured the completion time of each small file (latency), the (average) bndwidth at which large files were read and the average aggregate throughput (measured every second). In this respect, small and large file reads were performed in separated parts of the disk to generate an asymmetric workload, which is more prone to higher latencies and/or lower bandwidths.
As can be seen from Table V–excluding for a moment SCANEDF and YFQ–BFQ, CFQ and AS stand, respectively at the beginning, (about) the middle and the end of the low latency versus high aggregate throughput scale. Also C-LOOK achieves similar performance as AS, because a high number of competing
requests scattered over all the entire disk are present at all times. In contrast YFQ has very low latencies and throughput because, as the probability of a small file read is 9 times higher than the one of a large file read, each batch is likely to contain a high percentage of requests pertaining to small files. Finally, although
achieving a lower aggregate throughput than BFQ, SCAN-EDF guarantees higher bandwidths to the large files, by sacrificing the latency of the small ones. It is worth noting that the observed mean latency of BFQ is 2.22 times lower than CFQ and at least 4.41 times lower than all the other schedulers. On the contrary,
with any scheduler, the high width of the confidence interval for the mean bandwidth of the large files is a consequence of the quite random nature of the workload.


5.4 DBMS

This set of experiments was aimed at estimating the request completion time and the aggregate throughput achieved by the schedulers against a DBMS-like workload: 100 processes (all with the same weight/priority) concurrently issue direct (non cacheable) 4KB read requests at random positions in a common 5GB file. We have used this setup to evaluate also the performance of the schedulers with asynchronous requests, and, to this purposes, we have run 10 variants of the experiment, with each variant characterized by a different number of per-process outstanding requests, ranging from 1 (synchronous requests) to 10.
With random requests, AS does not perform disk idling at all, hence it achieves the same results as pure C-LOOK. Especially thanks to their global position-based request ordering, C-LOOK and AS achieve the lowest request completion time and hence the highest disk throughput. Being its service scheme closer to C-LOOK/AS than the one of BFQ, SCAN-EDF and CFQ, YFQ outperforms the latter in case of synchronous requests. It has instead the same performance as SCAN-EDF with 10 outstanding requests. Finally, because of their slice-by-slice/budget-by-budget service scheme both CFQ and BFQ exhibit a _ 1.4 times higher request completion time/lower throughput than C-LOOK/AS. It is however worth noting that only _ 2% of the disk rate is achieved by the latter, which is typically unbearable in a realistic DBMS. This result complies with the fact that, if the disk requests enjoy some locality, caches are usually tuned so as to reduce disk access and achieve feasible response times (of course multiple disks are typically used as well).



5.5 VIDEO STREAMING

last set of experiments was aimed at measuring the ability of the schedulers to support a very time-sensitive application. To this purpose, we set up a VLC streaming server [26], provided with 30 movies to stream to remote clients. Each movie was stored in a distinct disk slice, and the streaming thread of the server that read it was considered as a distinct application by the disk schedulers (all the threads were assigned the same weight/priority). To evaluate the performance of the schedulers, a packet sent by the streaming thread was considered lost if delayed by more than one second with respect to its due transmission time
(i.e., we assumed a 1 second playback buffer on the client side).
Every 15 seconds the streaming of a new movie was started. Each experiment ended either if 15 seconds elapsed from the start of the streaming of the last available movie, or if the packet loss rate reached the 1% threshold (this value, as well as the 1 second threshold for the delay, has been experimentally chosen to trade off between achieving a very high video quality, and allowing the system to stream a high enough number of simultaneous films to clearly show the different performance of the six schedulers). To mimic the mixed workload of a general purpose storage system and to increase to workload so that the disk was the only bottleneck, during each experiment we also run 5 ON/OFF file readers, each reading a portion of random length in [64, 512] MB of a file, and then sleeping for a random time interval in [1, 200] ms before starting to read a new portion of the same file (each file was stored in a different disk slice). _, BTsize and Tslice for SCAN-EDF, YFQ and CFQ were set to 20 ms, 4 requests and 20 ms, as these are the maximum values for which these schedulers achieve the highest number of simultaneous streams Table VII show the mean number of movies and the mean aggregate throughput achieved by the schedulers during the last five seconds before the end of each experiment. As can be seen, with Bmax = 4096 sectors BFQ guarantees a stable and higher number of simultaneous streams than all the other five schedulers. Interestingly, with BFQ also the aggregate throughput is comparable/higher than SCAN-EDF/YFQ. Most certainly this is a consequence of the low _ with SCAN-EDF. inally, to check whether the worst-case delay di,max guaranteed by BFQ to the periodic soft real-time application represented by any of the concurrent VLC streams complies with the observed maximum delay, from Table VII we can consider that, just before the end of most experiments, in case of Bmax = 4096 sectors, 24 + 5 applications with equal weights are competing for the disk, and the mean throughput is 7.56 MB/s. As in the Web server experiments we have that Lmax = 256 sectors, Li,min _ minj Lj i =16 sectors and a budget never higher than 256 sectors is assigned to the generic i-th streaming thread of the video server. Hence, according to (3).This value complies with the above mentioned 1 second threshold for considering a packet as late, assuming that a reasonable additional worst-case delay of _ 0.27 seconds is added by the rest of the system (probably mostly due to the execution of the 29 threads on the CPU).



6. CONCLUSIONS

In this paper we dealt with the problem of providing service guarantees while simultaneously achieving a high disk throughput in presence of synchronous requests. This type of requests may cause work-conserving scheduler to fail to provide a high throughput, and timestamp-based schedulers to fail to enforce guarantees. In this respect, we proposed BFQ, a new disk scheduler that combines disk idling and timestamp back-shifting to achieve a high throughput and preserve guarantees also in presence of synchronous requests.






BIBLIOGRAPHY



1. S. Iyer and P. Druschel, “Anticipatory scheduling: A disk scheduling framework to overcome deceptive idleness in synchronous I/O,” in 18th ACM Symposium on Operating Systems Principles, Oct. 2001.

2. G. Lawton, “Powering down the computing infrastructure,” Computer,
vol. 40, no. 2, pp. 16–19, 2007.

3. B. L. Worthington, G. R. Ganger, and Y. N. Patt, “Scheduling algorithms
for modern disk drives,” in SIGMETRICS ’94: Proceedings of the
1994 ACM SIGMETRICS conference on Measurement and modeling of
Computer systems. New York, NY, USA: ACM, 1994, pp. 241–251.


Reply

Important Note..!

If you are not satisfied with above reply ,..Please

ASK HERE

So that we will collect data for you and will made reply to the request....OR try below "QUICK REPLY" box to add a reply to this page
Popular Searches: imperial college engineering careers fair, fact about um66 called so, plasma science fair projects, science fair projects on physics, science fair projects gujarat, about ageing of bitumenance scheduling on an electric power distribution network ppt, science fair projects on solar ovens,

[-]
Quick Reply
Message
Type your reply to this message here.

Image Verification
Please enter the text contained within the image into the text box below it. This process is used to prevent automated spam bots.
Image Verification
(case insensitive)

Possibly Related Threads...
Thread Author Replies Views Last Post
  High Performance DSP Architectures computer science crazy 1 8,151 12-12-2012, 12:18 PM
Last Post: seminar details
Thumbs Down High Speed OFDM Packet Access (HSOPA) computer science crazy 2 10,417 08-12-2012, 02:44 PM
Last Post: seminar details
  High-Speed Downlink Packet Access (HSDPA) shibin.sree 1 9,157 08-12-2012, 02:44 PM
Last Post: seminar details
  High Speed Packet Access seminar surveyer 1 9,112 08-12-2012, 02:44 PM
Last Post: seminar details
  Hydra: A Block-Mapped Parallel Flash Memory Solid-State Disk Architecture summer project pal 3 2,922 01-12-2012, 12:40 PM
Last Post: seminar details
  A Hybrid Disk-Aware Spin-Down Algorithm with I/O Subsystem Support computer girl 0 1,065 07-06-2012, 04:01 PM
Last Post: computer girl
  Conflict-free scheduling and routing of automated guided vehicles in mesh topologies computer girl 0 829 05-06-2012, 12:52 PM
Last Post: computer girl
  high speed protocol processor to boost gateway performance electronics seminars 1 2,883 13-02-2012, 01:26 PM
Last Post: seminar paper
  HIGH-SPEED FACE RECOGNITION ppt seminar surveyer 1 2,733 20-01-2012, 10:03 AM
Last Post: seminar addict
  blue ray disk moonga 3 24,494 18-01-2012, 09:44 AM
Last Post: seminar addict

Forum Jump: