nerc.ac.uk

Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows

Ferreira da Silva, Rafael; Filgueira, Rosa; Deelman, Ewa; Pairo-Castineira, Erola; Overton, Ian M.; Atkinson, Malcolm P.. 2019 Using simple PID-inspired controllers for online resilient resource management of distributed scientific workflows. Future Generation Computer Systems, 95. 615-628. https://doi.org/10.1016/j.future.2019.01.015

Full text not available from this repository. (Request a copy)

Abstract/Summary

Scientific workflows have become mainstream for conducting large-scale scientific research. As a result, many workflow applications and Workflow Management Systems (WMSs) have been developed as part of the cyberinfrastructure to allow scientists to execute their applications seamlessly on a range of distributed platforms. Although the scientific community has addressed this challenge from both theoretical and practical approaches, failure prediction, detection, and recovery still raise many research questions. In this paper, we propose an approach inspired by the control theory developed as part of autonomic computing to predict failures before they happen, and mitigated them when possible. The proposed approach is inspired on the proportional–integral–derivative controller (PID controller) control loop mechanism, which is widely used in industrial control systems, where the controller will react to adjust its output to mitigate faults. PID controllers aim to detect the possibility of a non-steady state far enough in advance so that an action can be performed to prevent it from happening. To demonstrate the feasibility of the approach, we tackle two common execution faults of large scale data-intensive workflows—data storage overload and memory overflow. We developed a simulator, which implements and evaluates simple standalone PID-inspired controllers to autonomously manage data and memory usage of a data-intensive bioinformatics workflow that consumes/produces over 4.4 TB of data, and requires over 24 TB of memory to run all tasks concurrently. Experimental results obtained via simulation indicate that workflow executions may significantly benefit from the controller-inspired approach, in particular under online and unknown conditions. Simulation results show that nearly-optimal executions (slowdown of 1.01) can be attained when using our proposed method, and faults are detected and mitigated far in advance of their occurrence.

Item Type: Publication - Article
Digital Object Identifier (DOI): https://doi.org/10.1016/j.future.2019.01.015
ISSN: 0167739X
Date made live: 30 Jul 2019 08:54 +0 (UTC)
URI: https://nora.nerc.ac.uk/id/eprint/524563

Actions (login required)

View Item View Item

Document Downloads

Downloads for past 30 days

Downloads per month over past year

More statistics for this item...