DCRuntime: Toward Efficiently Sharing CPU-GPU Architectures

[en] Orchestrating distributed data movement and computation for large-scale, data-intensive applications (Big Data, ML/AI) on modern heterogeneous architectures (CPU, GPU) presents significant challenges. Current systems often rely on passive, application-driven (pull-based) data access, leading to inefficient resource utilization, particularly high GPU idle times (up to 70\%), complex manual memory management, and limited opportunities for global optimization and fault tolerance. This paper introduces DCRuntime, a vision for a unified, runtime-orchestrated system designed to actively manage both data and compute streaming. DCRuntime employs a proactive, push-based "compute-follows-data" execution model. It exposes two core abstractions: 1) distributed data streams, representing potentially unbounded sequences of (im)mutable buffers spanning cluster resources, and 2) global compute streams, enabling asynchronous task execution across multiple devices, tightly coupled with data availability. By delegating data sourcing, sinking, shuffling, and compute scheduling to the runtime, DCRuntime gains a global perspective to optimize data placement, minimize I/O stalls, mitigate interference and power jitter, and enable faster, application-aware fault recovery. This approach aims to abstract away low-level complexities, significantly improve resource utilization (especially for GPUs), enhance scalability, and provide a resilient foundation for demanding workloads on shared, heterogeneous high-performance computing infrastructure.

Disciplines :

Computer science

Author, co-author :

MARCU, Ovidiu-Cristian ; University of Luxembourg > Interdisciplinary Centre for Security, Reliability and Trust (SNT) > PCOG

DANOY, Grégoire ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

BOUVRY, Pascal ; University of Luxembourg > Faculty of Science, Technology and Medicine (FSTM) > Department of Computer Science (DCS)

Language :

English

Title :

DCRuntime: Toward Efficiently Sharing CPU-GPU Architectures

Publication date :

03 April 2025

Source :

https://orbilu.uni.lu/

Available on ORBilu :

since 03 April 2025

Statistics

Number of views

178 (15 by Unilu)

Number of downloads

275 (43 by Unilu)

More statistics