You are here: Home Blog XtreemOS

research paper


Feedback: XtreemOS summit at Euro-Par 2009

Delft, The Netherlands - August 25, 2009

The first XtreemOS summit was co-located with Euro-Par 2009 in Delft, The Netherlands.


The objective of this half-day summit was to present the XtreemOS technology with different talks ranging from a general overview to selected topics including security, resource matching, and parallel I/O. After the talks different demonstrations were run to show the benefits of  the XtreemOS system.

The summit concluded with an interesting and fruitful discussion between the audience and the XtreemOS representatives.

XtreemOS summit webpage: here

Thilo Kielmann (VUA) also gave an invited talk at CoreGRID 2009 workshop (in conjunction with EuroPar): "XtreemOS a sound foundation for IaaS and cloud federations"

In addition, John Mehnert-Spahn (UDUS) presented an XtreemOS paper:

- "The Architecture of the XtreemOS Grid Checkpointing Service" by John Mehnert-Spahn, Thomas Ropars, Michael Schoettner and Christine Morin



XtreemOS paper at COMPSAC 2009 Doctoral Symposium

Efficient Management of Consistent Backups in a Distributed File System


Authors: Jan Stender


Abstract: Setting up backup infrastructures for large-scale data management systems that can be operated cheaply and accessed with low latency has emerged as a practical problem. As a solution, we present a highly scalable and cost-efficient architecture for backup management in a distributed file system. We describe techniques for the creation of consistent backups at runtime, as well as approaches to resource management in connection with an integrated backup architecture.

COMPSAC 2009: website


Seattle, Washington, July 20-24, 2009

The Doctoral Symposium at COMPSAC will provide an international forum for doctoral students to interact with other students and faculty mentors. Since 2006, COMPSAC has been designated as the IEEE Computer Society Signature Conference on Software Technology and Applications.

The Doctoral Symposium seeks to bring together PhD Students working in computer software and applications and related fields. Selected students will have the opportunity to present and discuss their research goals, methodology, and preliminary results within a constructive and international atmosphere.

The Symposium organizers will strive to provide useful guidance for completion of the dissertation research and motivation for a research career. The Symposium is intended for students who have already settled on a specific research proposal and have produced limited preliminary results, but have enough time remaining before their final defense to benefit from the fruitful Symposium discussions. Due to the mentoring aspect of the event, the Symposium will be open only to the students and mentors participating directly in the event.

In coordination with the technical theme of COMPSAC 2009, topics pertaining to software engineering of critical infrastructure systems such as civil, telecommunications, and medical systems will be of particular interest. Related topics include, but are not limited to, requirements analysis, co-analysis and co-design, modeling, design, development, testing, measurement, verification and validation for performance, safety, security, and dependability constraints of such systems. As effective construction of critical infrastructure systems is not limited solely to the field of computer science and engineering and is truly a multidisciplinary effort, submissions addressing multidisciplinary research topics are particularly encouraged.


Two XtreemOS papers accepted at Euro-Par 2009

 Euro-Par 2009 logo


"The Architecture of the XtreemOS Grid Checkpointing Service"


Authors: John Mehnert-Spahn, Thomas Ropars, Michael Schoettner, Christine Morin

Abstract - The EU-funded XtreemOS pro ject implements a grid operating system (OS) transparently exploiting distributed resources through the SAGA and POSIX interfaces. XtreemOS uses an integrated grid checkpointing service (XtreemGCP) for implementing migration and fault tolerance. Checkpointing and restarting applications in a grid requires saving and restoring applications in distributed heterogeneous environments. In this paper we present the architecture of the XtreemGCP service integrating existing system-specific checkpointer solutions. We propose to bridge the gap between grid semantics and system-specific
checkpointers by introducing a common kernel checkpointer API that allows using different checkpointers in a uniform way. Our architecture is open to support different checkpointing strategies that can be adapted according to evolving failure situations or changing application requirements. We also present how to avoid resource conflicts during restart. Finally, we discuss measurements numbers showing that the XtreemGGP architecture introduces only minimal overhead.


"Active Optimistic Message Logging for Reliable Execution of MPI Applications"


Authors: Thomas Ropars, Christine Morin

Abstract -To execute MPI applications reliably, fault tolerance mechanisms are needed. Message logging is a well-known solution to provide fault tolerance for MPI applications. It has been proved that it can tolerate a higher failure rate than coordinated checkpointing. However pessimistic and causal message logging can induce high overhead on failure free execution. In this paper, we present O2P, a new optimistic message logging protocol, based on active optimistic message logging. Contrary to existing optimistic message logging protocols that save dependency information on reliable storage periodically, O2P logs dependency information as soon as possible to reduce the amount of data piggybacked on application messages. Thus, it reduces the overhead of the protocol on failure free execution, makes it more scalable and simplifies recovery. O2P is implemented as a module of the Open MPI library. Experiments show that active message logging can effectively improves scalability and performance of optimistic message logging.


Euro-Par conference website:


XtreemOS-related paper accepted at CCGrid 2009

Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems


Authors: Pierre Riteau, Adrien Lebre and Christine Morin

CCGrid 2009 


Computer clusters are today the reference architecture for high-performance computing.
The large number of nodes in these systems induces a high failure rate. This makes fault tolerance mechanisms, e.g. process checkpoint/restart, a required technology to effectively exploit clusters.
Most of the process checkpoint/restart implementations only handle volatile states and do not take into account persistent states of applications, which can lead to incoherent application restarts.
In this paper, we introduce an efficient persistent state checkpoint/restoration approach that can be interconnected with a large number of file systems. To avoid the performance issues of a stable support relying on synchronous replication mechanisms, we present a failure resilience scheme optimized for such persistent state checkpointing techniques in adistributed environment. First evaluations of our implementation in the kDFS distributed file system show the negligible performance impact of our proposal.




Paper about RSS accepted at the ICDCS 2009 conference

Filed Under:

Autonomous Resource Selection for Decentralized Utility Computing


 ICDCS 2009


Many large-scale utility computing infrastructures comprise heterogeneous hardware and software resources. This raises the need for scalable resource selection services, which identify resources that match application requirements, and can potentially be assigned to these applications. We present a fully decentralized resource selection algorithm by which resources autonomously select themselves when their attributes match a query. An application specifies what it expects from a resource by means of a conjunction of (attribute,value-range) pairs, which are matched against the attribute values of resources. We show that our solution scales in the number of resources as well as in the number of attributes, while being relatively insensitive to churn and other membership changes such as node failures.





XtreemOS at PDCAT08

The Ninth International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT'08) was held in Dunedin, New Zealand from 1–4 of December, 2008.

John Mehnert-Spahn (UDUS) presented XtreemOS within an invited talk "XtreemOS: Beyond Grid Middleware" (slides) within the workshop "High Performance and Grid Computing" co-located with PDCAT08.



Furthermore, he also presented the paper "Checkpointing Process Groups in a Grid Environment" within the main track of the International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT) in Dunedin, New Zealand, December 2008.


XtreemOS paper at PDCS 2008

  Florian Mueller (UDUS) presented the paper "Transactional Data Sharing in Grids" related to the XtreemOS Object Sharing Service (OSS), at the IASTED International Conference on Parallel and Distributed Computing and Systems in Orlando, USA.



"Transactional Data Sharing in Grids" - M.-F. Mueller, K.-T. Moeller, M. Sonnenfroh, M. Schoettner (UDUS)



Paper abstract: "The EU-funded XtreemOS project implements a Linux-based grid operating system (OS), exploiting resources of virtual organizations through the standard POSIX interface.
The Object Sharing Service (OSS) of XtreemOS addresses the challenges of transparent data sharing for distributed applications running in grids. We focus on the problem of handling consistency of replicated data in wide area networks in the presence of failures. The software architecture we propose interweaves concepts from transactional memory and peer-to-peer systems. Speculative transactions relieve programmers from complicated lock management.
Super-peer-based overlay networks improve scalability and distributed hash tables speed up data search. OSS replicates objects to improve reliability and performance. In case of severe faults, the XtreemOS grid checkpointing service will support OSS. In this paper we describe the software architecture of OSS, design decisions, and evaluation results of preliminary experiments with a multi-user 3D virtual world. "