Partial order reduction has traditionally been based on persistent sets, ample sets, stubborn sets, or variants thereof. Recently, we have presented a strengthening of this foundation, using source sets instead of persistent/ample/stubborn sets. Source sets subsume persistent sets and are often smaller than persistent sets. We introduced source sets as a basis for Dynamic Partial Order Reduction (DPOR), in a framework which assumes that processes are deterministic and that all program executions are finite. In this paper, show how to use source sets for partial order reduction in a framework which does not impose these restrictions. We also compare source sets with persistent sets, providing some insights into conditions under which source sets and persistent sets do or do not differ.
Stateless model checking is a powerful technique for program verification, which however suffers from an exponential growth in the number of explored executions. A successful technique for reducing this number, while still maintaining complete coverage, is Dynamic Partial Order Reduction (DPOR). We present a new DPOR algorithm, which is the first to be provably optimal in that it always explores the minimal number of executions. It is based on a novel class of sets, called source sets, which replace the role of persistent sets in previous algorithms. First, we show how to modify an existing DPOR algorithm to work with source sets, resulting in an efficient and simple to implement algorithm. Second, we extend this algorithm with a novel mechanism, called wakeup trees, that allows to achieve optimality. We have implemented both algorithms in a stateless model checking tool for Erlang programs. Experiments show that source sets significantly increase the performance and that wakeup trees incur only a small overhead in both time and space.
Stateless model checking is a powerful method for program verification that, however, suffers from an exponential growth in the number of explored executions. A successful technique for reducing this number, while still maintaining complete coverage, is Dynamic Partial Order Reduction (DPOR), an algorithm originally introduced by Flanagan and Godefroid in 2005 and since then not only used as a point of reference but also extended by various researchers. In this article, we present a new DPOR algorithm, which is the first to be provably optimal in that it always explores the minimal number of executions. It is based on a novel class of sets, called source sets, that replace the role of persistent sets in previous algorithms. We begin by showing how to modify the original DPOR algorithm to work with source sets, resulting in an efficient and simple-to-implement algorithm, called source-DPOR. Subsequently, we enhance this algorithm with a novel mechanism, called wakeup trees, that allows the resulting algorithm, called optimal-DPOR, to achieve optimality. Both algorithms are then extended to computational models where processes may disable each other, for example, via locks. Finally, we discuss tradeoffs of the source-and optimal-DPOR algorithm and present programs that illustrate significant time and space performance differences between them. We have implemented both algorithms in a publicly available stateless model checking tool for Erlang programs, while the source-DPOR algorithm is at the core of a publicly available stateless model checking tool for C/pthread programs running on machines with relaxed memory models. Experiments show that source sets significantly increase the performance of stateless model checking compared to using the original DPOR algorithm and that wakeup trees incur only a small overhead in both time and space in practice.
We present a new approach for stateless model checking (SMC) of multithreaded programs under Sequential Consistency (SC) semantics. To combat state-space explosion, SMC is often equipped with a partial-order reduction technique, which defines an equivalence on executions, and only needs to explore one execution in each equivalence class. Recently, it has been observed that the commonly used equivalence of Mazurkiewicz traces can be coarsened but still cover all program crashes and assertion violations. However, for this coarser equivalence, which preserves only the reads-from relation from writes to reads, there is no SMC algorithm which is (i) optimal in the sense that it explores precisely one execution in each reads-from equivalence class, and (ii) efficient in the sense that it spends polynomial effort per class. \end{inparaenum} We present the first SMC algorithm for SC that is both optimal and efficient in practice, meaning that it spends polynomial time per equivalence class on all programs that we have tried. This is achieved by a novel test that checks whether a given reads-from relation can arise in some execution. Our experimental results show that Nidhugg/rfsc, although slower than the fastest SMC tools in programs where tools happen to examine the same number of executions, always scales similarly or better than them, and outperforms them by an exponential factor in programs where the reads-from equivalence is coarser than the standard one. We also present two non-trivial use cases where the new equivalence is particularly effective, as well as the significant performance advantage that Nidhugg/rfsc offers compared to state-of-the-art SMC and systematic concurrency testing tools.
Dynamic partial order reduction (DPOR) algorithms are used in stateless model checking (SMC) to combat the combinatorial explosion in the number of schedulings that need to be explored to guarantee soundness. The most effective of them, the Optimal DPOR algorithm, is optimal in the sense that it explores only one scheduling per Mazurkiewicz trace. In this paper, we enhance DPOR with the notion of observability, which makes dependencies between operations conditional on the existence of future operations, called observers. Observers naturally lead to a lazy construction of dependencies. This requires significant changes in the core of POR algorithms (and Optimal DPOR in particular), but also makes the resulting algorithm, Optimal DPOR with Observers, super-optimal in the sense that it explores exponentially less schedulings than Mazurkiewicz traces in some cases. We argue that observers come naturally in many concurrency models, and demonstrate the performance benefits that Optimal DPOR with Observers achieves in both an SMC tool for shared memory concurrency and a tool for concurrency via message passing, using both synthetic and actual programs as benchmarks.
Programming language implementers rely heavily on benchmarking for measuring and understanding performance of algorithms, architectural designs, and trade-offs between alternative implementations of compilers, runtime systems, and virtual machine components. Given this fact, it seems a bit ironic that it is often more difficult to come up with a good benchmark suite than a good implementation of a programming language.
This paper presents the main aspects of the design and the current status of bencherl, a publicly available scalability benchmark suite for applications written in Erlang. In contrast to other benchmark suites, which are usually designed to report a particular performance point, our benchmark suite aims to assess scalability, i.e., help developers to study a set of performance points that show how an application's performance changes when additional resources (e.g., CPU cores, schedulers, etc.) are added. We describe the scalability dimensions that the suite aims to examine and present its infrastructure and current set of benchmarks. We also report some limited set of performance results in order to show the capabilities of our suite.
Erlang is a concurrent functional language based on the actor modelof concurrency. In the purest form of this model, actors are realizedby processes that do not share memory and communicate witheach other exclusively via message passing. Erlang comes quiteclose to this model, as message passing is the primary form of interprocesscommunication and each process has its own memoryarea that is managed by the process itself. For this reason, Erlangis often referred to as implementing “shared nothing” concurrency.Although this is a convenient abstraction, in reality Erlang’s mainimplementation, the Erlang/OTP system, comes with a large numberof built-in operations that access memory which is shared byprocesses. In this paper, we categorize these built-ins, and characterizethe interferences between them that can result in observabledifferences of program behaviour when these built-ins are usedin a concurrent setting. The paper is complemented by a publiclyavailable suite of more than one hundred small Erlang programsthat demonstrate the racing behaviour of these built-ins.
Corfu is a distributed shared log that is designed to be scalable and reliable in the presence of failures and asynchrony. Internally, Corfu is fully replicated for fault tolerance, without sharding data or sacrificing strong consistency. In this case study, we present the modeling approaches we followed to test and verify, using Concuerror, the correctness of repair methods for the Chain Replication protocol suitable for Corfu. In the first two methods we tried, Concuerror located bugs quite fast. In contrast, the tool did not manage to find bugs in the third method, but the time this took also motivated an improvement in the tool that reduces the number of traces explored. Besides more details about all the above, we present experiences and lessons learned from applying stateless model checking for verifying complex protocols suitable for distributed programming.
Implementations of network protocols must conform to their specifications in order to avoid security vulnerabilities and interoperability issues. We describe our experiences using symbolic execution to thoroughly test several implementations of a network security protocol against its specification. We employ a methodology in which we first extract requirements from the protocol's RFC and turn them into formulas. These formulas are then utilized by symbolically executing the protocol implementation to explore code paths that can be traversed on packet sequences that violate a requirement. When this exploration exposes a bug, corresponding input values are produced and turned into test cases that can validate the bug in the original implementation. Since we let symbolic execution be guided by requirements, it can naturally produce a wide variety of requirement-violating input sequences, which is difficult to achieve with existing techniques for protocol testing. We applied this methodology to test four different implementations of MILS against the protocol's RFC. We were able to quickly expose a known CVE in an older version of OpenSSL, and to discover numerous previously unknown vulnerabilities and nonconformance issues in DTI.S implementations, which have by now been confirmed and fixed by their implementors.
We describe an analysis-driven storage allocation scheme for concurrent systems that use message passing with copying semantics. The basic principle is that in such a system, data which is not part of any message does not need to be allocated in a shared data area. This allows for the deallocation of thread-specific data without requiring global synchronization and often without even triggering garbage collection. On the other hand, data that is part of a message should preferably be allocated on a shared area since this allows for fast (O(1)) interprocess communication that does not require actual copying. In the context of a dynamically typed, higher-order concurrent functional language, we present a static message analysis which guides the allocation. As shown by our performance evaluation, conducted using a production-quality language implementation, the analysis is effective enough to discover most data which is to be used as a message, and to allow the allocation scheme to combine the best performance characteristics of both a process-centric and a communal memory architecture.
We propose a novel approach for hardware-based strict TSO persistency, called TSOPER. We allow a TSO persistency model to freely coalesce values in the caches, by forming atomic groups of cachelines to be persisted. A group persist is initiated for an atomic group if any of its newly written values are exposed to the outside world. A key difference with prior work is that our architecture is based on the concept of a TSO persist buffer, that sits in parallel to the shared LLC, and persists atomic groups directly from private caches to NVM, bypassing the coherence serialization of the LLC.
To impose dependencies among atomic groups that are persisted from the private caches to the TSO persist buffer, we introduce a sharing-list coherence protocol that naturally captures the order of coherence operations in its sharing lists, and thus can reconstruct the dependencies among different atomic groups entirely at the private cache level without involving the shared LLC. The combination of the sharing-list coherence and the TSO persist buffer allows persist operations and writes to non-volatile memory to happen in the background and trail the coherence operations. Coherence runs ahead at full speed; persistency follows belatedly.
Our evaluation shows that TSOPER provides the same level of reordering as a program-driven relaxed model, hence, approximately the same level of performance, albeit without needing the programmer or compiler to be concerned about false sharing, data-race-free semantics, etc., and guaranteeing all software that can run on top of TSO, automatically persists in TSO.
Task Graph (ATaG) is a sensor network application development paradigm where the application is visually described by a graph where the nodes correspond to application-level tasks and edges correspond to dataflows. We extend ATaG with the option to add nonfunctional requirements: constraints on end-to-end delay and packet delivery rate. Setting up these constraints at the design phase naturally leads to enabling run-time assurance at the deployment phase, when the conditions of the constraints are used as network's performance goals. We provide both run-time middleware that checks the conditions of these constraints and a central management unit that dynamically adapts the system by doing task reallocation and putting task copies on redundant nodes. Through extensive simulations we show that the system is efficient enough to enable adaptations within tens of seconds even in large networks.
Sensor network macroprogramming methodologiessuch as the Abstract Task Graph hold the promise of enablinghigh-level sensor network application development. However,progress in this area is hampered by the scarcity of tools, andalso because of insufficient focus on developing tool support forprogramming applications aware of performance requirements.
We present ProFuN TG (Task Graph), a tool for designing sen-sor network applications using task graphs. ProFuN TG providesautomated task mapping, sensor node firmware macrocompila-tion, application simulation, deployment, and runtime mainte-nance capabilities. It allows users to incorporate performancerequirements in the applications, expressed through constraintson task-to-task dataflows. The tool includes middleware that usesan efficient flooding-based protocol to set up tasks in the network,and also enables runtime assurance by keeping track of theconstraint conditions.
We show that the adaptive task reallocation enabled by ourapproach can significantly increase application reliability whiledecreasing energy consumption: in a network with unreliablelinks, we achieve above 99.89 % task-to-task PDR while keepingthe maximal radio duty cycle around 2.0 %.
In this demo abstract we present ProFuN TG (Task Graph), a tool for sensor network application development using the data-flow programming paradigm. The tool has support for the whole lifecycle of WSN application: from the initial design of its task graph, task placement on network nodes, execution in a simulated environment, deployment on real hardware, to its automated maintenance through task remapping. ProFuN TG allows to program applications that incorporate quality-of-service requirements, expressed through constraints on task-to-task data flows.
Sensor network macroprogramming methodologies such as the Abstract Task Graph hold the promise of enabling high-level sensor network application development. However, progress in this area is hampered by the scarcity of tools, and also because of insufficient focus on developing tool support for programming applications aware of performance requirements.
In this demo we present ProFuN TG (Task Graph), a tool for designing sensor network applications using task graphs. ProFuN TG provides automated task mapping, sensor nodefirmware macrocompilation, application simulation, deployment, and runtime maintenance capabilities. It allows users to incorporate performance requirements in the applications, expressed through constraints on task-to-task dataflows. The tool includes middleware that uses an efficient flooding-based protocol to set up tasks in the network, and also enables runtime assurance by keeping track of the constraint conditions.
Through task allocation in a way that optimizes an objective function in a model of the network, and adaptive task reallocation in case of link, node, or sensor failures the tool helps to make sensornet applications both more energy-efficient and reliable.
Recent years have witnessed an increasing number of protocols relying on UDP. Compared to TCP, UDP offers performance advantages such as simplicity and lower latency. This has motivated its adoption in Voice over IP, tunneling technologies, IoT, and novel Web protocols. To protect sensitive data exchange in these scenarios, the DTLS protocol has been developed as a cryptographic variation of TLS. DTLS’s main challenge is to support the stateless and unreliable transport of UDP. This has forced protocol designers to make choices that affect the complexity of DTLS, and to incorporate features that need not be addressed in the numerous TLS analyses.
We present the first comprehensive analysis of DTLS implementations using protocol state fuzzing. To that end, we extend TLS-Attacker, an open source framework for analyzing TLS implementations, with support for DTLS tailored to the stateless and unreliable nature of the underlying UDP layer. We build a framework for applying protocol state fuzzing on DTLS servers, and use it to learn state machine models for thirteen DTLS implementations. Analysis of the learned state models reveals four serious security vulnerabilities, including a full client authentication bypass in the latest JSSE version, as well as several functional bugs and non-conformance issues. It also uncovers considerable differences between the models, confirming the complexity of DTLS state machines.
DTLS-Fuzzer is a protocol state fuzzer for implementations of DTLS clients and servers. DTLS-Fuzzer uses model learning to generate a stale machine model of a DTLS implementation, capturing its input/output behavior. This model can be used for model-based testing or can be analyzed for security vulnerabilities and specification violations. This demo abstract overviews the architecture, API, and usage of the tool.
We present the design of a formal low-level multithreaded language with advanced region-based memory management and thread synchronization primitives, where well-typed programs are memory safe and race free. In our language, regions and locks are combined in a single hierarchy and are subject to uniform ownership constraints imposed by this hierarchical structure: deallocating a region causes its sub-regions to be deallocated. Similarly, when a region is read/write-protected, then its sub-regions inherit the same access rights. We discuss aspects of the integration and implementation of the formal language within Cyclone and evaluate the performance of code produced by the modified Cyclone compiler against highly optimized C programs using pthreads. Our results show that the performance overhead for guaranteed race freedom and memory safety is in most cases acceptable.
Concolic testing is a software testing technique that simultaneously combines concrete execution of a program (given specific input, along specific paths) with symbolic execution (generating new test inputs that explore other paths, which gives better path coverage than random test case generation). So far, concolic testing has been applied, mainly at the level of bytecode or assembly code, to programs written in imperative languages that manipulate primitive data types such as integers and arrays. In this article, we demonstrate its application to a functional programming language core, the functional subset of Core Erlang, that supports pattern matching, structured recursive data types such as lists, recursion and higher-order functions. We present CutEr, a tool implementing this testing technique, and describe its architecture, the challenges that it needs to address, its current limitations, and report some experiences from its use.
Concolic testing is a software testing technique combining concrete execution of a program (given specific input, along specific paths) with symbolic execution (generating new test inputs that give better path coverage than random test case generation). Concolic testing has so far been applied, mainly at the level of bytecode or assembly code, to programs written in imperative languages that manipulate primitive data types such as integers and arrays. In this paper, we demonstrate its application to a functional programming language core, a subset of the core language of Erlang, that supports pattern matching, structured recursive data types such as lists, recursion and higher-order functions. Moreover, we present CutEr, a tool implementing this testing technique. We describe CutEr's architecture, the challenges that need to be addressed by such a tool, its current limitations, and report some experiences from its use.
Writing code that manipulates bit streams is a painful and error-prone programming task, often performed via bit twiddling techniques such as explicit bit shifts and bit masks in programmer-allocated buffers. Still, this kind of programming is necessary in many application areas ranging from decoding streaming media files to implementing network protocols. In this paper we employ high-level constructs from declarative programming, such as pattern matching at the bit level and bit stream comprehensions, and show how a variety of bit stream programming applications can be written in a succinct, less error-prone, and totally memory-safe manner. We also describe how these constructs can be implemented efficiently. The resulting performance is superior to that of other (purely) functional languages and competitive to that of low-level languages such as C.
Stateless Model Checking (SMC) is a verification technique for concurrent programs that checks for safety violations by exploring all possible thread schedulings. It is highly effective when coupled with Dynamic Partial Order Reduction (DPOR), which introduces an equivalence on schedulings and need explore only one in each equivalence class. Even with DPOR, SMC often spends unnecessary effort in exploring loop iterations that are pure, i.e., have no effect on the program state. We present techniques for making SMC with DPOR more effective on programs with pure loop iterations. The first is a static program analysis to detect loop purity and an associated program transformation, called Partial Loop Purity Elimination, that inserts assume statements to block pure loop iterations. Subsequently, some of these assumes are turned into await statements that completely remove many assume-blocked executions. Finally, we present an extension of the standard DPOR equivalence, obtained by weakening the conflict relation between events. All these techniques are incorporated into a new DPOR algorithm, OPTIMAL-DPOR-AWAIT, which can handle both awaits and the weaker conflict relation, is optimal in the sense that it explores exactly one execution in each equivalence class, and can also diagnose livelocks. Our implementation in NIDHUGG shows that these techniques can significantly speed up the analysis of concurrent programs that are currently challenging for SMC tools, both for exploring their complete set of interleavings, but even for detecting concurrency errors in them.
We introduce HiPErJiT, a profile-driven Just-in-Time compiler for the BEAM ecosystem based on HiPE, the High Performance Erlang compiler. HiPErJiT uses runtime profiling to decide which modules to compile to native code and which of their functions to inline and type-specialize. HiPErJiT is integrated with the runtime system of Erlang/OTP and preserves aspects of Erlang's compilation which are crucial for its applications: most notably, tail-call optimization and hot code loading at the module level. We present HiPErJiT's architecture, describe the optimizations that it performs, and compare its performance with BEAM, HiPE, and Pyrlang. HiPErJiT offers performance which is about two times faster than BEAM and almost as fast as HiPE, despite the profiling and compilation overhead that it has to pay compared to an ahead-of-time native code compiler. But there also exist programs for which HiPErJiT's profile-driven optimizations allow it to surpass HiPE's performance.