We present an integrated solution for storing and querying scientific data and metadata, using MATLAB envi ronment as client front-end and our prototype DBMS on the server. We use RDF for experiment metadata, and numeric arrays for the rest. Our extension of SPARQL supports array operations and extensibility with foreign functions.
Multidimensional numeric arrays are often serialized to binary formats for efficient storage and processing. These representations can be stored as binary objects in existing relational database management systems. To minimize data transfer overhead when arrays are large and only parts of arrays are accessed, it is favorable to split these arrays into separately stored chunks. We process queries expressed in an extended graph query language SPARQL, treating arrays as node values and having syntax for specifying array projection, element and range selection operations as part of a query. When a query selects parts of one or more arrays, only the relevant chunks of each array should be retrieved from the relational database. The retrieval is made by automatically generated SQL queries. We evaluate different strategies for partitioning the array content, and for generating the SQL queries that retrieve it on demand. For this purpose, we present a mini-benchmark, featuring a number of typical array access patterns. We draw some actionable conclusions from the performance numbers.
This article describes Syntel, a knowledge representation language used in building large-scale expert systems for financial risk assessment. Syntel is an outgrowth of rule-based systems such as MYCIN and network-based systems such as Prospector. Unlike typical rule- or frame-based expert system shells, however, Syntel is a data-driven, purely functional language providing probabilistic inference plus many kinds of functionality associated with spreadsheets and database systems.
Queries over scientific data often imply expensive analyses of data requiring a lot of computational resources available in Grids. We are developing a customizable query processor built on top of an established Grid infrastructure, the NorduGrid middleware, and have implemented a framework for managing long running queries in Grid environment. With the framework the user does not specify the detailed job and parallelization descriptions required by NorduGrid. Instead s/he specifies queries in terms of an application-oriented schema describing contents of files managed by the Grid and accessed through wrappers. When a query is received by the system it generates NorduGrid job descriptions submitted to NorduGrid for execution. The framework considers limitations of NorduGrid. It includes a submission mechanism, a job babysitter, and a generic data exchange mechanism. The submission mechanism generates a number of jobs for parallel execution of a user query over wrapped data files. The task of the babysitter is to submit generated jobs to NorduGrid for the execution, to monitor their execution status, and to download results from the execution. The generic exchange mechanism provides a way to exchange objects through files between Grid execution nodes and user applications.
Scientific experiments produce large volumes of data represented as complex objects that describe independent events such as particle collisions. Scientific analyses can be expressed as queries selecting objects that satisfy complex local conditions over properties of each object. The conditions include joins, aggregate functions, and numerical computations. Traditional query processing where data is loaded into a database does not perform well, since it takes time and space to load and index data. Therefore, we developed SQISLE to efficiently process in one pass large queries selecting complex objects from sources. Our contributions include runtime query optimization strategies, which during query execution collect runtime query statistics, reoptimize the query using collected statistics, and dynamically switch optimization strategies. Furthermore, performance is improved by query rewrites, temporary view materializations, and compile time evaluation of query fragments. We demonstrate that queries in SQISLE perform close to hard-coded C++ implementations of the same analyses.
Transportation–related problems, like road congestion, park-ing, and pollution are increasing in most cities. In order toreduce traffic, recent work has proposed methods for vehiclesharing, for example for sharing cabs by grouping “closeby”cab requests and thus minimizing transportation cost andutilizing cab space. However, the methods proposed so fardo not scale to large data volumes, which is necessary tofacilitate large scale collective transportation systems, e.g.,ride–sharing systems for large cities.This paper presents highly scalable “trip grouping” algo-rithms, that generalize previous techniques and support in-put rates that can be orders of magnitude larger. The follow-ing three contributions make the grouping algorithms scal-able. First, the basic grouping algorithm is expressed as acontinuous stream query in a data stream management sys-tem to allow for very large flows of requests. Second, follow-ing the divide–and–conquer paradigm, four space–partition-ing policies for dividing the input data stream into sub–streams are developed and implemented using continuousstream queries. Third, using the partitioning policies, par-allel implementations of the grouping algorithm in a paral-lel computing environment are described. Extensive experi-mental results show that the parallel implementation usingsimple adaptive partitioning methods can achieve speed–upsof several orders of magnitudes without significantly effect-ing the quality of the grouping.
Scientific applications require processing high-volume on-line streams of numerical data from instruments and simulations. We present an extensible stream database system that allows scalable and flexible continuous queries on such streams. Application dependent streams and query functions are defined through an object-relational model. Distributed execution plans for continuous queries are described as high-level data flow distribution templates. Using a generic template we define two partitioning strategies for scalable parallel execution of expensive stream queries: window split and window distribute. Window split provides operators for parallel execution of query functions by reducing the size of stream data units using application dependent functions as parameters. By contrast, window distribute provides operators for customized distribution of entire data units without reducing their size. We evaluate these strategies for a typical high volume scientific stream application and show that window split is favorable when expensive queries are executed on limited resources, while window distribution is better otherwise.
In this paper we describe how Controller Area Network (CAN) frames can be relayed over a wireless Internet connection, enabling remote access to the CAN buses of vehicles for applications in automotive testing. This opens up many new possibilities for automotive diagnostics, monitoring, testing, analysis and verification, which we believe can significantly reduce the time required for the testing and verification phases of automotive development. A CAN-over-IP tunneling protocol is described and the design and implementation of a generic system for remote access to the CAN bus of vehicles is presented. Examples of applications of the technology are given and implications in terms of new possibilities and challenges in automotive engineering are discussed.
Data integration on a large scale poses complexity and
The underlying hypothesis of the case study described in this article is that the development, implementation, operations, and maintenance of large complex data intensive applications such as computer integrated manufacturing can be simplified through the use of object-oriented DBMS. The objective of the case study is to verify this hypothesis.
The approach of the case study is to prototype selected representative components of a computer integrated manufacturing system, which have been developed on top of a relational DBMS, using an object-oriented DBMS.
The results of the study illustrate that the object-oriented prototype has a superior schema, is capable of providing convenient access to information, and is easier to extend and maintain.