Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Intelligent Data Management via Machine Learning: From Storage Hierarchy to Information Hierarchy
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing.ORCID iD: 0000-0001-9983-3755
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Description
Abstract [en]

The rise of Big Data has catalyzed numerous advanced data-driven methods, while simultaneously posing significant challenges in data management. This thesis aims to address two fundamental aspects of data management–storage management and information extraction–by leveraging machine learning (ML) techniques. In particular, we focus on two research topics: Storage Hierarchy, which explores hierarchical storage management (HSM) in multi-tiered storage systems; and Information Hierarchy, which targets the extraction of intrinsic data hierarchies from raw data.

We begin by introducing the key stages of data life cycle and their associated challenges in the Big Data era, alongside a review of machine learning foundations and their potentials for addressing these challenges. Subsequently, we present the Storage Hierarchy project, which is detailed across Paper I, II, and III. In these works, we develop automated, adaptive, and efficient HSM approaches using reinforcement learning (RL). In Paper I we introduce the HSM-RL framework for managing file-level data migration in hierarchical storage system (HSS). It leverages RL to optimize file placement and temporal difference learning for real-time adaptability. Paper II extends this work to complex real–world scenarios using scientific datasets, exploring the framework’s flexibility, scalability, and effectiveness. Moving to finer granularity, Paper III presents ReStore, an RL-based page-level data migration approach that incorporates the unique characteristics of modern Solid-State Drives (SSDs), such as read/write asymmetry and parallelism.

The Information Hierarchy project focuses on autonomous extraction of implicit data hierarchies from raw, unlabeled data. Presented in Paper IV, we propose InfoHier, a framework that integrates self-supervised learning (SSL) with hierarchical clustering (HC) to uncover latent data representations and hierarchical structures. By jointly training SSL and HC through a dynamic balancing loss, InfoHier ensure that the HC results align with the intrinsic data hierarchy. This method facilitates meaningful and structured information extraction and retrieval. 

Collectively, the Storage Hierarchy and Information Hierarchy projects advance intelligent data management by enabling efficient storage solutions and autonomous information extraction. These contributions pave the foundation for next generation data management systems, addressing the challenges of Big Data with adaptive and scalable solutions.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2025. , p. 93
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 2483
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:uu:diva-544718ISBN: 978-91-513-2332-9 (print)OAI: oai:DiVA.org:uu-544718DiVA, id: diva2:1919253
Public defence
2025-02-07, Häggsalen, Ångströmlaboratoriet, Lägerhyddsvägen 1, Uppsala, 10:15 (English)
Opponent
Supervisors
Available from: 2025-01-16 Created: 2024-12-08 Last updated: 2025-01-16
List of papers
1. Efficient Hierarchical Storage Management Empowered by Reinforcement Learning
Open this publication in new window or tab >>Efficient Hierarchical Storage Management Empowered by Reinforcement Learning
2023 (English)In: IEEE Transactions on Knowledge and Data Engineering, ISSN 1041-4347, E-ISSN 1558-2191, Vol. 35, p. 5780-5793Article in journal (Refereed) Published
Abstract [en]

With the rapid development of big data and cloud computing, data management has become increasingly challenging. Over the years, a number of frameworks for data management have become available. Most of them are highly efficient, but ultimately create data silos. It becomes difficult to move and work coherently with data as new requirements emerge. A possible solution is to use an intelligent hierarchical (multi-tier) storage system (HSS). A HSS is a meta solution that consists of different storage frameworks organized as a jointly constructed storage pool. A built-in data migration policy that determines the optimal placement of the datasets in the hierarchy is essential. Placement decisions is a non-trivial task since it should be made according to the characteristics of the dataset, the tier status in a hierarchy, and access patterns. This paper presents an open-source hierarchical storage framework with a dynamic migration policy based on reinforcement learning (RL). We present a mathematical model, a software architecture, and implementations based on both simulations and a live cloud-based environment. We compare the proposed RL-based strategy to a baseline of three rule-based policies, showing that the RL-based policy achieves significantly higher efficiency and optimal data distribution in different scenarios.

Place, publisher, year, edition, pages
IEEE, 2023
Keywords
Data Management, Cloud Computing, Hierarchical Storage System, Data Migration, Reinforcement Learning
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-490399 (URN)10.1109/tkde.2022.3176753 (DOI)000981944600024 ()
Projects
eSSENCE - An eScience Collaboration
Funder
Swedish Foundation for Strategic Research, BD15-0008
Available from: 2022-12-09 Created: 2022-12-09 Last updated: 2024-12-16Bibliographically approved
2. Data management of scientific applications in a reinforcement learning-based hierarchical storage system
Open this publication in new window or tab >>Data management of scientific applications in a reinforcement learning-based hierarchical storage system
Show others...
2024 (English)In: Expert systems with applications, ISSN 0957-4174, E-ISSN 1873-6793, Vol. 237, article id 121443Article in journal (Refereed) Published
Abstract [en]

In many areas of data-driven science, large datasets are generated where the individual data objects are images, matrices, or otherwise have a clear structure. However, these objects can be information-sparse, and a challenge is to efficiently find and work with the most interesting data as early as possible in an analysis pipeline. We have recently proposed a new model for big data management where the internal structure and information of the data are associated with each data object (as opposed to simple metadata). There is then an opportunity for comprehensive data management solutions to account for data-specific internal structure as well as access patterns. In this article, we explore this idea together with our recently proposed hierarchical storage management framework that uses reinforcement learning (RL) for autonomous and dynamic data placement in different tiers in a storage hierarchy. Our case-study is based on four scientific datasets: Protein translocation microscopy images, Airfoil angle of attack meshes, 1000 Genomes sequences, and Phenotypic screening images. The presented results highlight that our framework is optimal and can quickly adapt to new data access requirements. It overall reduces the data processing time, and the proposed autonomous data placement is superior compared to any static or semi-static data placement policies.

Place, publisher, year, edition, pages
Elsevier, 2024
Keywords
Data management, Scientific application, Hierarchical storage system, Reinforcement learning, Large scientific datasets
National Category
Computer Sciences Computational Mathematics
Research subject
Computer Science with specialization in Database Technology; Computer Science
Identifiers
urn:nbn:se:uu:diva-513854 (URN)10.1016/j.eswa.2023.121443 (DOI)001081909200001 ()
Funder
Swedish Foundation for Strategic Research, BD15-0008Swedish National Infrastructure for Computing (SNIC), SNIC 2022/22-835eSSENCE - An eScience Collaboration
Available from: 2023-10-12 Created: 2023-10-12 Last updated: 2024-12-08Bibliographically approved
3. Restore: A Reinforcement Learning Approach For Data Migration In Multi-Tiered Storage
Open this publication in new window or tab >>Restore: A Reinforcement Learning Approach For Data Migration In Multi-Tiered Storage
(English)Manuscript (preprint) (Other academic)
Abstract [en]

With the development of storage technologies, a wide variety of storage devices with differing performance characteristics and cost profiles have emerged. As a result, data systems are increasingly adopting multi-tiered storage solutions. A primary challenge in multi-tiered storage systems is data placement, as data must be dynamically stored and migrated across different storage tiers to optimize overall performance. Effective data migration policies should be able to adapt to workload variations while also considering the unique characteristics of underlying devices (such as PCIe/SATA SSD, or HDD), notably their read/write asymmetry and parallelism. In this paper, we introduce ReStore, a reinforcement learning (RL) approach for data migration in multi-tiered storage systems. ReStore leverages RL to capture both workload patterns and device-specific characteristics, including access frequency and recency, as well as device read/write asymmetry and parallelism. Each storage tier uses a different device and is associated with an RL agent that dynamically updates its parameter using temporal difference learning, ensuring continuous adaptability to changing workloads and system states. We experimentally show that ReStore achieves up to 2.2× lower runtime and up to 10× fewer migrations using industry-grade benchmarks, like TPC-C/E and YCSB, real-life traces, like Google Thesios, and a wide variety of synthetic workloads.

National Category
Computer Sciences
Research subject
Computer Science with specialization in Database Technology; Computer Science
Identifiers
urn:nbn:se:uu:diva-544716 (URN)
Available from: 2024-12-08 Created: 2024-12-08 Last updated: 2024-12-18Bibliographically approved
4. InfoHier: Hierarchical Information Extraction via Encoding and Embedding
Open this publication in new window or tab >>InfoHier: Hierarchical Information Extraction via Encoding and Embedding
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Analyzing large-scale datasets, especially involving complex and high-dimensional data like images, is particularly challenging. While self-supervised learning (SSL) has proven effective for learning representations from unlabeled data, it typically focuses on flat, non-hierarchical structures, missing the multi-level relationships present in many real-world datasets. Hierarchical clustering (HC) can uncover these relationships by organizing data into a tree-like structure, but it often relies on rigid similarity metrics that struggle to capture the complexity of diverse data types. To address these we envision InfoHier, a framework that combines SSL with HC to jointly learn robust latent representations and hierarchical structures. This approach leverages SSL to provide adaptive representations, enhancing HC's ability to capture complex patterns. Simultaneously, it integrates HC loss to refine SSL training, resulting in representations that are more attuned to the underlying information hierarchy. InfoHier has the potential to improve the expressiveness and performance of both clustering and representation learning, offering significant benefits for data analysis, management, and information retrieval.

Keywords
Hierarchical Representation, Hierarchical Clustering, Self-Supervised Learning, Joint Learning, Information Retrieval
National Category
Computer Sciences Information Systems
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-544717 (URN)
Available from: 2024-12-08 Created: 2024-12-08 Last updated: 2024-12-18Bibliographically approved

Open Access in DiVA

UUThesis_T-Zhang-2025(1633 kB)67 downloads
File information
File name FULLTEXT01.pdfFile size 1633 kBChecksum SHA-512
45b60883d204560ff4d62bf1cbc105ba7d50d6e0e0ad20ac79e3c67c6c5390cae1fad5059d9b3072f93ecadbb9ae1383740685552f1ed6ca192e7fa253c2c548
Type fulltextMimetype application/pdf

Authority records

Zhang, Tianru

Search in DiVA

By author/editor
Zhang, Tianru
By organisation
Division of Scientific Computing
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 67 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 366 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf