Logo: to the web site of Uppsala University

uu.sePublications from Uppsala University
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
InfoHier: Hierarchical Information Extraction via Encoding and Embedding
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science.ORCID iD: 0000-0001-9983-3755
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing.
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science. Uppsala University, Science for Life Laboratory, SciLifeLab.ORCID iD: 0000-0002-3123-3478
Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Computational Science. Uppsala University, Disciplinary Domain of Science and Technology, Mathematics and Computer Science, Department of Information Technology, Division of Scientific Computing.ORCID iD: 0000-0003-0302-6276
(English)Manuscript (preprint) (Other academic)
Abstract [en]

Analyzing large-scale datasets, especially involving complex and high-dimensional data like images, is particularly challenging. While self-supervised learning (SSL) has proven effective for learning representations from unlabeled data, it typically focuses on flat, non-hierarchical structures, missing the multi-level relationships present in many real-world datasets. Hierarchical clustering (HC) can uncover these relationships by organizing data into a tree-like structure, but it often relies on rigid similarity metrics that struggle to capture the complexity of diverse data types. To address these we envision InfoHier, a framework that combines SSL with HC to jointly learn robust latent representations and hierarchical structures. This approach leverages SSL to provide adaptive representations, enhancing HC's ability to capture complex patterns. Simultaneously, it integrates HC loss to refine SSL training, resulting in representations that are more attuned to the underlying information hierarchy. InfoHier has the potential to improve the expressiveness and performance of both clustering and representation learning, offering significant benefits for data analysis, management, and information retrieval.

Keywords [en]
Hierarchical Representation, Hierarchical Clustering, Self-Supervised Learning, Joint Learning, Information Retrieval
National Category
Computer Sciences Information Systems
Research subject
Computer Science
Identifiers
URN: urn:nbn:se:uu:diva-544717OAI: oai:DiVA.org:uu-544717DiVA, id: diva2:1919250
Available from: 2024-12-08 Created: 2024-12-08 Last updated: 2024-12-18Bibliographically approved
In thesis
1. Intelligent Data Management via Machine Learning: From Storage Hierarchy to Information Hierarchy
Open this publication in new window or tab >>Intelligent Data Management via Machine Learning: From Storage Hierarchy to Information Hierarchy
2025 (English)Doctoral thesis, comprehensive summary (Other academic)
Abstract [en]

The rise of Big Data has catalyzed numerous advanced data-driven methods, while simultaneously posing significant challenges in data management. This thesis aims to address two fundamental aspects of data management–storage management and information extraction–by leveraging machine learning (ML) techniques. In particular, we focus on two research topics: Storage Hierarchy, which explores hierarchical storage management (HSM) in multi-tiered storage systems; and Information Hierarchy, which targets the extraction of intrinsic data hierarchies from raw data.

We begin by introducing the key stages of data life cycle and their associated challenges in the Big Data era, alongside a review of machine learning foundations and their potentials for addressing these challenges. Subsequently, we present the Storage Hierarchy project, which is detailed across Paper I, II, and III. In these works, we develop automated, adaptive, and efficient HSM approaches using reinforcement learning (RL). In Paper I we introduce the HSM-RL framework for managing file-level data migration in hierarchical storage system (HSS). It leverages RL to optimize file placement and temporal difference learning for real-time adaptability. Paper II extends this work to complex real–world scenarios using scientific datasets, exploring the framework’s flexibility, scalability, and effectiveness. Moving to finer granularity, Paper III presents ReStore, an RL-based page-level data migration approach that incorporates the unique characteristics of modern Solid-State Drives (SSDs), such as read/write asymmetry and parallelism.

The Information Hierarchy project focuses on autonomous extraction of implicit data hierarchies from raw, unlabeled data. Presented in Paper IV, we propose InfoHier, a framework that integrates self-supervised learning (SSL) with hierarchical clustering (HC) to uncover latent data representations and hierarchical structures. By jointly training SSL and HC through a dynamic balancing loss, InfoHier ensure that the HC results align with the intrinsic data hierarchy. This method facilitates meaningful and structured information extraction and retrieval. 

Collectively, the Storage Hierarchy and Information Hierarchy projects advance intelligent data management by enabling efficient storage solutions and autonomous information extraction. These contributions pave the foundation for next generation data management systems, addressing the challenges of Big Data with adaptive and scalable solutions.

Place, publisher, year, edition, pages
Uppsala: Acta Universitatis Upsaliensis, 2025. p. 93
Series
Digital Comprehensive Summaries of Uppsala Dissertations from the Faculty of Science and Technology, ISSN 1651-6214 ; 2483
National Category
Computer Sciences
Research subject
Computer Science
Identifiers
urn:nbn:se:uu:diva-544718 (URN)978-91-513-2332-9 (ISBN)
Public defence
2025-02-07, Häggsalen, Ångströmlaboratoriet, Lägerhyddsvägen 1, Uppsala, 10:15 (English)
Opponent
Supervisors
Available from: 2025-01-16 Created: 2024-12-08 Last updated: 2025-01-16

Open Access in DiVA

No full text in DiVA

Authority records

Zhang, TianruJu, LiSingh, PrashantToor, Salman

Search in DiVA

By author/editor
Zhang, TianruJu, LiSingh, PrashantToor, Salman
By organisation
Division of Scientific ComputingComputational ScienceScience for Life Laboratory, SciLifeLab
Computer SciencesInformation Systems

Search outside of DiVA

GoogleGoogle Scholar

urn-nbn

Altmetric score

urn-nbn
Total: 72 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf