DBML 2026 - 5th International Workshop on Databases and Machine Learning

in conjunction with ICDE 2026 — May 8th, 2026

About

DBML 2026 Workshop, held in conjunction with ICDE 2026 in Montréal, Canada, explores the growing synergy between databases and machine learning.

Advances in data management techniques have become essential for building robust, scalable ML systems. Applications range from data preparation and cleaning to feature engineering and managing the ML lifecycle. The recent rise of LLMs and RAG has only intensified demand for high-performance data infrastructure. Modern AI systems increasingly rely on vector databases and scalable model serving. Multimodal AI adds further requirements for storing and querying images, audio, and video.

In the opposite direction, ML techniques are now incorporated as core components of database systems: query optimization, indexing, storage layout, and self-tuning. Long-standing challenges like cardinality estimation, operator and plan selection, and resource management - traditionally handled via human knowledge or heuristics - increasingly benefit from learned models.

DBML 2026 brings together researchers and practitioners working at this intersection. We welcome work combining DB and ML strengths, ranging from foundational techniques and system design to practical applications and real-world deployments, including ML for scientific and data-intensive domains.

Information about previous editions can be found at DBML 2025 DBML 2024, DBML 2023, and DBML 2022.

For questions regarding the workshop, please contact: dbml26@googlegroups.com.

Topics of Interest

ML for Data Management and DBMS

  • Learned data discovery, cleaning, and transformation
  • ML-enabled data exploration and discovery in data lakes and lakehouses
  • Learned database design, configuration, and tuning
  • ML for query optimization, indexing, and storage/layout decisions
  • Natural language interfaces for data (querying, exploration, summarization, assistants)
  • Pretrained, foundation, and LLM-based models for data management
  • Representation learning for data cleaning, preprocessing, and integration
  • Benchmarking and evaluation of ML-enhanced data management and DBMS components

Data Management for ML and AI Systems

  • Data collection, preparation, and governance for ML/LLM/RAG applications
  • Data quality, robustness, provenance, and lineage for ML workflows
  • Systems and storage for efficient training, inference, and model serving
  • Vector databases, indexing, and hybrid query processing for embeddings
  • Management of multimodal data (text, images, audio, video, etc.) for AI applications
  • DB-inspired techniques for modeling, storage, and provenance of ML and AI artifacts

Keynote Speakers

Amir Shaikhha

Associate Professor (Reader), University of Edinburgh

Optimizing Data Science by Leveraging Structure

Modern data science pipelines employ a variety of workloads going beyond relational query processing, including graph processing algorithms and tensor processing. This results in the use of loosely coupled data processing frameworks that move the data across the analytics pipeline, leading to unnecessary resource and energy consumption. This talk presents a compilation-based approach to move the computation closer to the data. This is achieved by designing domain-specific languages that leverage the structure of data with algebraic optimizations. We show that our proposed approach significantly outperforms state-of-the-art frameworks for a wide range of applications, including database query processing, tensor processing, and quantum simulation.

Amir Shaikhha is an Associate Professor (Reader) in the School of Informatics at the University of Edinburgh. His research focuses on the design and implementation of data analytics systems by using techniques from the databases, programming languages, compilers, and machine learning communities. He was a Departmental Lecturer at the University of Oxford (2019-2020) before starting as an Assistant Professor (Lecturer) at the University of Edinburgh (2020-2024). He earned his Ph.D. from EPFL in 2018, for which he was awarded a Google Ph.D. Fellowship in structured data analysis, as well as a Ph.D. thesis distinction award. He has won the Best Paper Award at GPCE 2017, the Most Reproducible Paper Award at SIGMOD 2017, the Most Influential Paper Award at GPCE 2024, Google Research Scholar Award 2025, and Dahl-Nygaard Junior Prize 2025. He (co-)chaired the program committees of GPCE, DBPL, Scala, Sparse, and DRAGSTERS.

Oana Balmau

Assistant Professor, McGill University

Data pre-processing challenges in ML pipelines

Current machine learning frameworks rely on data loaders to preprocess data before feeding accelerators. Inefficient preprocessing pipelines can leave GPUs idle for long periods—up to 76% in some cases—significantly slowing training. A key issue is variability in preprocessing time across samples: existing data loaders treat all samples uniformly, so a single slow sample can stall an entire batch. In this talk, I present MinatoLoader, a general-purpose data loader implemented in PyTorch that improves GPU utilization by prioritizing fast-to-process samples while handling slower ones in parallel, enabling more efficient training on single-server, multi-GPU systems.

Oana Balmau is an Assistant Professor in the School of Computer Science at McGill University, where she leads the DISCS Lab. She is a part of MLCommons, where she co-founded MLPerf Storage, an open-source benchmark for storage on ML workloads. Her research focuses on storage systems and data management, with an emphasis on ML, data science, and edge computing workloads. She completed her PhD in Computer Science at the University of Sydney, advised by Prof. Willy Zwaenepoel. Before her PhD, Oana earned her Bachelors and Masters degrees in Computer Science from EPFL.

Amine Mhedhbi

Assistant Professor, École Polytechnique de Montréal

Semantic Query Processing over Relations

Language models are making it possible to ask richer questions over relational data, but doing so efficiently remains difficult. Join-heavy queries, often over networked data, can produce large intermediate results that must be serialized into prompts and then fed into language models. This talk presents FFX (Fast Factorized eXecution), a query engine that combines factorized and vectorized execution to address this bottleneck.

The talk focuses on how FFX changes semantic query processing by keeping join intermediates compact, enabling semantic operators to serialize factorized intermediates and predict over their implied Cartesian products. Operators then produce predictions as flat output tuples and bypass having to first flatten the input relation. Empirically, and somewhat surprisingly, our evaluation shows that even non-reasoning models can often perform this Cartesian expansion accurately while still carrying out the semantic task. In our evaluation, FFX achieves an order-of-magnitude reduction in input tokens while maintaining the same accuracy and degrades more gracefully as context size increases.

Amine Mhedhbi is an assistant professor at École Polytechnique de Montréal. His interests are in building and analyzing analytical and AI-driven data system architectures. His work includes tackling performance considerations, debuggability, interface design, and data applications. Amine received his Ph.D. in 2023 from the University of Waterloo. His research has been awarded a VLDB best paper award, a Microsoft Ph.D. fellowship award, and the University of Waterloo's Computer Science distinguished dissertation award.

Program

Time Activity Title Presenter
10:00–10:10 Greetings and Introduction
10:10–10:55 Invited Talk Optimizing Data Science by Leveraging Structure Amir Shaikhha
10:55–11:15 Accepted Paper Talk What Makes a Clustering Set Interesting? Georg Stefan Schlake
11:15–12:00 Invited Talk Data pre-processing challenges in ML pipelines Oana Balmau
12:00–1:30 Lunch Break
1:30–2:15 Invited Talk Semantic Query Processing over Relations Amine Mhedhbi
2:15–2:35 Accepted Paper Talk MatSQL: Accelerating Text-to-SQL via Database Schema Materialization Kyong-Shik Lee
2:35–2:55 Accepted Paper Talk MACE: A Mamba-based Database-Agnostic Cost Estimator Pohsun An
2:55–3:00 Wrap-up

Important Dates

All deadlines are 11:59 PM AoE.

Deadlines have been extended!

Submission deadline: Feb 14th, 2026
Author notification: April 5th, 2026
Camera-ready version: April 19th, 2026
Workshop day: May 8th, 2026

Submission and Author Guidelines

Submissions should be made electronically via the submission site. Papers must be prepared in accordance with the official IEEE conference templates. Submitted papers must not exceed 6 pages including references. No appendix is allowed. Only electronic submissions in PDF format will be accepted. Submissions will be reviewed in a single-blind manner.

Organisation

Fatemeh Nargesian
Fatemeh Nargesian
University of Rochester
Workshop Chair
Guillaume Lachaud
Guillaume Lachaud
École polytechnique
Workshop Chair
Jiwon Chang
Jiwon Chang
University of Rochester
Publicity Chair

Program Committee

The current program committee members are tentative.

  • Amine Mhedhbi - Polytechnique Montréal
  • Andra Ionescu - KTH Royal Institute of Technology
  • Christos Koutras - New York University
  • Gerardo Vitagliano - MIT CSAIL
  • Roee Shraga - WPI
  • Xue Li - CWI