DBML 2026 Workshop, held in conjunction with ICDE 2026 in Montréal, Canada, explores the growing synergy between databases and machine learning.
Advances in data management techniques have become essential for building robust, scalable ML systems. Applications range from data preparation and cleaning to feature engineering and managing the ML lifecycle. The recent rise of LLMs and RAG has only intensified demand for high-performance data infrastructure. Modern AI systems increasingly rely on vector databases and scalable model serving. Multimodal AI adds further requirements for storing and querying images, audio, and video.
In the opposite direction, ML techniques are now incorporated as core components of database systems: query optimization, indexing, storage layout, and self-tuning. Long-standing challenges like cardinality estimation, operator and plan selection, and resource management - traditionally handled via human knowledge or heuristics - increasingly benefit from learned models.
DBML 2026 brings together researchers and practitioners working at this intersection. We welcome work combining DB and ML strengths, ranging from foundational techniques and system design to practical applications and real-world deployments, including ML for scientific and data-intensive domains.
Information about previous editions can be found at DBML 2025 DBML 2024, DBML 2023, and DBML 2022.
For questions regarding the workshop, please contact: dbml26@googlegroups.com.
Associate Professor (Reader), University of Edinburgh
Optimizing Data Science by Leveraging Structure
Modern data science pipelines employ a variety of workloads going beyond relational query processing, including graph processing algorithms and tensor processing. This results in the use of loosely coupled data processing frameworks that move the data across the analytics pipeline, leading to unnecessary resource and energy consumption. This talk presents a compilation-based approach to move the computation closer to the data. This is achieved by designing domain-specific languages that leverage the structure of data with algebraic optimizations. We show that our proposed approach significantly outperforms state-of-the-art frameworks for a wide range of applications, including database query processing, tensor processing, and quantum simulation.
Amir Shaikhha is an Associate Professor (Reader) in the School of Informatics at the University of Edinburgh. His research focuses on the design and implementation of data analytics systems by using techniques from the databases, programming languages, compilers, and machine learning communities. He was a Departmental Lecturer at the University of Oxford (2019-2020) before starting as an Assistant Professor (Lecturer) at the University of Edinburgh (2020-2024). He earned his Ph.D. from EPFL in 2018, for which he was awarded a Google Ph.D. Fellowship in structured data analysis, as well as a Ph.D. thesis distinction award. He has won the Best Paper Award at GPCE 2017, the Most Reproducible Paper Award at SIGMOD 2017, the Most Influential Paper Award at GPCE 2024, Google Research Scholar Award 2025, and Dahl-Nygaard Junior Prize 2025. He (co-)chaired the program committees of GPCE, DBPL, Scala, Sparse, and DRAGSTERS.
Assistant Professor, McGill University
Data pre-processing challenges in ML pipelines
Current machine learning frameworks rely on data loaders to preprocess data before feeding accelerators. Inefficient preprocessing pipelines can leave GPUs idle for long periods—up to 76% in some cases—significantly slowing training. A key issue is variability in preprocessing time across samples: existing data loaders treat all samples uniformly, so a single slow sample can stall an entire batch. In this talk, I present MinatoLoader, a general-purpose data loader implemented in PyTorch that improves GPU utilization by prioritizing fast-to-process samples while handling slower ones in parallel, enabling more efficient training on single-server, multi-GPU systems.
Oana Balmau is an Assistant Professor in the School of Computer Science at McGill University, where she leads the DISCS Lab. She is a part of MLCommons, where she co-founded MLPerf Storage, an open-source benchmark for storage on ML workloads. Her research focuses on storage systems and data management, with an emphasis on ML, data science, and edge computing workloads. She completed her PhD in Computer Science at the University of Sydney, advised by Prof. Willy Zwaenepoel. Before her PhD, Oana earned her Bachelors and Masters degrees in Computer Science from EPFL.
Assistant Professor, École Polytechnique de Montréal
Semantic Query Processing over Relations
Language models are making it possible to ask richer questions over relational data, but doing so efficiently remains difficult. Join-heavy queries, often over networked data, can produce large intermediate results that must be serialized into prompts and then fed into language models. This talk presents FFX (Fast Factorized eXecution), a query engine that combines factorized and vectorized execution to address this bottleneck.
The talk focuses on how FFX changes semantic query processing by keeping join intermediates compact, enabling semantic operators to serialize factorized intermediates and predict over their implied Cartesian products. Operators then produce predictions as flat output tuples and bypass having to first flatten the input relation. Empirically, and somewhat surprisingly, our evaluation shows that even non-reasoning models can often perform this Cartesian expansion accurately while still carrying out the semantic task. In our evaluation, FFX achieves an order-of-magnitude reduction in input tokens while maintaining the same accuracy and degrades more gracefully as context size increases.
Amine Mhedhbi is an assistant professor at École Polytechnique de Montréal. His interests are in building and analyzing analytical and AI-driven data system architectures. His work includes tackling performance considerations, debuggability, interface design, and data applications. Amine received his Ph.D. in 2023 from the University of Waterloo. His research has been awarded a VLDB best paper award, a Microsoft Ph.D. fellowship award, and the University of Waterloo's Computer Science distinguished dissertation award.
| Time | Activity | Title | Presenter |
|---|---|---|---|
| 10:00–10:10 | Greetings and Introduction | ||
| 10:10–10:55 | Invited Talk | Optimizing Data Science by Leveraging Structure | Amir Shaikhha |
| 10:55–11:15 | Accepted Paper Talk | What Makes a Clustering Set Interesting? | Georg Stefan Schlake |
| 11:15–12:00 | Invited Talk | Data pre-processing challenges in ML pipelines | Oana Balmau |
| 12:00–1:30 | Lunch Break | ||
| 1:30–2:15 | Invited Talk | Semantic Query Processing over Relations | Amine Mhedhbi |
| 2:15–2:35 | Accepted Paper Talk | MatSQL: Accelerating Text-to-SQL via Database Schema Materialization | Kyong-Shik Lee |
| 2:35–2:55 | Accepted Paper Talk | MACE: A Mamba-based Database-Agnostic Cost Estimator | Pohsun An |
| 2:55–3:00 | Wrap-up | ||
All deadlines are 11:59 PM AoE.
Deadlines have been extended!
| Submission deadline: | Feb 14th, 2026 |
| Author notification: | April 5th, 2026 |
| Camera-ready version: | April 19th, 2026 |
| Workshop day: | May 8th, 2026 |
Submissions should be made electronically via the submission site. Papers must be prepared in accordance with the official IEEE conference templates. Submitted papers must not exceed 6 pages including references. No appendix is allowed. Only electronic submissions in PDF format will be accepted. Submissions will be reviewed in a single-blind manner.
The current program committee members are tentative.