publications
* denotes equal contribution
An up-to-date list is available on Google Scholar.
PhD thesis
Thesis
conference papers
2026
- arXivKernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at MetaGang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Abdul Zainul-Abedin, Ketan Singh, Hongtao Yu, Wenyuan Chi, Barney Huang, Sean Zhang, Noah Weller, Zach Marine, Wyatt Cook, Carole-Jean Wu, and Gaoxiang LiuarXiv:2512.23236, 2026
Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta’s AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.
2025
- CIDRBullion: A Column Store for Machine LearningGang Liao, Ye Liu, Jianjun Chen, and Daniel J. AbadiIn Proceedings of 15th Conference on Innovative Data Systems Research (CIDR), 2025
The past two decades have witnessed columnar storage revolutionizing data warehousing and analytics. However, the rapid growth of machine learning poses new challenges to this domain. This paper presents Bullion, a columnar storage system tailored for machine learning workloads. Bullion addresses the complexities of data compliance, optimizes the encoding of long sequence sparse features, efficiently manages wide-table projections, and introduces feature quantization in storage. By aligning with the evolving requirements of ML applications, Bullion extends columnar storage to various scenarios, from advertising and recommendation systems to the expanding realm of Generative AI. Preliminary experimental results and theoretical analysis demonstrate Bullion’s superior performance in handling the unique demands of machine learning workloads compared to existing columnar storage solutions. Bullion significantly reduces I/O costs for deletion compliance, achieves substantial storage savings with its optimized encoding scheme for sparse features, and drastically improves metadata parsing speed for wide-table projections. These advancements position Bullion as a critical component in the future of machine learning infrastructure, enabling organizations to efficiently manage and process the massive volumes of data required for training and inference in modern AI applications.
2024
- arXivFlock: A Low-Cost Streaming Query Engine on FaaS PlatformsGang Liao, Amol Deshpande, and Daniel J. AbadiarXiv:2312.16735, 2024
In this paper, we present Flock, a cloud-native streaming query engine that leverages the on-demand elasticity of Function-as-aService (FaaS) platforms to perform real-time data analytics. Traditional server-centric deployments often suffer from resource underor over-provisioning, leading to resource wastage or performance degradation. Flock addresses these issues by providing more finegrained elasticity that can dynamically match the per-query basis with continuous scaling, and its billing methods are more finegrained with millisecond granularity, making it a low-cost solution for stream processing. Our approach, payload invocation, eliminates the need for external storage services and eliminates the requirement for a query coordinator in the data architecture. Our evaluation shows that Flock significantly outperforms state-of-theart systems in terms of cost, especially on ARM processors, making it a promising solution for real-time data analytics on FaaS platforms.
- DaMoNSFVInt: Simple, Fast and Generic Variable-Length Integer Decoding using Bit Manipulation InstructionsIn Proceedings of 20th International Workshop on Data Management on New Hardware (DaMoN), 2024
The ubiquity of variable-length integers in data storage and communication necessitates efficient decoding techniques. In this paper, we present SFVInt, a simple and fast approach to decode the prevalent Little Endian Base-128 (LEB128) varints. Our approach, distilled into a mere 500 lines of code, effectively utilizes the Bit Manipulation Instruction Set 2 (BMI2) in modern Intel and AMD processors, achieving significant performance improvement while maintaining simplicity and avoiding overengineering. SFVInt, with its generic design, effectively processes both 32-bit and 64-bit unsigned integers using a unified code template, marking a significant leap forward in varint decoding efficiency. We thoroughly evaluate SFVInt’s performance across various datasets and scenarios, demonstrating that it achieves up to a 2x increase in decoding speed when compared to varint decoding methods used in established frameworks like Facebook Folly and Google Protobuf.
2023
- SoCCFileScale: Fast and Elastic Metadata Management for Distributed File SystemsGang Liao, and Daniel J. AbadiIn Proceedings of the 2023 ACM Symposium on Cloud Computing, 2023
Recent work has shown that distributed database systems are a promising solution for scaling metadata management in scalable file systems. This work has shown that systems that store metadata on a single machine, or over a shared-disk abstraction, struggle to scale performance to deployments including billions of files. In contrast, leveraging a scalable, shared-nothing, distributed system for metadata storage can achieve much higher levels of scalabil- ity, without giving up high availability guarantees. However, for low-scale deployments – where metadata can fit in memory on a single machine – these systems that store metadata in a distributed database typically perform an order of magnitude worse than sys- tems that store metadata in memory on a single machine. This has limited the impact of these distributed database approaches, since they are only currently applicable to file systems of extreme scale.
This paper describes FileScale, a three-tier architecture that incorporates a distributed database system as part of a comprehen- sive approach to metadata management in distributed file systems. In contrast to previous approaches, the architecture described in the paper performs comparably to the single-machine architecture at a small scale, while enabling linear scalability as the file system metadata increases.
2021
- SIGMODBullFrog: Online Schema Evolution via Lazy EvaluationIn Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data, 2021
This paper presents BullFrog, a relational DBMS that supports single-step, non-backwards compatible schema migrations without downtime, and without advanced warning. When a schema migration is presented, BullFrog initiates a logical switch to the new schema, but physically migrates affected data lazily, as it is demanded by incoming transactions. BullFrog’s internal concurrency control algorithms and data structures enable concurrent processing of schema migration operations with post-migration transactions, while ensuring exactly-once migration of all old data into the physical layout required by the new schema. BullFrog is implemented as an open source extension to PostgreSQL. Experiments using this prototype over a TPC-C based workload (supplemented to include schema migrations) show that BullFrog can achieve zero-downtime migration to non-trivial new schemas with near-invisible impact on transaction throughput and latency.