* denotes equal contribution
An up-to-date list is available on Google Scholar.
PhD thesis
Thesis
-
The Evolution of Cloud Data Architectures: Storage, Compute, and Migration. Gang Liao (2022). University of Maryland, College Park.
conference papers
2025
Bullion: A Column Store for Machine Learning
In Proceedings of 15th Conference on Innovative Data Systems Research (CIDR),
2025
The past two decades have witnessed columnar storage revolutionizing data warehousing and analytics. However, the rapid growth of machine learning poses new challenges to this domain. This paper presents Bullion, a columnar storage system tailored for machine learning workloads. Bullion addresses the complexities of data compliance, optimizes the encoding of long sequence sparse features, efficiently manages wide-table projections, and introduces feature quantization in storage. By aligning with the evolving requirements of ML applications, Bullion extends columnar storage to various scenarios, from advertising and recommendation systems to the expanding realm of Generative AI.
Preliminary experimental results and theoretical analysis demonstrate Bullion’s superior performance in handling the unique demands of machine learning workloads compared to existing columnar storage solutions. Bullion significantly reduces I/O costs for deletion compliance, achieves substantial storage savings with its optimized encoding scheme for sparse features, and drastically improves metadata parsing speed for wide-table projections. These advancements position Bullion as a critical component in the future of machine learning infrastructure, enabling organizations to efficiently manage and process the massive volumes of data required for training and inference in modern AI applications.
2024
Flock: A Low-Cost Streaming Query Engine on FaaS Platforms
arXiv:2312.16735,
2024
In this paper, we present Flock, a cloud-native streaming query
engine that leverages the on-demand elasticity of Function-as-aService (FaaS) platforms to perform real-time data analytics. Traditional server-centric deployments often suffer from resource underor over-provisioning, leading to resource wastage or performance
degradation. Flock addresses these issues by providing more finegrained elasticity that can dynamically match the per-query basis
with continuous scaling, and its billing methods are more finegrained with millisecond granularity, making it a low-cost solution
for stream processing. Our approach, payload invocation, eliminates the need for external storage services and eliminates the
requirement for a query coordinator in the data architecture. Our evaluation shows that Flock significantly outperforms state-of-theart systems in terms of cost, especially on ARM processors, making
it a promising solution for real-time data analytics on FaaS platforms.
SFVInt: Simple, Fast and Generic Variable-Length Integer Decoding using Bit Manipulation Instructions
In Proceedings of 20th International Workshop on Data Management on New Hardware (DaMoN),
2024
The ubiquity of variable-length integers in data storage and communication necessitates efficient decoding techniques. In this paper, we present SFVInt, a simple and fast approach to decode the prevalent Little Endian Base-128 (LEB128) varints. Our approach, distilled into a mere 500 lines of code, effectively utilizes the Bit Manipulation Instruction Set 2 (BMI2) in modern Intel and AMD processors, achieving significant performance improvement while maintaining simplicity and avoiding overengineering. SFVInt, with its generic design, effectively processes both 32-bit and 64-bit unsigned integers using a unified code template, marking a significant leap forward in varint decoding efficiency. We thoroughly evaluate SFVInt’s performance across various datasets and scenarios, demonstrating that it achieves up to a 2x increase in decoding speed when compared to varint decoding methods used in established frameworks like Facebook Folly and Google Protobuf.
2023
FileScale: Fast and Elastic Metadata Management for Distributed File Systems
In Proceedings of the 2023 ACM Symposium on Cloud Computing,
2023
Recent work has shown that distributed database systems are a promising solution for scaling metadata management in scalable file systems. This work has shown that systems that store metadata on a single machine, or over a shared-disk abstraction, struggle to scale performance to deployments including billions of files. In contrast, leveraging a scalable, shared-nothing, distributed system for metadata storage can achieve much higher levels of scalabil- ity, without giving up high availability guarantees. However, for low-scale deployments – where metadata can fit in memory on a single machine – these systems that store metadata in a distributed database typically perform an order of magnitude worse than sys- tems that store metadata in memory on a single machine. This has limited the impact of these distributed database approaches, since they are only currently applicable to file systems of extreme scale.
This paper describes FileScale, a three-tier architecture that incorporates a distributed database system as part of a comprehen- sive approach to metadata management in distributed file systems. In contrast to previous approaches, the architecture described in the paper performs comparably to the single-machine architecture at a small scale, while enabling linear scalability as the file system metadata increases.
2021
BullFrog: Online Schema Evolution via Lazy Evaluation
In Proceedings of the 2021 ACM SIGMOD International Conference on Management of Data,
2021
This paper presents BullFrog, a relational DBMS that supports single-step, non-backwards compatible schema migrations without downtime, and without advanced warning.
When a schema migration is presented, BullFrog initiates a logical switch to the new schema, but physically migrates affected data lazily, as it is demanded by incoming transactions. BullFrog’s internal concurrency control algorithms and data structures enable concurrent processing of schema migration operations with post-migration transactions, while ensuring exactly-once migration of all old data into the physical layout required by the new schema.
BullFrog is implemented as an open source extension to PostgreSQL. Experiments using this prototype over a TPC-C based workload (supplemented to include schema migrations) show that BullFrog can achieve zero-downtime migration to non-trivial new schemas with near-invisible impact on transaction throughput and latency.