A C++ sparse matrix library optimized for compression of highly redundant sparse data.
IVSparse is a C++ sparse matrix library that introduces novel compression formats for sparse matrices where non-zero values are highly redundant — a common property in single-cell omics and other datasets from machine learning, data science, and scientific computing.
The library provides three storage formats:
CSC — standard Compressed Sparse Column format
VCSC (Value Compressed Sparse Column) — stores unique values and their column indices, achieving ~2.25× compression over CSC
IVCSC (Indexed Value Compressed Sparse Column) — further compresses index storage via positive-delta encoding and bytepacking, achieving ~7.5× compression over CSC
The full paper was published at IEEE BigData ‘24. The work was also accepted as a one-page summary and poster at the 2024 Data Compression Conference (DCC), where we presented alongside researchers from across the compression community.
Genomics datasets, such as single-cell transcriptomics, are often very large and highly sparse, posing significant challenges for both storage and computation. As the scale of data generation accelerates, efficiently compressing these datasets becomes crucial. Current compression methods, like the popular Compressed Sparse Column (CSC) format, capitalize only on sparsity but overlook other properties like redundancy, which can offer additional opportunities for compression. Genomics data, especially single-cell assays, often exhibit high redundancy within columns, making traditional sparse formats inefficient for in-core computation. In this paper, we present two extensions to CSC: (1) Value-Compressed Sparse Column (VCSC) and (2) Index- and Value-Compressed Sparse Column (IVCSC). VCSC takes advantage of high redundancy within a column to further compress data up 1.9-fold over CSC on real data, without significant negative impact to performance characteristics. IVCSC extends VCSC by compressing index arrays through delta encoding and byte-packing, achieving up to a 4.4-fold decrease in memory usage over CSC on real data. Our benchmarks show that VCSC and IVCSC can be used in compressed form with little added computational cost. These formats represent a step forward in balancing the growing demands of data storage and processing in the era of large-scale genomics.
@inproceedings{wolfgang2024vcsc,title={Value-Compressed Sparse Column (VCSC): Sparse Matrix Storage for Single-cell Omics Data},author={Wolfgang, Seth and Ruiter, Skyler and Tunnell, Marc and Triche Jr., Timothy and Carrier, Erin and DeBruine, Zachary},booktitle={2024 IEEE International Conference on Big Data (BigData)},year={2024},month=dec,address={Washington, DC, USA},doi={10.1109/BigData62323.2024.10825091},}