Publications | Haiyu Mao

2025

ISCA
REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing

Kangqi Chen, Rakesh Nadig, Andreas Kosmas Kakolyris, Manos Frouzakis, Nika Mansouri Ghiasi, Yu Liang, Haiyu Mao, Jisung Park, Mohammad Sadrosadati, and Onur Mutlu

2025

Abs Bib PDF

Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. This limitation, combined with the significant cost of retraining renders them incapable of providing up-to-date responses. To overcome these issues, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: (i) indexing, which creates a database that facilitates similarity search on text embeddings, (ii) retrieval, which, given a user query, searches and retrieves relevant data from the database and (iii) generation, which uses the user query and the retrieved data to generate a response. The retrieval stage of RAG in particular becomes a significant performance bottleneck in inference pipelines. In this stage, (i) a given user query is mapped to an embedding vector and (ii) an Approximate Nearest Neighbor Search (ANNS) algorithm searches for the most semantically similar embedding vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS workloads by performing computations inside the storage system. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications to the storage system, limiting performance and hindering their adoption. We propose REIS, the first Retrieval system tailored for RAG with In-Storage processing that addresses the limitations of existing implementations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored algorithm and data placement technique that: (i) distributes embeddings across all planes of the storage system to exploit parallelism, and (ii) employs a lightweight Flash Translation Layer (FTL) to improve performance. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system, without requiring hardware modifications. The three key mechanisms form a cohesive framework that largely enhances both the performance and energy efficiency of RAG pipelines. Compared to a high-end server-grade system, REIS improves the performance (energy efficiency) of the retrieval stage by an average of 13 × (55 ×). REIS offers improved performance against existing ISP-based ANNS accelerators, without introducing any hardware modifications, enabling easier adoption for RAG pipelines.
@article{reis, title = {REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing}, author = {Chen, Kangqi and Nadig, Rakesh and Kakolyris, Andreas Kosmas and Frouzakis, Manos and Mansouri Ghiasi, Nika and Liang, Yu and Mao, Haiyu and Park, Jisung and Sadrosadati, Mohammad and Mutlu, Onur}, booktitle = {The 52nd Annual International Symposium on Computer Architecture}, year = {2025} }
ICS
MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem

Melina Soysal, Konstantina Koliogeorgi, Can Firtina, Nika Mansouri Ghiasi, Rakesh Nadig, Haiyu Mao, Geraldo Francisco, Yu Liang, Klea Zambaku, Mohammad Sadrosadati, and Onur Mutlu

2025

Abs Bib PDF

Raw signal genome analysis (RSGA) has emerged as a promising approach to enable real-time genome analysis by directly analyzing raw electrical signals. However, rapid advancements in sequencing technologies make it increasingly difficult for software-based RSGA to match the throughput of raw signal generation. This paper demonstrates that while hardware acceleration techniques can significantly accelerate RSGA, the high volume of genomic data shifts the performance and energy bottleneck from computation to I/O data movement. As sequencing throughput increases, I/O overhead becomes the main contributor to both runtime and energy consumption. Therefore, there is a need to design a high-performance, energy-efficient system for RSGA that can both alleviate the data movement bottleneck and provide large acceleration capabilities. We propose MARS, a storage-centric system that leverages the heterogeneous resources within modern storage systems (e.g., storage-internal DRAM, storage controller, flash chips) alongside their large storage capacity to tackle both data movement and computational overheads of RSGA in an area-efficient and low-cost manner. MARS accelerates RSGA through a novel hardware/software co-design approach. First, MARS modifies the RSGA pipeline via two filtering mechanisms and a quantization scheme, reducing hardware demands and optimizing for in-storage execution. Second, MARS accelerates the RSGA steps directly within the storage by leveraging both Processing-Near-Memory and Processing-Using-Memory paradigms. Third, MARS orchestrates the execution of all steps to fully exploit in-storage parallelism and minimize data movement. Our evaluation shows that MARS outperforms basecalling-based software and hardware-accelerated state-of-the-art read mapping pipelines by 93x and 40x, on average across different datasets, while reducing their energy consumption by 427x and 72x.
@article{mars, title = {MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem}, author = {Soysal, Melina and Koliogeorgi, Konstantina and Firtina, Can and Mansouri Ghiasi, Nika and Nadig, Rakesh and Mao, Haiyu and Francisco, Geraldo and Liang, Yu and Zambaku, Klea and Sadrosadati, Mohammad and Mutlu, Onur}, year = {2025} }
ASPLOS
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System

Yintao He, Haiyu Mao, Christina Giannoula, Mohammad Sadrosadati, Juan Gomez-Luna, Huawei Li, Xiaowei Li, Ying Wang, and Onur Mutlu

2025

Abs Bib PDF

Large language models (LLMs) are widely used for natural language understanding and text generation. An LLM model relies on a time-consuming step called LLM decoding to generate output tokens. Several prior works focus on improving the performance of LLM decoding using parallelism techniques, such as batching and speculative decoding. State-of-the-art LLM decoding has both compute-bound and memory-bound kernels. Some prior works statically identify and map these different kernels to a heterogeneous architecture consisting of both processing-in-memory (PIM) units and computation-centric accelerators. We observe that characteristics of LLM decoding kernels (e.g., whether or not a kernel is memory-bound) can change dynamically due to parameter changes to meet user and/or system demands, making (1) static kernel mapping to PIM units and computation-centric accelerators suboptimal, and (2) one-size-fits-all approach of designing PIM units inefficient due to a large degree of heterogeneity even in memory-bound kernels. In this paper, we aim to accelerate LLM decoding while considering the dynamically changing characteristics of the kernels involved. We propose PAPI (PArallel Decoding with PIM), a PIM-enabled heterogeneous architecture that exploits dynamic scheduling of compute-bound or memory-bound kernels to suitable hardware units. PAPI has two key mechanisms: (1) online kernel characterization to dynamically schedule kernels to the most suitable hardware units at runtime and (2) a PIM-enabled heterogeneous computing system that harmoniously orchestrates both computation-centric processing units and hybrid PIM units with different computing capabilities. Our experimental results on three broadly-used LLMs show that PAPI achieves 1.8× and 11.1× speedups over a state-of-the-art heterogeneous LLM accelerator and a state-of-the-art PIM-only LLM accelerator, respectively.
@article{papi, title = {PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System}, author = {He, Yintao and Mao, Haiyu and Giannoula, Christina and Sadrosadati, Mohammad and Gomez-Luna, Juan and Li, Huawei and Li, Xiaowei and Wang, Ying and Mutlu, Onur}, booktitle = {The 30th International Conference on Architectural Support for Programming Languages and Operating Systems}, year = {2025} }
ASPLOS
CIPHERMATCH: Accelerating Homomorphic Encryption based String Matching via Memory-Efficient Data Packing and In-Flash Processing

Mayank Kabra, Rakesh Nadig, Harshita Gupta, Rahul Bera, Manos Frouzakis, Vamanan Arulchelvan, Yu Liang, Haiyu Mao, Mohammad Sadrosadati, and Onur Mutlu

2025

Abs Bib PDF

Homomorphic encryption (HE) allows secure computation on encrypted data without revealing the original data, providing significant benefits for privacy-sensitive applications. Many cloud computing applications (e.g., DNA read mapping, biometric matching, web search) use exact string matching as a key operation. However, prior string matching algorithms that use homomorphic encryption are limited by high computational latency caused by the use of complex operations and data movement bottlenecks due to the large encrypted data size. In this work, we provide an efficient algorithm-hardware codesign to accelerate HE-based secure exact string matching. We propose CIPHERMATCH, which (i) reduces the increase in memory footprint after encryption using an optimized software-based data packing scheme, (ii) eliminates the use of costly homomorphic operations (e.g., multiplication and rotation), and (iii) reduces data movement by designing a new in-flash processing (IFP) architecture. We demonstrate the benefits of CIPHERMATCH using two case studies: (1) Exact DNA string matching and (2) encrypted database search. Our pure software-based CIPHERMATCH implementation that uses our memory-efficient data packing scheme improves performance and reduces energy consumption by 42.9X and 17.6X, respectively, compared to the state-of-the-art software baseline. Integrating CIPHERMATCH with IFP improves performance and reduces energy consumption by 136.9X and 256.4X, respectively, compared to the software-based CIPHERMATCH implementation.
@article{ciphermatch, title = {CIPHERMATCH: Accelerating Homomorphic Encryption based String Matching via Memory-Efficient Data Packing and In-Flash Processing}, author = {Kabra, Mayank and Nadig, Rakesh and Gupta, Harshita and Bera, Rahul and Frouzakis, Manos and Arulchelvan, Vamanan and Liang, Yu and Mao, Haiyu and Sadrosadati, Mohammad and Mutlu, Onur}, booktitle = {The 30th International Conference on Architectural Support for Programming Languages and Operating Systems}, year = {2025} }
HPCA
Ariadne: A Hotness-Aware and Size-Adaptive Compressed Swap Technique for Fast Application Relaunch and Reduced CPU Usage on Mobile Devices

Yu Liang, Aofeng Shen, Chun Jason Xue, Riwei Pan, Haiyu Mao, Nika Mansouri Ghiasi, Qingcai Jiang, Rakesh Nadig, Lei Li, Rachata Ausavarungnirun, Mohammad Sadrosadati, and Onur Mutlu

2025

Abs Bib PDF

As the memory demands of individual mobile applications continue to grow and the number of concurrently running applications increases, available memory on mobile devices is becoming increasingly scarce. When memory pressure is high, current mobile systems use a RAM-based compressed swap scheme (called ZRAM) to compress unused execution-related data (called anonymous data in Linux) in main memory. This approach avoids swapping data to secondary storage (NAND flash memory) or terminating applications, thereby achieving shorter application relaunch latency. In this paper, we observe that the state-of-the-art ZRAM scheme prolongs relaunch latency and wastes CPU time because it does not differentiate between hot and cold data or leverage different compression chunk sizes and data locality. We make three new observations. First, anonymous data has different levels of hotness. Hot data, used during application relaunch, is usually similar between consecutive relaunches. Second, when compressing the same amount of anonymous data, small-size compression is very fast, while large-size compression achieves a better compression ratio. Third, there is locality in data access during application relaunch. Based on these observations, we propose a hotness-aware and size-adaptive compressed swap scheme, Ariadne, for mobile devices to mitigate relaunch latency and reduce CPU usage. Ariadne incorporates three key techniques. First, a low-overhead hotness-aware data organization scheme aims to quickly identify the hotness of anonymous data without significant overhead. Second, a size-adaptive compression scheme uses different compression chunk sizes based on the data’s hotness level to ensure fast decompression of hot and warm data. Third, a proactive decompression scheme predicts the next set of data to be used and decompresses it in advance, reducing the impact of data swapping back into main memory during application relaunch. We implement and evaluate Ariadne on a commercial smartphone, Google Pixel 7 with the latest Android 14. Our experimental evaluation results show that, on average, Ariadne reduces application relaunch latency by 50% and decreases the CPU usage of compression and decompression procedures by 15% compared to the state-of-the-art compressed swap scheme for mobile devices.
@article{ariande, title = {Ariadne: A Hotness-Aware and Size-Adaptive Compressed Swap Technique for Fast Application Relaunch and Reduced CPU Usage on Mobile Devices}, author = {Liang, Yu and Shen, Aofeng and Jason Xue, Chun and Pan, Riwei and Mao, Haiyu and Mansouri Ghiasi, Nika and Jiang, Qingcai and Nadig, Rakesh and Li, Lei and Ausavarungnirun, Rachata and Sadrosadati, Mohammad and Mutlu, Onur}, booktitle = {The 31st International Symposium on High-Performance Computer Architecture}, year = {2025} }

2024

ISCA
MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing

Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joel Lindegger, Meryem Banu Cavlak, Alser Mohammed, Jisung Park, and Onur Mutlu

2024

Abs Bib PDF

Metagenomics, the study of the genome sequences of diverse organisms in a common environment, has led to significant advances in many fields. Since the species present in a metagenomic sample are not known in advance, metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species’ genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system to the rest of the system. In-storage processing can be a fundamental solution for reducing this overhead. However, designing an in-storage processing system for metagenomics is challenging because existing approaches to metagenomic analysis cannot be directly implemented in storage effectively due to the hardware limitations of modern SSDs.We propose MegIS, the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. Through our detailed analysis of the end-to-end metagenomic analysis pipeline and careful hardware/software co-design, we address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS’s design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines. Our evaluation shows that MegIS outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7× – 37.2× and 6.9×–100.2×, respectively, while matching the accuracy of the accuracy-optimized tool. MegIS achieves 1.5×–5.1× speedup compared to the state-of-the-art metagenomic hardware-accelerated (using processing-in-memory) tool, while achieving significantly higher accuracy.
@article{megis, title = {MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing}, author = {Mansouri Ghiasi, Nika and Sadrosadati, Mohammad and Mustafa, Harun and Gollwitzer, Arvid and Firtina, Can and Eudine, Julien and Mao, Haiyu and Lindegger, Joel and Cavlak, Meryem Banu and Mohammed, Alser and Park, Jisung and Mutlu, Onur}, booktitle = {2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture}, year = {2024} }

2023

MICRO
Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors

Taha Shahroodi, Gagandeep Singh, Mahdi Zahedi, Haiyu Mao, Joel Lindegger, Can Firtina, Stephan Wong, Onur Mutlu, and Said Hamdioui

2023

Abs Bib PDF

Basecalling, an essential step in many genome analysis studies, relies on large Deep Neural Networks (DNNs) to achieve high accuracy. Unfortunately, these DNNs are computationally slow and inefficient, leading to considerable delays and resource constraints in the sequence analysis process. A Computation-In-Memory (CIM) architecture using memristors can significantly accelerate the performance of DNNs. However, inherent device non-idealities and architectural limitations of such designs can greatly degrade the basecalling accuracy, which is critical for accurate genome analysis. To facilitate the adoption of memristor-based CIM designs for basecalling, it is important to (1) conduct a comprehensive analysis of potential CIM architectures and (2) develop effective strategies for mitigating the possible adverse effects of inherent device non-idealities and architectural limitations. This paper proposes Swordfish, a novel hardware/software co-design framework that can effectively address the two aforementioned issues. Swordfish incorporates seven circuit and device restrictions or non-idealities from characterized real memristor-based chips. Swordfish leverages various hardware/software co-design solutions to mitigate the basecalling accuracy loss due to such non-idealities. To demonstrate the effectiveness of Swordfish, we take Bonito, the state-of-the-art (i.e., accurate and fast), open-source basecaller as a case study. Our experimental results using Sword-fish show that a CIM architecture can realistically accelerate Bonito for a wide range of real datasets by an average of 25.7x, with an accuracy loss of 6.01%.
@article{swordfish, title = {Swordfish: A Framework for Evaluating Deep Neural Network-based Basecalling using Computation-In-Memory with Non-Ideal Memristors}, author = {Shahroodi, Taha and Singh, Gagandeep and Zahedi, Mahdi and Mao, Haiyu and Lindegger, Joel and Firtina, Can and Wong, Stephan and Mutlu, Onur and Hamdioui, Said}, booktitle = {2023 56th Annual IEEE/ACM International Symposium on Microarchitecture}, year = {2023} }
ISCA
Venice: Improving Solid-State Drive Parallelism at Low Cost via Conflict-Free Accesses

Rakesh Nadig, Mohammad Sadrosadati, Haiyu Mao, Nika Mansouri Ghiasi, Arash Tavakkol, Jisung Park, Hamid Sarbazi-Azad, Juan Gómez Luna, and Onur Mutlu

2023

Abs Bib PDF

The performance and capacity of solid-state drives (SSDs) are continuously improving to meet the increasing demands of modern data-intensive applications. Unfortunately, communication between the SSD controller and memory chips (e.g., 2D/3D NAND flash chips) is a critical performance bottleneck for many applications. SSDs use a multi-channel shared bus architecture where multiple memory chips connected to the same channel communicate to the SSD controller with only one path. As a result, path conflicts often occur during the servicing of multiple I/O requests, which significantly limits SSD parallelism. It is critical to handle path conflicts well to improve SSD parallelism and performance. Our goal is to fundamentally tackle the path conflict problem by increasing the number of paths between the SSD controller and memory chips at low cost. To this end, we build on the idea of using an interconnection network to increase the path diversity between the SSD controller and memory chips. We propose Venice, a new mechanism that introduces a low-cost interconnection network between the SSD controller and memory chips and utilizes the path diversity to intelligently resolve path conflicts. Venice employs three key techniques: 1) a simple router chip added next to each memory chip without modifying the memory chip design, 2) a path reservation technique that reserves a path from the SSD controller to the target memory chip before initiating a transfer, and 3) a fully-adaptive routing algorithm that effectively utilizes the path diversity to resolve path conflicts. Our experimental results show that Venice 1) improves performance by an average of 2.65×/1.67× over a baseline performance-optimized/cost-optimized SSD design across a wide range of workloads, 2) reduces energy consumption by an average of 61% compared to a baseline performance-optimized SSD design. Venice’s benefits come at a relatively low area overhead.
@article{nadig2023venice, title = {Venice: Improving Solid-State Drive Parallelism at Low Cost via Conflict-Free Accesses}, author = {Nadig, Rakesh and Sadrosadati, Mohammad and Mao, Haiyu and Mansouri Ghiasi, Nika and Tavakkol, Arash and Park, Jisung and Sarbazi-Azad, Hamid and Luna, Juan G{\'o}mez and Mutlu, Onur}, booktitle = {Proceedings of the 50th Annual International Symposium on Computer Architecture}, year = {2023} }
Bioinformatics
RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes

Can Firtina, Nika Mansouri Ghiasi, Joel Lindegger, Gagandeep Singh, Meryem Banu Cavlak, Haiyu Mao, and Onur Mutlu

Bioinformatics 2023

Abs Bib PDF Code

Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either (i) require powerful computational resources that may not be available for portable sequencers or (ii) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: (i) read mapping, (ii) relative abundance estimation, and (iii) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides (i) 25.8X and 3.4X better average throughput and (ii) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash.
@article{firtina2023rawhash, title = {RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes}, author = {Firtina, Can and Mansouri Ghiasi, Nika and Lindegger, Joel and Singh, Gagandeep and Cavlak, Meryem Banu and Mao, Haiyu and Mutlu, Onur}, journal = {Bioinformatics}, year = {2023}, publisher = {Oxford University Press} }

2022

MICRO
GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping

Haiyu Mao, Mohammed Alser, Mohammad Sadrosadati, Can Firtina, Akanksha Baranwal, Damla Senol Cali, Aditya Manglik, Nour Almadhoun Alserr, and Onur Mutlu

2022

Abs Bib PDF

Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline. This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6× (8.4×) speedup and 32.8× (20.8×) energy savings with negligible accuracy loss compared to the state-of-the-art software genome analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design that combines state-of-the-art in-memory basecalling and read mapping accelerators, GenPIP provides 1.39× speedup and 1.37× energy savings.
@article{mao2022genpip, title = {GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping}, author = {Mao, Haiyu and Alser, Mohammed and Sadrosadati, Mohammad and Firtina, Can and Baranwal, Akanksha and Cali, Damla Senol and Manglik, Aditya and Alserr, Nour Almadhoun and Mutlu, Onur}, booktitle = {2022 55th Annual IEEE/ACM International Symposium on Microarchitecture}, year = {2022} }
CSBJ
From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures

Mohammed Alser, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao, Gagandeep Singh, Juan Gomez-Luna, and Onur Mutlu

Computational and Structural Biotechnology Journal 2022

Abs Bib PDF

We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.
@article{alser2022molecules, title = {From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures}, author = {Alser, Mohammed and Lindegger, Joel and Firtina, Can and Almadhoun, Nour and Mao, Haiyu and Singh, Gagandeep and Gomez-Luna, Juan and Mutlu, Onur}, journal = {Computational and Structural Biotechnology Journal}, year = {2022}, publisher = {Elsevier} }
ASPLOS
GenStore: a high-performance in-storage processing system for genome sequence analysis

Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, and others

In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 2022

Abs Bib PDF

Read mapping is a fundamental step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). Read mapping is costly because it needs to perform approximate string matching (ASM) on large amounts of data. To address the computational challenges in genome analysis, many prior works propose various approaches such as accurate filters that select the reads within a dataset of genomic reads (called a read set) that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the amount of expensive computation, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different properties such as read lengths and error rates, which highly depend on the sequencing technology, and 2) different degrees of genetic variation compared to the reference genome, which highly depends on the genomes that are being compared. Through rigorous analysis of read mapping processes of reads with different properties and degrees of genetic variation, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based solid-state drive (SSD). Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern NAND flash-based SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05× (1.52-3.32×) for read sets with high similar- ity to the reference genome and 1.45-33.63× (2.70-19.2×) for read sets with low similarity to the reference genome.
@inproceedings{mansouri2022genstore, title = {GenStore: a high-performance in-storage processing system for genome sequence analysis}, author = {Mansouri Ghiasi, Nika and Park, Jisung and Mustafa, Harun and Kim, Jeremie and Olgun, Ataberk and Gollwitzer, Arvid and Senol Cali, Damla and Firtina, Can and Mao, Haiyu and Almadhoun Alserr, Nour and others}, booktitle = {Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems}, year = {2022} }

2021

SSI
Development of processing-in-memory

Haiyu Mao, Jiwu Shu, Fei Li, and Zhe Liu

SCIENTIA SINICA Informationis 2021

Abs Bib PDF

With the explosive increase of processed data, data transmission through the bus between CPU and the main memory has become a bottleneck in the traditional von Neumann architecture. On top of this, popular data-intensive workloads, such as neural networks and graph computing applications, have poor data locality, which results in a substantial increase of the cache miss rate. Processing such popular data-intensive workloads hinders the entire system since the data transmission causes long latency and high energy consumption. Processing-in-memory greatly reduces this data transmission by equipping the main memory with computation ability, alleviating the problems of poor performance and high energy consumption caused by a large amount of data and a poor data locality. Processing-in-memory consists of two different approaches. One method involves integrating computation resources into the main memory with high-bandwidth interconnects (i.e., near data computing). The other method consists of employing memory arrays to compute directly (i.e., computing-in-memory). These two approaches have their own advantages and disadvantages, as well as suitable scenarios. In this survey, the birth and development of processing-in-memory is firstly introduced and discussed. Its techniques, ranging from hardware to microarchitecture, are then presented. Furthermore, the challenges faced by processing-in-memory are analyzed. Finally, the opportunities that processing-in-memory offers for popular applications are discussed.
@article{haiyu2021development, title = {Development of processing-in-memory}, author = {Mao, Haiyu and Shu, Jiwu and Li, Fei and Liu, Zhe}, journal = {SCIENTIA SINICA Informationis}, publisher = {Science China Press}, year = {2021} }

2020

TC
LrGAN: A Compact and Energy Efficient PIM-Based Architecture for GAN Training

Haiyu Mao, Jiwu Shu, Mingcong Song, and Tao Li

IEEE Transactions on Computers 2020

Abs Bib PDF

As a powerful unsupervised learning method, Generative Adversarial Network (GAN) plays an essential role in many domains. However, training a GAN imposes four more challenges: (1) intensive communication caused by complex train phases of GAN, (2) much more ineffectual computations caused by peculiar convolutions, (3) more frequent off-chip memory accesses for exchanging intermediate data between the generator and the discriminator and (4) high energy consumption of unnecessary fine-grained MLC programming. In this paper, we propose LrGAN, a PIM-based GAN accelerator, to address the challenges of training GAN. We first propose a zero-free data reshaping scheme for ReRAM-based PIM, which removes the zero-related computations. We then propose a 3D-connected PIM, which can reconfigure connections inside PIM dynamically according to dataflows of propagation and updating. After that, we propose an approximate weight update algorithm to avoid unnecessary fine-grain MLC programming. Finally, we propose LrGAN based on these three techniques, providing different levels of accelerating GAN for programmers. Experiments show that LrGAN achieves 47.2×, 21.42×, and 7.46× speedup over FPGA-based GAN accelerator, GPU platform, and ReRAM-based neural network accelerator respectively. Besides, LrGAN achieves 13.65×, 10.75×, and 1.34× energy saving on average over GPU platform, PRIME, and FPGA-based GAN accelerator, respectively.
@article{mao2020lrgan, title = {LrGAN: A Compact and Energy Efficient PIM-Based Architecture for GAN Training}, author = {Mao, Haiyu and Shu, Jiwu and Song, Mingcong and Li, Tao}, journal = {IEEE Transactions on Computers}, year = {2020}, publisher = {IEEE} }
TOS
ShieldNVM: An efficient and fast recoverable system for secure non-volatile memory

Fan Yang, Youmin Chen, Haiyu Mao, Youyou Lu, and Jiwu Shu

ACM Transactions on Storage 2020

Abs Bib PDF

Data encryption and authentication are essential for secure non-volatile memory (NVM). However, the in- troduced security metadata needs to be atomically written back to NVM along with data, so as to provide crash consistency, which unfortunately incurs high overhead. To support fine-grained data protection and fast recovery for a secure NVM system without compromising the performance, we propose ShieldNVM. It first proposes an epoch-based mechanism to aggressively cache the security metadata in the metadata cache while retaining the consistency of them in NVM. Deferred spreading is also introduced to reduce the calcu- lating overhead for data authentication. Leveraging the ability of data hash message authentication codes, we can always recover the consistent but old security metadata to its newest version. By recording a limited number of dirty addresses of the security metadata, ShieldNVM achieves fast recovering the secure NVM sys- tem after crashes. Compared to Osiris, a state-of-the-art secure NVM, ShieldNVM reduces system runtime by 39.1% and hash message authentication code computation overhead by 80.5% on average over NVM work- loads. When system crashes happen, ShieldNVM’s recovery time is orders of magnitude faster than Osiris. In addition, ShieldNVM also recovers faster than AGIT, which is the Osiris-based state-of-the-art mechanism addressing the recovery time of the secure NVM system. Once the recovery process fails, instead of dropping all data due to malicious attacks, ShieldNVM is able to detect and locate the area of the tampered data with the help of the tracked addresses.
@article{yang2020shieldnvm, title = {ShieldNVM: An efficient and fast recoverable system for secure non-volatile memory}, author = {Yang, Fan and Chen, Youmin and Mao, Haiyu and Lu, Youyou and Shu, Jiwu}, journal = {ACM Transactions on Storage}, year = {2020}, publisher = {ACM New York, NY, USA} }

2019

DAC
No compromises: Secure NVM with crash consistency, write-efficiency and high-performance

Fan Yang, Youyou Lu, Youmin Chen, Haiyu Mao, and Jiwu Shu

In 2019 56th ACM/IEEE Design Automation Conference 2019

Abs Bib PDF

Data encryption and authentication are essential for secure NVM. However, the introduced security metadata needs to be atomically written back to NVM along with data, so as to provide crash consistency, which unfortunately incurs high overhead. To support fine-grained data protection without compromising the performance, we propose cc-NVM. It firstly proposes an epoch-based mechanism to aggressively cache the security metadata in CPU cache while retaining the consistency of them in NVM. Deferred spread- ing is also introduced to reduce the calculating overhead for data authentication. Leveraging the hidden ability of data HMACs, we can always recover the consistent but old security metadata to its newest version. Compared to Osiris, a state-of-the-art secure NVM, cc-NVM improves performance by 20.4% on average. When the system crashes, instead of dropping all the data due to malicious attacks, cc-NVM is able to detect and locate the exact tampered data while only incurring extra write traffic by 29.6% on average.
@inproceedings{yang2019no, title = {No compromises: Secure NVM with crash consistency, write-efficiency and high-performance}, author = {Yang, Fan and Lu, Youyou and Chen, Youmin and Mao, Haiyu and Shu, Jiwu}, booktitle = {2019 56th ACM/IEEE Design Automation Conference}, year = {2019}, organization = {IEEE} }

2018

MICRO
Lergan: A zero-free, low data movement and pim-based gan architecture

Haiyu Mao, Mingcong Song, Tao Li, Yuting Dai, and Jiwu Shu

In 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture 2018

Abs Bib PDF Lightning talk

As a powerful unsupervised learning method, Generative Adversarial Network (GAN) plays an important role in many domains such as video prediction and autonomous driving. It is one of the ten breakthrough technologies in 2018 reported in MIT Technology Review. However, training a GAN imposes three more challenges: (1) intensive communication caused by complex train phases of GAN, (2) much more ineffectual computations caused by special convolutions, and (3) more frequent off-chip memory accesses for exchanging inter-mediate data between the generator and the discriminator. In this paper, we propose LerGAN, a PIM-based GAN accelerator to address the challenges of training GAN. We first propose a zero-free data reshaping scheme for ReRAM-based PIM, which removes the zero-related computations. We then propose a 3D-connected PIM, which can reconfigure connections inside PIM dynamically according to dataflows of propagation and updating. Our proposed techniques reduce data movement to a great extent, avoiding I/O to become a bottleneck of training GANs. Finally, we propose LerGAN based on these two techniques, providing different levels of accelerating GAN for programmers. Experiments shows that LerGAN achieves 47.2×, 21.42× and 7.46× speedup over FPGA-based GAN accelerator, GPU platform, and ReRAM-based neural network accelerator respectively. Moreover, LerGAN achieves 9.75×, 7.68× energy saving on average over GPU platform, ReRAM-based neural network accelerator respectively, and has 1.04× energy consuming over FPGA-based GAN accelerator.
@inproceedings{mao2018lergan, title = {Lergan: A zero-free, low data movement and pim-based gan architecture}, author = {Mao, Haiyu and Song, Mingcong and Li, Tao and Dai, Yuting and Shu, Jiwu}, booktitle = {2018 51st Annual IEEE/ACM International Symposium on Microarchitecture}, lightningtalk = {https://www.youtube.com/watch?v=dmsGaoJKbAU}, year = {2018}, organization = {IEEE} }

2017

DATE
Protect non-volatile memory from wear-out attack based on timing difference of row buffer hit/miss

Haiyu Mao, Xian Zhang, Guangyu Sun, and Jiwu Shu

In Design, Automation & Test in Europe Conference & Exhibition 2017

Abs Bib PDF

Non-volatile Memories (NVMs), such as PCM and ReRAM, have been widely proposed for future main memory design because of their low standby power, high storage density, fast access speed. However, these NVMs suffer from the write endurance problem. In order to prevent a malicious program from wearing out NVMs deliberately, researchers have proposed various wear-leveling methods, which remap logical addresses to physical addresses randomly and dynamically. However, we discover that side channel leakage based on NVM row buffer hit information can reveal details of address remappings. Consequently, it can be leveraged to side-step the wear-leveling. Our simulation shows that the proposed attack method in this paper can wear out a NVM within 137 seconds, even with the protection of state-of-the-art wear-leveling schemes. To counteract this attack, we further introduce an effective countermeasure named Intra-Row Swap (IRS) to hide the wear-leveling details. The basic idea is to enable an additional intra-row block swap when a new logical address is remapped to the memory row. Experiments demonstrate that IRS can secure NVMs with negligible timing/energy overhead, compared with previous works.
@inproceedings{mao2017protect, title = {Protect non-volatile memory from wear-out attack based on timing difference of row buffer hit/miss}, author = {Mao, Haiyu and Zhang, Xian and Sun, Guangyu and Shu, Jiwu}, booktitle = {Design, Automation \& Test in Europe Conference \& Exhibition}, year = {2017}, organization = {IEEE} }

2015

NVMSA
Exploring data placement in racetrack memory based scratchpad memory

Haiyu Mao, Chao Zhang, Guangyu Sun, and Jiwu Shu

In IEEE Non-Volatile Memory System and Applications Symposium 2015

Abs Bib PDF

Scratchpad Memory (SPM) has been widely adopted in various computing systems to improve performance of data access. Recently, non-volatile memory technologies (NVMs) have been employed for SPM design to improve its capacity and reduce its energy consumption. In this paper, we explore data allocation in SPM based on racetrack memory (RM), which is an emerging NVM with ultra-high storage density and fast access speed. Since a shift operation is needed to access data in RM, data allocation has an impact on performance of RM based SPM. Several allocation methods have been discussed and compared in this work. Especially, we addressed how to leverage genetic algorithm to achieve near-optimal data allocation.
@inproceedings{mao2015exploring, title = {Exploring data placement in racetrack memory based scratchpad memory}, author = {Mao, Haiyu and Zhang, Chao and Sun, Guangyu and Shu, Jiwu}, booktitle = {IEEE Non-Volatile Memory System and Applications Symposium}, year = {2015}, organization = {IEEE} }