publications
2022
- MICROGenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read MappingHaiyu Mao, Mohammed Alser, Mohammad Sadrosadati, Can Firtina, Akanksha Baranwal, Damla Senol Cali, Aditya Manglik, Nour Almadhoun Alserr, and Onur MutluMICRO 2022
Nanopore sequencing is a widely-used high-throughput genome sequencing technology that can sequence long fragments of a genome into raw electrical signals at low cost. Nanopore sequencing requires two computationally-costly processing steps for accurate downstream genome analysis. The first step, basecalling, translates the raw electrical signals into nucleotide bases (i.e., A, C, G, T). The second step, read mapping, finds the correct location of a read in a reference genome. In existing genome analysis pipelines, basecalling and read mapping are executed separately. We observe in this work that such separate execution of the two most time-consuming steps inherently leads to (1) significant data movement and (2) redundant computations on the data, slowing down the genome analysis pipeline. This paper proposes GenPIP, an in-memory genome analysis accelerator that tightly integrates basecalling and read mapping. GenPIP improves the performance of the genome analysis pipeline with two key mechanisms: (1) in-memory fine-grained collaborative execution of the major genome analysis steps in parallel; (2) a new technique for early-rejection of low-quality and unmapped reads to timely stop the execution of genome analysis for such reads, reducing inefficient computation. Our experiments show that, for the execution of the genome analysis pipeline, GenPIP provides 41.6× (8.4×) speedup and 32.8× (20.8×) energy savings with negligible accuracy loss compared to the state-of-the-art software genome analysis tools executed on a state-of-the-art CPU (GPU). Compared to a design that combines state-of-the-art in-memory basecalling and read mapping accelerators, GenPIP provides 1.39× speedup and 1.37× energy savings.
@article{mao2022genpip, title = {GenPIP: In-Memory Acceleration of Genome Analysis via Tight Integration of Basecalling and Read Mapping}, author = {Mao, Haiyu and Alser, Mohammed and Sadrosadati, Mohammad and Firtina, Can and Baranwal, Akanksha and Cali, Damla Senol and Manglik, Aditya and Alserr, Nour Almadhoun and Mutlu, Onur}, journal = {MICRO}, year = {2022} }
- CSBJFrom molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architecturesMohammed Alser, Joel Lindegger, Can Firtina, Nour Almadhoun, Haiyu Mao, Gagandeep Singh, Juan Gomez-Luna, and Onur MutluComputational and Structural Biotechnology Journal 2022
We now need more than ever to make genome analysis more intelligent. We need to read, analyze, and interpret our genomes not only quickly, but also accurately and efficiently enough to scale the analysis to population level. There currently exist major computational bottlenecks and inefficiencies throughout the entire genome analysis pipeline, because state-of-the-art genome sequencing technologies are still not able to read a genome in its entirety. We describe the ongoing journey in significantly improving the performance, accuracy, and efficiency of genome analysis using intelligent algorithms and hardware architectures. We explain state-of-the-art algorithmic methods and hardware-based acceleration approaches for each step of the genome analysis pipeline and provide experimental evaluations. Algorithmic approaches exploit the structure of the genome as well as the structure of the underlying hardware. Hardware-based acceleration approaches exploit specialized microarchitectures or various execution paradigms (e.g., processing inside or near memory) along with algorithmic changes, leading to new hardware/software co-designed systems. We conclude with a foreshadowing of future challenges, benefits, and research directions triggered by the development of both very low cost yet highly error prone new sequencing technologies and specialized hardware chips for genomics. We hope that these efforts and the challenges we discuss provide a foundation for future work in making genome analysis more intelligent.
@article{alser2022molecules, title = {From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures}, author = {Alser, Mohammed and Lindegger, Joel and Firtina, Can and Almadhoun, Nour and Mao, Haiyu and Singh, Gagandeep and Gomez-Luna, Juan and Mutlu, Onur}, journal = {Computational and Structural Biotechnology Journal}, year = {2022}, publisher = {Elsevier} }
- ASPLOSGenStore: a high-performance in-storage processing system for genome sequence analysisNika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, and othersIn Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 2022
Read mapping is a fundamental step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). Read mapping is costly because it needs to perform approximate string matching (ASM) on large amounts of data. To address the computational challenges in genome analysis, many prior works propose various approaches such as accurate filters that select the reads within a dataset of genomic reads (called a read set) that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the amount of expensive computation, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different properties such as read lengths and error rates, which highly depend on the sequencing technology, and 2) different degrees of genetic variation compared to the reference genome, which highly depends on the genomes that are being compared. Through rigorous analysis of read mapping processes of reads with different properties and degrees of genetic variation, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based solid-state drive (SSD). Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern NAND flash-based SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05× (1.52-3.32×) for read sets with high similar- ity to the reference genome and 1.45-33.63× (2.70-19.2×) for read sets with low similarity to the reference genome.
@inproceedings{mansouri2022genstore, title = {GenStore: a high-performance in-storage processing system for genome sequence analysis}, author = {Mansouri Ghiasi, Nika and Park, Jisung and Mustafa, Harun and Kim, Jeremie and Olgun, Ataberk and Gollwitzer, Arvid and Senol Cali, Damla and Firtina, Can and Mao, Haiyu and Almadhoun Alserr, Nour and others}, booktitle = {Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems}, year = {2022} }
2021
- SSIDevelopment of processing-in-memoryHaiyu Mao, Jiwu Shu, Fei Li, and Zhe LiuSCIENTIA SINICA Informationis 2021
With the explosive increase of processed data, data transmission through the bus between CPU and the main memory has become a bottleneck in the traditional von Neumann architecture. On top of this, popular data-intensive workloads, such as neural networks and graph computing applications, have poor data locality, which results in a substantial increase of the cache miss rate. Processing such popular data-intensive workloads hinders the entire system since the data transmission causes long latency and high energy consumption. Processing-in-memory greatly reduces this data transmission by equipping the main memory with computation ability, alleviating the problems of poor performance and high energy consumption caused by a large amount of data and a poor data locality. Processing-in-memory consists of two different approaches. One method involves integrating computation resources into the main memory with high-bandwidth interconnects (i.e., near data computing). The other method consists of employing memory arrays to compute directly (i.e., computing-in-memory). These two approaches have their own advantages and disadvantages, as well as suitable scenarios. In this survey, the birth and development of processing-in-memory is firstly introduced and discussed. Its techniques, ranging from hardware to microarchitecture, are then presented. Furthermore, the challenges faced by processing-in-memory are analyzed. Finally, the opportunities that processing-in-memory offers for popular applications are discussed.
@article{haiyu2021development, title = {Development of processing-in-memory}, author = {Mao, Haiyu and Shu, Jiwu and Li, Fei and Liu, Zhe}, journal = {SCIENTIA SINICA Informationis}, publisher = {Science China Press}, year = {2021} }
2020
- TCLrGAN: A Compact and Energy Efficient PIM-Based Architecture for GAN TrainingHaiyu Mao, Jiwu Shu, Mingcong Song, and Tao LiIEEE Transactions on Computers 2020
As a powerful unsupervised learning method, Generative Adversarial Network (GAN) plays an essential role in many domains. However, training a GAN imposes four more challenges: (1) intensive communication caused by complex train phases of GAN, (2) much more ineffectual computations caused by peculiar convolutions, (3) more frequent off-chip memory accesses for exchanging intermediate data between the generator and the discriminator and (4) high energy consumption of unnecessary fine-grained MLC programming. In this paper, we propose LrGAN, a PIM-based GAN accelerator, to address the challenges of training GAN. We first propose a zero-free data reshaping scheme for ReRAM-based PIM, which removes the zero-related computations. We then propose a 3D-connected PIM, which can reconfigure connections inside PIM dynamically according to dataflows of propagation and updating. After that, we propose an approximate weight update algorithm to avoid unnecessary fine-grain MLC programming. Finally, we propose LrGAN based on these three techniques, providing different levels of accelerating GAN for programmers. Experiments show that LrGAN achieves 47.2×, 21.42×, and 7.46× speedup over FPGA-based GAN accelerator, GPU platform, and ReRAM-based neural network accelerator respectively. Besides, LrGAN achieves 13.65×, 10.75×, and 1.34× energy saving on average over GPU platform, PRIME, and FPGA-based GAN accelerator, respectively.
@article{mao2020lrgan, title = {LrGAN: A Compact and Energy Efficient PIM-Based Architecture for GAN Training}, author = {Mao, Haiyu and Shu, Jiwu and Song, Mingcong and Li, Tao}, journal = {IEEE Transactions on Computers}, year = {2020}, publisher = {IEEE} }
- TOSShieldNVM: An efficient and fast recoverable system for secure non-volatile memoryFan Yang, Youmin Chen, Haiyu Mao, Youyou Lu, and Jiwu ShuACM Transactions on Storage 2020
Data encryption and authentication are essential for secure non-volatile memory (NVM). However, the in- troduced security metadata needs to be atomically written back to NVM along with data, so as to provide crash consistency, which unfortunately incurs high overhead. To support fine-grained data protection and fast recovery for a secure NVM system without compromising the performance, we propose ShieldNVM. It first proposes an epoch-based mechanism to aggressively cache the security metadata in the metadata cache while retaining the consistency of them in NVM. Deferred spreading is also introduced to reduce the calcu- lating overhead for data authentication. Leveraging the ability of data hash message authentication codes, we can always recover the consistent but old security metadata to its newest version. By recording a limited number of dirty addresses of the security metadata, ShieldNVM achieves fast recovering the secure NVM sys- tem after crashes. Compared to Osiris, a state-of-the-art secure NVM, ShieldNVM reduces system runtime by 39.1% and hash message authentication code computation overhead by 80.5% on average over NVM work- loads. When system crashes happen, ShieldNVM’s recovery time is orders of magnitude faster than Osiris. In addition, ShieldNVM also recovers faster than AGIT, which is the Osiris-based state-of-the-art mechanism addressing the recovery time of the secure NVM system. Once the recovery process fails, instead of dropping all data due to malicious attacks, ShieldNVM is able to detect and locate the area of the tampered data with the help of the tracked addresses.
@article{yang2020shieldnvm, title = {ShieldNVM: An efficient and fast recoverable system for secure non-volatile memory}, author = {Yang, Fan and Chen, Youmin and Mao, Haiyu and Lu, Youyou and Shu, Jiwu}, journal = {ACM Transactions on Storage}, year = {2020}, publisher = {ACM New York, NY, USA} }
2019
- DACNo compromises: Secure NVM with crash consistency, write-efficiency and high-performanceFan Yang, Youyou Lu, Youmin Chen, Haiyu Mao, and Jiwu ShuIn 2019 56th ACM/IEEE Design Automation Conference 2019
Data encryption and authentication are essential for secure NVM. However, the introduced security metadata needs to be atomically written back to NVM along with data, so as to provide crash consistency, which unfortunately incurs high overhead. To support fine-grained data protection without compromising the performance, we propose cc-NVM. It firstly proposes an epoch-based mechanism to aggressively cache the security metadata in CPU cache while retaining the consistency of them in NVM. Deferred spread- ing is also introduced to reduce the calculating overhead for data authentication. Leveraging the hidden ability of data HMACs, we can always recover the consistent but old security metadata to its newest version. Compared to Osiris, a state-of-the-art secure NVM, cc-NVM improves performance by 20.4% on average. When the system crashes, instead of dropping all the data due to malicious attacks, cc-NVM is able to detect and locate the exact tampered data while only incurring extra write traffic by 29.6% on average.
@inproceedings{yang2019no, title = {No compromises: Secure NVM with crash consistency, write-efficiency and high-performance}, author = {Yang, Fan and Lu, Youyou and Chen, Youmin and Mao, Haiyu and Shu, Jiwu}, booktitle = {2019 56th ACM/IEEE Design Automation Conference}, year = {2019}, organization = {IEEE} }
2018
- MICROLergan: A zero-free, low data movement and pim-based gan architectureHaiyu Mao, Mingcong Song, Tao Li, Yuting Dai, and Jiwu ShuIn 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture 2018
As a powerful unsupervised learning method, Generative Adversarial Network (GAN) plays an important role in many domains such as video prediction and autonomous driving. It is one of the ten breakthrough technologies in 2018 reported in MIT Technology Review. However, training a GAN imposes three more challenges: (1) intensive communication caused by complex train phases of GAN, (2) much more ineffectual computations caused by special convolutions, and (3) more frequent off-chip memory accesses for exchanging inter-mediate data between the generator and the discriminator. In this paper, we propose LerGAN, a PIM-based GAN accelerator to address the challenges of training GAN. We first propose a zero-free data reshaping scheme for ReRAM-based PIM, which removes the zero-related computations. We then propose a 3D-connected PIM, which can reconfigure connections inside PIM dynamically according to dataflows of propagation and updating. Our proposed techniques reduce data movement to a great extent, avoiding I/O to become a bottleneck of training GANs. Finally, we propose LerGAN based on these two techniques, providing different levels of accelerating GAN for programmers. Experiments shows that LerGAN achieves 47.2×, 21.42× and 7.46× speedup over FPGA-based GAN accelerator, GPU platform, and ReRAM-based neural network accelerator respectively. Moreover, LerGAN achieves 9.75×, 7.68× energy saving on average over GPU platform, ReRAM-based neural network accelerator respectively, and has 1.04× energy consuming over FPGA-based GAN accelerator.
@inproceedings{mao2018lergan, title = {Lergan: A zero-free, low data movement and pim-based gan architecture}, author = {Mao, Haiyu and Song, Mingcong and Li, Tao and Dai, Yuting and Shu, Jiwu}, booktitle = {2018 51st Annual IEEE/ACM International Symposium on Microarchitecture}, lightningtalk = {https://www.youtube.com/watch?v=dmsGaoJKbAU}, year = {2018}, organization = {IEEE} }
2017
- DATEProtect non-volatile memory from wear-out attack based on timing difference of row buffer hit/missHaiyu Mao, Xian Zhang, Guangyu Sun, and Jiwu ShuIn Design, Automation & Test in Europe Conference & Exhibition 2017
Non-volatile Memories (NVMs), such as PCM and ReRAM, have been widely proposed for future main memory design because of their low standby power, high storage density, fast access speed. However, these NVMs suffer from the write endurance problem. In order to prevent a malicious program from wearing out NVMs deliberately, researchers have proposed various wear-leveling methods, which remap logical addresses to physical addresses randomly and dynamically. However, we discover that side channel leakage based on NVM row buffer hit information can reveal details of address remappings. Consequently, it can be leveraged to side-step the wear-leveling. Our simulation shows that the proposed attack method in this paper can wear out a NVM within 137 seconds, even with the protection of state-of-the-art wear-leveling schemes. To counteract this attack, we further introduce an effective countermeasure named Intra-Row Swap (IRS) to hide the wear-leveling details. The basic idea is to enable an additional intra-row block swap when a new logical address is remapped to the memory row. Experiments demonstrate that IRS can secure NVMs with negligible timing/energy overhead, compared with previous works.
@inproceedings{mao2017protect, title = {Protect non-volatile memory from wear-out attack based on timing difference of row buffer hit/miss}, author = {Mao, Haiyu and Zhang, Xian and Sun, Guangyu and Shu, Jiwu}, booktitle = {Design, Automation \& Test in Europe Conference \& Exhibition}, year = {2017}, organization = {IEEE} }
2015
- NVMSAExploring data placement in racetrack memory based scratchpad memoryHaiyu Mao, Chao Zhang, Guangyu Sun, and Jiwu ShuIn IEEE Non-Volatile Memory System and Applications Symposium 2015
Scratchpad Memory (SPM) has been widely adopted in various computing systems to improve performance of data access. Recently, non-volatile memory technologies (NVMs) have been employed for SPM design to improve its capacity and reduce its energy consumption. In this paper, we explore data allocation in SPM based on racetrack memory (RM), which is an emerging NVM with ultra-high storage density and fast access speed. Since a shift operation is needed to access data in RM, data allocation has an impact on performance of RM based SPM. Several allocation methods have been discussed and compared in this work. Especially, we addressed how to leverage genetic algorithm to achieve near-optimal data allocation.
@inproceedings{mao2015exploring, title = {Exploring data placement in racetrack memory based scratchpad memory}, author = {Mao, Haiyu and Zhang, Chao and Sun, Guangyu and Shu, Jiwu}, booktitle = {IEEE Non-Volatile Memory System and Applications Symposium}, year = {2015}, organization = {IEEE} }