ZKPU: The NVMe-Powered ZK Hardware Solution to Blockchain’s Trilemma
Written By Dingsen Shi and Chris Tsu
Overview
TL; DR:
- ASICs: The Missing Link for Scaling ZKP Applications. - The lack of specialized ASICs is a key reason why ZKP has not yet become widely adopted. 
- ASICs are more affordable due to economies of scale and far more energy-efficient than GPUs and FPGAs, enabling scalable and cost-effective ZKP solutions. 
 
- With NVMe protocol support, our ZK chip for proof generation provides: - Plug-and-play functionality on any motherboard, ensuring low deployment costs, greater accessibility, and enhanced decentralization. 
- Seamless integration with existing ZK infrastructures, including GPUs and FPGAs. 
- Full compatibility with modern server infrastructure. 
 
- Built with UC Berkeley’s Open-Source RISC-V Chipyard System, our ZK chip provides: - Transparency and trust through collaboration with leading open-source hardware project. 
- A flexible, scalable architecture designed specifically for ZK workloads. 
- Demonstrated reliability backed by successful tape-outs. 
 
- As a Modular Design SoC, Our ZK chip: - Resolves the PCIe bottleneck with on-chip CPU integration. 
- Simplifies design complexity while enhancing scalability. 
- Leverages existing, production-tested resources to reduce cost and risk. 
 
Whitepaper Walkthrough
1. Introduction
Zero-Knowledge Proofs (ZKPs) are a groundbreaking cryptographic technology that enables one party to prove the validity of a statement to another without revealing any underlying information. ZKPs have applications in privacy-preserving transactions, secure identity verification, scalable computation, and secure data exchange across various domains. However, their adoption has been limited due to the computational intensity of proof generation, which remains a significant bottleneck.
Existing ZKP hardware solutions face critical challenges, including high costs, limited compatibility with modern infrastructure, and reliance on closed-source, monolithic designs. These approaches often require proprietary systems that are expensive, hard to maintain, and incompatible with cloud-based environments or standard hardware protocols like NVMe. Additionally, many of these solutions reinvent proven methodologies, leading to increased development costs and inefficiencies.
To address these challenges, our ZK hardware introduces a modular, open-source design that prioritizes compatibility and efficiency. By leveraging industry-standard protocols like NVMe and integrating seamlessly with existing infrastructures, our chips provide a scalable and accessible solution for proof generation today. Looking ahead, our vision is to enable a cloud-native “Proof as a Service” (PaaS) model, where distributed ZK computations operate efficiently within modern data-center environments, paving the way for ZK adoption across a wide range of real-world applications.
2. Background
This section introduces the challenges and opportunities of Zero-Knowledge Proofs (ZKPs) through a real-world use case in data exchange (2.1). It then delves into the proof generation process (2.2), highlighting its computational intensity and the hardware requirements it demands. Finally, it explores the limitations of existing hardware solutions and their associated challenges (2.3), which underscore the need for a more efficient and compatible approach to ZK proof generation.
2.1 An Application of ZK in Data Exchange
- Scenario: Company A (a financial institution) wants to sell aggregated credit risk data to Company B (a fintech startup) via an Data Exchange(e.g. Shanghai Data Exchange) 
- Challenge: Company A needs to protect its proprietary transaction data and comply with privacy regulations, while Company B wants assurance of the data’s quality and compliance before purchasing. 
- ZKP Solution: - Company A uses Zero-Knowledge Proofs to prove: - The dataset is derived from at least 10,000 verified transactions. 
- The data is fully anonymized to meet regulatory requirements. 
 
- Company A does this without revealing any individual transactions or sensitive information. 
 
- Outcome: - Company B confidently purchases the dataset, assured of its quality and compliance. 
- Company A secures its data’s privacy and proprietary value while complying with legal standards 
 
2.2 Proof Generation
It seems promising, as demonstrated in the example above, that ZKP technology can address significant challenges and offer substantial benefits. However, despite its potential, ZKP is still not widely adopted in real-world applications. The primary reason lies in the proof generation process, which is notoriously resource-intensive. Generating even a simple proof can take hours on modern CPUs (as current benchmarks suggest), making it impractical for many use cases. Fortunately, offloading proof generation to specialized coprocessors offers a viable solution. To fully understand this bottleneck, the following section will dive deeper into the proof generation process and examine the challenges it entails.
Proof Generation Process Overview
At a high level, the proof generation process begins with a Zero-Knowledge Virtual Machine (ZKVM), such as those implemented by RISC0 and Scroll. The ZKVM executes programs deterministically, producing an execution trace that serves as the foundation for generating a proof. This trace encodes the computation steps and ensures that the proof aligns with the program’s logic.
Once the execution trace is produced, the process transitions into the computationally intensive stages, including witness generation, multiscalar multiplication (MSM), and number-theoretic transform (NTT):
- Witness generation: Constructing auxiliary data to validate the correctness of the computation. 
- MSM (Multi-Scalar Multiplication): Computing the dot product between a vector of scalars and a vector of group elements over finite field group elements. 
- NTT (Number Theoretic Transform): Computing polynomial multiplication in modular arithmetic. 
Computational Complexity
While these processes may appear straightforward, they involve complex mathematical operations that are computationally intensive. Tasks such as witness generation, multi-scalar multiplication (MSM), and number-theoretic transform (NTT) require extensive computations over elliptic curve group elements, with each arithmetic operation necessitating modular arithmetic involving large numbers. This complexity poses a significant bottleneck to the widespread adoption of Zero-Knowledge Proofs (ZKPs).
Moreover, each of these tasks has distinct hardware requirements:
- Witness Generation: - Features: This stage involves executing the program to generate intermediate and final values (witnesses) that satisfy the computational constraints. It is a sequential process and benefits from systems that can handle workloads in a step-by-step manner with ease of programmability. 
- Ideal hardware: General-purpose processors (e.g., CPUs) is well-suited due to their flexibility in handling sequential logic. 
 
- Multi-Scalar Multiplication (MSM): - Features: DSP intensive - MSM is the most computationally intensive part of proof generation, involving a large number of elliptic curve point multiplications and additions. It is characterized by highly pipeline-friendly computations over group elements, making it a prime candidate for hardware acceleration. 
 
- Ideal hardware: Custom hardware like ASICs optimized for elliptic curve arithmetic can significantly improve performance. Parallel processing capabilities and efficient handling of scalar multiplications are essential. 
 
- Number-Theoretic Transform (NTT): - Features: Memory-access intensive - NTT is crucial for fast polynomial arithmetic, such as multiplication and evaluation. It is a memory-hard computation with irregular memory access patterns, requiring efficient memory management and high bandwidth. 
 
- Ideal hardware: Custom Hardware with substantial memory bandwidth and optimized data flow architectures, custom ASICs with scalable memory hierarchies(e.g. HBM), is best suited. 
 
2.3 Why ASIC?
Given the computational complexities outlined earlier, specialized coprocessors are becoming essential for the practical application of ZKP. While GPUs and FPGAs have been widely used to address these challenges, they come with significant limitations that underscore the need for a more tailored solution.
The Current Preference for FPGA and GPU
Many in the industry argue that ZK technology is still evolving and, therefore, prefer GPUs and FPGAs for their flexibility. However, this line of thinking overlooks a critical question:
Could the lack of purpose-built hardware be the reason ZK has not yet gained widespread adoption?
To understand this, we must examine the advantages and shortcomings of GPUs and FPGAs.
GPU Solutions: Advantages and Shortcomings
Advantages:
- Low Cost of Entry: GPUs are widely available and relatively affordable for small-scale deployments, making them a common choice for experimentation and initial development. 
- Software Ecosystem: GPUs benefit from a mature ecosystem of software tools, drivers, and frameworks, which simplifies deployment and integration. 
Shortcomings:
- High Power Consumption: GPUs consume significant power, which drives up operational costs and makes them impractical for large-scale ZKP workloads. 
- Scaling Challenges: GPUs are inefficient for modular arithmetic, a cornerstone of ZKP computations, with studies showing only about 20% of their circuitry is utilized. Scaling GPU systems to meet ZKP demands incurs excessive costs and energy usage. 
FPGA Solutions: Advantages and Shortcomings
Advantages:
- Configurability: FPGAs are reconfigurable, allowing them to adapt to new ZK protocols or algorithms. - However, this configurability is less relevant for ZKP workloads, as the primary challenge lies in efficiently executing modular arithmetic and not signal processing or logic handling. 
- In fact, the limited effectiveness of FPGAs in modular arithmetic has led many to move away from FPGA-based solutions for ZK applications. 
 
Shortcomings:
- High Power Consumption: Like GPUs, FPGAs are power-hungry, making them costly and inefficient for sustained operations. 
- Scaling Limitations: FPGAs are slower and less efficient compared to ASICs designed specifically for ZKP tasks. 
- High Cost of Entry and Maintenance: FPGAs are expensive to purchase, and their complexity results in high maintenance costs, which can deter widespread adoption, especially for smaller enterprises. 
The Need for ASICs
The reluctance to adopt ASICs often stems from the perception that ZK technology is “still a mess” and constantly evolving, which discourages investment in fixed-function hardware. However, this perspective may be short-sighted:
A Lesson from OpenAI’s Success
The transformative role of purpose-built hardware is not new. A key factor in OpenAI’s success was its early partnership with NVIDIA in 2016, which gave it access to a massive amount of GPU computing power. This collaboration enabled OpenAI to scale its AI models and achieve breakthroughs that were otherwise unattainable. Similarly, the development of ZK-specific ASICs will provide the necessary computational foundation for ZK to evolve from niche experimentation to widespread application, unlocking its true potential.
As ZK is poised for broader adoption, we envision it in a similar position to OpenAI in 2016—promising but constrained by a lack of appropriate hardware. Now is the ideal time to invest in ZK ASICs, because:
- Acceleration Enables Adoption: The absence of purpose-built hardware like ASICs is likely one of the key reasons why ZK has not yet gained widespread adoption. Without hardware tailored for ZKP workloads, the technology remains inefficient and inaccessible, discouraging broader experimentation and integration into real-world applications. 
- Cost and Power Advantages: ASICs provide a massive reduction in both cost and power consumption. A single ZK-specific ASIC can outperform FPGA or GPU-based systems while consuming a fraction of the power. This not only makes ZK more affordable but also drastically reduces gas fees, making it viable for broader use cases. 
- Cost Efficiency Through SoC ASIC: - Higher DSP Utilization: ASICs maximize DSP usage by focusing on ZKP-specific tasks like modular arithmetic, eliminating unused features, and streamlining data flows. Unlike GPUs and FPGAs, ASICs achieve consistently high resource utilization. 
- Economies of Scale: While the upfront cost of ASIC development is high, the per-unit cost decreases significantly as production volumes increase. For example, at a scale of 100,000 units, ASICs become far more cost-effective than GPUs, making them a highly economical choice for widespread deployment. 
 
- Driving Standardization: Purpose-built ASICs will play a pivotal role in stabilizing the ZK ecosystem by standardizing performance benchmarks and accelerating protocol optimization. As the hardware matures, the industry will gain clarity on how to best leverage ZK, encouraging widespread adoption. 
While GPUs and FPGAs are invaluable for early-stage development, their cost, power inefficiency, and scaling limitations make them unsustainable for the future of ZK technology. The development and adoption of ZK-specific ASICs are not just viable—they are essential to unlocking the full potential of Zero-Knowledge Proofs in real-world applications.
2.4 ASIC Challenges
However, fully established ASIC-level coprocessors designed specifically for ZKP remain absent. Companies like Cysic and fabric cryptography are pioneering efforts to develop ASIC solutions. Yet, based on a thorough analysis of their designs and leveraging extensive ASIC industry expertise, we identify critical challenges inherent in their approaches—both in terms of architectural design and customer-centric adaptability.
Incompatibility with Existing Infrastructure
All current ASIC-based proof generation solutions(e.g. Cysic and fabric cryptography) are incompatible with existing infrastructure. These solutions typically consist of large server boxes featuring custom chips that rely on proprietary PCIe-based communication protocols. This approach introduces several major challenges:
- Incompatibility with Server Infrastructure: Current ASIC-based proof generation solutions face major compatibility issues with modern server infrastructure. These systems are costly, bulky, and difficult to maintain, lacking seamless integration with standard motherboards and cloud-based environments. Without support for widely adopted technologies—such as NVMe SSDs and datacenter-grade CPUs—their use is limited to large enterprises that can afford the high costs and dedicate teams for maintenance, further reinforcing centralization. This lack of compatibility renders them unsuitable for data-intensive scenarios, significantly impeding the broader adoption of ZKP technology in practical applications. 
- Recreating the ZKP Ecosystem: Existing ASIC solutions are not only incompatible with modern server ecosystems but also fail to integrate with established GPU and FPGA-based systems that currently dominate ZKP experimentation. Instead of building on these proven infrastructures, these solutions essentially attempt to recreate the entire ZKP hardware stack from scratch. This approach is not only risky but also creates a siloed ecosystem that cannot incorporate advancements or collaborate with existing systems, further delaying ZK technology adoption and increasing development costs. 
Not Open-Source
All existing solutions are closed-source, a practice that has also been common in previous mining machines and cold wallet designs. This lack of transparency raises significant concerns about potential backdoors in the hardware, which coul
d allow manufacturers to exploit private information or misuse the hardware in unauthorized ways. Additionally, closed-source projects fail to address the diverse needs of customers effectively, as they cannot leverage the collective expertise and innovation of the broader community. This is particularly problematic in the blockchain space, where open collaboration and community contributions are foundational principles and critical for driving advancements and building trust.
Latency in CPU-Coprocessor Communication
Among the three main workloads outlined earlier, witness generation is less compute-intensive than MSM or NTT. However, in architectures like Cysic’s—where witness generation is performed on the host while MSM and NTT are offloaded to specialized chips—this design introduces a significant bottleneck. The core issue lies in the data dependency: MSM and NTT modules require large input tables generated during the witness generation phase. This necessitates transferring these tables from the host to the MSM or NTT chips via PCIe, incurring substantial latency.
Surprisingly, this data transfer accounts for approximately 50% of the total proving time. Even Cysic has acknowledged that PCIe bandwidth constraints remain one of their biggest challenges. To address this critical bottleneck, integrating an on-chip CPU becomes almost essential, enabling the system to bypass the inefficiencies of PCIe-based data movement entirely.
High R&D Costs of GPU-Inspired Designs
Current in-development solutions often adopt architectures modeled after GPUs to handle diverse workloads. Designing ZK ASICs using this monolithic GPU-inspired approach significantly increases research and development costs. This strategy essentially reinvents the wheel, abandoning the valuable expertise gained from proven FPGA-based systems for accelerating MSM and NTT. Transitioning from these frameworks to a monolithic design requires starting from scratch, which is both time-intensive and expensive in an error-prone and risk-averse industry.
For instance, in our experience of developing a production-level SSD controller at a 28nm node from scratch, it took four Multi-Project Wafer (MPW) rounds—each costing half million dollars—and over four years to reach the full-mask stage. Considering that SSD controllers are less complex than GPU-like architectures, designing ZK ASICs with a similar GPU-inspired approach would likely result in even higher costs and extended timelines.
Conclusion: A Top-Down Approach Is Essential
ASIC development for ZKP requires a top-down mindset due to its fixed-function nature. A successful ASIC solution must prioritize:
- I/O Performance: Efficient data movement and reduced system latency are paramount for scalability. 
- Cost Efficiency: Optimizing power consumption and achieving economies of scale are critical for adoption. 
- Usability & Compatibility: Seamless integration with existing ZKP ecosystems, including GPUs and FPGAs, is non-negotiable. 
- System Latency: Minimizing communication delays between components is essential for throughput and overall performance. 
While MSM and NTT speed and implementation are important, they are not the defining factors of a successful ZKP ASIC. The focus must shift to building a cost-effective, scalable, and highly compatible system that can integrate seamlessly into existing infrastructures while addressing the real-world needs of ZKP applications.
Existing solutions often fail to adopt this top-down approach. Instead, they rely on a mining farm mindset, focusing narrowly on isolated hardware performance without considering the broader ecosystem. This approach leads to siloed, inflexible designs that cannot adapt to the evolving needs of ZKP or integrate with modern datacenter and ZKP infrastructures.
To truly unlock the potential of ZKP technology, ASIC development must embrace an ecosystem-driven strategy. This means prioritizing modularity, scalability, and compatibility with the existing and future ZKP landscape—ensuring that ZKP ASICs are not just hardware solutions but foundational components of a larger, interoperable ecosystem.
3. ZKPU Overview
The goal of ZKPU is to provide a ZKP chip with a modular architecture that delivers high performance and low cost for Zero-Knowledge Proofs, making them accessible to everyone. By leveraging existing ZKP FPGA IPs (e.g., MSM, NTT) and integrating state-of-the-art ASIC technology—both in architecture and packaging—ZKPU aims to enable on-demand, high-throughput processing. With built-in support for the NVMe protocol, ZKPU ensures seamless compatibility with modern infrastructure, including cloud environments, while maintaining a cost-efficient and scalable approach.
3.1 ZKPU Architecture

To address the limitations of existing architectures, we have developed a modular design with integrated NVMe support, incorporating the following key components:
- Integrated Chipyard RISC-V CPU: Our design includes RISC-V CPUs built on UC Berkeley’s Chipyard framework, which offers a flexible and modular bus architecture. This on-chip CPU is critical for reducing data movement latency and efficiently managing witness generation. It handles intermediate results in the on-chip DDR and directly interfaces with the hardware APIs of the NTT and MSM modules for precise, fine-grained operations when needed. 
- NVMe Endpoint: This module enables seamless data transfer between the ZKPU chip and the host system using the standardized NVMe protocol, widely adopted in SSDs. By integrating NVMe, our architecture ensures compatibility with existing operating systems, facilitating effortless integration with current infrastructure. 
- Dedicated Algorithm Modules: These modules are custom-designed to optimize their respective algorithms. They interface with the CPU and DRAM through Direct Memory Access (DMA), enabling high-speed data transfers and seamless operation. 
- GDDR/HBM Module: High-bandwidth memory (e.g., GDDR or HBM) is integrated to handle memory-intensive operations like NTT, which involves significant data shuffling and high-frequency memory access. 
Algorithm modules expose hardware APIs for fine-grained modular arithmetic operations, such as Montgomery multiplication. These APIs allow the CPU to offload granular computations to the modules by issuing commands through register read-write operations, which function equivalently to a RISC-V instruction set extension. This modular design not only boosts performance but also provides the flexibility required to effectively manage diverse ZKP workloads.
3.2 How ZKPU Overcomes Key Challenges
The ZKPU architecture is designed with a clear focus on overcoming the core challenges previously outlined. By combining compatibility and modularity, ZKPU redefines the approach to proof generation for Zero-Knowledge Proofs (ZKPs). Below, we detail how each challenge is addressed through the innovative features of the architecture:
3.2.1 Support of NVMe Protocol
As mentioned above, one of the greatest challenges of existing ZK hardware solutions is their incompatibility with modern infrastructure. Current designs rely on proprietary, closed systems that are expensive, bulky, and difficult to maintain. These solutions lack support for standard interfaces, making them incompatible with existing motherboards and datacenter-based environments. This severely limits their accessibility, restricts adoption to large enterprises, and exacerbates centralization in ZKP applications.
To address these issues, our approach integrates support for the NVMe protocol, a universally adopted standard in modern computing. This ensures seamless compatibility with existing infrastructure while providing scalability, modularity, and adaptability to evolving workloads. Below, we delve into what NVMe is and how it helps solve these challenges.
What is NVMe?
NVMe (Non-Volatile Memory Express) is a widely adopted storage protocol designed for high-performance and efficient access to non-volatile storage media, such as SSDs. It is supported across virtually all modern operating systems and computing platforms, making it a key component in ensuring compatibility with existing infrastructure.
How Does NVMe Help?
To address the shortcomings of existing ASIC-based proof generation solutions, our architecture fully integrates support for the NVMe protocol, offering significant advantages over current designs that rely on proprietary communication systems:
- Seamless Compatibility with Existing Infrastructure: Unlike closed and proprietary solutions, NVMe is a universally supported standard across modern computing platforms. Integrating NVMe allows our hardware to connect effortlessly with existing motherboards and systems, eliminating the need for custom drivers or complex configurations and significantly reducing deployment complexity and costs. For instance, in our demo, we showcase direct interaction with our MSM module on a VCU118 FPGA, leveraging NVMe’s native support on an Ubuntu 22.04 system using nvme-cli tools, running on a regular SuperMicro motherboard. 
- Integration with NVIDIA GPUDirect and ZKLink: One of NVMe’s most transformative features is its ability to facilitate advanced interconnection capabilities like NVIDIA’s GPUDirect. GPUDirect enables direct communication between NVMe devices and NVIDIA GPUs, bypassing the CPU and dramatically reducing latency. Leveraging this capability, we envision a protocol akin to NVIDIA’s NVLink, which we call ZKLink, based on NVMe. ZKLink facilitates high-speed, peer-to-peer communication between ZKPUs, GPUs, and other devices, creating a unified and interconnected ecosystem. This makes it possible for GPUs to handle general-purpose tasks while ZKPUs accelerate ZK workloads, ensuring optimal efficiency and flexibility. 
- Interconnectivity Across Diverse Devices: NVMe support ensures that our ZKPU can interconnect seamlessly with other ZKPUs, GPUs, and even FPGA systems, provided they also support NVMe. This universal compatibility enables peer-to-peer communication across a wide range of devices, fostering collaboration between heterogeneous systems. Whether in a distributed cloud environment or on-premises, our ZKPU leverages NVMe to integrate with existing infrastructure, enabling scalable and coordinated proof generation. 
- Modularity and Scalability: Supporting NVMe enables a modular design, where components such as CPUs, MSM/NTT accelerators, and storage devices can be easily upgraded or replaced. This adaptability allows our solution to scale with evolving workloads and advancements in hardware, ensuring long-term viability and flexibility. 
- Higher Server Capacity: A single server can support up to 24 NVMe cards or more, which is at least three times the typical number of PCIe slots available. This increased capacity enables each server to host a greater number of ZKPU cards, substantially boosting the overall proof generation throughput while ensuring efficient utilization of resources. 
- Integration with Data Center Infrastructure: NVMe’s compatibility with modern data center ecosystems ensures our architecture is optimized for scenarios requiring intensive data processing. Direct integration with cloud-based resources, such as NVMe SSDs and datacenter-grade CPUs, makes our solution well-suited for data-intensive ZKP applications, significantly broadening its applicability. 
- Community and Ecosystem Support: By adopting NVMe, our design benefits from a robust ecosystem of tools, drivers, and community support. This eliminates reliance on proprietary ecosystems, encouraging innovation and accelerating adoption. 
- Future-Proof Design: NVMe continues to evolve with advancements in speed, features, and compatibility, ensuring that our solution remains aligned with industry trends and ready to leverage future developments. 

3.2.2 Adopting UC Berkeley’s Chipyard System
The UC Berkeley Chipyard SoC generator system serves as the foundation of our hardware design. By mastering this open-source, flexible, and robust framework, we have harnessed its full potential to address the unique challenges of ZK (Zero-Knowledge) hardware development. In this section, we highlight the key advantages of Chipyard that make it an ideal platform for designing advanced ZK architectures while overcoming the limitations of closed-source and inflexible hardware solutions:
- Open-Source Architecture: Chipyard is fully open-source, addressing critical concerns about transparency and trust in hardware design. Unlike closed-source solutions that may hide backdoors or limit collaboration, Chipyard’s openness ensures every design aspect is accessible and verifiable. This transparency builds trust within the ZK and blockchain communities, where openness is foundational to innovation. Moreover, it invites contributions from a global developer community, accelerating improvements and fostering collective advancement in ZK hardware solutions. 
- Object-Oriented Architecture: Chipyard’s object-oriented architecture provides unmatched flexibility for creating hardware optimized for ZK-specific workloads. Its dynamic and parameterized design system is particularly advantageous for ZK applications: - Customizable RTL Modules: The modular nature of Chipyard allows developers to seamlessly integrate and configure key ZK components like MSM and NTT accelerators. This ensures the architecture can be tailored to specific proof-generation protocols without the need for extensive redesign. 
- Scalable NoC Support: For the large-scale parallelism required in ZK operations, Chipyard’s native support for Network-on-Chip (NoC) architectures enables efficient communication between computational modules. At the same time, it allows switching to simpler bus architectures for lightweight tasks like witness generation, ensuring adaptability and optimal performance. 
 
- Agile Development and Prototyping: ZK hardware development demands iterative testing and refinement due to the complexity and diversity of workloads. Chipyard’s development environment streamlines this process: - Fast Prototyping: Developers can rapidly prototype ZK-specific components, such as modular arithmetic engines, MSM, and NTT accelerators, ensuring quick validation and iteration of design choices. 
- Optimized Workflows: Its modular framework allows fine-grained adjustments to resources, such as tweaking DSP allocations for MSM or optimizing memory hierarchies for NTT, crucial for real-world ZK deployments. 
- Comprehensive Toolchain: Chipyard provides a complete suite of tools, including simulators, debuggers, and verification utilities, simplifying the validation of complex ZK workflows and reducing production risks. 
 
- Tape-Out Proven for Reliability: One of Chipyard’s most compelling advantages is its proven track record in successful chip tape-outs. It has been used to design and produce multiple production-level chips, demonstrating its reliability and suitability for real-world applications. For ZK hardware developers, this provides assurance that Chipyard-based designs can transition smoothly from prototype to production with minimal risk of costly errors. 
3.2.3 Modular Design
Our modular design balances innovation with efficiency by combining our custom MSM module with proven, production-tested resources. Unlike monolithic architectures that reinvent the wheel, we leverage existing solutions to reduce costs and development risks. Here are the key advantages of this approach:
- Leveraging Proven Resources: Our system builds on a foundation of established and production-tested resources. While open-source contributions like ZPrize designs play a significant role in our development, we also see opportunities for potential collaboration with industry leaders such as Ingonyama for their MSM module and Irreducible for their NTT module. Both have demonstrated success in production environments and, if partnerships are established, could further enhance our system’s capabilities. By combining these proven resources, we significantly reduce development risks. Based on our tape-out experience, leveraging such designs allows us to achieve a time-to-market of approximately one year, with development costs estimated at around $6 million—highlighting the effectiveness of our modular approach. 
- Avoiding Reinvention: By relying on established frameworks, we eliminate the need for costly, error-prone development from scratch. This allows us to focus on advancing modularity and adaptability. Looking forward, our architecture is designed to incorporate technologies such as embedded FPGA (eFPGA) cores, which further enhance flexibility and scalability. - Adaptability with eFPGA Integration: - An eFPGA core is a powerful addition to our modular design, offering programmable flexibility within the SoC. Unlike standalone FPGAs, an eFPGA integrates directly into the ASIC, allowing us to specify the exact logic, DSP, and memory resources required for ZKP workloads. This tailored flexibility provides several advantages: - Protocol Adaptability: eFPGA technology enables our system to adapt dynamically to new proof protocols and cryptographic primitives, ensuring relevance in a rapidly evolving ZKP landscape. 
- Resource Abstraction: By wrapping DSP and memory resources into configurable units, eFPGA cores simplify resource allocation across modules like MSM and NTT accelerators, reducing complexity and improving efficiency. 
- Cost and Power Efficiency: By eliminating the unnecessary features of standalone FPGAs, eFPGAs lower system costs, reduce power consumption, and save board space, making them ideal for high-volume production. 
 
 
 
- Compatibility and Adaptability: Our modular design ensures smooth integration with existing FPGA-server setups and facilitates a seamless transition to ASIC deployment. With the potential for eFPGA integration, we further future-proof our architecture to handle evolving workloads, meet new performance requirements, and support next-generation ZKP protocols with minimal disruption. 
4. Our Vision: PaaS(Proof as a Service)
Our vision for Zero-Knowledge Proof (ZK) technology is to establish it as a datacenter-level Proof as a Service (PaaS) solution, enabling scalable, efficient, and accessible proof generation for modern cloud infrastructures. This approach has the potential to empower applications like the Shanghai Data Exchange, where ZK can ensure secure and privacy-preserving data sharing at scale.
To realize this vision, our architecture focuses on addressing key challenges and leveraging modern cloud technologies. In the following sections, we outline the foundational elements of our approach:
4.1 Data-Center-Level Integration
For ZK proof systems to seamlessly integrate into cloud environments, they must support data-center-level protocols, with NVMe playing a pivotal role. NVMe’s high performance and wide adoption ensure compatibility with existing infrastructure while enabling efficient interactions with NVMe SSDs. These interactions are crucial for storing and retrieving intermediate proof data during ZK computations, making NVMe a cornerstone of our architecture.
4.2 Scalable Distributed Proof Generation with Peer-to-Peer Communication
As ZK technology evolves, the size of proofs is expected to grow exponentially, requiring innovative strategies to maintain scalability. A key solution is distributed proof generation, where large proofs are divided into smaller sub-proofs that are processed in parallel across multiple ZKPUs. This approach significantly accelerates proof generation while reducing computational bottlenecks.
Efficient peer-to-peer communication is critical for synchronizing these sub-proofs and aggregating the final result. For example, systems like the Halo 2 proving architecture enable recursive ZK proofs by leveraging smaller sub-proofs, which are later combined into a comprehensive proof. This process relies on fast and reliable data exchange between computational units.
Inspired by Nvidia’s NVLink technology in AI applications, which enables high-speed peer-to-peer communication between GPUs, we envision a similar system for ZK. By building peer-to-peer communication(we call it ZKLink) capabilities on the foundation of NVMe, our architecture facilitates direct data sharing between ZKPUs, bypassing traditional bottlenecks such as PCIe. This ensures seamless coordination of sub-proofs, even in highly distributed environments.
4.3 Elastic Zero-Knowledge Workload
In addition to distributed proof generation, we envision the future of ZK technology driven by elastic zero-knowledge workloads, where function-specific ZKPUs dynamically allocate resources to meet diverse and evolving computational demands. By leveraging ZKLink and a shared memory pool, this architecture introduces flexibility and efficiency in resource management.
- Function-Specific ZKPUs: These specialized ZKPUs are optimized for tasks like MSM or NTT, ensuring maximum throughput and performance for distinct workloads. 
- Shared Memory Pool: The shared memory pool serves as a unified and scalable resource hub accessible by all ZKPUs, eliminating memory fragmentation and bottlenecks. It can be implemented in two ways: - Storage-Based Solutions: Using NVMe SSDs ensures compatibility with existing infrastructure, providing high-performance, low-latency storage for intermediate data during ZK computations. 
- Memory-Based Solutions: Leveraging a memory pool with CXL (Compute Express Link) support allows scalable access to high-bandwidth, low-latency memory resources, ensuring optimal performance for diverse ZK workloads. 
 
- Dynamic Resource Allocation: By dynamically redistributing resources across workloads, this approach ensures high efficiency, minimizes underutilization, and adapts to changing proof-generation requirements. 
This elastic architecture aligns perfectly with cloud-native environments, allowing systems to scale fluidly based on workload intensity. It represents a paradigm shift from rigid hardware configurations to a flexible, demand-driven ecosystem that optimizes performance and cost-efficiency for Zero-Knowledge Proof applications.
4.4 PaaS: The Future of ZKP

Our Proof as a Service (PaaS) vision reimagines ZK technology as a cloud-native solution designed for large-scale, distributed applications. By integrating data-center protocols like NVMe, supporting scalable proof splitting, and enabling advanced peer-to-peer communication, we aim to deliver a robust, future-ready infrastructure. With the envisioned integration of more advanced node-to-node communication infrastructure (such as the Broadcom switch illustrated in this figure), our architecture ensures seamless coordination not only across ZKPUs but also with GPUs and FPGAs, leveraging NVIDIA’s GPUDirect for efficient communication. This interconnectivity maximizes throughput and efficiency, allowing diverse hardware to collaborate effectively.
This vision addresses the demands of current ZK applications while enabling transformative use cases in blockchain, AI, and beyond, ensuring scalability, flexibility, and long-term relevance.
5. Conclusion
In this whitepaper, we have outlined the challenges facing current ZK hardware solutions and presented our vision for a scalable, efficient, and modular approach to zero-knowledge proof generation. By leveraging open-source tools like UC Berkeley’s Chipyard, adopting industry-standard protocols such as NVMe, and emphasizing a modular architecture, we aim to create a solution that is transparent, cost-effective, and adaptable to evolving ZK workloads.
Our vision for Proof as a Service (PaaS) takes ZK technology to the next level, enabling seamless integration with modern cloud infrastructure while addressing the growing demands for scalability and cross-unit communication. With a foundation rooted in community collaboration and proven hardware designs, we are confident that our approach will accelerate the adoption of ZK technologies in real-world applications and establish a new standard for efficiency and trust in this space.
As ZK continues to evolve, our commitment to innovation, openness, and collaboration remains steadfast. We invite researchers, developers, and industry leaders to join us in shaping the future of ZK infrastructure and unlocking the full potential of this transformative technology.
Last updated
