nexusstc/Efficient Execution of Irregular Dataflow Graphs. Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra/311374d4a8fdb1bc1ebe1b56c7546003.pdf
Efficient Execution of Irregular Dataflow Graphs : Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra 🔍
Nimish Shah, Wannes Meert, Marian Verhelst
Springer International Publishing, Springer Nature, Cham, 2023
英语 [en] · PDF · 12.0MB · 2023 · 📘 非小说类图书 · 🚀/lgli/lgrs/nexusstc/zlib · Save
描述
This book focuses on the acceleration of emerging irregular sparse workloads, posed by novel artificial intelligent (AI) models and sparse linear algebra. Specifically, the book outlines several co-optimized hardware-software solutions for a highly promising class of emerging sparse AI models called Probabilistic Circuit (PC) and a similar sparse matrix workload for triangular linear systems (SpTRSV). The authors describe optimizations for the entire stack, targeting applications, compilation, hardware architecture and silicon implementation, resulting in orders of magnitude higher performance and energy-efficiency compared to the existing state-of-the-art solutions. Thus, this book provides important building blocks for the upcoming generation of edge AI platforms.
备用文件名
lgli/Efficient_Execution_of_Irregular_Dataflow_Graphs_2023.pdf
备用文件名
lgrsnf/Efficient_Execution_of_Irregular_Dataflow_Graphs_2023.pdf
备用文件名
zlib/no-category/Nimish Shah, Wannes Meert, Marian Verhelst/Efficient Execution of Irregular Dataflow Graphs. Hardware/Software Co-optimization for Probabilistic AI and Sparse Linear Algebra_25655465.pdf
备用出版商
Springer Nature Switzerland AG
备用版本
Switzerland, Switzerland
备用版本
155
元数据中的注释
{"isbns":["3031331354","3031331362","9783031331350","9783031331367"],"last_page":2023,"publisher":"Springer"}
备用描述
415ycu+1gnL._SX320_BO1,204,203,200_
1
Preface
Contents
List of Abbreviations
List of Symbols
List of Figures
List of Tables
978-3-031-33136-7_1
1 Irregular Workloads at Risk of Losing the Hardware Lottery
1.1 Domain Specialization and the Hardware Lottery
1.2 Recent Trends and Irregular Workloads
1.3 Introduction to Graphs
1.4 Target Workloads
1.4.1 Probabilistic Circuit (PC)
1.4.2 Sparse Matrix Triangular Solves (SpTRSV)
1.4.3 Comparison of PC and SpTRSV
1.5 Open Research Questions for Efficient Execution of Irregular DFGs
1.5.1 Q1: What Type of Data Representation Is Suitable?
1.5.2 Q2: How Can We Parallelize Irregular DFGs Effectively?
1.5.3 Q3: How Can We Improve the Throughput and Energy Efficiency Through a Custom Processor Architecture?
1.5.4 Q4: How Can We Improve the Hardware Further Through a Dedicated Datapath Design?
1.6 Book Contributions
1.6.1 ProbLP and the Custom Posit Representation
1.6.2 GraphOpt: A Tool for Effective Parallelization of DFGs
1.6.3 DAG Processing Unit: Version 1 (DPU)
1.6.4 DAG Processing Unit-Version 2 (DPU-v2)
978-3-031-33136-7_2
2 Suitable Data Representation: A Study of Fixed-Point, Floating-Point, and PositTM Formats for Probabilistic AI
2.1 Error-Bound Analysis
2.1.1 Fixed-Point Error Models
2.1.2 Floating-Point Error Models
2.1.3 Ensuring That All the Intermediate Values of a PC Are Within the Range
2.1.4 Error Propagation from PC Inputs to the Output
2.2 ProbLP
2.2.1 Bounds for Probabilistic Queries
Marginal Probability and MPE
Conditional Probability
2.2.2 Selecting Optimal Representation
2.2.3 Automatic Hardware Generation
2.2.4 Experimental Results
Validation of Bounds
Overall Performance
2.3 Beyond Fixed and Floating Point: Posit Representation
2.4 Conclusions
978-3-031-33136-7_3
3 GraphOpt: Constrained-Optimization-Based Parallelization of Irregular Workloads for Multicore Processors
3.1 Graph Partitioning for Parallelization
3.2 GraphOpt
3.2.1 Recursive Two-Way Partitioning (M1)
Optimization Model for the Two-Way Partitioning
Example
3.2.2 Workload Balancing (M2)
3.2.3 Scale to Large Graphs (S1, S2, S3)
Consider Limited Layers (S1)
Independent Connected Components (S2)
Heuristic Coarsening (S3)
3.3 Performance Evaluation
3.3.1 Experimental Setup
3.3.2 Analysis of Super Layers
How Large Are the Super Layers?
Workload Balancing
Throughput Scaling
Impact of the Scalability Techniques
3.3.3 Comparison with State-of-the-Art Libraries
Sparse Matrix Triangular Solves
Probabilistic Circuits
3.4 Discussion and Related Work
3.4.1 Sparse Triangular Solves
3.4.2 Probabilistic Circuits
3.4.3 Graph Partitioning
3.4.4 DAG Scheduling
3.5 Conclusion
978-3-031-33136-7_4
4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular Workloads on a Multicore Processor
4.1 Challenges Due to Irregularity
4.1.1 SIMD Unfriendly
4.1.2 Frequent Synchronizations
4.1.3 Inefficient Use of Caches
4.1.4 Data Prefetching
4.2 DPU Architecture
4.2.1 Compute Units (CUs)
4.2.2 Global Scratchpad and Asymmetric Crossbar
4.2.3 Global Sync Unit
4.3 Compute Unit (CU) Architecture
4.3.1 Local Scratchpad
4.3.2 Data Prefetching Using Decoupled Instruction Streams
4.4 Precision-Scalable Custom Posit Unit
4.5 Implementation and Experiments
4.5.1 Physical Implementation
4.5.2 Peak Performance and Voltage Scaling
4.5.3 Workloads
4.5.4 Throughput Scaling with Different Active CUs
4.5.5 Comparison with CPU and GPU
4.5.6 DPU's Performance for a Regular DAG
4.6 Related Work
4.7 Conclusion
978-3-031-33136-7_5
5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular Workloads on a Spatial Datapath
5.1 Designing a Processor with a Spatial Datapath for Large Irregular DAGs
5.1.1 Which Spatial Datapath Topology Should Be Used?
5.1.2 How to Read/Write the Inputs/Outputs?
5.1.3 How to Handle Bank Access Conflicts?
5.2 DPU-v2 Architecture Template
5.2.1 Parallel Tree of PEs
5.2.2 Register File Architecture
5.2.3 Datapath-Register Banks Connections
5.2.4 Load, Store, and Copy of Data
5.2.5 Long, Variable-Length Instructions
5.3 Compiler for DAG
5.3.1 Block Decomposition (Step 1)
5.3.2 PE and Register Bank Mapping (Step 2)
5.3.3 Pipeline-Aware Reordering (Step 3)
5.3.4 Spilling from Register File (Step 4)
5.3.5 Reduction in Memory Footprint
5.4 Design Space Exploration
5.4.1 The Most-Efficient Design Configuration
5.5 State-of-the-Art Comparison
5.5.1 Comparison Using PC and SpTRSV
5.5.2 Comparison Using large PCs
5.5.3 Detailed Comparison of DPU-v2 and DPU
5.6 Additional Related Works
5.7 Conclusion
978-3-031-33136-7_6
6 Conclusions and Future Work
6.1 Contributions and Conclusions
6.2 Suggestions for Future Works
6.3 Closing Remarks
1 (1)
A The Two-Way Partitioning Model of GraphOpt
Bibliography
Index
1
Preface
Contents
List of Abbreviations
List of Symbols
List of Figures
List of Tables
978-3-031-33136-7_1
1 Irregular Workloads at Risk of Losing the Hardware Lottery
1.1 Domain Specialization and the Hardware Lottery
1.2 Recent Trends and Irregular Workloads
1.3 Introduction to Graphs
1.4 Target Workloads
1.4.1 Probabilistic Circuit (PC)
1.4.2 Sparse Matrix Triangular Solves (SpTRSV)
1.4.3 Comparison of PC and SpTRSV
1.5 Open Research Questions for Efficient Execution of Irregular DFGs
1.5.1 Q1: What Type of Data Representation Is Suitable?
1.5.2 Q2: How Can We Parallelize Irregular DFGs Effectively?
1.5.3 Q3: How Can We Improve the Throughput and Energy Efficiency Through a Custom Processor Architecture?
1.5.4 Q4: How Can We Improve the Hardware Further Through a Dedicated Datapath Design?
1.6 Book Contributions
1.6.1 ProbLP and the Custom Posit Representation
1.6.2 GraphOpt: A Tool for Effective Parallelization of DFGs
1.6.3 DAG Processing Unit: Version 1 (DPU)
1.6.4 DAG Processing Unit-Version 2 (DPU-v2)
978-3-031-33136-7_2
2 Suitable Data Representation: A Study of Fixed-Point, Floating-Point, and PositTM Formats for Probabilistic AI
2.1 Error-Bound Analysis
2.1.1 Fixed-Point Error Models
2.1.2 Floating-Point Error Models
2.1.3 Ensuring That All the Intermediate Values of a PC Are Within the Range
2.1.4 Error Propagation from PC Inputs to the Output
2.2 ProbLP
2.2.1 Bounds for Probabilistic Queries
Marginal Probability and MPE
Conditional Probability
2.2.2 Selecting Optimal Representation
2.2.3 Automatic Hardware Generation
2.2.4 Experimental Results
Validation of Bounds
Overall Performance
2.3 Beyond Fixed and Floating Point: Posit Representation
2.4 Conclusions
978-3-031-33136-7_3
3 GraphOpt: Constrained-Optimization-Based Parallelization of Irregular Workloads for Multicore Processors
3.1 Graph Partitioning for Parallelization
3.2 GraphOpt
3.2.1 Recursive Two-Way Partitioning (M1)
Optimization Model for the Two-Way Partitioning
Example
3.2.2 Workload Balancing (M2)
3.2.3 Scale to Large Graphs (S1, S2, S3)
Consider Limited Layers (S1)
Independent Connected Components (S2)
Heuristic Coarsening (S3)
3.3 Performance Evaluation
3.3.1 Experimental Setup
3.3.2 Analysis of Super Layers
How Large Are the Super Layers?
Workload Balancing
Throughput Scaling
Impact of the Scalability Techniques
3.3.3 Comparison with State-of-the-Art Libraries
Sparse Matrix Triangular Solves
Probabilistic Circuits
3.4 Discussion and Related Work
3.4.1 Sparse Triangular Solves
3.4.2 Probabilistic Circuits
3.4.3 Graph Partitioning
3.4.4 DAG Scheduling
3.5 Conclusion
978-3-031-33136-7_4
4 DAG Processing Unit Version 1 (DPU): Efficient Execution of Irregular Workloads on a Multicore Processor
4.1 Challenges Due to Irregularity
4.1.1 SIMD Unfriendly
4.1.2 Frequent Synchronizations
4.1.3 Inefficient Use of Caches
4.1.4 Data Prefetching
4.2 DPU Architecture
4.2.1 Compute Units (CUs)
4.2.2 Global Scratchpad and Asymmetric Crossbar
4.2.3 Global Sync Unit
4.3 Compute Unit (CU) Architecture
4.3.1 Local Scratchpad
4.3.2 Data Prefetching Using Decoupled Instruction Streams
4.4 Precision-Scalable Custom Posit Unit
4.5 Implementation and Experiments
4.5.1 Physical Implementation
4.5.2 Peak Performance and Voltage Scaling
4.5.3 Workloads
4.5.4 Throughput Scaling with Different Active CUs
4.5.5 Comparison with CPU and GPU
4.5.6 DPU's Performance for a Regular DAG
4.6 Related Work
4.7 Conclusion
978-3-031-33136-7_5
5 DAG Processing Unit Version 2 (DPU-v2): Efficient Execution of Irregular Workloads on a Spatial Datapath
5.1 Designing a Processor with a Spatial Datapath for Large Irregular DAGs
5.1.1 Which Spatial Datapath Topology Should Be Used?
5.1.2 How to Read/Write the Inputs/Outputs?
5.1.3 How to Handle Bank Access Conflicts?
5.2 DPU-v2 Architecture Template
5.2.1 Parallel Tree of PEs
5.2.2 Register File Architecture
5.2.3 Datapath-Register Banks Connections
5.2.4 Load, Store, and Copy of Data
5.2.5 Long, Variable-Length Instructions
5.3 Compiler for DAG
5.3.1 Block Decomposition (Step 1)
5.3.2 PE and Register Bank Mapping (Step 2)
5.3.3 Pipeline-Aware Reordering (Step 3)
5.3.4 Spilling from Register File (Step 4)
5.3.5 Reduction in Memory Footprint
5.4 Design Space Exploration
5.4.1 The Most-Efficient Design Configuration
5.5 State-of-the-Art Comparison
5.5.1 Comparison Using PC and SpTRSV
5.5.2 Comparison Using large PCs
5.5.3 Detailed Comparison of DPU-v2 and DPU
5.6 Additional Related Works
5.7 Conclusion
978-3-031-33136-7_6
6 Conclusions and Future Work
6.1 Contributions and Conclusions
6.2 Suggestions for Future Works
6.3 Closing Remarks
1 (1)
A The Two-Way Partitioning Model of GraphOpt
Bibliography
Index
开源日期
2023-08-07
We strongly recommend that you support the author by buying or donating on their personal website, or borrowing in your local library.
🚀 快速下载
成为会员以支持书籍、论文等的长期保存。为了感谢您对我们的支持,您将获得高速下载权益。❤️
🐢 低速下载
由可信的合作方提供。 更多信息请参见常见问题解答。 (可能需要验证浏览器——无限次下载!)
- 低速服务器(合作方提供) #1 (稍快但需要排队)
- 低速服务器(合作方提供) #2 (稍快但需要排队)
- 低速服务器(合作方提供) #3 (稍快但需要排队)
- 低速服务器(合作方提供) #4 (稍快但需要排队)
- 低速服务器(合作方提供) #5 (无需排队,但可能非常慢)
- 低速服务器(合作方提供) #6 (无需排队,但可能非常慢)
- 低速服务器(合作方提供) #7 (无需排队,但可能非常慢)
- 低速服务器(合作方提供) #8 (无需排队,但可能非常慢)
- 低速服务器(合作方提供) #9 (无需排队,但可能非常慢)
- 下载后: 在我们的查看器中打开
所有选项下载的文件都相同,应该可以安全使用。即使这样,从互联网下载文件时始终要小心。例如,确保您的设备更新及时。
外部下载
-
对于大文件,我们建议使用下载管理器以防止中断。
推荐的下载管理器:JDownloader -
您将需要一个电子书或 PDF 阅读器来打开文件,具体取决于文件格式。
推荐的电子书阅读器:Anna的档案在线查看器、ReadEra和Calibre -
使用在线工具进行格式转换。
推荐的转换工具:CloudConvert和PrintFriendly -
您可以将 PDF 和 EPUB 文件发送到您的 Kindle 或 Kobo 电子阅读器。
推荐的工具:亚马逊的“发送到 Kindle”和djazz 的“发送到 Kobo/Kindle” -
支持作者和图书馆
✍️ 如果您喜欢这个并且能够负担得起,请考虑购买原版,或直接支持作者。
📚 如果您当地的图书馆有这本书,请考虑在那里免费借阅。
下面的文字仅以英文继续。
总下载量:
“文件的MD5”是根据文件内容计算出的哈希值,并且基于该内容具有相当的唯一性。我们这里索引的所有影子图书馆都主要使用MD5来标识文件。
一个文件可能会出现在多个影子图书馆中。有关我们编译的各种数据集的信息,请参见数据集页面。
有关此文件的详细信息,请查看其JSON 文件。 Live/debug JSON version. Live/debug page.