Advanced Backend Code Optimization (Iste)

معرفی کتاب «Advanced Backend Code Optimization (Iste)» نوشتهٔ Sid Touati, Benoit Dupont de Dinechin، منتشرشده توسط نشر Wiley-Interscience; Wiley-ISTE در سال 2014. این کتاب در فرمت pdf، زبان انگلیسی ارائه شده است. «Advanced Backend Code Optimization (Iste)» در دستهٔ بدون دسته‌بندی قرار دارد.

"This book is a summary of more than a decade of research in the area of backend optimization. It contains the latest fundamental research results in this field. While existing books are often more oriented toward Masters students, this book is aimed more towards professors and researchers as it contains more advanced subjects. It is unique in the sense that it contains information that has not previously been covered by other books in the field, with chapters on phase ordering in optimizing compilation; register saturation in instruction level parallelism; code size reduction for software pipelining; memory hierarchy effects and instruction level parallelism. Other chapters provide the latest research results in well-known topics such as register need, and software pipelining and periodic register allocation."-- Unedited summary from book Advanced Backend Code Optimization 4 Contents 6 Introduction 14 I.1. Inside this book 16 Part 1: introduction to optimizing compilation 16 Part 2: instruction scheduling 17 Part 3: register optimization 18 Part 4: Epilog 19 I.2. Other contributors 19 I.3. Basics on instruction-level parallelism processor architectures 20 I.3.1. Processors with dynamic instruction issue 22 I.3.2. Processors with static instruction issue 25 I.3.2.1. VLIW processors 26 I.3.2.2. Transport triggered architectures 28 I.3.2.3. EPIC/IA64 processors 28 Part 1: Prolog: Optimizing Compilation 30 1 On the Decidability of Phase Ordering in Optimizing Compilation 32 1.1. Introduction to the phase ordering problem 32 1.2. Background on phase ordering 34 1.2.1. Performance modeling and prediction 34 1.2.2. Some attempts in phase ordering 35 1.3. Toward a theoretical model for the phase ordering problem 36 1.3.1. Decidability results 38 1.3.2. Another formulation of the phase ordering problem 40 1.4. Examples of decidable simplified cases 41 1.4.1. Models with compilation costs 41 1.4.2. One-pass generative compilers 42 1.5. Compiler optimization parameter space exploration 45 1.5.1. Toward a theoretical model 46 1.5.2. Examples of simplified decidable cases 48 1.6. Conclusion on phase ordering in optimizing compilation 49 Part 2: Instruction Scheduling 52 2 Instruction Scheduling Problems and Overview 54 2.1. VLIW instruction scheduling problems 54 2.1.1. Instruction scheduling and register allocation in a code generator 54 2.1.2. The block and pipeline VLIW instruction scheduling problems 56 2.2. Software pipelining 58 2.2.1. Cyclic, periodic and pipeline scheduling problems 58 2.2.2. Modulo instruction scheduling problems and techniques 61 2.3. Instruction scheduling and register allocation 64 2.3.1. Register instruction scheduling problem solving approaches 64 3 Applications of Machine Scheduling to Instruction Scheduling 68 3.1. Advances in machine scheduling 68 3.1.1. Parallel machine scheduling problems 68 3.1.2. Parallel machine scheduling extensions and relaxations 70 3.2. List scheduling algorithms 72 3.2.1. List scheduling algorithms and list scheduling priorities 72 3.2.2. The scheduling algorithm of Leung, Palem and Pnueli 74 3.3. Time-indexed scheduling problem formulations 76 3.3.1. The non-preemptive time-indexed RCPSP formulation 76 3.3.2. Time-indexed formulation for the modulo RPISP 77 4 Instruction Scheduling Before Register Allocation 80 4.1. Instruction scheduling for an ILP processor: case of a VLIW architecture 80 4.1.1. Minimum cumulative register lifetime modulo scheduling 80 4.1.2. Resource modeling in instruction scheduling problems 83 4.1.3. The modulo insertion scheduling theorems 85 4.1.4. Insertion scheduling in a backend compiler 87 4.1.5. Example of an industrial production compiler from STMicroelectronics 89 4.1.6. Time-indexed formulation of the modulo RCISP 93 4.2. Large neighborhood search for the resource-constrained modulo scheduling problem 96 4.3. Resource-constrained modulo scheduling problem 97 4.3.1. Resource-constrained cyclic scheduling problems 97 4.3.2. Resource-constrained modulo scheduling problem statement 98 4.3.3. Solving resource-constrained modulo scheduling problems 99 4.4. Time-indexed integer programming formulations 100 4.4.1. The non-preemptive time-indexed RCPSP formulation 100 4.4.2. The classic modulo scheduling integer programming formulation 101 4.4.3. A new time-indexed formulation for modulo scheduling 102 4.5. Large neighborhood search heuristic 103 4.5.1. Variables and constraints in time-indexed formulations 103 4.5.2. A large neighborhood search heuristic for modulo scheduling 103 4.5.3. Experimental results with a production compiler 104 4.6. Summary and conclusions 105 5 Instruction Scheduling After Register Allocation 106 5.1. Introduction 106 5.2. Local instruction scheduling 108 5.2.1. Acyclic instruction scheduling 108 5.2.2. Scoreboard Scheduling principles 109 5.2.3. Scoreboard Scheduling implementation 111 5.3. Global instruction scheduling 113 5.3.1. Postpass inter-region scheduling 113 5.3.2. Inter-block Scoreboard Scheduling 115 5.3.3. Characterization of fixed points 116 5.4. Experimental results 116 5.5. Conclusions 118 6 Dealing in Practice with Memory Hierarchy Effects and Instruction Level Parallelism 120 6.1. The problem of hardware memory disambiguation at runtime 121 6.1.1. Introduction 121 6.1.2. Related work 122 6.1.3. Experimental environment 123 6.1.4. Experimentation methodology 124 6.1.5. Precise experimental study of memory hierarchy performance 124 6.1.5.1. Example with Alpha 21264 processor (superscalar) 125 6.1.5.2. Example with Power 4 processor (superscalar) 126 6.1.5.3. Example with Itanium 2 processor (EPIC/IA64) 127 6.1.5.4. Summary of the experiments on cache behavior 128 6.1.6. The effectiveness of load/store vectorization 129 6.1.6.1. Example with Alpha 21264 processor 130 6.1.6.2. Example with Power 4 processor 130 6.1.6.3. Example with Itanium 2 processor 130 6.1.7. Conclusion on hardware memory disambiguation mechanisms 132 6.2. How to deal in practice with memory latency by software data preloading and prefetching 133 6.2.1. Introduction 133 6.2.2. Related work 134 6.2.3. Problems of optimizing cache effects at the instruction level 136 6.2.4. Target processor description 138 6.2.5. Our method of instruction-level code optimization 139 6.2.5.1. Data prefetching method at instruction level 139 6.2.5.2. Data preloading method 143 6.2.6. Experimental results 145 6.2.7. Conclusion on prefetching and preloading at instruction level 146 Part 3: Register Optimization 148 7 The Register Need of a Fixed Instruction Schedule 150 7.1. Data dependence graph and processor model for register optimization 151 7.1.1. NUAL and UAL semantics 151 7.2. The acyclic register need 152 7.3. The periodic register need 154 7.3.1. Software pipelining, periodic scheduling and cyclic scheduling 154 7.3.2. The circular lifetime intervals 156 7.4. Computing the periodic register need 158 7.5. Some theoretical results on the periodic register need 161 7.5.1. Minimal periodic register need versus initiation interval 162 7.5.2. Computing the periodic register sufficiency 162 7.5.3. Stage scheduling under register constraints 163 7.6. Conclusion on the register requirement 168 8 The Register Saturation 170 8.1. Motivations on the register saturation concept 170 8.2. Computing the acyclic register saturation 173 8.2.1. Characterizing the register saturation 175 8.2.2. Efficient algorithmic heuristic for register saturation computation 178 8.2.3. Experimental efficiency of Greedy-k 180 8.3. Computing the periodic register saturation 182 8.3.1. Basic integer linear variables 183 8.3.2. Integer linear constraints 183 8.3.3. Linear objective function 185 8.4. Conclusion on the register saturation 186 9 Spill Code Reduction 188 9.1. Introduction to register constraints in software pipelining 188 9.2. Related work in periodic register allocation 189 9.3. SIRA: schedule independant register allocation 191 9.3.1. Reuse graphs 191 9.3.2. DDG associated with reuse graph 193 9.3.3. Exact SIRA with integer linear programming 195 9.3.3.1. Basic integer variables 195 9.3.3.2. Integer linear constraints 196 9.3.3.3. Objective function 196 9.3.4. SIRA with fixed reuse edges 197 9.4. SIRALINA: an efficient polynomial heuristic for SIRA 198 9.4.1. Integer variables for the linear problem 199 9.4.2. Step 1: the scheduling problem 199 9.4.3. Step 2: the linear assignment problem 201 9.5. Experimental results with SIRA 202 9.6. Conclusion on spill code reduction 204 10 Exploiting the Register Access Delays Before Instruction Scheduling 206 10.1. Introduction 206 10.2. Problem description of DDG circuits with non-positive distances 208 10.3. Necessary and sufficient condition to avoid non-positive circuits 209 10.4. Application to the SIRA framework 211 10.4.1. Recall on SIRALINA heuristic 212 10.4.2. Step 1: the scheduling problem for a fixed II 212 10.4.3. Step 2: the linear assignment problem 213 10.4.4. Eliminating non-positive circuits in SIRALINA 213 10.4.5. Updating reuse distances 215 10.5. Experimental results on eliminating non-positive circuits 216 10.6. Conclusion on non-positive circuit elimination 217 11 Loop Unrolling Degree Minimization for Periodic Register Allocation 220 11.1. Introduction 220 11.2. Background 224 11.2.1. Loop unrolling after SWP with modulo variable expansion 225 11.2.2. Meeting graphs (MG) 226 11.2.3. SIRA, reuse graphs and loop unrolling 229 11.3. Problem description of unroll factor minimization for unscheduled loops 233 11.4. Algorithmic solution for unroll factor minimization: single register type 234 11.4.1. Fixed loop unrolling problem 235 11.4.2. Solution for the fixed loop unrolling problem 236 11.4.2.1. Analysis of the complexity of algorithm 11.1 238 11.4.3. Solution for LCM-MIN problem 238 11.4.3.1. Algorithmic complexity analysis of algorithm 11.4 240 11.5. Unroll factor minimization in the presence of multiple register types 242 11.5.1. Search space for minimal kernel loop unrolling 246 11.5.2. Generalization of the fixed loop unrolling problem in the presence of multiple register types 247 11.5.3. Algorithmic solution for the loop unrolling minimization (LUM, problem 11.1) 248 11.6. Unroll factor reduction for already scheduled loops 250 11.6.1. Improving algorithm 11.4 (LCM-MIN) for the meeting graph framework 253 11.6.1.1. Algorithmic complexity analysis for solving problem 11.5 253 11.7. Experimental results 253 11.8. Related work 255 11.8.1. Rotating register files 255 11.8.2. Inserting move operations 256 11.8.3. Loop unrolling after software pipelining 257 11.8.4. Code generation for multidimensional loops 257 11.9. Conclusion on loop unroll degree minimization 257 Part 4: Epilog: Performance, Open Problems 260 12 Statistical Performance Analysis: The Speedup-Test Protocol 262 12.1. Code performance variation 262 12.2. Background and notations 265 12.3. Analyzing the statistical significance of the observed speedups 268 12.3.1. The speedup of the observed average execution time 268 12.3.1.1. Remark on the normality of the distributions of X and Y 268 12.3.1.2. Remark on the variances of the distributions of X and Y 269 12.3.1.3. The size needed for the samples X and Y 269 12.3.1.4. Using the Student’s t-test correctly 269 12.3.2. The speedup of the observed median execution time, as well as individual runs 270 12.3.2.1. Using the test of Kolmogorov–Smirnov at first step 272 12.3.2.2. Using the test of Wilcoxon–Mann–Whitney 272 12.4. The Speedup-Test software 273 12.5. Evaluating the proportion of accelerated benchmarks by a confidence interval 275 12.6. Experiments and applications 277 12.6.1. Comparing the performances of compiler optimization levels 278 12.6.2. Testing the performances of parallel executions of OpenMP applications 279 12.6.3. Comparing the efficiency of two compilers 280 12.6.4. The impact of the Speedup-Test protocol over some observed speedups 282 12.7. Related work 282 12.7.1. Observing execution times variability 282 12.7.2. Program performance evaluation in presence of variability 283 12.8. Discussion and conclusion on the Speedup-Test protocol 284 Conclusion 286 C.1. Some open problems in backend optimizing compilation 286 C.1.1. Problem of instruction selection 286 C.1.2. Code optimization for multi-core and many-core processors 287 C.2. Summary 288 Appendix 1: Presentation of the Benchmarks used in our Experiments 292 A1.1. Qualitative benchmarks presentation 292 A1.2. Quantitative benchmarks presentation 294 A1.3. Changing the architectural configuration of the processor 296 Appendix 2: Register Saturation Computation on Stand-alone DDG 300 A2.1. The cyclic register saturation 300 A2.1.1. On the optimal RS computation 300 A2.1.2. On the accuracy of Greedy-k heuristic versus optimal RS computation 301 A2.1.3. GREEDY-K execution times 302 A2.2. The periodic register saturation 303 A2.2.1. Optimal PRS computation 303 A2.2.2. Approximate PRS computation with heuristic 306 Appendix 3: Efficiency of SIRA on the Benchmarks 308 A3.1. Efficiency of SIRALINA on stand-alone DDG 308 A3.1.1. Naming conventions for register optimization orders 308 A3.1.2. Experimental efficiency of SIRALINA 309 A3.1.3. Measuring the increase of the MII 311 A3.1.4. SIRALINA execution times 312 A3.2. Efficiency of SIRALINA plugged with an industrial compiler 313 A3.2.1. Static performance results 316 A3.2.1.1. Spill code reduction and IIdecrease 316 A3.2.1.2. Decoupling Register Constrains from SWP 316 A3.2.1.3. Does SIRA definitely remove spill code? 317 A3.2.2. Execution time performance results 318 A3.2.2.1. Speedups 318 A3.2.2.2. Impact on instruction cache effects 319 A3.2.2.3. Impact on data cache effects 321 Appendix 4: Efficiency of Non-Positive Circuit Elimination in the SIRA Framework 322 A4.1. Experimental setup 322 A4.1.1. Heuristics nomenclature 322 A4.1.2. Empirical efficiency measures 322 A4.2. Comparison between the heuristics execution times 323 A4.2.1. Time to minimize register pressure for a fixed II 324 A4.3. Convergence of the proactive heuristic (iterative SIRALINA) 326 A4.4. Qualitative analysis of the heuristics 326 A4.4.1. Number of saved registers 327 A4.4.2. Proportion of success when looking for a solution that satisfies the register constraints 330 A4.4.3. Increase of the MII when looking for a solution that satisfies the register constraints 330 A4.5. Conclusion on non-positive circuit elimination strategy 330 Appendix 5: Loop Unroll Degree Minimization: Experimental Results 332 A5.1. Stand-alone experiments with single register types 332 A5.1.1. Experiments with unscheduled loops 332 A5.1.2. Results on randomly generated data dependence graphs 332 A5.1.3. Experiments on real DDG 334 A5.1.3.1. Machine with an unbounded number of registers 334 A5.1.3.2. Machine with a bounded number of registers 335 A5.1.3.3. Machine with bounded number of registers and option continue 337 A5.1.4. Experiments with scheduled loops 337 A5.1.4.1. Loop unrolling of scheduled versus unscheduled loops 340 A5.2. Experiments with multiple register types 341 A5.2.1. Statistics on minimal loop unrolling factors 342 Appendix 6: Experimental Efficiency of Software Data Preloading and Prefetching for Embedded VLIW 344 Appendix 7: Appendix of the Speedup-Test Protocol 348 A7.1. Why is the observed minimal execution time not necessarily a good statistical estimation of program performances? 348 A7.1.1. Case of exponential distribution function shifted by θ 349 A7.1.2. Case of normal populations 349 A7.2. Hypothesis testing in statistical and probability theory 350 A7.3. What is a reasonable large sample? Observing the central limit theorem in practice 351 Bibliography 356 Lists of Figures, Tables and Algorithms 374 Index 382 Introduction xiii Part 1 Prolog: Optimizing Compilation 1 Chapter 1 On the Decidability of Phase Ordering in Optimizing Compilation 3 Part 2 Instruction Scheduling 23 Chapter 2 Instruction Scheduling Problems and Overview 25 Chapter 3 Applications of Machine Scheduling to Instruction Scheduling 39 Chapter 4 Instruction Scheduling Before Register Allocation 51 Chapter 5 Instruction Scheduling After Register Allocation 77 Chapter 6 Dealing in Practice with Memory Hierarchy Effects and Instruction Level Parallelism 91 Part 3 Register Optimization 119 Chapter 7 The Register Need of a Fixed Instruction Schedule 121 Chapter 8 The Register Saturation 141 Chapter 9 Spill Code Reduction 159 Chapter 10 Exploiting the Register Access Delays Before Instruction Scheduling 177 Chapter 11 Loop Unrolling Degree Minimization for Periodic Register Allocation 191 Part 4 Epilog: Performance, Open Problems 231 Chapter 12 Statistical Performance Analysis: The Speedup-Test Protocol 233 Conclusion 257 Appendix 1 Presentation of the Benchmarks Used in Our Experiments 263 Appendix 2 Register Saturation Computation on Stand-Alone DDG 271 Appendix 3 Efficiency of SIRA on the Benchmarks 279 Appendix 4 Efficiency of Non-Positive Circuit Elimination in the SIRA Framework 293 Appendix 5 Loop Unroll Degree Minimization: Experimental Results 303 Appendix 6 Experimental Efficiency of Software Data Preloading and Prefetching for Embedded VLIW 313 Appendix 7 Appendix of the Speedup-Test Protocol 319 Bibliography 327 Lists of Figures, Tables and Algorithms 345 Index 353

دانلود کتاب Advanced Backend Code Optimization (Iste)