## 多核共享缓存下的编程优化和正确性

Program Behavior in Shared Cache: Performance and Correctness

### 丁晨 Chen Ding

美国纽约州私立罗切斯特大学

计算机科学系教授

Professor

University of Rochester

2014 DragonStar Course at University of Science and Technology of China

#### Lecture 1: Why Cache 缓存的存在

Necessity of memory hierarchy. Sec. 1.1, Ulrich Meyer, Peter Sanders, and Jop F. Sibeyn, editors. Algorithms for Memory Hierarchies, Advanced Lectures, volume 2625 of Lecture Notes in Computer Science. Springer, 2003. Memory bandwidth bottleneck. Sec. 1, Chen Ding and Kennedy, "Improving Effective Bandwidth through Compiler Enhancement of Global Cache Reuse," Journal of Parallel and Distributed Computing, Volume 64, Issue 1, January 2004, Elsevier Press, pages 108-134.

Performance metrics

footprint, reuse distance, fill time, miss ratio. Xiaoya Xiang, Chen Ding, Bin Bao and Hao Luo, "A Higher Order Theory of Cache Locality", in Proceedings of The Symposium on Architectural Support for Programming Languages and Operating Systems, March 2013.

AMAT and APC. Xian-He Sun and Dawei Wang. APC: a performance metric of memory systems. SIGMETRICS Performance Evaluation Review, 40(2):125–130, 2012.

Sec. 1--2, "Program Locality Analysis Using Reuse Distance ", Yutao Zhong, Xipeng Shen, and Chen Ding, ACM Transactions on Programming Languages and Systems, Volume 31, Number 6, August 2009, pages 1--39.

Cache hardware. Preface and Chap. 1, Rajeev Balasubramonian, Norman Jouppi, Naveen Muralimanohar, Multi-CoreCache Hierarchies, Synthesis Lecturess on Computer Architecture #17, Morgan Claypool Publishers, 2011. Software defined cache -- memcached. Atikoglu et al. Workload Analysis of a Large Key-Value Store. SIGMETRICS 2012.

http://www.cs.rochester.edu/drupal/program-behavior-shared-cache-performance-and-correctness

Lecture 2: Footprint Theory of Locality 程序局部性的足迹理论

"多核程序交互理论及应用", 丁晨, 袁良, *计算机工程与科学*, Volume 36, Number 1, January 2014, pages 1--5. "Performance Metrics and Models for Shared Cache (共享缓存性能的度量与分析)", Chen Ding (丁晨), Xiaoya Xiang (向晓姬), Bin Bao (包斌), Hao Luo (罗灵), Ying-Wei Luo (罗英伟), and Xiao-Lin Wang (汪小林), Journal of Computer Science and

Technology, 2014, V29(4): 692-712.

Lecture 3: Locality Optimization 程序局部性优化的概述和举例

Five dimensiions of locality.

Sec. 6, "Program Locality Analysis Using Reuse Distance ", Yutao Zhong, Xipeng Shen, and Chen Ding, ACM Transactions on Programming Languages and Systems, Volume 31, Number 6, August 2009, pages 1–39. "Performance Metrics and Models for Shared Cache (共享條件條約度量与分析)", Chen Ding (丁晨), Xiaoya Xiang (向終級), Bin Bao (匈狄), Hao Luo (罗亮), Ying-Wei Luo (罗条作), and Xiao-Lin Wang (江/从外), Journal of Computer Science and

Technology, 2014,V29(4): 692-712. (added 7/4) \*On-the-Fhy Elimination of Dynamic Irregularities for GPU Computing", Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tan, Xipeng Shen, the Sixteenth International Conference on Architectural Support for Programming

Cade (an hair, open young the backet in international contention of internation of internation of internationa contention of international con

(added 7/8) "Defensive Loop Tilling for Shared Cache", Bin Bao and Chen Ding, in Proceedings of on Code Generation and Optimization, February 2013. (slides, video) Lecture 4: Fundamentals of Shared-Memory Synchronization 共享内存同步的基本原理

First 3 chapters and 8.2 in Michael L. Scott, Shared-Memory Synchronization, Synthesis Lecturess on Computer Architecture, Morgan Claypool Publishers, 2013.

(added 7/8) Safe (Hint based) Parallel Programming: "Software Behavior Oriented Parallelization", Chen Ding, Xipeng Shen, Kirk Kelsey, Chris Tice, Ruke Huang, and Chengliang Zhang, in Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation, San Diego CA, June 2007.

#### Lecture 5: Collaborative Cache Management and Optimization 软硬件协同管理和优化缓存

Xlaoming Gu (BS at USTC 2003, MS at ICT 2006, PhD at Rochester 2013), 谷磜铭(中科大计算机系本科2003,中科院计算所 硕士2006, 罗切斯特大学博士2013) Optimal Collaborative Caching: Theory and Applications. Ph.D. Dissertation, 2013. (Guest lecture) Professor Song Jiang (韦恩州立大学江松教授,前中科大计算机系老师)

LIRS algorithm. Jiang et al. *IEEE Transactions on Computers*, 2005 and 2007. SIGMETRICS 2002. Introduction to 大數据在互联网数据中心的管理和计算(中科院深圳先进技术研究院龙星课程)



## Cache System

Three problems: latency/bandwidth and Matthew Hertz's beer capacity and Trishul Chilimbi's cliff sharing Chen's Platform



#### • Cache

- which most transistors are used for
- where most memory accesses happen
- managed by priority/usage

#### • Shared cache

- available cache is variable
- throughput/stability/fairness/QoS

8

• sequential/parallel code

Madison Itanium 2, 2002 slide 3, Anant Agrawal, MIT 6.975 Fall 2007

Chen Ding, University of Rochester, INRIA 2014



Chen Ding, University of Rochester

| GPU                                       |       | G80                  | GT200                  | Fermi                         |
|-------------------------------------------|-------|----------------------|------------------------|-------------------------------|
| Transistors                               |       | 681 million          | 1.4 billion            | 3.0 billion                   |
| CUDA Cores                                |       | 128                  | 240                    | 512                           |
| Double Precision Floa<br>Point Capability | ating | None                 | 30 FMA ops / clock     | 256 FMA ops /clock            |
| Single Precision Floa<br>Point Capability | ting  | 128 MAD<br>ops/clock | 240 MAD ops /<br>clock | 512 FMA ops /clock            |
| Special Function Unit<br>(SFUs) / SM      | ts    | 2                    | 2                      | 4                             |
| Warp schedulers (per                      | r SM) | 1                    | 1                      | 2                             |
| Shared Memory (per                        | SM)   | 16 KB                | 16 KB                  | Configurable 48 KB o<br>16 KB |
| L1 Cache (per SM)                         |       | None                 | None                   | Configurable 16 KB o<br>48 KB |
| L2 Cache                                  |       | None                 | None                   | 768 KB                        |
| ECC Memory Suppor                         | t     | No                   | No                     | Yes                           |
| Concurrent Kernels                        |       | No                   | No                     | Up to 16                      |
| Load/Store Address                        | Width | 32-bit               | 32-bit                 | 64-bit                        |

Chen Ding, University of Rochester





Introduction

# **RESEARCH DRIVE**

# Computing Become Data Intensive

• Simulation, visualization, data mining, information retrieval, etc.

#### Data requirements for selected INCITE applications at ALCF

| <u>PI</u>              | Project                                                                   | On-Line Data | <u>a Off-Line Data</u> |
|------------------------|---------------------------------------------------------------------------|--------------|------------------------|
| Lamb, Don              | FLASH: Buoyancy-Driven Turbulent Nuclear Burning                          | 75TB         | 300TB                  |
| Fischer, Paul          | Reactor Core Hydrodynamics                                                | 2TB          | 5TB                    |
| Dean, David            | Computational Nuclear Structure                                           | 4TB          | 40TB                   |
| Baker, David           | Computational Protein Structure                                           | 1TB          | 2TB                    |
| Worley, Patrick H.     | Performance Evaluation and Analysis                                       | 1TB          | 1TB                    |
| Wolverton, Christopher | Kinetics and Thermodynamics of Metal and<br>Complex Hydride Nanoparticles | 5TB          | 100TB                  |
| Washington, Warren     | Climate Science                                                           | 10TB         | 345TB                  |
| Tsigelny, Igor         | Parkinson's Disease                                                       | 2.5TB        | 50TB                   |
| Tang, William          | Plasma Microturbulence                                                    | 2TB          | 10TB                   |
| Sugar, Robert          | Lattice QCD                                                               | 1TB          | 44TB                   |
| Siegel, Andrew         | Thermal Striping in Sodium Cooled Reactors                                | 4TB          | 8TB                    |
| Roux, Benoit           | Gating Mechanisms of Membrane Proteins                                    | 10TB         | 10TB                   |
|                        | Source: R. Ross et. al., Argonne National Laborator                       | y            |                        |
|                        |                                                                           |              | 11                     |

#### The Memory-wall Problem Processor performance increases rapidly □ Uni-processor: ~52% until 2004, $\sim 25\%$ since then 100.000 □ New trend: multi-core/many-10.000 core architecture 1.000 Intel TeraFlops chip, 2007 100 □ Aggregate processor performance much higher ■ Memory: ~9% per year Processor-memory speed gap . keeps increasing 4/26/2014 Scalable Computing Software Lab, Illinois Institute of Technology 12





## Improve via Memory Hierarchy







# The Introduction of APC

- <u>A</u>ccess <u>Per Cycle</u> (APC) • APC = A/T
- APC is measured as the number of memory accesses per memory active cycle or <u>Access Per Memory Active Cycle</u> (APMAC)
- Benefits of APC (APMAC)
  - □ Separate memory evaluation from CPU evaluation
  - Each memory level has its own APC value
  - A better understanding of memory system as a whole, and at each layer
  - A better understanding of the match between computing capacity and memory system performance

X.-H. Sun and D. Wang, "APC: A Performance Metric of Memory Systems", ACM SIGMETRICS Performance Evaluation Review, Volume 40 , Issue 2, 2012.

## **APC** Measurement

- The difficulty is measuring the total cycle T
  - Hundreds of memory accesses co-exist the memory system
- Measure T based on the *overlapping mode*
  - □ When there are several memory accesses co-existing during the same clock cycle, *T* only increases by one
  - Measure the concurrence
  - Measure the concurrence at each level
- Hardware cost: one bit
- Concurrence and Data-Centric view

## **Exhausted Testing**

- With different benchmarks, and with different configurations
- With advanced cache technologies
  - Non-block cache
  - Pipelined cache
  - Multi-port cache
  - Hardware prefetcher
- With single core or multicore
- APC always has the highest CC values among all the memory metrics

D. Wang, X.-H. Sun "Memory Access Cycle and the Measurement of Memory Systems", IEEE Transactions on Computers, (May-June) 2014

## **APC and C-AMAT Applications**

- Provide a new way to measure and analyze the contribution of memory concurrence
- Provide new approaches to reduce memory access delay
- Reveal the importance of memory parallelism and its relation to data locality
- Provide a mean to study the matching between memory organization and microprocessor architecture,
- Provide a mean to study the matching between memory organization and a given application
- Design and Co-Design of Parallel Memory Systems

## **Data Access is Application Dependent**

- Conventional algorithm analysis
   Floating point operation
- Data-centric algorithm analysis
  - Floating point operation
  - Memory requirement
  - Data reuse rate
  - Data access/movement pattern

Understanding

## **APPLICATION BEHAVIOR**



31



□Read/write

S. Byna, X.-H. Sun, et. al, "Parallel I/O Prefetching Using MPI File Caching and I/O Signatures," SC'08

### I/O Trace Signature

- Description of a sequence of I/O accesses in a pattern
- Form: {I/O operation, init position, dimension, ([{offset pattern}, {request size pattern}, {pattern of number of repetitions}], [...]), # of repetitions?

#### Pattern Signature

- provides a simple description that explains the nature of a pattern
- Form: {I/O operation, <Spatial pattern, Dimension>, <Repetitive behavior>, <Request size>, <Temporal Intervals>}

### **I/O Pattern Detection**

- Developed a pattern detection tool
- Five pattern detectors for finding patterns among initial positions, offsets, request sizes, temporality, and repetitions
- Outputs I/O Signature that can be used for prefetching, data layout, and data reorganization



| More<br>Addi<br>Da<br>Lo<br>Appl | tional contra access proceed and glo<br>ical and glo | IOSIG<br>htributions<br>attern categories<br>bal I/O signatures                                                                  |
|----------------------------------|------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|
| refetching                       | SC'08                                                | S. Byna, Y. Chen, XH. Sun, R. Thakur, and W. Gropp<br>Parallel I/O Prefetching Using MPI File Caching and I/O<br>Signatures      |
| ata Layout                       | HPDC11                                               | H. Song, Y. Yin, Y. Chen, XH. Sun,<br>A Cost-intelligent Application-specific Data layout Scheme for<br>Parallel File Systems    |
| ata<br>oordination               | SC11                                                 | H. Song, Y. Yin, XH. Sun, et. al,<br>Server-Side I/O Coordination for Parallel File Systems                                      |
| ata<br>Irganization              | PDSW11                                               | J.He, H. Song, XH. Sun, Y. Yin, and R. Thakur<br>Pattern-aware File Reorganization in MPI-IO                                     |
| ata<br>eplication                | IPDPS13                                              | Y. Yin, J. Li, J. He, XH. Sun, and R. Thakur, <i>Pattern-Direct and Layout-Aware Replication Scheme for Parallel I/O Systems</i> |
|                                  |                                                      | Data Compression                                                                                                                 |
| • Web                            | site: w                                              | ww.cs.iit.edu/~scs/iosig/                                                                                                        |

# **OPTIMIZATION** FROM DATA VIEW

## Developing

# **INTEGRATED SYSTEM DESIGN**





S. He, X.-H. Sun, et.al. "S4D-Cache: Smart Selective SSD Cache for Parallel I/O Systems", accepted to appear in ICDCS2014

# **Data Access is a Complex Matter**

- Dynamic, Application-aware
- System and algorithm re-design

## Big Data, Big Deal, Big Challenge



Operation of Memory Hierarchy



Layers of parallel I/O



# **Challenges of Exascale Computing**

(Cloud, data center)

- Energy and Power
- Memory and Storage
- Concurrency and Locality (programming model)
- Resiliency
- **Ours Solution:**

Concurrent memory and storage systems (with programming support)



## **Conclusion:** HPC Data-centric Rethinking

- Power & fault-tolerant depending on data handling
- Data access is a major concern of big data, cloud, HPC
- Data intensive computing requires the rethinking of program model, system, algorithm, and architecture
- C-AMAT and APC build the foundation of rethinking computing systems
- Integrated parallel data-access system design is the first step

TOO MANY THINGS NEED TO DO ROM HARDWARE TO SOFTWARE

Scalable Computing Software Lab, Illinois Institute of Technology



#### My Interests This Course • Is research for knowledge or utility? • Three basic problems in (computer) systems Pythagorean or Baconian? locality, parallelism, synchrony • The science of systems research The material what can we ultimately create • key studies that illuminate the limits or overcome some of New knowledge <-> more useful systems the limits • How to do research? · computational solutions and experimental verification • "Don't do what everybody else is doing" --- Jim Larus • what we have learned collectively in the last decade • supercomputing, ILP, pointer analysis, multi-core, ... ... Class dynamics Look forward to the next limit explanation of basic concepts and questions Basic research question • selection of specific material (from the reading list) · How much a program can understand other programs? based on common interests

#### Chen Ding, DragonStar lecture, ICT 2008

4/26/2014

47

Chen Ding, DragonStar lecture, ICT 2008

## 1994: Instruction-level Parallelism

- Studying under Philip Sweany and Steven Carr at MTU
- · Hiding the latency of operations and branches
- most operations have predictable latency
- except for ... ...
- Memory accesses
  - papers assumed L1 miss and L2 hit
- What about L2 misses?
  - the latency can be over a hundred cycles
  - ILP may not matter
  - no discussion in ILP papers
  - no one really knows the general answer until this decade

Chen Ding, DragonStar lecture, ICT 2008

#### Scalability and Data Placement on SGI Origin \*

Chen DING

Arun CHAUHAN

Barry SHERAW

52

Dept of Computer Science 6100 S Main, Rice University Houston, TX 77005 {achauhan,cding,sheraw}@rice.edu

April 28, 1997

Some of our results are quite different from the predictions of two recent simulation studies on directory-based ccNUMA machines ([HSH96] and [PRA97]), especially on FFT. These differences are partly due to the fact that the machine models used in previous simulation studies are different from the Origin machine in some important aspects. Our results also include data sizes that are larger than any of the previous simulation studies. To increase our confidence on the latency numbers and data placement tools, we also measured memory latencies using micro-benchmarks.

#### The Memory Problem 2002: Mark Wegman • The journey of an idea Compiler legend, co-invented many classic techniques 1997 summer, I told Ken the problem of memory bandwidth · compression, universal hashing, global value numbering, 1998(?) John Hennessy "single-node bandwidth" is fundamental problem 1998 (?) John McCalpin "It is the bandwidth, stupid" constant propagation, congruence, and static single 1999 summer, Burton Smith at LCPC at UCSD assignement 2000, my dissertation done, talk w/ Crawford of Intel and Carter of Utah · First ACM Workshop on Memory System Performance and 2001, my visits at Intel Itanium compiler group and Lawrence Livermore 2002, Intel used RAMBUS in Pentium 4 Correctness (MSPC) in 2002 2002, Earth Simulator became world's fastest computer recurring comment during the PC meeting 2002, Utah work won ICS best student paper award "The first load takes a long time, the next 10 do not 2003/4, US invested in high-end computing matter!" 2003 PACT, global loop fusion by Intel 2005 ICS, array regrouping by IBM · Performance depends on not instruction count and not 2005 ICS, data packing used by Lawrence Livermore instruction type but when and how often there is a miss 2003--2007, a new understanding of locality emerged 2007, DARPA MIT multi-core workshop listed off-chip bw as #1 problem 51 Chen Ding, DragonStar lecture, ICT 2008

49

Chen Ding, DragonStar lecture, ICT 2008



# **The Memory Problem**







| [Callahan, Cocke, and Kennedy, JPDC 1988]<br>Ding and Kennedy, JPDRS 2000 and JPDC 20041 |            |            |            |  |  |  |
|------------------------------------------------------------------------------------------|------------|------------|------------|--|--|--|
| ing and Ken                                                                              | neuy, IPDP | 5 2000 and | I JPDC 200 |  |  |  |
| program/                                                                                 | Progra     | m/machine  | balance    |  |  |  |
| machine                                                                                  | L1-Reg     | L2-L1      | Mem-L2     |  |  |  |
| Convolution                                                                              | 6.4        | 5.1        | 5.2        |  |  |  |
| Dmxpy                                                                                    | 8.3        | 8.3        | 8.4        |  |  |  |
| 1mjki (-02)                                                                              | 24.0       | 8.2        | 5.9        |  |  |  |
| Amjki (-03)                                                                              | 8.1        | 1.0        | 0.04       |  |  |  |
| FFT                                                                                      | 8.3        | 3.0        | 2.7        |  |  |  |
| SP                                                                                       | 10.8       | 6.4        | 4.9        |  |  |  |
| Sweep3D                                                                                  | 15.0       | 9.1        | 7.8        |  |  |  |
| Origin2000                                                                               | 4.0        | 4.0        | 0.8        |  |  |  |

| Applicat  | tions | Rati   | o: demand/s | upply  |
|-----------|-------|--------|-------------|--------|
| ripplieur | ions  | Reg BW | Cache BW    | Mem BW |
| Convolu   | tion  | 1.6    | 1.3         | 6.5    |
| Dm×p      | у     | 2.1    | 2.1         | 10.5   |
| Mmjki (-  | -02)  | 6.0    | 2.1         | 7.4    |
| FFT       |       | 2,1    | 0.8         | 3.4    |
| SP        |       | 2.7    | 1.6         | 6.1    |
| Sweep     | 3D    | 3.8    | 2.3         | 9.8    |













| Traditional Processors                                                                                                                                                                                                                                                                                                                                                                                                                                                          | Chip Multi-processors                                                                                                                                                                                                                                                                                                                                     |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>NEC SX-4</li> <li>vector processor, June 1997, 2 Gflops, 7.4GB/s<br/>bandwidth, 3.7 bytes per flop</li> <li>Alpha Server <ul> <li>May 1999, 600MHz CPU, 2MB cache, 932 Mflops, 50MB/s<br/>bandwidth, 0.48 byte per flop</li> </ul> </li> <li>Pentium 4 <ul> <li>Sept. 2003, 2.4GHz, 4.8 Gflops, 1.58GB/s bandwidth, 0.33 byte per flop</li> </ul> </li> <li>Opteron <ul> <li>Dec. 2003, 2.2GHz, 4 Gflops, 1.1GB/s bandwidth, 0.28 byte per flop</li> </ul> </li> </ul> | <ul> <li>4-way Power 4</li> <li>May 2003, 1.7GHz, 6MB partitioned L2, 27.2 Gflops, 6GB/<br/>s, 0.22 byte per flop</li> <li>Intel Core2 Quad</li> <li>April 2007, 2.4GHz, 154 Gflops, 5.3 GB/s bandwidth, 0.07<br/>byte per flop</li> <li>64-core Tilera</li> <li>Dec. 2007, 750MHz, 25GB/s peak memory bandwidth, 3.8<br/>TB on-chip bandwidth</li> </ul> |
| Chen Ding, DragonStar lecture, ICT 2008                                                                                                                                                                                                                                                                                                                                                                                                                                         | , Chen Ding, DragonStar lecture, ICT 2008                                                                                                                                                                                                                                                                                                                 |

# **Current Expert Opinion**





## Multi-Core Processors – Are they Here Yet ?

• My shopping basket at Fry's electronics on BlackFriday

| Item                                            | Cost |
|-------------------------------------------------|------|
| Motherboard+Intel <sup>®</sup> Quad core 2.4Ghz | 200  |
| 4GB Memory                                      | 70   |
| 0.5TB Disk                                      | 80   |
| Case                                            | 10   |
| Graphics                                        | 10   |
| CD/DVD                                          | 10   |
| Total                                           | 380  |

# Languages and Compilers for Multicore Computing Systems

FRAN ALLEN allen@watson.ibm.com

> Workshop Keynote IIT Kanpur, India December 13, 2007

(intel)



## **OPPORTUNITIES**

- New very high level languages
- New compiler techniques to manage data locality, integrity, ownership, ... in the presence of parallelism.
- Influence the architects before it is too late
- Rebuild the software stack
- Establish overall system goals:
  - Ser Productivity
  - Application Performance



# Impact of Locality

- □ Acceptable page fault rates can be achieved even when the memory allocated to a program is much less than that required to store all of its pages
- □ Internet routers can make high speed routing decisions with very modest forwarding caches
- □ Mobile users can work with remotely stored files even though they are located far from the file server

# Known Aliases

- □ The law of scattering
- The principle of least effort
- □ The 80-20 rule
- □ Concentration of productivity
- □ The law of diminishing returns



December 6, 2003

Temporal and Spatial Locality: A Time and a Place for Everything Slide 5 of 29

December 6, 2003

Temporal and Spatial Locality: A Time and a Place for Everything Slide 6 of 29

#### December 6, 2003 Import and Spatial Locality: Temporal and Spatial Locality: A more and a Place for Everything Support of Spatial Locality: A more and a Place for Everything Support of Spatial Locality: A memore and a Place for Everything Support of Spatial Locality: A memore and a Place for Everything Support of Spatial Locality: A memore and a Place for Everything Support of Spatial Locality: A memore and a Place for Everything Support of Spatial Locality: A memore and a Place for Everything Support of Spatial Locality: A memore and a Place for Everything Support of Spatial Locality: A memore and a Place for Everything Support of Spatial Locality: A memore and a Place for Everything Support of Spatial Locality: A memore and a Place for Everything Support of Spatial Locality: A memory and Spatial Locality: A memory and Place for Everything Support of Spatial Locality: A memory and Place for Everything Support of Spatial Locality: A memory and Place for Everything Support of Spatial Locality: A memory and Place for Everything Support of Spatial Locality: A memory and Place for Everything Support of Spatial Locality: A memory and Place for Everything Support of Spatial Locality: A memory and Place for Everything Support of Spatial Locality: A memory and Place for Everything Support of Spatial Locality: A memory and Place for Everything Support of Spatial Locality: A memory and Place for Everything Support of Spatial Locality: A memory and Place for Everything A memory and Place for Everything A memory and Place for Everything A memory and Place for Everything

# The Underlying Concept

- □ There is a very large population of items, many more than we can manage
- There is a small core of relevant items on which we can productively focus our attention
- □ This core will continue to be relevant long enough to justify our attention

Locality: Innate or Emergent?

#### Bell's theorem without inequalities<sup>a)</sup>

Daniel M. Greenberger

Department of Physics, City College of the City University of New York, New York, New York 10031 Michael A. Horne

Department of Physics, Stonehill College, North Easton, Massachusetts 02357

Abner Shimony Departments of Philosophy and Physics, Boston University, Boston, Massachusetts 02215

Anton Zeilinger Atominstitut der Österreichischen Universitäten, Schüttelstrasse 115, A-1020 Vienna, Austria

(Received 10 June 1990; accepted for publication 30 July 1990)

It is demonstrated that the premisses of the Einstein–Podolsky–Rosen paper are inconsistent when applied to quantum systems consisting of at least three particles. The demonstration reveals



Temporal and Spatial Locality: A Time and a Place for Everything

Slide 7 of 29

(i) *Perfect correlation*: If the spins of particles 1 and 2 are measured along the same direction, then with certainty the outcomes will be found to be opposite.

December 6 2003

(ii) Locality: "Since at the time of measurement the two systems no longer interact, no real change can take place in the second system in consequence of anything that may be done to the first system."

(iii) *Reality*: "If, without in any way disturbing a system, we can predict with certainty (i.e., with probability equal to unity) the value of a physical quantity, then there exists an element of physical reality corresponding to this physical quantity."

(iv) Completeness: "Every element of the physical reality must have a counterpart in the [complete] physical theory."











## **Measuring Reuse Distance**



ent-chinese-ba.ipa

Measuring Reuse Distance Predicting whole-program locality through reuse distance analysis 1 2 3 4 5 6 7 8 9 10 11 12 time: Full text Pdf (298 KB) dacbccgef a f b access: Conference on Programming Language Design and Implementation archive Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementati SESSION: Code optimization II table of contents Pages: 245 - 257 Year of Publication: 2003 ISBN1-5611-562-5 Source ← 5 distinct accesses – distance: ≁ (a) an example access sequence the reuse distance between two b's is 5 Chen Ding University of Rochester, Rochester, New York Yutao Zhong University of Rochester, Rochester, New York Authors Naive counting, O(N) time per access, O(N) space ACM: Association for Computing Machinery SIGPLAN: ACM Special Interest Group on Programming Languages Sponsors • N is the number of memory accesses ACM New York, NY, USA Publisher Bibliometrics Downloads (6 Weeks): 7, Downloads (12 Months): 101, Citation Count: 20 • M is the number of distinct data elements Too costly • N is up to 120 billion, M 25 million Chen Ding, DragonStar lecture, ICT 2008 94

















#### James R. Fienup



Robert E. Hopkins Professor of Optics also, Professor, Center for Visual Science Senior Scientist, Laboratory for Laser Energetics Professor of Electrical and Computer Engineerin University of Rochester Institute of Optics, Wilmot 410 275 Hutchison Rd Rochester, NY 14627-0186 (585) 275-8009 fienuu@ optics.rochester.edu



The James Webb Space Telescope, a segmented aperture system.





### **Whole-Program Locality**



























## Whole-Program Locality



- differ by programs
- consistent within the same program

#### Emergent behavior

- an observation
- a computational discovery
- implementation independent

#### Limitations

- not complete
- no structure yet



































| <b>Open Performance Modeling Issues</b>                                                                                                       | Modeling Related Work                                                                                                                                            |  |  |
|-----------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| <ul> <li>Short term</li> <li>Better modeling of memory subsystem</li> <li># outstanding loads to accurately predict memory latency</li> </ul> | <ul> <li>Reuse distance         <ul> <li>—Cache utilization [BeyIs &amp; D'Hollander]</li> <li>—Investigating optimizations [Ding et al.]</li> </ul> </li> </ul> |  |  |
| <ul><li>Explore modeling of irregular applications</li><li>Long term</li></ul>                                                                | <ul> <li>Program instrumentation</li> <li>—EEL, QPT [Ball, Larus, Schnarr]</li> </ul>                                                                            |  |  |
| —Model parallel applications<br>– Present modeling applies between synchronization points                                                     | <ul> <li>Scalable analytic models</li> <li>—[Vernon et al; Hoisie et al.]</li> </ul>                                                                             |  |  |
| <ul> <li>Combine with manually constructed parallel models</li> <li>Semi-automatically recover parallel trends</li> </ul>                     | <ul> <li>Cross-architecture models at scale         —[Snavely et al.; Cascaval et al.]</li> </ul>                                                                |  |  |
|                                                                                                                                               | Simulation (trace-based and execution-driven)                                                                                                                    |  |  |
| Mellor-Crummey <sup>1</sup>                                                                                                                   | Mellor-Crummey         154                                                                                                                                       |  |  |

| Instruction Based Memory Distance Analysis<br>and its Application to Optimization | Motivation                                                                                                                                                                                                                                                                                                                                                                                                                                      |  |  |
|-----------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|
| > Changpeng Fang<br>> Steve Carr<br>> Soner Önder<br>> Zhenlin Wang               | <ul> <li>Memory distance</li> <li>A dynamic quantifiable distance in terms of memory reference between tow access to the same memory location.</li> <li>reuse distance</li> <li>access distance</li> <li>value distance</li> <li>Is memory distance predictable across both integer and floating-point codes?</li> <li>predict miss rates</li> <li>predict critical instructions</li> <li>identify instructions for load speculation</li> </ul> |  |  |
| Michigan Tech Carr, Fang, Onder, Wang 155                                         | MongenTech Carr, Fang, Onder, Wang 156                                                                                                                                                                                                                                                                                                                                                                                                          |  |  |







# Experimental Methodology

- Use 11 CFP2000 and 11 CINT2000 benchmarks
   others don't compile correctly
- > Use ATOM to collect reuse distance statistics
- > Use test and train data sets for training runs
- > Evaluation based on dynamic weighting
- Report reuse distance prediction accuracy
  - value and access very similar

Mongen Tech Carr, Fang, Onder, Wang

# **Reuse Distance Prediction**

| Suite    | Patt      | erns    | Coverage | Accuracy |
|----------|-----------|---------|----------|----------|
|          | %constant | %linear | ~ %      | %        |
| CFP2000  | 85.1      | 7.7     | 93.0     | 97.6     |
| CINT2000 | 81.2      | 5.1     | 91.6     | 93.8     |

MengenTech Carr, Fang, Onder, Wang

## Coverage issues

159

161

Reasons for no coverage

- 1. instruction does not appear in at least one test run
- 2. reuse distance of test is larger than train
- 3. number of patterns does not remain constant in both training runs

| Suite    | Reason 1 | Reason 2 | Reason 3 |
|----------|----------|----------|----------|
| CFP2000  | 4.2%     | 0.3%     | 2.5%     |
| CINT2000 | 2.2%     | 4.4%     | 1.8%     |

Mchigan**Tech** Carr, Fang, Onder, Wang

# Number of Patterns

| Suite    | 1     | 2     | 3    | 4    | ≥5   |
|----------|-------|-------|------|------|------|
| CFP2000  | 81.8% | 10.5% | 4.8% | 1.4% | 1.5% |
| CINT2000 | 72.3% | 10.9% | 7.6% | 4.6% | 5.3% |

Carr, Fang, Onder, Wang



# Cache Configurations

| config no. | L1                | L2         |              |  |
|------------|-------------------|------------|--------------|--|
| 1          | 32K, fully assoc. | 1 <b>M</b> | fully assoc. |  |
| 2          |                   |            | 8-way        |  |
| 3          | 32K, 2-way        | 1 <b>M</b> | 4-way        |  |
| 4          |                   |            | 2-way        |  |
|            |                   |            |              |  |
|            |                   |            |              |  |

# L1 Miss Rate Prediction Accuracy

| Suite    | PRD  | RRD  | TCS  |
|----------|------|------|------|
| CFP2000  | 97.5 | 98.4 | 95.1 |
| CINT2000 | 94.4 | 96.7 | 93.9 |

Mongen Tech Carr, Fang, Onder, Wang

# L2 Miss Rate Accuracy

| Suite    | 2-way |     |     | Fully Associative |       |     |  |
|----------|-------|-----|-----|-------------------|-------|-----|--|
|          | PRD   | RRD | TCS | PRD               | RRD   | TCS |  |
| CFP2000  | 91%   | 93% | 87% | 97%               | 99.9% | 91% |  |
| CINT2000 | 91%   | 95% | 87% | 94%               | 99.9% | 89% |  |

166

Mchigan Tech Carr, Fang, Onder, Wang

Critical Instruction Prediction Critical Instructions Given reuse distance for an instruction · Can we determine which instructions are critical in terms of cache Suite PRD TCS performance? RRD %pred %act > An instruction is critical if it is in the set of instructions that CPF2000 92% 98% 51% 1.66% 1.67% generate the most L2 cache misses Those top miss-rate instructions whose cumulative total misses account for 95% of the misses in a program. **CINT2000** 89% 98% 53% 0.94% 0.97% > Use the execution frequency of one training run to determine the relative contribution number of misses for each instruction > Compare the actual critical instructions with predicted • Use cache configuration 2 Carr, Fang, Onder, Wang Carr, Fang, Onder, Wang 167 168

| Suite   | 1    | 2    | 3    | 4    | ≥5  | PRD performs better than TCS when data size is a factor                                                                                                    |
|---------|------|------|------|------|-----|------------------------------------------------------------------------------------------------------------------------------------------------------------|
| CFP2000 | 22.1 | 38.4 | 20.0 | 12.8 | 6.7 | <ul> <li>TCS performs better when data size doesn't change<br/>much and there are conflict misses</li> </ul>                                               |
|         | 10.7 | 14.5 | 23.5 | 22.5 | 10  | <ul> <li>PRD is much better at identifying the critical<br/>instructions than TCS</li> <li>these instructions should be targets of optimization</li> </ul> |
|         |      |      |      |      |     |                                                                                                                                                            |



