
- Wire delay is not a problem for SMT (in the near future) Zeshan Chishti and T. N. Vijaykumar
- Microarchitectural Techniques to Reduce Interconnect Power in Clustered Karthik Ramani
- Recent works [14] show that delays introduced in the issue and bypass logic will become critical for wide issue super-
- Lecture 21: Coherence and Interconnection Networks Flexible Snooping: Adaptive Filtering and Forwarding
- LEVERAGING MIXED PROCESS THREE-DIMENSIONAL DIE STACKING
- Journal of Instruction-Level Parallelism 8 (2005) 1-16 Submitted 6/05; published 10/05 A Case for Thermal-Aware Floorplanning at the
- Towards Scalable, Energy-Efficient, Bus-Based On-Chip Networks Aniruddha N. Udipi
- To appear at the 25th International Symposium on Computer Architecture, June 1998. Threaded Multiple Path Execution
- Lecture 13: Interconnection Networks Topics: flow control, router pipelines, case studies
- Lecture 15: Large Cache Design Topics: innovations for multi-mega-byte cache hierarchies
- Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed
- As device dimensions shrink, con-ventional global on-chip wires are not scaling
- RETROSPECTIVE: Improving Direct-Mapped Cache Performance by the Addition of a Small
- Lecture 11: SMT and Caching Basics Today: SMT, cache access basics
- Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory,
- Appears in the 11th International Conference on Architectural Support for Programming Languages and Operating Systems Coherence Decoupling: Making Use of Incoherence
- Lecture 2: System Metrics and Pipelining Today's topics: (Sections 1.6, 1.7, 1.9, A.1)
- 0018-9162/03/$17.00 2003 IEEE December 2003 49 C O V E R F E A T U R E
- Introduction Background: CS 3810 or equivalent, based on Hennessy
- CACTI 6.0: A Tool to Understand Large Caches Naveen Muralimanohar
- Lecture 5: Synchronization Topics: synchronization primitives and optimizations
- Complexity-Effective Superscalar Processors Subbarao Palacharla Norman P. Jouppi J. E. Smith
- Appears in the Proceedings of the 29 International Symposium on Computer Architecture
- Appears in the Proceedings of the 27 Annual International Symposium on Computer Architecture
- A Case for Increased Operating System Support in Chip Multi-Processors
- Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation
- Optimizing Communication and Capacity in a 3D Stacked Reconfigurable Cache Hierarchy
- URCS Technical Report 745 A High Performance Two-Level Register File Organization
- Lecture 10: TM Pathologies Topics: scalable lazy implementation, paper on TM
- The Microarchitecture of the Pentium 4 Processor 1
- Optimizing a Multi-Core Processor for Message-Passing Workloads Niladrish Chatterjee, Seth H. Pugsley, Josef Spjut, Rajeev Balasubramonian
- Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian y , Sandhya Dwarkadas y , and David H. Albonesi z
- Lecture 23: Interconnection Networks Topics: communication latency, centralized and
- Advances in IC processing allow for more microprocessor design options. The increasing gate density and cost of wires in advanced
- Dynamically Managing the Communication-Parallelism Trade-off in Future Clustered Processors
- CHOP: Adaptive Filter-Based DRAM Caching for CMP Server Platforms
- Lecture 14: Virtual Memory Topics: virtual memory (Section 5.4)
- EnergyEfficient Processor Design Using Multiple Clock Domains with Dynamic Voltage and Frequency Scaling
- Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors
- Lecture 17: Large Cache Design Managing Distributed, Shared L2 Caches through
- Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies
- Dynamic Management of Microarchitecture Resources in Future Microprocessors
- Maximizing CMP Throughput with Mediocre Cores John D. Davis, James Laudon
- Power Efficient Processor Architecture and The Cell Processor H. Peter Hofstee
- A First-Order Analysis of Power Overheads of Redundant Multi-Threading
- SWEL: Hardware Cache Coherence Protocols to Map Shared Data onto Shared Caches
- Parallel Algorithms III Topics: graph and sort algorithms
- Power-driven Design of Router Microarchitectures in On-chip Networks Hangsheng Wang Li-Shiuan Peh Sharad Malik
- CS 7810 Lecture 2 Complexity-Effective Superscalar Processors
- To appear in the Proceedings of the 26th International Symposium on Computer Architecture (ISCA 26), 1999. Is SC + ILP = RC?
- Scalable and Reliable Communication for Hardware Transactional Memory
- Parallel Algorithms IV Topics: image analysis algorithms
- ARCHITECTING EFFICIENT INTERCONNECTS FOR LARGE CACHES
- Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching Eric Rotenberg
- Lecture 2: Intro and Snooping Protocols Topics: multi-core cache organizations, programming
- Lecture 23: Interconnection Networks Express Virtual Channels: Towards the Ideal
- Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models
- Leveraging 3D Technology for Improved Reliability Niti Madan, Rajeev Balasubramonian
- Microarchitectural Wire Management for Performance and Power in Partitioned Architectures
- Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor Andre Seznec Stephen Felix Venkata Krishnan Yiannakis Sazeides
- Power-Efficient Approaches to Redundant Multithreading
- Lecture 26: Storage Systems Topics: Storage Systems (Chapter 6), other innovations
- Proceedings of 12th Intl Conference on Parallel Architectures and Compilation Techniques, September 2003. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor
- In Proceedings of the 34th International Symposium on Microarchitecture, December, 2001 Reducing Power with Dynamic Critical Path Information
- Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations
- Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture
- Hot-and-Cold: Using Criticality in the Design of Energy-Efficient Caches Rajeev Balasubramonian
- ViChaR: A Dynamic Virtual Channel Regulator for Network-on-Chip Routers* Chrysostomos A. Nicopoulos, Dongkook Park, Jongman Kim,
- J U N E 1 9 9 3 Technical Note TN-36
- N O V E M B E R 1 9 9 3 Research Report 93/6
- Power dissipation limits have emerged as a major constraint in the design
- Managing Wire Delay in Large Chip-Multiprocessor Caches Bradford M. Beckmann and David A. Wood
- IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 10, OCTOBER 2003 1 A Dynamically Tunable Memory Hierarchy
- Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control
- The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization
- Dynamically Managing the CommunicationParallelism Tradeoff in Future Clustered Processors
- Memory Hierarchy Reconfiguration for Energy and Performance in GeneralPurpose Processor Architectures
- Dynamically Allocating Processor Resources between Nearby and Distant ILP Rajeev Balasubramonian y , Sandhya Dwarkadas y , and David H. Albonesi z
- Dynamic Memory Hierarchy Performance Optimization Rajeev Balasubramonian y , David Albonesi z , Alper Buyuktosunoglu z , and Sandhya Dwarkadas y
- Interconnect Design Considerations for Large NUCA Naveen Muralimanohar
- Cache Decay: Exploiting Generational Behavior to Reduce Cache Leakage Power
- Lecture 21: Router Design Power-Driven Design of Router Microarchitectures
- Lecture 19: Networks for Large Cache Design Interconnect Design Considerations for Large NUCA
- Meanwhile, numerous mechanisms have been proposed and implemented to eliminate false data dependences and toler-
- The Microarchitecture of Superscalar Processors James E. Smith
- Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations
- Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors
- Memory Dependence Prediction using Store Sets George Z. Chrysos and Joel S. Emer
- Exploiting Eager Register Release in a Redundantly Multi-Threaded Processor Niti Madan, Rajeev Balasubramonian
- Appears in the Proceedings of the 33 Annual International Symposium on Microarchitecture
- Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers
- Wire Management for Coherence Traffic in Chip Multiprocessors Liqun Cheng, Naveen Muralimanohar, Karthik Ramani, Rajeev Balasubramonian, John Carter
- Lecture 16: Large Cache Design An Adaptive, Non-Uniform Cache Structure for
- Pipeline Gating: Speculation Control For Energy Reduction
- Microarchitectural Tradeoffs in the Design of a Scalable Clustered Microprocessor
- URCS Technical Report 743 Dynamically Allocating Processor Resources between Nearby and Distant ILP
- Shared Memory Consistency Models: A Tutorial \Lambda Sarita V. Adve y and Kourosh Gharachorloo z
- Power Efficient Resource Scaling in Partitioned Architectures through Dynamic Heterogeneity
- Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores
- Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement
- Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi
- OS Execution on Multi-Cores: Is Out-Sourcing Worthwhile? David Nellans, Rajeev Balasubramonian, Erik Brunvand
- Dynamic Hardware-Assisted Software-Controlled Page Placement to Manage Capacity Allocation and Sharing within Large Caches
- Optimizing NUCA Organizations and Wiring Alternatives for Large Caches With CACTI 6.0
- Journal of Instruction-Level Parallelism 9 (2007) 1-27 Submitted 2/07; published 6/07 Understanding the Impact of 3D Stacked Layouts on ILP
- Interconnect-Aware Coherence Protocols for Chip Multiprocessors Liqun Cheng, Naveen Muralimanohar, Karthik Ramani, Rajeev Balasubramonian, John B. Carter
- WIRE AWARE CACHE ARCHITECTURE Naveen Muralimanohar
- QUANTIFYING THE IMPACT OF INTERBLOCK WIRE-DELAYS ON PROCESSOR
- Exploring the Design Space for 3D Clustered Architectures Manu Awasthi, Rajeev Balasubramonian
- The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar, Rajeev Balasubramonian
- Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures
- Leveraging Bloom Filters for Smart Search Within NUCA Caches Robert Ricci, Steve Barrus, Dan Gebhardt, and Rajeev Balasubramonian
- Re-Visiting the Performance Impact of Microarchitectural Floorplanning Anupam Chakravorty, Abhishek Ranjan, Rajeev Balasubramonian
- Interference Aware Cache Designs for Operating System Execution
- Scalable, Reliable, Power-Efficient Communication for Hardware
- Power-Efficient Approaches to Reliability
- Combining Memory and a Controller with Photonics through 3D-Stacking to Enable Scalable and
- Lecture 3: Pipelining Basics Biggest contributors to performance: clock speed, parallelism
- Lecture 4: Advanced Pipelines Control hazards, multi-cycle in-order pipelines, static ILP
- Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW
- Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction
- Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction
- Lecture Notes: Out-of-Order Processors Rajeev Balasubramonian
- Lecture 9: Dynamic ILP Topics: out-of-order processors
- Lecture 10: ILP Innovations Today: ILP innovations and SMT
- Lecture 12: Cache Innovations Today: cache access basics and innovations
- Lecture 13: Memory Design Topics: virtual memory, DRAMs
- Lecture 17: Multiprocessors Topics: multiprocessor intro and taxonomy, symmetric
- Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed
- Lecture 21: Transactional Memory Topics: consistency model recap,
- Lecture 22: Transactional Memory Topics: transactional memory implementations
- Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor
- Lecture 1: Parallel Architecture Intro Course organization
- Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations
- Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations
- Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory
- Lecture 8: Transactional Memory TCC Topics: "lazy" implementation (TCC)
- Lecture 9: TM Implementations Topics: wrap-up of "lazy" implementation (TCC),
- Lecture 10: TM Implementations Topics: wrap-up of eager implementation (LogTM),
- Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control
- Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement
- Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various
- Lecture 20: Speculation Is SC+ILP=RC?, Purdue, ISCA'99
- In-Network Cache Coherence Noel Eisley
- A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks
- To provide high dependability in a multithreaded system despite hardware faults, the system must detect and cor-
- Lecture 22: Fault Tolerance Token Coherence: Decoupling Performance and
- Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms
- Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms
- Parallel Algorithms II Topics: matrix and graph algorithms
- Simultaneous Multithreading: Maximizing On-Chip Parallelism Dean M. Tullsen, Susan J. Eggers, and Henry M. Levy
- Delaying Physical Register Allocation Through Virtual-Physical Registers
- Commit Algorithms for Scalable Hardware Transactional Memory
- Refining the Utility Metric for Utility-Based Cache Partitioning Xing Lin, Rajeev Balasubramonian
- Prediction Based DRAM Row-Buffer Management in the Many-Core Era Manu Awasthi, David W. Nellans, Rajeev Balasubramonian, Al Davis
- Buses and Crossbars Rajeev Balasubramonian
- CHOP: INTEGRATING DRAM CACHES FOR CMP SERVER PLATFORMS
- Understanding the Behavior of Pthread Applications on Non-Uniform Cache Architectures
- USIMM: the Utah SImulated Memory Module A Simulation Infrastructure for the JWAC Memory