SlideShare a Scribd company logo
1 of 43
Download to read offline
Massed Refresh: An Energy-Efficient Technique to
Reduce Refresh Overhead in Hybrid Memory
Cube Architectures
Ishan Thakkar, Sudeep Pasricha
Department of Electrical and Computer Engineering
Colorado State University, Fort Collins, CO, U.S.A.
{ishan.thakkar, sudeep}@colostate.edu
VLSID 2016
KOLKATA, INDIA
January 4-8, 2016
DOI 10.1109/VLSID.2016.13
• Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
1
• Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
2
Introduction
3
• Main memory is DRAM
• It is a critical component of all computing systems: server, desktop,
mobile, embedded, sensor
• DRAM stores data in cell capacitor
• Fully charged cell-capacitor  logic ‘1’
• Fully discharged cell-capacitor  logic ‘0’
• DRAM cell loses data over time, as cell-capacitor leaks charge over
time
• For temperatures below 85°C, DRAM cell loses data in 64ms
• For higher temperatures, DRAM cell loses data at faster rate
DRAM:
Dynamic Random Access Memory
Word Line
BitLine
Cell
CapacitorAccess
Transistor
To preserve data integrity, the charge on each DRAM cell
(cell-capacitor) must be periodically restored or refreshed.
• Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
4
Background on DRAM Structure
5
• Based on their structure, DRAMs are classified in
two categories:
1. 2D DRAMs: Planar single layer DRAMs
2. 3D-Stacked DRAMs: Multiple 2D DRAM layers stacked
on one-another using TSVs
• 2D DRAM structure
TSV:
Through Silicon Via
2D DRAM Structure
Hierarchy
Chip Bank Subarray BitcellRank
2D DRAM: Rank and Chip Structure
6
<N>
<N>
<N>
. . .
<N>
Mux
DRAM
Chip
<N>
DRAM Rank DRAM Chip
• 2D DRAM rank:
• Multiple chips work in tandem
3D-Stacked DRAM Structure
7
HMC Structure
Hierarchy
Vault Bank Subarray Bitcell
Hybrid Memory Cube
In this paper, we consider Hybrid Memory Cube (HMC), which is as a standard
for 3D-Stacked DRAMs defined by a consortium of industries
DRAM Bank Structure
8
Sense Amplifiers
Sense Amplifiers
RowAddressDecoder
Row Buffer
Columns
Rows
Subarray
Column Mux
Data bits
Bank Core
Bank
Peripherals
Column
Address
Decoder
3D-Stacked and 2D DRAMs have similar bank structures
DRAM Subarray Structure
9
Sense Amps
Row
Address
Word Line
BitLine
Cell
CapacitorAccess
Transistor
Word Line
BitLine
Sense Amp Sense Amp Sense Amp
DRAM
Cell
DRAM
Cell
3D-Stacked and 2D DRAMs have similar subarray structures
All bitlines of
the bank are
pre-charged
to 0.5 VDD
Basic DRAM Operations
10
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
The target
row is
opened,
Basic DRAM Operations
11
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4
The target
row is
opened,
then it’s
captured by
SAs
Basic DRAM Operations
12
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4
Basic DRAM Operations
13
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4
SAs drive
each bitline
fully either to
VDD or 0V –
restore the
open row
Row 4
Basic DRAM Operations
14
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4
Row 4
Open row is
stored in
global row
buffer
Basic DRAM Operations
15
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
READ
Row
Address
Row 4
Subarray ID: 1
Row 4
Row 4
Column 1
Target data
block is
selected,
and then
multiplexed
out from row
buffer
Basic DRAM Operations
16
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
READ
Row
Address
Row 4
Subarray ID: 1
Row 4
Row 4
Column 1
A duet of PRECHARGE-ACTIVATION operations
restores/refreshes the target row  dummy
PRECHARGE-ACTIVATION operations are performed to
refresh the rows
Refresh: 2D Vs 3D-Stacked DRAMs
17
• 3D-Stacked DRAMs have
• Higher capacity/density  more rows need to be refreshed
• Higher power density  higher operating temperature (>85°C)
 smaller retention period (time before DRAM cells lose data)
of 32ms than that of 64ms for 2D DRAMs
• Thus, refresh problem for 3D-Stacked DRAMs is more critical
• Therefore, in this study, we target a standardized 3D-Stacked
DRAM architecture HMC
Refresh
Dummy ACTIVATION-PRECHARGE are performed on all rows every
retention cycle (32 ms)
To prevent long pauses  a JEDEC standardized Distributed Refresh
method is used
Background: Refresh Operation
18
• Distributed Refresh – JEDEC standardized method
• A group of 𝑛 rows are refreshed every 3.9μs
• A group of 𝑛 rows form a ‘Refresh Bundle (RB)’
• Size of RB increases w/ increase in DRAM capacity  increases tRFC
Example Distributed Refresh Operation – 1Gb HMC Vault
RB1
tRFC
tREFI = 3.9µs
RB2
tRFC
tREFI = 3.9µs
RB8192
tRFC
tREFI = 3.9µs
Retention Cycle = 32ms
Size of RB is 16
tREC
tRFC
Row1
tRC
Row2
tRC
Row3
tRC
Row4
tRC
Row15
tRC
Row16
tRCtREC tREC tREC
tREFI:
Refresh Interval
tRFC:
Refresh Cycle Time
tRC:
Row Cycle Time
tRFC = time taken to refresh entire RB
Performance Overhead of Distributed
Refresh
19
Source: J Liu+, ISCA 2012
Performance overhead of refresh increases with increase in device capacity
Energy Overhead of Distributed Refresh
20
Source: J Liu+, ISCA 2012
Energy overhead of refresh increases with increase in device capacity
Energy Overhead of Distributed Refresh
21
Source: J Liu+, ISCA 2012
Energy overhead of refresh increases with increase in device capacity
Refresh is a growing problem, which needs to be
addressed to realize low-latency, low-energy DRAMs
• Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
22
Related Work
23
We improve upon Scattered Refresh
Scattered Refresh improves upon Per-bank Refresh and All-bank Refresh
All-Bank Refresh Vs Per-Bank Refresh
24
• Distributed Refresh can be implemented at two
different granularities
• All-bank Refresh: All banks are refreshed simultaneously,
and none of the banks is allowed to serve any request until
refresh is complete
• Supported by all general purpose DDRx DRAMs
• DRAM operation is completely stalled  no. of available banks (#AB)
is zero
• Exploits bank-level parallelism (BLP) for refreshing  smaller tRFC
• Per-bank Refresh: Only one bank is refreshed at a time, so
all other banks are allowed to serve other requests
• Supported by LPDDRx DRAMs
• #AB > 0
• No BLP  larger value of tRFC
tRFC:
Refresh Cycle Time
All-Bank Refresh Vs Per-Bank Refresh
25
All-Bank
Refresh
tRC:
Row Cycle Time
• Smaller value of tRFC
• Number of available banks (#AB) = 0
 DRAM operation is completely stalled
tRFC:
Refresh Cycle Time
Dummy ACTIVATION-PRECHARGE
operations for refresh command
Per-Bank
Refresh
• #AB > 0
• No BLP  larger value of tRFC
Both All-bank Refresh and Per-bank Refresh have
drawbacks and they can be improved
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID
Scattered Refresh
26
Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4
• Improves upon Per-bank Refresh – uses subarray-level parallelism
(SLP) for refresh
• Each row of RB is mapped to a different subarray
• SLP gives opportunity to overlap PRECHARGE with next ACTIVATE
 reduces tRFC
Source: T Kalyan+, ISCA 2012
Scattered
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID
How does Scattered Refresh compare to Per-bank Refresh and All-bank Refresh?
All-Bank
Scattered
Scattered Refresh
27
Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4
Per-Bank
Room for improvement - Scattered Refresh
tRFC for All-bank Refresh < tRFC for Scattered Refresh
< tRFC for Per-bank Refresh
• Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
28
Contributions
29
• Crammed Refresh: Per-bank Refresh + All-bank Refresh
• 2 banks are refreshed in parallel, instead of 1 bank in Per-bank Refresh
and all banks in All-bank Refresh
• Massed Refresh: Crammed Refresh + Scattered Refresh
• 2 banks are refreshed in parallel
• Uses SLP in both banks being refreshed
#AB:
Number of
banks available
to serve other
requests while
remaining banks
are being
refreshed
#BLP:
Bank-level
Parallelism
#SLP:
Subarray-level
Parallelism
Only 2 banks are refreshed in parallel – proof of concept
More than 2 banks can also be chosen
Idea is to keep balance between #AB and BLP for refresh
Scattered
Crammed
Per-Bank
Crammed Refresh – tRFC Timing
30
Example Crammed Refresh Operation – HMC Vault – Refresh Bundle size of 4
• Bank-level parallelism (BLP) for refresh
• Only 2 banks are refreshed in parallel  #AB>0
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID
tRFC for Crammed Refresh
< tRFC for Scattered Refresh
Massed
Crammed
Massed Refresh – tRFC Timing
31
Example Massed Refresh Operation – HMC Vault – Refresh Bundle size of 4
Per-Bank
• Bank-level parallelism (BLP) +
Subarray-level parallelism (SLP) for refresh
tRFC for Massed Refresh
< tRFC for Crammed Refresh
How to implement BLP and SLP together?
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID
Subarray-level Parallelism (SLP)
32
Global Row-address Latch
Per-Subarray Row-address Latch
Source: Y Kim+, ISCA 2012
Global Row-address Latch
hinders SLP
Bank-level Parallelism (BLP)
33
Physical Address Latch
LayerAddr[2] RowAddr[14]BankAddr[1]
17-bit Address Counter
Refresh Scheduler
Address
Calculator
Control
Refresh Controller
Physical Addr Decoder
Row Addr
Latch
LayerID
LID
BankID
BID
Mask
EN
Memory die 1
Memory die 2
Memory die 3
Memory die 4
Logic Base (LoB)
Vault ControllerTSV Launch
Pads
To Banks
BLP is implemented by masking BankID during refresh
• Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
34
Evaluation Setup
35
• Trace-driven simulation for PARSEC benchmarks
• Memory access traces extracted from detailed cycle-accurate simulations
using gem5
• These memory traces were then provided as inputs to the DRAM simulator
DRAMSim2
• Energy, timing and area analysis
• CACTI-3DD based simulation – based on 4Gb HMC quad model
• DRAMSim2 configuration
• Configured DRAMSim2 using CACTI-3DD results
• Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Motivation
• Massed Refresh Technique
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
36
Results I – Energy, Timing, Area
37
Results II – Throughput
38
Crammed refresh achieves 7.1% and 2.9% more throughput on average
over distributed per-bank refresh and scattered refresh respectively
PARSEC Benchmarks
Massed refresh achieves 8.4% and 4.3% more throughput on average
over distributed per-bank refresh and scattered refresh respectively
Results III – Energy Delay Product (EDP)
39
Crammed refresh achieves 6.4% and 2.7% less EDP on average over
distributed per-bank refresh and scattered refresh respectively
PARSEC Benchmarks
Massed refresh achieves 7.5% and 3.9% less EDP on average over
distributed per-bank refresh and scattered refresh respectively
• Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Motivation
• Massed Refresh Technique
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
40
Conclusions
41
• Proposed Massed Refresh technique exploits
• Bank-level as well as subarray-level parallelism while refresh
operations
• Proposed Crammed Refresh and Massed Refresh techniques
• Improve throughput and energy-efficiency of DRAM
• Crammed Refresh improves upon state-of-the-art
• 7.1% & 6.4% improvements in throughput and EDP over the
distributed per-bank refresh
• 2.9% & 2.7% improvements in throughput and EDP over the
scattered refresh schemes respectively
• Massed Refresh improves upon state-of-the-art
• 8.4% & 7.5% improvements in throughput and EDP over the
distributed per-bank refresh
• 4.3% & 3.9% improvements in throughput and EDP over the
scattered refresh schemes respectively
• Questions / Comments ?
Thank You
42

More Related Content

What's hot

System On Chip
System On ChipSystem On Chip
System On Chipanishgoel
 
Oracle Clusterware Node Management and Voting Disks
Oracle Clusterware Node Management and Voting DisksOracle Clusterware Node Management and Voting Disks
Oracle Clusterware Node Management and Voting DisksMarkus Michalewicz
 
RISC (reduced instruction set computer)
RISC (reduced instruction set computer)RISC (reduced instruction set computer)
RISC (reduced instruction set computer)LokmanArman
 
Memory Organisation in embedded systems
Memory Organisation in embedded systemsMemory Organisation in embedded systems
Memory Organisation in embedded systemsUthraSowrirajan1
 
Implementing Useful Clock Skew Using Skew Groups
Implementing Useful Clock Skew Using Skew GroupsImplementing Useful Clock Skew Using Skew Groups
Implementing Useful Clock Skew Using Skew GroupsM Mei
 
Oracle Database – Mission Critical
Oracle Database – Mission CriticalOracle Database – Mission Critical
Oracle Database – Mission CriticalMarkus Michalewicz
 
Introduction to PCB Design (Eagle)
Introduction to PCB Design (Eagle)Introduction to PCB Design (Eagle)
Introduction to PCB Design (Eagle)yeokm1
 
In memory computing
In memory computingIn memory computing
In memory computingGagan Reddy
 
Arm instruction set
Arm instruction setArm instruction set
Arm instruction setPriyangaKR1
 
Unit 1 Introduction to Embedded computing and ARM processor
Unit 1 Introduction to Embedded computing and ARM processorUnit 1 Introduction to Embedded computing and ARM processor
Unit 1 Introduction to Embedded computing and ARM processorVenkat Ramanan C
 
Db2 for z os trends
Db2 for z os trendsDb2 for z os trends
Db2 for z os trendsCuneyt Goksu
 
Oracle 12c and its pluggable databases
Oracle 12c and its pluggable databasesOracle 12c and its pluggable databases
Oracle 12c and its pluggable databasesGustavo Rene Antunez
 
Power Reduction Techniques
Power Reduction TechniquesPower Reduction Techniques
Power Reduction TechniquesRajesh M
 
System-on-Chip Design, Embedded System Design Challenges
System-on-Chip Design, Embedded System Design ChallengesSystem-on-Chip Design, Embedded System Design Challenges
System-on-Chip Design, Embedded System Design Challengespboulet
 

What's hot (20)

System On Chip
System On ChipSystem On Chip
System On Chip
 
Oracle Clusterware Node Management and Voting Disks
Oracle Clusterware Node Management and Voting DisksOracle Clusterware Node Management and Voting Disks
Oracle Clusterware Node Management and Voting Disks
 
RISC (reduced instruction set computer)
RISC (reduced instruction set computer)RISC (reduced instruction set computer)
RISC (reduced instruction set computer)
 
Embedded systems
Embedded systemsEmbedded systems
Embedded systems
 
Memory Organisation in embedded systems
Memory Organisation in embedded systemsMemory Organisation in embedded systems
Memory Organisation in embedded systems
 
Memory Organization
Memory OrganizationMemory Organization
Memory Organization
 
Implementing Useful Clock Skew Using Skew Groups
Implementing Useful Clock Skew Using Skew GroupsImplementing Useful Clock Skew Using Skew Groups
Implementing Useful Clock Skew Using Skew Groups
 
Oracle Database – Mission Critical
Oracle Database – Mission CriticalOracle Database – Mission Critical
Oracle Database – Mission Critical
 
normaliztion
normaliztionnormaliztion
normaliztion
 
Introduction to PCB Design (Eagle)
Introduction to PCB Design (Eagle)Introduction to PCB Design (Eagle)
Introduction to PCB Design (Eagle)
 
In memory computing
In memory computingIn memory computing
In memory computing
 
Introduction to EDA Tools
Introduction to EDA ToolsIntroduction to EDA Tools
Introduction to EDA Tools
 
Arm instruction set
Arm instruction setArm instruction set
Arm instruction set
 
Unit 1 Introduction to Embedded computing and ARM processor
Unit 1 Introduction to Embedded computing and ARM processorUnit 1 Introduction to Embedded computing and ARM processor
Unit 1 Introduction to Embedded computing and ARM processor
 
Db2 for z os trends
Db2 for z os trendsDb2 for z os trends
Db2 for z os trends
 
Oracle 12c and its pluggable databases
Oracle 12c and its pluggable databasesOracle 12c and its pluggable databases
Oracle 12c and its pluggable databases
 
Physical design
Physical design Physical design
Physical design
 
Power Reduction Techniques
Power Reduction TechniquesPower Reduction Techniques
Power Reduction Techniques
 
ROS distributed architecture
ROS  distributed architectureROS  distributed architecture
ROS distributed architecture
 
System-on-Chip Design, Embedded System Design Challenges
System-on-Chip Design, Embedded System Design ChallengesSystem-on-Chip Design, Embedded System Design Challenges
System-on-Chip Design, Embedded System Design Challenges
 

Viewers also liked

Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Arch...
Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Arch...Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Arch...
Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Arch...Ishan Thakkar
 
Mp So C 18 Apr
Mp So C 18 AprMp So C 18 Apr
Mp So C 18 AprFNian
 
Blue gene technology
Blue gene technologyBlue gene technology
Blue gene technologyVivek Jha
 
Modern Control - Lec 02 - Mathematical Modeling of Systems
Modern Control - Lec 02 - Mathematical Modeling of SystemsModern Control - Lec 02 - Mathematical Modeling of Systems
Modern Control - Lec 02 - Mathematical Modeling of SystemsAmr E. Mohamed
 
Tidal scale short_story_v2
Tidal scale short_story_v2Tidal scale short_story_v2
Tidal scale short_story_v2Chuck Piercey
 
Speed power exploration of 2-d intelligence network-on-chip for multi-clock m...
Speed power exploration of 2-d intelligence network-on-chip for multi-clock m...Speed power exploration of 2-d intelligence network-on-chip for multi-clock m...
Speed power exploration of 2-d intelligence network-on-chip for multi-clock m...eSAT Publishing House
 
DSP_FOEHU - Lec 07 - Digital Filters
DSP_FOEHU - Lec 07 - Digital FiltersDSP_FOEHU - Lec 07 - Digital Filters
DSP_FOEHU - Lec 07 - Digital FiltersAmr E. Mohamed
 
Hybrid Memory Cube: Developing Scalable and Resilient Memory Systems
Hybrid Memory Cube: Developing Scalable and Resilient Memory SystemsHybrid Memory Cube: Developing Scalable and Resilient Memory Systems
Hybrid Memory Cube: Developing Scalable and Resilient Memory SystemsMicronTechnology
 

Viewers also liked (10)

Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Arch...
Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Arch...Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Arch...
Process Variation Aware Crosstalk Mitigation for DWDM based Photonic NoC Arch...
 
Mathematical Modeling Experimental Approach of the Friction on the Tool-Chip ...
Mathematical Modeling Experimental Approach of the Friction on the Tool-Chip ...Mathematical Modeling Experimental Approach of the Friction on the Tool-Chip ...
Mathematical Modeling Experimental Approach of the Friction on the Tool-Chip ...
 
Mp So C 18 Apr
Mp So C 18 AprMp So C 18 Apr
Mp So C 18 Apr
 
Blue gene technology
Blue gene technologyBlue gene technology
Blue gene technology
 
Modern Control - Lec 02 - Mathematical Modeling of Systems
Modern Control - Lec 02 - Mathematical Modeling of SystemsModern Control - Lec 02 - Mathematical Modeling of Systems
Modern Control - Lec 02 - Mathematical Modeling of Systems
 
Blue brain
Blue brainBlue brain
Blue brain
 
Tidal scale short_story_v2
Tidal scale short_story_v2Tidal scale short_story_v2
Tidal scale short_story_v2
 
Speed power exploration of 2-d intelligence network-on-chip for multi-clock m...
Speed power exploration of 2-d intelligence network-on-chip for multi-clock m...Speed power exploration of 2-d intelligence network-on-chip for multi-clock m...
Speed power exploration of 2-d intelligence network-on-chip for multi-clock m...
 
DSP_FOEHU - Lec 07 - Digital Filters
DSP_FOEHU - Lec 07 - Digital FiltersDSP_FOEHU - Lec 07 - Digital Filters
DSP_FOEHU - Lec 07 - Digital Filters
 
Hybrid Memory Cube: Developing Scalable and Resilient Memory Systems
Hybrid Memory Cube: Developing Scalable and Resilient Memory SystemsHybrid Memory Cube: Developing Scalable and Resilient Memory Systems
Hybrid Memory Cube: Developing Scalable and Resilient Memory Systems
 

Similar to Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures

Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Hsien-Hsin Sean Lee, Ph.D.
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memoryDeepak John
 
sramanddram.ppt
sramanddram.pptsramanddram.ppt
sramanddram.pptAmalNath44
 
Computer architecture for HNDIT
Computer architecture for HNDITComputer architecture for HNDIT
Computer architecture for HNDITtjunicornfx
 
Computer Organisation and Architecture
Computer Organisation and ArchitectureComputer Organisation and Architecture
Computer Organisation and ArchitectureSubhasis Dash
 
DRAM Cell - Working and Read and Write Operations
DRAM Cell - Working and Read and Write OperationsDRAM Cell - Working and Read and Write Operations
DRAM Cell - Working and Read and Write OperationsNaman Bhalla
 
High Bandwidth Memory(HBM)
High Bandwidth Memory(HBM)High Bandwidth Memory(HBM)
High Bandwidth Memory(HBM)HARINATH REDDY
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike, Inc.
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization2022002857mbit
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxssuser30e7d2
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxAkshitAgiwal1
 
Basic Computer Architecture
Basic Computer ArchitectureBasic Computer Architecture
Basic Computer ArchitectureYong Heui Cho
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...ScyllaDB
 
Critical Performance Metrics for DDR4 based Systems
Critical Performance Metrics for DDR4 based SystemsCritical Performance Metrics for DDR4 based Systems
Critical Performance Metrics for DDR4 based SystemsBarbara Aichinger
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldPraveen Kumar
 

Similar to Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures (20)

Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
Lec11 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Memory part3
 
Computer organization memory
Computer organization memoryComputer organization memory
Computer organization memory
 
sramanddram.ppt
sramanddram.pptsramanddram.ppt
sramanddram.ppt
 
Computer architecture for HNDIT
Computer architecture for HNDITComputer architecture for HNDIT
Computer architecture for HNDIT
 
Computer Organisation and Architecture
Computer Organisation and ArchitectureComputer Organisation and Architecture
Computer Organisation and Architecture
 
DRAM Cell - Working and Read and Write Operations
DRAM Cell - Working and Read and Write OperationsDRAM Cell - Working and Read and Write Operations
DRAM Cell - Working and Read and Write Operations
 
Dd sdram
Dd sdramDd sdram
Dd sdram
 
Adaptive bank management[1]
Adaptive bank management[1]Adaptive bank management[1]
Adaptive bank management[1]
 
Memory Access Scheduling
Memory Access SchedulingMemory Access Scheduling
Memory Access Scheduling
 
High Bandwidth Memory(HBM)
High Bandwidth Memory(HBM)High Bandwidth Memory(HBM)
High Bandwidth Memory(HBM)
 
Aerospike Hybrid Memory Architecture
Aerospike Hybrid Memory ArchitectureAerospike Hybrid Memory Architecture
Aerospike Hybrid Memory Architecture
 
Memory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer OrganizationMemory Hierarchy PPT of Computer Organization
Memory Hierarchy PPT of Computer Organization
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptx
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 
Basic Computer Architecture
Basic Computer ArchitectureBasic Computer Architecture
Basic Computer Architecture
 
Memory management
Memory managementMemory management
Memory management
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
Critical Performance Metrics for DDR4 based Systems
Critical Performance Metrics for DDR4 based SystemsCritical Performance Metrics for DDR4 based Systems
Critical Performance Metrics for DDR4 based Systems
 
RAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 TalkRAMinate ACM SoCC 2016 Talk
RAMinate ACM SoCC 2016 Talk
 
Chapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworldChapter5 the memory-system-jntuworld
Chapter5 the memory-system-jntuworld
 

Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures

  • 1. Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures Ishan Thakkar, Sudeep Pasricha Department of Electrical and Computer Engineering Colorado State University, Fort Collins, CO, U.S.A. {ishan.thakkar, sudeep}@colostate.edu VLSID 2016 KOLKATA, INDIA January 4-8, 2016 DOI 10.1109/VLSID.2016.13
  • 2. • Introduction • Background on DRAM Structure and Refresh Operation • Related Work • Contributions • Evaluation Setup • Evaluation Results • Conclusion Outline 1
  • 3. • Introduction • Background on DRAM Structure and Refresh Operation • Related Work • Contributions • Evaluation Setup • Evaluation Results • Conclusion Outline 2
  • 4. Introduction 3 • Main memory is DRAM • It is a critical component of all computing systems: server, desktop, mobile, embedded, sensor • DRAM stores data in cell capacitor • Fully charged cell-capacitor  logic ‘1’ • Fully discharged cell-capacitor  logic ‘0’ • DRAM cell loses data over time, as cell-capacitor leaks charge over time • For temperatures below 85°C, DRAM cell loses data in 64ms • For higher temperatures, DRAM cell loses data at faster rate DRAM: Dynamic Random Access Memory Word Line BitLine Cell CapacitorAccess Transistor To preserve data integrity, the charge on each DRAM cell (cell-capacitor) must be periodically restored or refreshed.
  • 5. • Introduction • Background on DRAM Structure and Refresh Operation • Related Work • Contributions • Evaluation Setup • Evaluation Results • Conclusion Outline 4
  • 6. Background on DRAM Structure 5 • Based on their structure, DRAMs are classified in two categories: 1. 2D DRAMs: Planar single layer DRAMs 2. 3D-Stacked DRAMs: Multiple 2D DRAM layers stacked on one-another using TSVs • 2D DRAM structure TSV: Through Silicon Via 2D DRAM Structure Hierarchy Chip Bank Subarray BitcellRank
  • 7. 2D DRAM: Rank and Chip Structure 6 <N> <N> <N> . . . <N> Mux DRAM Chip <N> DRAM Rank DRAM Chip • 2D DRAM rank: • Multiple chips work in tandem
  • 8. 3D-Stacked DRAM Structure 7 HMC Structure Hierarchy Vault Bank Subarray Bitcell Hybrid Memory Cube In this paper, we consider Hybrid Memory Cube (HMC), which is as a standard for 3D-Stacked DRAMs defined by a consortium of industries
  • 9. DRAM Bank Structure 8 Sense Amplifiers Sense Amplifiers RowAddressDecoder Row Buffer Columns Rows Subarray Column Mux Data bits Bank Core Bank Peripherals Column Address Decoder 3D-Stacked and 2D DRAMs have similar bank structures
  • 10. DRAM Subarray Structure 9 Sense Amps Row Address Word Line BitLine Cell CapacitorAccess Transistor Word Line BitLine Sense Amp Sense Amp Sense Amp DRAM Cell DRAM Cell 3D-Stacked and 2D DRAMs have similar subarray structures
  • 11. All bitlines of the bank are pre-charged to 0.5 VDD Basic DRAM Operations 10 Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID?=ID?ENEN GlobalAddress Latch Row Buffer Column MuxColumn Address Decoder PRECHARGE
  • 12. The target row is opened, Basic DRAM Operations 11 Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID?=ID?ENEN GlobalAddress Latch Row Buffer Column MuxColumn Address Decoder PRECHARGE ACTIVATION Row Address Row 4 Subarray ID: 1 Row 4
  • 13. The target row is opened, then it’s captured by SAs Basic DRAM Operations 12 Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID?=ID?ENEN GlobalAddress Latch Row Buffer Column MuxColumn Address Decoder PRECHARGE ACTIVATION Row Address Row 4 Subarray ID: 1 Row 4
  • 14. Basic DRAM Operations 13 Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID?=ID?ENEN GlobalAddress Latch Row Buffer Column MuxColumn Address Decoder PRECHARGE ACTIVATION Row Address Row 4 Subarray ID: 1 Row 4 SAs drive each bitline fully either to VDD or 0V – restore the open row Row 4
  • 15. Basic DRAM Operations 14 Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID?=ID?ENEN GlobalAddress Latch Row Buffer Column MuxColumn Address Decoder PRECHARGE ACTIVATION Row Address Row 4 Subarray ID: 1 Row 4 Row 4 Open row is stored in global row buffer
  • 16. Basic DRAM Operations 15 Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID?=ID?ENEN GlobalAddress Latch Row Buffer Column MuxColumn Address Decoder PRECHARGE ACTIVATION READ Row Address Row 4 Subarray ID: 1 Row 4 Row 4 Column 1 Target data block is selected, and then multiplexed out from row buffer
  • 17. Basic DRAM Operations 16 Sense Amplifiers Sense Amplifiers Global Row Dec. Subarray Dec. Subarray Dec. =ID?=ID?ENEN GlobalAddress Latch Row Buffer Column MuxColumn Address Decoder PRECHARGE ACTIVATION READ Row Address Row 4 Subarray ID: 1 Row 4 Row 4 Column 1 A duet of PRECHARGE-ACTIVATION operations restores/refreshes the target row  dummy PRECHARGE-ACTIVATION operations are performed to refresh the rows
  • 18. Refresh: 2D Vs 3D-Stacked DRAMs 17 • 3D-Stacked DRAMs have • Higher capacity/density  more rows need to be refreshed • Higher power density  higher operating temperature (>85°C)  smaller retention period (time before DRAM cells lose data) of 32ms than that of 64ms for 2D DRAMs • Thus, refresh problem for 3D-Stacked DRAMs is more critical • Therefore, in this study, we target a standardized 3D-Stacked DRAM architecture HMC Refresh Dummy ACTIVATION-PRECHARGE are performed on all rows every retention cycle (32 ms) To prevent long pauses  a JEDEC standardized Distributed Refresh method is used
  • 19. Background: Refresh Operation 18 • Distributed Refresh – JEDEC standardized method • A group of 𝑛 rows are refreshed every 3.9μs • A group of 𝑛 rows form a ‘Refresh Bundle (RB)’ • Size of RB increases w/ increase in DRAM capacity  increases tRFC Example Distributed Refresh Operation – 1Gb HMC Vault RB1 tRFC tREFI = 3.9µs RB2 tRFC tREFI = 3.9µs RB8192 tRFC tREFI = 3.9µs Retention Cycle = 32ms Size of RB is 16 tREC tRFC Row1 tRC Row2 tRC Row3 tRC Row4 tRC Row15 tRC Row16 tRCtREC tREC tREC tREFI: Refresh Interval tRFC: Refresh Cycle Time tRC: Row Cycle Time tRFC = time taken to refresh entire RB
  • 20. Performance Overhead of Distributed Refresh 19 Source: J Liu+, ISCA 2012 Performance overhead of refresh increases with increase in device capacity
  • 21. Energy Overhead of Distributed Refresh 20 Source: J Liu+, ISCA 2012 Energy overhead of refresh increases with increase in device capacity
  • 22. Energy Overhead of Distributed Refresh 21 Source: J Liu+, ISCA 2012 Energy overhead of refresh increases with increase in device capacity Refresh is a growing problem, which needs to be addressed to realize low-latency, low-energy DRAMs
  • 23. • Introduction • Background on DRAM Structure and Refresh Operation • Related Work • Contributions • Evaluation Setup • Evaluation Results • Conclusion Outline 22
  • 24. Related Work 23 We improve upon Scattered Refresh Scattered Refresh improves upon Per-bank Refresh and All-bank Refresh
  • 25. All-Bank Refresh Vs Per-Bank Refresh 24 • Distributed Refresh can be implemented at two different granularities • All-bank Refresh: All banks are refreshed simultaneously, and none of the banks is allowed to serve any request until refresh is complete • Supported by all general purpose DDRx DRAMs • DRAM operation is completely stalled  no. of available banks (#AB) is zero • Exploits bank-level parallelism (BLP) for refreshing  smaller tRFC • Per-bank Refresh: Only one bank is refreshed at a time, so all other banks are allowed to serve other requests • Supported by LPDDRx DRAMs • #AB > 0 • No BLP  larger value of tRFC tRFC: Refresh Cycle Time
  • 26. All-Bank Refresh Vs Per-Bank Refresh 25 All-Bank Refresh tRC: Row Cycle Time • Smaller value of tRFC • Number of available banks (#AB) = 0  DRAM operation is completely stalled tRFC: Refresh Cycle Time Dummy ACTIVATION-PRECHARGE operations for refresh command Per-Bank Refresh • #AB > 0 • No BLP  larger value of tRFC Both All-bank Refresh and Per-bank Refresh have drawbacks and they can be improved L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID
  • 27. Scattered Refresh 26 Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4 • Improves upon Per-bank Refresh – uses subarray-level parallelism (SLP) for refresh • Each row of RB is mapped to a different subarray • SLP gives opportunity to overlap PRECHARGE with next ACTIVATE  reduces tRFC Source: T Kalyan+, ISCA 2012 Scattered L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID How does Scattered Refresh compare to Per-bank Refresh and All-bank Refresh?
  • 28. All-Bank Scattered Scattered Refresh 27 Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4 Per-Bank Room for improvement - Scattered Refresh tRFC for All-bank Refresh < tRFC for Scattered Refresh < tRFC for Per-bank Refresh
  • 29. • Introduction • Background on DRAM Structure and Refresh Operation • Related Work • Contributions • Evaluation Setup • Evaluation Results • Conclusion Outline 28
  • 30. Contributions 29 • Crammed Refresh: Per-bank Refresh + All-bank Refresh • 2 banks are refreshed in parallel, instead of 1 bank in Per-bank Refresh and all banks in All-bank Refresh • Massed Refresh: Crammed Refresh + Scattered Refresh • 2 banks are refreshed in parallel • Uses SLP in both banks being refreshed #AB: Number of banks available to serve other requests while remaining banks are being refreshed #BLP: Bank-level Parallelism #SLP: Subarray-level Parallelism Only 2 banks are refreshed in parallel – proof of concept More than 2 banks can also be chosen Idea is to keep balance between #AB and BLP for refresh
  • 31. Scattered Crammed Per-Bank Crammed Refresh – tRFC Timing 30 Example Crammed Refresh Operation – HMC Vault – Refresh Bundle size of 4 • Bank-level parallelism (BLP) for refresh • Only 2 banks are refreshed in parallel  #AB>0 L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID tRFC for Crammed Refresh < tRFC for Scattered Refresh
  • 32. Massed Crammed Massed Refresh – tRFC Timing 31 Example Massed Refresh Operation – HMC Vault – Refresh Bundle size of 4 Per-Bank • Bank-level parallelism (BLP) + Subarray-level parallelism (SLP) for refresh tRFC for Massed Refresh < tRFC for Crammed Refresh How to implement BLP and SLP together? L = Layer ID B = Bank ID SA = Saubarray ID R = Row ID
  • 33. Subarray-level Parallelism (SLP) 32 Global Row-address Latch Per-Subarray Row-address Latch Source: Y Kim+, ISCA 2012 Global Row-address Latch hinders SLP
  • 34. Bank-level Parallelism (BLP) 33 Physical Address Latch LayerAddr[2] RowAddr[14]BankAddr[1] 17-bit Address Counter Refresh Scheduler Address Calculator Control Refresh Controller Physical Addr Decoder Row Addr Latch LayerID LID BankID BID Mask EN Memory die 1 Memory die 2 Memory die 3 Memory die 4 Logic Base (LoB) Vault ControllerTSV Launch Pads To Banks BLP is implemented by masking BankID during refresh
  • 35. • Introduction • Background on DRAM Structure and Refresh Operation • Related Work • Contributions • Evaluation Setup • Evaluation Results • Conclusion Outline 34
  • 36. Evaluation Setup 35 • Trace-driven simulation for PARSEC benchmarks • Memory access traces extracted from detailed cycle-accurate simulations using gem5 • These memory traces were then provided as inputs to the DRAM simulator DRAMSim2 • Energy, timing and area analysis • CACTI-3DD based simulation – based on 4Gb HMC quad model • DRAMSim2 configuration • Configured DRAMSim2 using CACTI-3DD results
  • 37. • Introduction • Background on DRAM Structure and Refresh Operation • Related Work • Motivation • Massed Refresh Technique • Evaluation Setup • Evaluation Results • Conclusion Outline 36
  • 38. Results I – Energy, Timing, Area 37
  • 39. Results II – Throughput 38 Crammed refresh achieves 7.1% and 2.9% more throughput on average over distributed per-bank refresh and scattered refresh respectively PARSEC Benchmarks Massed refresh achieves 8.4% and 4.3% more throughput on average over distributed per-bank refresh and scattered refresh respectively
  • 40. Results III – Energy Delay Product (EDP) 39 Crammed refresh achieves 6.4% and 2.7% less EDP on average over distributed per-bank refresh and scattered refresh respectively PARSEC Benchmarks Massed refresh achieves 7.5% and 3.9% less EDP on average over distributed per-bank refresh and scattered refresh respectively
  • 41. • Introduction • Background on DRAM Structure and Refresh Operation • Related Work • Motivation • Massed Refresh Technique • Evaluation Setup • Evaluation Results • Conclusion Outline 40
  • 42. Conclusions 41 • Proposed Massed Refresh technique exploits • Bank-level as well as subarray-level parallelism while refresh operations • Proposed Crammed Refresh and Massed Refresh techniques • Improve throughput and energy-efficiency of DRAM • Crammed Refresh improves upon state-of-the-art • 7.1% & 6.4% improvements in throughput and EDP over the distributed per-bank refresh • 2.9% & 2.7% improvements in throughput and EDP over the scattered refresh schemes respectively • Massed Refresh improves upon state-of-the-art • 8.4% & 7.5% improvements in throughput and EDP over the distributed per-bank refresh • 4.3% & 3.9% improvements in throughput and EDP over the scattered refresh schemes respectively
  • 43. • Questions / Comments ? Thank You 42