This paper presents a novel, energy-efficient DRAM re-fresh technique called massed refresh that simultaneously leverages bank-level and subarray-level concurrency to reduce the overhead of distributed refresh operations in the Hybrid Memory Cube (HMC). In massed refresh, a bundle of DRAM rows in a refresh operation is composed of two subgroups mapped to two different banks, with the rows of each subgroup mapped to different subarrays within the corresponding bank. Both subgroups of DRAM rows are refreshed concurrently during a refresh com-mand, which greatly reduces the refresh cycle time and improves bandwidth and energy efficiency of the HMC. Our experimental analysis shows that the proposed massed refresh technique achieves up to 6.3% and 5.8% improvements in throughput and energy-delay product on average over JEDEC standardized distributed per-bank refresh and state-of-the-art scattered refresh tech-niques.
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures
1. Massed Refresh: An Energy-Efficient Technique to
Reduce Refresh Overhead in Hybrid Memory
Cube Architectures
Ishan Thakkar, Sudeep Pasricha
Department of Electrical and Computer Engineering
Colorado State University, Fort Collins, CO, U.S.A.
{ishan.thakkar, sudeep}@colostate.edu
VLSID 2016
KOLKATA, INDIA
January 4-8, 2016
DOI 10.1109/VLSID.2016.13
2. • Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
1
3. • Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
2
4. Introduction
3
• Main memory is DRAM
• It is a critical component of all computing systems: server, desktop,
mobile, embedded, sensor
• DRAM stores data in cell capacitor
• Fully charged cell-capacitor logic ‘1’
• Fully discharged cell-capacitor logic ‘0’
• DRAM cell loses data over time, as cell-capacitor leaks charge over
time
• For temperatures below 85°C, DRAM cell loses data in 64ms
• For higher temperatures, DRAM cell loses data at faster rate
DRAM:
Dynamic Random Access Memory
Word Line
BitLine
Cell
CapacitorAccess
Transistor
To preserve data integrity, the charge on each DRAM cell
(cell-capacitor) must be periodically restored or refreshed.
5. • Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
4
6. Background on DRAM Structure
5
• Based on their structure, DRAMs are classified in
two categories:
1. 2D DRAMs: Planar single layer DRAMs
2. 3D-Stacked DRAMs: Multiple 2D DRAM layers stacked
on one-another using TSVs
• 2D DRAM structure
TSV:
Through Silicon Via
2D DRAM Structure
Hierarchy
Chip Bank Subarray BitcellRank
7. 2D DRAM: Rank and Chip Structure
6
<N>
<N>
<N>
. . .
<N>
Mux
DRAM
Chip
<N>
DRAM Rank DRAM Chip
• 2D DRAM rank:
• Multiple chips work in tandem
8. 3D-Stacked DRAM Structure
7
HMC Structure
Hierarchy
Vault Bank Subarray Bitcell
Hybrid Memory Cube
In this paper, we consider Hybrid Memory Cube (HMC), which is as a standard
for 3D-Stacked DRAMs defined by a consortium of industries
9. DRAM Bank Structure
8
Sense Amplifiers
Sense Amplifiers
RowAddressDecoder
Row Buffer
Columns
Rows
Subarray
Column Mux
Data bits
Bank Core
Bank
Peripherals
Column
Address
Decoder
3D-Stacked and 2D DRAMs have similar bank structures
10. DRAM Subarray Structure
9
Sense Amps
Row
Address
Word Line
BitLine
Cell
CapacitorAccess
Transistor
Word Line
BitLine
Sense Amp Sense Amp Sense Amp
DRAM
Cell
DRAM
Cell
3D-Stacked and 2D DRAMs have similar subarray structures
11. All bitlines of
the bank are
pre-charged
to 0.5 VDD
Basic DRAM Operations
10
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
12. The target
row is
opened,
Basic DRAM Operations
11
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4
13. The target
row is
opened,
then it’s
captured by
SAs
Basic DRAM Operations
12
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4
14. Basic DRAM Operations
13
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4
SAs drive
each bitline
fully either to
VDD or 0V –
restore the
open row
Row 4
15. Basic DRAM Operations
14
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4
Row 4
Open row is
stored in
global row
buffer
16. Basic DRAM Operations
15
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
READ
Row
Address
Row 4
Subarray ID: 1
Row 4
Row 4
Column 1
Target data
block is
selected,
and then
multiplexed
out from row
buffer
17. Basic DRAM Operations
16
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
READ
Row
Address
Row 4
Subarray ID: 1
Row 4
Row 4
Column 1
A duet of PRECHARGE-ACTIVATION operations
restores/refreshes the target row dummy
PRECHARGE-ACTIVATION operations are performed to
refresh the rows
18. Refresh: 2D Vs 3D-Stacked DRAMs
17
• 3D-Stacked DRAMs have
• Higher capacity/density more rows need to be refreshed
• Higher power density higher operating temperature (>85°C)
smaller retention period (time before DRAM cells lose data)
of 32ms than that of 64ms for 2D DRAMs
• Thus, refresh problem for 3D-Stacked DRAMs is more critical
• Therefore, in this study, we target a standardized 3D-Stacked
DRAM architecture HMC
Refresh
Dummy ACTIVATION-PRECHARGE are performed on all rows every
retention cycle (32 ms)
To prevent long pauses a JEDEC standardized Distributed Refresh
method is used
19. Background: Refresh Operation
18
• Distributed Refresh – JEDEC standardized method
• A group of 𝑛 rows are refreshed every 3.9μs
• A group of 𝑛 rows form a ‘Refresh Bundle (RB)’
• Size of RB increases w/ increase in DRAM capacity increases tRFC
Example Distributed Refresh Operation – 1Gb HMC Vault
RB1
tRFC
tREFI = 3.9µs
RB2
tRFC
tREFI = 3.9µs
RB8192
tRFC
tREFI = 3.9µs
Retention Cycle = 32ms
Size of RB is 16
tREC
tRFC
Row1
tRC
Row2
tRC
Row3
tRC
Row4
tRC
Row15
tRC
Row16
tRCtREC tREC tREC
tREFI:
Refresh Interval
tRFC:
Refresh Cycle Time
tRC:
Row Cycle Time
tRFC = time taken to refresh entire RB
20. Performance Overhead of Distributed
Refresh
19
Source: J Liu+, ISCA 2012
Performance overhead of refresh increases with increase in device capacity
21. Energy Overhead of Distributed Refresh
20
Source: J Liu+, ISCA 2012
Energy overhead of refresh increases with increase in device capacity
22. Energy Overhead of Distributed Refresh
21
Source: J Liu+, ISCA 2012
Energy overhead of refresh increases with increase in device capacity
Refresh is a growing problem, which needs to be
addressed to realize low-latency, low-energy DRAMs
23. • Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
22
24. Related Work
23
We improve upon Scattered Refresh
Scattered Refresh improves upon Per-bank Refresh and All-bank Refresh
25. All-Bank Refresh Vs Per-Bank Refresh
24
• Distributed Refresh can be implemented at two
different granularities
• All-bank Refresh: All banks are refreshed simultaneously,
and none of the banks is allowed to serve any request until
refresh is complete
• Supported by all general purpose DDRx DRAMs
• DRAM operation is completely stalled no. of available banks (#AB)
is zero
• Exploits bank-level parallelism (BLP) for refreshing smaller tRFC
• Per-bank Refresh: Only one bank is refreshed at a time, so
all other banks are allowed to serve other requests
• Supported by LPDDRx DRAMs
• #AB > 0
• No BLP larger value of tRFC
tRFC:
Refresh Cycle Time
26. All-Bank Refresh Vs Per-Bank Refresh
25
All-Bank
Refresh
tRC:
Row Cycle Time
• Smaller value of tRFC
• Number of available banks (#AB) = 0
DRAM operation is completely stalled
tRFC:
Refresh Cycle Time
Dummy ACTIVATION-PRECHARGE
operations for refresh command
Per-Bank
Refresh
• #AB > 0
• No BLP larger value of tRFC
Both All-bank Refresh and Per-bank Refresh have
drawbacks and they can be improved
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID
27. Scattered Refresh
26
Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4
• Improves upon Per-bank Refresh – uses subarray-level parallelism
(SLP) for refresh
• Each row of RB is mapped to a different subarray
• SLP gives opportunity to overlap PRECHARGE with next ACTIVATE
reduces tRFC
Source: T Kalyan+, ISCA 2012
Scattered
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID
How does Scattered Refresh compare to Per-bank Refresh and All-bank Refresh?
28. All-Bank
Scattered
Scattered Refresh
27
Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4
Per-Bank
Room for improvement - Scattered Refresh
tRFC for All-bank Refresh < tRFC for Scattered Refresh
< tRFC for Per-bank Refresh
29. • Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
28
30. Contributions
29
• Crammed Refresh: Per-bank Refresh + All-bank Refresh
• 2 banks are refreshed in parallel, instead of 1 bank in Per-bank Refresh
and all banks in All-bank Refresh
• Massed Refresh: Crammed Refresh + Scattered Refresh
• 2 banks are refreshed in parallel
• Uses SLP in both banks being refreshed
#AB:
Number of
banks available
to serve other
requests while
remaining banks
are being
refreshed
#BLP:
Bank-level
Parallelism
#SLP:
Subarray-level
Parallelism
Only 2 banks are refreshed in parallel – proof of concept
More than 2 banks can also be chosen
Idea is to keep balance between #AB and BLP for refresh
31. Scattered
Crammed
Per-Bank
Crammed Refresh – tRFC Timing
30
Example Crammed Refresh Operation – HMC Vault – Refresh Bundle size of 4
• Bank-level parallelism (BLP) for refresh
• Only 2 banks are refreshed in parallel #AB>0
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID
tRFC for Crammed Refresh
< tRFC for Scattered Refresh
32. Massed
Crammed
Massed Refresh – tRFC Timing
31
Example Massed Refresh Operation – HMC Vault – Refresh Bundle size of 4
Per-Bank
• Bank-level parallelism (BLP) +
Subarray-level parallelism (SLP) for refresh
tRFC for Massed Refresh
< tRFC for Crammed Refresh
How to implement BLP and SLP together?
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID
34. Bank-level Parallelism (BLP)
33
Physical Address Latch
LayerAddr[2] RowAddr[14]BankAddr[1]
17-bit Address Counter
Refresh Scheduler
Address
Calculator
Control
Refresh Controller
Physical Addr Decoder
Row Addr
Latch
LayerID
LID
BankID
BID
Mask
EN
Memory die 1
Memory die 2
Memory die 3
Memory die 4
Logic Base (LoB)
Vault ControllerTSV Launch
Pads
To Banks
BLP is implemented by masking BankID during refresh
35. • Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
34
36. Evaluation Setup
35
• Trace-driven simulation for PARSEC benchmarks
• Memory access traces extracted from detailed cycle-accurate simulations
using gem5
• These memory traces were then provided as inputs to the DRAM simulator
DRAMSim2
• Energy, timing and area analysis
• CACTI-3DD based simulation – based on 4Gb HMC quad model
• DRAMSim2 configuration
• Configured DRAMSim2 using CACTI-3DD results
37. • Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Motivation
• Massed Refresh Technique
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
36
39. Results II – Throughput
38
Crammed refresh achieves 7.1% and 2.9% more throughput on average
over distributed per-bank refresh and scattered refresh respectively
PARSEC Benchmarks
Massed refresh achieves 8.4% and 4.3% more throughput on average
over distributed per-bank refresh and scattered refresh respectively
40. Results III – Energy Delay Product (EDP)
39
Crammed refresh achieves 6.4% and 2.7% less EDP on average over
distributed per-bank refresh and scattered refresh respectively
PARSEC Benchmarks
Massed refresh achieves 7.5% and 3.9% less EDP on average over
distributed per-bank refresh and scattered refresh respectively
41. • Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Motivation
• Massed Refresh Technique
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
40
42. Conclusions
41
• Proposed Massed Refresh technique exploits
• Bank-level as well as subarray-level parallelism while refresh
operations
• Proposed Crammed Refresh and Massed Refresh techniques
• Improve throughput and energy-efficiency of DRAM
• Crammed Refresh improves upon state-of-the-art
• 7.1% & 6.4% improvements in throughput and EDP over the
distributed per-bank refresh
• 2.9% & 2.7% improvements in throughput and EDP over the
scattered refresh schemes respectively
• Massed Refresh improves upon state-of-the-art
• 8.4% & 7.5% improvements in throughput and EDP over the
distributed per-bank refresh
• 4.3% & 3.9% improvements in throughput and EDP over the
scattered refresh schemes respectively