Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures

Massed Refresh: An Energy-Efficient Technique to
Reduce Refresh Overhead in Hybrid Memory
Cube Architectures
Ishan Thakkar, Sudeep Pasricha
Department of Electrical and Computer Engineering
Colorado State University, Fort Collins, CO, U.S.A.
{ishan.thakkar, sudeep}@colostate.edu
VLSID 2016
KOLKATA, INDIA
January 4-8, 2016
DOI 10.1109/VLSID.2016.13

• Introduction
• Background on DRAM Structure and Refresh
Operation
• Related Work
• Contributions
• Evaluation Setup
• Evaluation Results
• Conclusion
Outline
1

• Introduction
Operation
• Related Work
• Contributions
• Conclusion
Outline
2

Introduction
3
• Main memory is DRAM
• It is a critical component of all computing systems: server, desktop,
mobile, embedded, sensor
• DRAM stores data in cell capacitor
• Fully charged cell-capacitor  logic ‘1’
• Fully discharged cell-capacitor  logic ‘0’
• DRAM cell loses data over time, as cell-capacitor leaks charge over
time
• For temperatures below 85°C, DRAM cell loses data in 64ms
• For higher temperatures, DRAM cell loses data at faster rate
DRAM:
Dynamic Random Access Memory
Word Line
BitLine
Cell
CapacitorAccess
Transistor
To preserve data integrity, the charge on each DRAM cell
(cell-capacitor) must be periodically restored or refreshed.

• Introduction
Operation
• Related Work
• Contributions
• Conclusion
Outline
4

Background on DRAM Structure
5
• Based on their structure, DRAMs are classified in
two categories:
1. 2D DRAMs: Planar single layer DRAMs
2. 3D-Stacked DRAMs: Multiple 2D DRAM layers stacked
on one-another using TSVs
• 2D DRAM structure
TSV:
Through Silicon Via
2D DRAM Structure
Hierarchy
Chip Bank Subarray BitcellRank

2D DRAM: Rank and Chip Structure
6
<N>
<N>
<N>
. . .
<N>
Mux
DRAM
Chip
<N>
DRAM Rank DRAM Chip
• 2D DRAM rank:
• Multiple chips work in tandem

3D-Stacked DRAM Structure
7
HMC Structure
Hierarchy
Vault Bank Subarray Bitcell
Hybrid Memory Cube
In this paper, we consider Hybrid Memory Cube (HMC), which is as a standard
for 3D-Stacked DRAMs defined by a consortium of industries

DRAM Bank Structure
8
Sense Amplifiers
Sense Amplifiers
RowAddressDecoder
Row Buffer
Columns
Rows
Subarray
Column Mux
Data bits
Bank Core
Bank
Peripherals
Column
Address
Decoder
3D-Stacked and 2D DRAMs have similar bank structures

DRAM Subarray Structure
9
Sense Amps
Row
Address
Word Line
BitLine
Cell
CapacitorAccess
Transistor
Word Line
BitLine
Sense Amp Sense Amp Sense Amp
DRAM
Cell
DRAM
Cell
3D-Stacked and 2D DRAMs have similar subarray structures

All bitlines of
the bank are
pre-charged
to 0.5 VDD
Basic DRAM Operations
10
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE

The target
row is
opened,
11
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4

The target
row is
opened,
then it’s
captured by
SAs
12
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4

13
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4
SAs drive
each bitline
fully either to
VDD or 0V –
restore the
open row
Row 4

14
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
Row
Address
Row 4
Subarray ID: 1
Row 4
Row 4
Open row is
stored in
global row
buffer

15
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
READ
Row
Address
Row 4
Subarray ID: 1
Row 4
Row 4
Column 1
Target data
block is
selected,
and then
multiplexed
out from row
buffer

16
Sense Amplifiers
Sense Amplifiers
Global
Row Dec.
Subarray
Dec.
Subarray
Dec.
=ID?=ID?ENEN
GlobalAddress
Latch
Row Buffer
Column MuxColumn
Address
Decoder
PRECHARGE
ACTIVATION
READ
Row
Address
Row 4
Subarray ID: 1
Row 4
Row 4
Column 1
A duet of PRECHARGE-ACTIVATION operations
restores/refreshes the target row  dummy
PRECHARGE-ACTIVATION operations are performed to
refresh the rows

Refresh: 2D Vs 3D-Stacked DRAMs
17
• 3D-Stacked DRAMs have
• Higher capacity/density  more rows need to be refreshed
• Higher power density  higher operating temperature (>85°C)
 smaller retention period (time before DRAM cells lose data)
of 32ms than that of 64ms for 2D DRAMs
• Thus, refresh problem for 3D-Stacked DRAMs is more critical
• Therefore, in this study, we target a standardized 3D-Stacked
DRAM architecture HMC
Refresh
Dummy ACTIVATION-PRECHARGE are performed on all rows every
retention cycle (32 ms)
To prevent long pauses  a JEDEC standardized Distributed Refresh
method is used

Background: Refresh Operation
18
• Distributed Refresh – JEDEC standardized method
• A group of 𝑛 rows are refreshed every 3.9μs
• A group of 𝑛 rows form a ‘Refresh Bundle (RB)’
• Size of RB increases w/ increase in DRAM capacity  increases tRFC
Example Distributed Refresh Operation – 1Gb HMC Vault
RB1
tRFC
tREFI = 3.9µs
RB2
tRFC
tREFI = 3.9µs
RB8192
tRFC
tREFI = 3.9µs
Retention Cycle = 32ms
Size of RB is 16
tREC
tRFC
Row1
tRC
Row2
tRC
Row3
tRC
Row4
tRC
Row15
tRC
Row16
tRCtREC tREC tREC
tREFI:
Refresh Interval
tRFC:
Refresh Cycle Time
tRC:
Row Cycle Time
tRFC = time taken to refresh entire RB

Performance Overhead of Distributed
Refresh
19
Source: J Liu+, ISCA 2012
Performance overhead of refresh increases with increase in device capacity

Energy Overhead of Distributed Refresh
20
Energy overhead of refresh increases with increase in device capacity

Energy Overhead of Distributed Refresh
21
Energy overhead of refresh increases with increase in device capacity
Refresh is a growing problem, which needs to be
addressed to realize low-latency, low-energy DRAMs

• Introduction
Operation
• Related Work
• Contributions
• Conclusion
Outline
22

Related Work
23
We improve upon Scattered Refresh
Scattered Refresh improves upon Per-bank Refresh and All-bank Refresh

All-Bank Refresh Vs Per-Bank Refresh
24
• Distributed Refresh can be implemented at two
different granularities
• All-bank Refresh: All banks are refreshed simultaneously,
and none of the banks is allowed to serve any request until
refresh is complete
• Supported by all general purpose DDRx DRAMs
• DRAM operation is completely stalled  no. of available banks (#AB)
is zero
• Exploits bank-level parallelism (BLP) for refreshing  smaller tRFC
• Per-bank Refresh: Only one bank is refreshed at a time, so
all other banks are allowed to serve other requests
• Supported by LPDDRx DRAMs
• #AB > 0
• No BLP  larger value of tRFC
tRFC:
Refresh Cycle Time

All-Bank Refresh Vs Per-Bank Refresh
25
All-Bank
Refresh
tRC:
Row Cycle Time
• Smaller value of tRFC
• Number of available banks (#AB) = 0
 DRAM operation is completely stalled
tRFC:
Refresh Cycle Time
Dummy ACTIVATION-PRECHARGE
operations for refresh command
Per-Bank
Refresh
• #AB > 0
• No BLP  larger value of tRFC
Both All-bank Refresh and Per-bank Refresh have
drawbacks and they can be improved
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID

Scattered Refresh
26
Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4
• Improves upon Per-bank Refresh – uses subarray-level parallelism
(SLP) for refresh
• Each row of RB is mapped to a different subarray
• SLP gives opportunity to overlap PRECHARGE with next ACTIVATE
 reduces tRFC
Source: T Kalyan+, ISCA 2012
Scattered
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID
How does Scattered Refresh compare to Per-bank Refresh and All-bank Refresh?

All-Bank
Scattered
Scattered Refresh
27
Example Scattered Refresh Operation – HMC Vault – Refresh Bundle size of 4
Per-Bank
Room for improvement - Scattered Refresh
tRFC for All-bank Refresh < tRFC for Scattered Refresh
< tRFC for Per-bank Refresh

• Introduction
Operation
• Related Work
• Contributions
• Conclusion
Outline
28

Contributions
29
• Crammed Refresh: Per-bank Refresh + All-bank Refresh
• 2 banks are refreshed in parallel, instead of 1 bank in Per-bank Refresh
and all banks in All-bank Refresh
• Massed Refresh: Crammed Refresh + Scattered Refresh
• 2 banks are refreshed in parallel
• Uses SLP in both banks being refreshed
#AB:
Number of
banks available
to serve other
requests while
remaining banks
are being
refreshed
#BLP:
Bank-level
Parallelism
#SLP:
Subarray-level
Parallelism
Only 2 banks are refreshed in parallel – proof of concept
More than 2 banks can also be chosen
Idea is to keep balance between #AB and BLP for refresh

Scattered
Crammed
Per-Bank
Crammed Refresh – tRFC Timing
30
Example Crammed Refresh Operation – HMC Vault – Refresh Bundle size of 4
• Bank-level parallelism (BLP) for refresh
• Only 2 banks are refreshed in parallel  #AB>0
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID
tRFC for Crammed Refresh
< tRFC for Scattered Refresh

Massed
Crammed
Massed Refresh – tRFC Timing
31
Example Massed Refresh Operation – HMC Vault – Refresh Bundle size of 4
Per-Bank
• Bank-level parallelism (BLP) +
Subarray-level parallelism (SLP) for refresh
tRFC for Massed Refresh
< tRFC for Crammed Refresh
How to implement BLP and SLP together?
L = Layer ID
B = Bank ID
SA = Saubarray ID
R = Row ID

Subarray-level Parallelism (SLP)
32
Global Row-address Latch
Per-Subarray Row-address Latch
Source: Y Kim+, ISCA 2012
Global Row-address Latch
hinders SLP

Bank-level Parallelism (BLP)
33
Physical Address Latch
LayerAddr[2] RowAddr[14]BankAddr[1]
17-bit Address Counter
Refresh Scheduler
Address
Calculator
Control
Refresh Controller
Physical Addr Decoder
Row Addr
Latch
LayerID
LID
BankID
BID
Mask
EN
Memory die 1
Memory die 2
Memory die 3
Memory die 4
Logic Base (LoB)
Vault ControllerTSV Launch
Pads
To Banks
BLP is implemented by masking BankID during refresh

• Introduction
Operation
• Related Work
• Contributions
• Conclusion
Outline
34

Evaluation Setup
35
• Trace-driven simulation for PARSEC benchmarks
• Memory access traces extracted from detailed cycle-accurate simulations
using gem5
• These memory traces were then provided as inputs to the DRAM simulator
DRAMSim2
• Energy, timing and area analysis
• CACTI-3DD based simulation – based on 4Gb HMC quad model
• DRAMSim2 configuration
• Configured DRAMSim2 using CACTI-3DD results

• Introduction
Operation
• Related Work
• Motivation
• Massed Refresh Technique
• Conclusion
Outline
36

Results I – Energy, Timing, Area
37

Results II – Throughput
38
Crammed refresh achieves 7.1% and 2.9% more throughput on average
over distributed per-bank refresh and scattered refresh respectively
PARSEC Benchmarks
Massed refresh achieves 8.4% and 4.3% more throughput on average
over distributed per-bank refresh and scattered refresh respectively

Results III – Energy Delay Product (EDP)
39
Crammed refresh achieves 6.4% and 2.7% less EDP on average over
distributed per-bank refresh and scattered refresh respectively
PARSEC Benchmarks
Massed refresh achieves 7.5% and 3.9% less EDP on average over
distributed per-bank refresh and scattered refresh respectively

• Introduction
Operation
• Related Work
• Motivation
• Massed Refresh Technique
• Conclusion
Outline
40

Conclusions
41
• Proposed Massed Refresh technique exploits
• Bank-level as well as subarray-level parallelism while refresh
operations
• Proposed Crammed Refresh and Massed Refresh techniques
• Improve throughput and energy-efficiency of DRAM
• Crammed Refresh improves upon state-of-the-art
• 7.1% & 6.4% improvements in throughput and EDP over the
distributed per-bank refresh
scattered refresh schemes respectively
• Massed Refresh improves upon state-of-the-art
distributed per-bank refresh
scattered refresh schemes respectively

• Questions / Comments ?
Thank You
42

Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures

Similar to Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures (20)

Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures