## ENHANCING MICROPROCESSOR POWER EFFICIENCY THROUGH CLOCK-DATA COMPENSATION

A Thesis Presented to The Academic Faculty

by

Ashwin Srinath Subramanian

In Partial Fulfillment of the Requirements for the Degree Master of Science in the School of Electrical and Computer Engineering

> Georgia Institute of Technology December 2015

Copyright  $\bigodot$  2015 by Ashwin Srinath Subramanian

#### ENHANCING MICROPROCESSOR POWER EFFICIENCY THROUGH CLOCK-DATA COMPENSATION

Approved by:

Dr. Arijit Raychowdhury, Advisor School of Electrical and Computer Engineering Georgia Institute of Technology

Dr. Saibal Mukhopadhyay School of Electrical and Computer Engineering Georgia Institute of Technology

Dr. Hua Wang School of Electrical and Computer Engineering Georgia Institute of Technology

Date Approved: December 4th 2015

To my parents,

without whom none of this would have been possible.

#### ACKNOWLEDGEMENTS

I have interacted with many amazing people in the due course of my Graduate Studies at Georgia Tech and their contributions have helped me in various ways. Its only fitting that I convey my deep sense of gratitude.

First and foremost, I am extremely indebted to my advisor, Professor Arijit Raychowdhury for his insightful guidance, patience and encouragement throughout my research. His profound knowledge of many diverse areas in circuit design relevant to industry and his ability to integrate them to offer unique and novel solutions to pressing problems is worthy of emulation.

I am grateful to Professor Saibal Mukhopadhyay and Professor Hua Wang for agreeing to be part of my reading committee. Their constructive criticism and suggestions based on experience was really valuable.

I am thankful to the present and past members of the Integrated Circuits and Systems Research Laboratory including Samantak Gangopadhyay, Ashwin Chintaluri, Soham Desai, Anvesha Amravati and Insik Yoon for their support and friendship. I would also like to thank my room-mates and friends for their encouragement and their help in proof-reading my thesis.

Finally, I would like to thank my parents without whose support none of this would have been possible.

# TABLE OF CONTENTS

| DEDICATION ii                                 |                    |                              |          |  |  |  |
|-----------------------------------------------|--------------------|------------------------------|----------|--|--|--|
| ACKNOWLEDGEMENTS                              |                    |                              |          |  |  |  |
| LIST OF FIGURES                               |                    |                              |          |  |  |  |
| SUMMARY                                       |                    |                              |          |  |  |  |
| Ι                                             | INTRODUCTION       |                              |          |  |  |  |
| Π                                             | BA                 | CKGROUND AND MOTIVATION      | <b>2</b> |  |  |  |
|                                               | 2.1                | Voltage Droops               | 2        |  |  |  |
|                                               | 2.2                | Clock-Data Compensation      | 5        |  |  |  |
|                                               | 2.3                | PLL Clock Modulation         | 7        |  |  |  |
|                                               | 2.4                | Resilient Circuit Design     | 10       |  |  |  |
|                                               | 2.5                | Power Management             | 13       |  |  |  |
| III INTEGRATING POWER MANAGEMENT AND CLOCKING |                    |                              |          |  |  |  |
|                                               | 3.1                | Circuit Design               | 16       |  |  |  |
|                                               | 3.2                | Architecture                 | 20       |  |  |  |
|                                               | 3.3                | Mathematical Analysis        | 23       |  |  |  |
| IV                                            | SYSTEM PERFORMANCE |                              | 26       |  |  |  |
|                                               | 4.1                | Steady State Characteristics | 26       |  |  |  |
|                                               | 4.2                | Transient Characteristics    | 28       |  |  |  |
|                                               | 4.3                | Regulator Efficiency         | 29       |  |  |  |
| $\mathbf{V}$                                  | CO                 | NCLUSION                     | 38       |  |  |  |
| REFERENCES                                    |                    |                              |          |  |  |  |

# LIST OF FIGURES

| 1  | Simulated Voltage Droops[22]                                                               | 2  |  |  |
|----|--------------------------------------------------------------------------------------------|----|--|--|
| 2  | Voltage Droop Occurence in SPEC Benchmarks[18]                                             | 3  |  |  |
| 3  | Production Processor Voltage Margins[18]                                                   | 4  |  |  |
| 4  | Clocking a Typical Microprocessor Datapath                                                 | 5  |  |  |
| 5  | Timing Margins during a Voltage Droop                                                      | 6  |  |  |
| 6  | Adaptive Frequency control through Supply Tracking[13]                                     |    |  |  |
| 7  | Adaptive Phase-Shifting PLL[11]                                                            | 9  |  |  |
| 8  | Desired Clock-Data Compensation from the Adaptive $\operatorname{PLL}[11]$                 |    |  |  |
| 9  | Adaptive Clock Distribution[2]                                                             | 10 |  |  |
| 10 | Tunable Replica Circuit[2]                                                                 | 11 |  |  |
| 11 | Dynamic Variation Monitor[1]                                                               | 12 |  |  |
| 12 | Adaptive Clock Gating on a Voltage Droop[2]                                                | 12 |  |  |
| 13 | Power distribution using DC-DC switching converters<br>[15]                                | 14 |  |  |
| 14 | Power loss in switched-capacitor converters[8]                                             | 15 |  |  |
| 15 | Integrating Power Management and Clocking                                                  | 17 |  |  |
| 16 | Phase-Locked Linear Regulator                                                              | 18 |  |  |
| 17 | Tunable Replica Circuit based Voltage Controlled Oscillator                                | 18 |  |  |
| 18 | Overrun Protection Unit[6]                                                                 | 19 |  |  |
| 19 | Dynamic Frequency Scaling                                                                  | 21 |  |  |
| 20 | Feedback Divider in the System                                                             | 23 |  |  |
| 21 | Output Voltage at a different Clock Frequencies                                            | 27 |  |  |
| 22 | Delay Scaling of Two Different Circuits                                                    | 27 |  |  |
| 23 | Response to a Current Transient Inside the Regulation Bandwidth                            | 28 |  |  |
| 24 | Response to a Current Transient Outside the Regulation Bandwidth .                         | 29 |  |  |
| 25 | Response to a Current Transient Outside the Regulation Bandwidth with a Divide Ratio of 16 | 30 |  |  |
| 26 | Severity of a Voltage Droop at different Divide Ratios                                     | 30 |  |  |

| 27 | Time taken to recover from a Voltage Droop at different Divide Ratios          | 31 |
|----|--------------------------------------------------------------------------------|----|
| 28 | Minimum Core Clock frequency during a Voltage Droop at different Divide Ratios | 31 |
| 29 | Voltage Ripple vs No of Stages at Constant Load Current                        | 32 |
| 30 | Controller Power vs No of Stages at Constant Load Current $\ . \ . \ .$        | 33 |
| 31 | Current Efficiency vs No of Stages at Constant Load Current                    | 34 |
| 32 | Power Efficiency vs No of Stages at Constant Load Current                      | 35 |
| 33 | Voltage Ripple vs No of Stages at Different Load Currents $\ . \ . \ .$ .      | 36 |
| 34 | Controller Power vs No of Stages at Different Load Currents $\ . \ . \ .$      | 36 |
| 35 | Current Efficiency vs No of Stages at Different Load Currents $\ . \ . \ .$    | 37 |
| 36 | Power Efficiency vs No of Stages at Different Load Currents                    | 37 |

#### SUMMARY

The Smartphone revolution and the Internet of Things (IoTs) have triggered rapid advances in complex system-on-chips (SoCs) that increasing provide more functionality within a tight power budget. Highly power efficient on die switched-capacitor voltage regulators suffer from large output voltage ripple preventing their widespread use in modern integrated circuits. With technology scaling and increasing architectural complexity, the number of transistors switching in a power domain is growing rapidly leading to major issues with respect to voltage noise. The large voltage and frequency guardbands present in current microprocessor designs to combat voltage noise both degrade the performance and erode the energy efficiency of the design. In an effort to reduce guardbands, adaptive clocking based systems combat the problem of voltage noise by adjusting the clock frequency during a voltage droop to avoid timing failure. This thesis presents an integrated power management and clocking scheme that utilizes clock-data compensation to achieve adaptive clocking. The design is capable of automatically configuring the supply voltage given a target clock frequency for the load circuit. Furthermore, during a voltage droop the design adjusts clock frequency to meet critical path timing margins while simultaneously increasing the current delivered to the load to recover from the droop. The design was implemented in IBM's 130nm technology and simulation results show that the design is able to clock the load circuit from 30 MHz to 800 Mhz with current efficiencies as high as 97%.

#### CHAPTER I

## INTRODUCTION

In an effort to keep up with Moore's Law, the micro-processor industry has been following an aggresive cadence of technology scaling. Newer technology nodes offer the promise of faster transistors with more control on channel current characteristics, more transistors per unit area and lower power consumption. In practice however, power density and leakage concerns often limit the performance that can be derived and innovative circuit techniques are required to fully realize the potential offered by advanced technology nodes. The rising trend of mobile devices and their associated applications has pushed power to the forefront of the concerns that designers have to deal with. Power efficiency has been shown to be the key metric based on which modern mobile SoCs succeed.

Voltage noise has become a major issue at advanced technology nodes with the guardbands implemented to over-provision for voltage noise severly degrading the operating frequency and power efficiency of microprocessors. The increasing demand for fine grained DVFS in modern SoCs has fueled the development of on-die switching converters that achieve higher efficiency than linear regulators across a broad range of output voltages.

This thesis aims to design an integrated power management and clocking circuit that is capable of automatically setting the core voltage given the target clock frequency while operating at the point of maximum energy efficiency. It also explores the architectural implications of such an integrated power delivery and clocking scheme, investigates a potential opportunity to reduce clock distribution power and studies the performance of the system through extensive simulation.

#### CHAPTER II

#### BACKGROUND AND MOTIVATION

#### 2.1 Voltage Droops

The power delivery network of a micro-processor tries to maintain a constant supply voltage in the power rails across a wide range of operating conditions. However, large current transients are induced in the power delivery system due to abrupt changes in microprocessor switching activity that can arise due to several reasons such as, processor wake up from a sleep state, transition to a turbo state or the execution of a compute intesive block of instructions with high core utilization. Large current transients strain the power delivery network and lead to voltage droops on the power grid. Voltage droops can severly impact the performance of a micro-processor when a critical path is activated during the voltage droop, with potential errors in data and control logic. To complicate matters the different elements, like the logic gates and interconnects, that make up the critical path, have different delay sensitivities to the processor supply voltage. Therefore, delay scaling is not straightforward during



Figure 1: Simulated Voltage Droops[22]



Figure 2: Voltage Droop Occurence in SPEC Benchmarks[18]

a Vdd droop and at a low enough supply voltages the critical path for a given block may change.

Voltage droops can be divided into three distinct categories, a first droop that depends on the package inductance and on-die de-coupling capacitors and lasts for a few nanoseconds, a second droop that depends on board and package capacitance which lasts for hundreds of nanoseconds and a third droop that is primarily dependent on the bulk on die capacitors and lasts for several microseconds[22]. Figure 1 shows the three charactersitic voltage droops.

The third kind of voltage droop that lasts for several micro-seconds is primarily dependent on the on board bulk capacitors and therefore can be minimized through liberal use of the bulk capacitors on the board. Since the second kind of voltage droop is a function of the motherboard and package decoupling, it can be readily minimized by improving the motherboard and package routing and increasing the amount of onchip decoupling capacitors. Finally, of the three different droops the first type is the most severe and affects circuit operation the most. However, mitigating the first kind of voltage droop is quite difficult because it is generally impractical to place sufficient capacitance near the core supply to achieve perfect filtering. Increasing demand



Figure 3: Production Processor Voltage Margins[18]

for core functionality has propelled a significant increase in the number of switching transistors, and consequently the dynamic power, of each new core micro-architecture. Furthermore, increasing variation of transistor characteristics both across the die and under different process, voltage and temperature conditions at each new technology node, exacerbates these voltage droops.

In [18], Reddi *et al.* show that a large fraction of the voltage droops that occur during the execution of a program are within 2-3% of the nominal voltage with very few outliers. Furthermore, as seen in figure 2, there are typically few voltage droops over a large number of cycles and the number of droops correlates very well with the stall ratio (0.97 correlation coefficient). As alluded to earlier, the large current transients that cause a lot of these voltage droops occur due to micro-architectural events like the processor recovering from various stalls.

From figure 3 it is evident that typical production processors have highly conservative voltage margins. Creating a hefty guard band is an extremely effective way of handling these voltage droops and ensuring reliable microprocessor performance under various process, voltage and temperature variations. However, this significantly degrades the performance and energy efficiency of such processor designs.



Figure 4: Clocking a Typical Microprocessor Datapath

#### 2.2 Clock-Data Compensation

Figure 4 shows how clocking is done in a typical microprocessor pipeline stage. The root clock signal is generated at the PLL, is distributed through buffers along the clock tree and finally reaches the leaf flops in the pipeline registers of the datapath.

An interesting effect occurs in the microprocessor datapath during a voltage droop. The PLL operates on an independent voltage rail and therefore the voltage droops on the digital core power grid do not affect the root clock. However, both the clock distribution buffers and the datapath are connected to the digital core power grid and therefore the clock tree delay and the datapath delay are affected by voltage droops. As the voltage begins to droop on the digital core power grid, the clock distribution delay to the leaf nodes on the clock tree increases. Thus, at the onset of a voltage droop, the clock delay increases temporarily to match the datapath delay and this phenomenon is known as clock-data compensation[22].

The benefits of path clock-data delay compensation depend on a complex interaction between the clock frequency, the distribution delay, the datapath delay, the supply noise frequency and the sensitivity of the clock distribution and datapath to voltage noise[22][10][9]. Conventional schemes that try to take advantage of clock data compensation need to carefully tune the aforementioned parameters to achieve



Figure 5: Timing Margins during a Voltage Droop

maximum timing slack benefit. For instance, a very short clock path in the distribution network makes the clock modulation effect weaker and a very long path makes each clock edge see a similar average supply noise. If the supply noise is close to DC, then two consecutive clock edges will see almost the same supply noise, and if the supply noise is of very high frequency, then once again two consecutive clock edges see almost the same supply noise due to an averaging effect. Furthermore, if the clock distribution network is weakly sensitive to voltage noise then proper clock data compensation cannot occur. Finally, if the clock distribution network is very sensitive to voltage noise then the benefits of clock data compensation abate since there will be extremely compressed clock periods during upswings. This also illustrates a major issue with using clock-data compensation to improve timing slack.

Clock-data compensation occurs only for a few clock edges while the voltage is drooping. Consider the timing margin of the datapath of the circuit shown in figure 4. When the voltage begins to recover from the droop, the clock distribution network delay reduces and clock edges start arriving at the leaf flops earlier than anticipated. While the delay of the datapath also reduces, it is however relative to the previous slow clock edge and therefore the next clock edge arrives at the output flops faster than the data does. This is illustrated in figure 5 with the P denoting a pass and the F denoting a fail in that particular clock cycle due to timing violations.

#### 2.3 PLL Clock Modulation

Conventional approaches to improve microprocessor performance in the presence of voltage noise employed circuits to detect voltage droops. After a voltage droop was detected, operating conditions like clock frequency were adjusted to reduce the effect of these droops. However, the response time of the "droop detector" circuits was slow. They could only be used for detecting low frequency droops and could not be successfully used to prevent critical path timing failure during high frequency droops.

The timing failures occuring in figure 5 happen primarily due to the fact that the edges of the leaf clock begin to "compress" as the supply voltage recovers from the voltage droop. If however, the leaf clock edges did not "compress" during the upswing then the timing failures could be avoided. This could be achieved by stretching the PLL root clock in response to a voltage droop which would compensate for the "compression" experienced by the leaf clock edges during the upswing.

An interesting approach proposed by Kurd et. al.[13] tries to achieve PLL root clock stretching by directly modulating the phase-locked-loop clock output with voltage changes in the power grid. In the design, an analog mixer integrated the digital core with the control voltage of the voltage-controlled oscillator in the PLL. Therefore, voltage noise on the supply is coupled to the phase output of the PLL resulting in a stretched PLL clock output during a voltage droop. The mixer circuit controlled the influence of the digital core Vdd on the VCO voltage to ensure PLL stability. In advanced power states, when the digital core operated at a voltage below the safe level for the analog circuitry, the mixer core disabled the influence of the digital core on the VCO control voltage to avoid large short-circuit currents. Figure 6 shows the



Figure 6: Adaptive Frequency control through Supply Tracking[13]

adaptive supply tracking circuit proposed in [13].

A similar approach is adopted in [11] where the authors also consider the phase of the supply noise as seen by the clock buffers. This is not a new idea and the Intel Pentium processors utilized clock-data compensation by applying a separate RC lowpass filtered supply to the clock distribution network to control the phase of voltage noise seen by the clock buffers[14].

In [11] the authors design a supply tracking modulator unit that is embedded in the PLL. This unit couples the digital core voltage with the control voltage of the VCO through programmable binary-weighted capacitor banks and a bias generation circuit. The capacitor banks and transistors M1 and M2 in figure 7 form a high-pass filter so that the supply noise from the digital core voltage can be coupled with the bias voltage of the PLL. The PLL can be programmed to have the desired noise sensitivity and phase shift to generate timing like in figure 8.

While these schemes were successful, they however, have their shortcomings. The adaptive PLLs were only able to act on and produce stretched clocks for fast droops



Figure 7: Adaptive Phase-Shifting PLL[11]

that were outside the loop bandwidth of the PLLs. Any voltage noise that was within the loop bandwidth of the adpative PLLs could not be acted upon since the closed loop control of the PLLs suppressed the noise automatically. Furthermore, PLL stability requirements and analog circuit complexities limited the potential benefits of this technique for a wide range of operating conditions.



Figure 8: Desired Clock-Data Compensation from the Adaptive PLL[11]



Figure 9: Adaptive Clock Distribution[2]

# 2.4 Resilient Circuit Design

All-digital resilient timing-error detection and recovery circuits provide an alternate method to deal with voltage noise, in contrast to the clock period modulation circuits discussed that prevent timing failures in critical paths. Such resilient timing-error detection and recovery circuits, like those in [4] and [3], relax the response-time constraints and avoid the limitations of analog circuits by allowing timing failures to occur. The resilient design detects the timing violations and isolates the error from corrupting the architecture state of the micro-processor. The design then corrects the failure by re-executing the instructions that were deemed to have errored. Such resilient designs are highly effective at improving performance in the presence of high frequency Vdd droops. However, the architectural complexities involved in such a resilient design prevent practical implementations and it is an area of ongoing research.

Bowman *et. al.* in [2] and [1] use Clock-Data compensation to generate sufficient margins to detect voltage droops. Their design integrates a tunable length delay[20] prior to the global clock distribution to prolong the clock-data delay compensation in critical paths during a voltage droop. The tunable length delay is achieved through a tunable replica circuit (TRC) as shown in figure 10. The TRC includes programmable length transistor and interconnect delay components which allow for the circuit to be tuned for the appropriate delay and voltage supply sensitivity. It is typically tuned



Figure 10: Tunable Replica Circuit<sup>[2]</sup>

to match the delay profile and voltage sensitivity of the critical path in the microprocessor pipeline. The tunable length delay prevents critical-path timing-margin degradation for multiple cycles after the voltage droop occurs.

The on-die dynamic variation monitor (DVM) detects the onset of the voltage droop and generates the corresponding signals to drive a finite state machine (FSM) that either gates the clock or adjusts its frequency. Figure 11 shows the circuit used to implement the dynamic variation monitor and its operation is described below. The DVM contains a TRC element between the launch and capture flip-flops that has configuration bits that determine the TRC delay and configuration bits to tune the rise and fall transitions. The circuit around the launch flip-flop toggles the input to the flip-flop every cycle. During a rising or falling transition the input data arrives at the check output flip-flop with minimal delay while simultaneously propagating through the TRC to either the rise or fall output flip-flops. In the next clock cycle the check output is compared with the rise or fall output to determine whether an error, that is insufficient timing margin for the data to propagate through the TRC, has occured. An additional flip-flop is added to capture the error signal after another clock cycle to combat meta-stability issues.

In [2], when the DVM detects a voltage droop, it produces an error signal that drives a FSM which gates the clock signal for the next couple of cycles. The working



Figure 11: Dynamic Variation Monitor[1]



Figure 12: Adaptive Clock Gating on a Voltage Droop[2]

of the design during a voltage droop is shown in figure 12. Note that by gating the clock, the design elimintates the clock edges that would have caused timing failure. Conversely, in [1], when the DVM detects a voltage droop, it produces an error signal that drives another signal which adjusts the clock divider to reduce the  $F_{CLK}$  to  $F_{CLK}/2$ . A FSM maintains the  $F_{CLK}/2$  operation for a programmed number of clock cycles after the voltage recovers from the droop as a safety guardband.

The all-digital adaptive clock distribution schemes proposed in [2] and [1] mitigate the impact of voltage droops on performance and energy efficiency while avoiding the limitations of analog circuits and the complex error recovery in a resilient design. There are however, issues that pop-up while trying to use them in practical designs. The scheme in [2] involves stalling the pipeline of a complex microprocessor which introduces a host of issues when considering the design and architecture of the overall system. Also, the scheme introducted in [1] involves transitioning to half the clock frequency,  $F_{CLK}/2$ , after detecting a voltage droop. While this would successfully avoid timing failure in the microprocessor pipeline, it would not operate the circuitry at the most efficient manner during and for a few cycles after the droop. Furthermore, transitioning the microprocessor from  $F_{CLK}/2$  to  $F_{CLK}$  could potentially introduce a new voltage droop in the power grid and clock dithering logic needs to be used to avoid this scenario.

#### 2.5 Power Management

Modern system-on-chips (SoCs) contain several CPU cores along with many GPU cores and other function units. As the number of cores and units integrated on a chip increase, there is a growing need and potential benefit to having each module on an independent power supply to stay within power budget while maximizing performance. To achieve this goal, a rising trend in modern SoCs is the capability to perform individual core dynamic voltage and frequency scaling (DVFS) [5] for power saving and its highly desirable to control the voltage and frequency of each core independently as shown in [16] and [12].

It is increasingly difficult to satisfy this demand through off-chip power supplies since this approach not only degrades the power supply impedance but also wastes the limited pins on the package. Therefore, there is strong motivation to integrate voltage conversion on die to achieve the power domain specifications and figure 13 illustrates this notion. Today, DC-DC converters are primarily implemented using linear regulators which are not very efficient across a broad range of output voltages. Switched-Capacitor based DC-DC switching converters that have high efficiency across a broad range of voltages have been proposed as a replacement for the linear regulators [15], [19]. Furthermore, since integrated capacitors can achieve low series resistance and high capacitance density, and can be fabricated without special steps, its very easy to integrate switched-capacitor based converters on-die.

|          |          | V <sub>global</sub> |                 |
|----------|----------|---------------------|-----------------|
| Chip     | DC-DC    | DC-DC               | DC-DC           |
| DC-DC    | Loading  | Loading             |                 |
|          | ÷        |                     | <u><u> </u></u> |
| DC-DC    | DC-DC    | DC-DC               | DC-DC           |
| Loading  | Loading  | Loading             | Loading         |
| circuits | circuits | circuits            | circuits        |

Figure 13: Power distribution using DC-DC switching converters[15]

A big disadvantage of Switched-Capacitor based DC-DC converters is the inherent loss caused by voltage ripple across the flying capacitors and figure 14 illustrates this phenomenon. A typical method to reduce the voltage ripple at the output of the switching converter is to use the converter with multiple switching phases - called the interleaved converter. Optimizing the converter involves adjusting the fly capacitor size, the switch width and the switching frequency[15]. Usually, capacitor size is fixed by the power density and aggressively optimizing the converter for efficiency involves choosing the appropriate switch width to minimize the switching loss and the conduction loss. It also involves adjusting the switching frequency to reduce the intrinsic switched-capacitor loss and the bottom plate losses. However, such an energy efficient converter has the problem of large output voltage ripple. This is a problem since conventional microprocessor systems are designed to operate based on the minimum supply voltage and the problem is further exacerbated in practice since voltage noise adds to the already significant voltage ripple.

An aggresive form of adaptive clocking is therefore necessary to fully exploit the advantages offered by the switched-capacitor dc-dc converters. The authors in [8] study the energy and performance impact of having aggressive adaptive clocking with switched-capacitor DC-DC converters. They optimize a switched-capacitor based dcdc converter to operate at maximum efficiency with a high output voltage ripple,



Figure 14: Power loss in switched-capacitor converters[8]

however, they seek to mitigate the effect of the ripple through adaptive clocking. Integrating this onto a RISC-V processor [21] implemented in a 28-nm FDSOI technology, they collect power and timing data through simulation and create mathematical models for the CPU Energy and CPU Frequency. They also create a mathematical model of the microprocessor critical path to track the dynamic CPU frequency from cycle to cycle as the supply voltage varies. The results of their simulations indicate that allowing switched-capacitor dc-dc converters to be optimized heavily without output voltage ripple constraints results in efficiencies greater than 90%. Furthermore, finegrained DVFS along with adaptive clocking results in energy savings between 5% and 25% over a range of performance constraints.

#### CHAPTER III

# INTEGRATING POWER MANAGEMENT AND CLOCKING

#### 3.1 Circuit Design

In the previous chapter we discussed voltage droops and their impact on performance and energy efficiency. We also discussed methods of achieving adaptive clocking to mitigate the effect of voltage noise. Finally, we showed that adaptive clocking in conjunction with power management can yield great benefits. This thesis is based on the premise that combining power management and clocking circuitry is the best way to achieve aggressive adaptive clocking and reducing both voltage and frequency guardbands.

To integrate clocking and power management in our design, we tightly couple the power management circuitry and the clock generation circuitry. The output voltage of the regulator is fed to the oscillator and the output clock of the oscillator is fed back to the regulator. Figure 15 illustrates this notion.

Power management is implemented in our design using a linear phase-locked regulator based on the work done in [6]. The choice is motivated by the regulator's capability to adjust output voltage based on clock edge information providing an ideal interface to the clock output of the voltage controlled oscillator. The phaselocked linear regulator is shown in figure 16 and takes two input reference clocks,  $R_{CLK}$  and  $S_{CLK}$ , and regulates the input voltage,  $V_{IN}$  to produce the output voltage,  $V_{OUT}$ .  $R_{CLK}$  and  $S_{CLK}$  are used to clock N-stage johnson counters whose outputs at each stage are denoted as  $R_i$  and  $S_i$ . The input to the power mosfet at each stage,  $P_i$  is the XOR of  $R_i$  and  $S_i$ . Note that the power mosfet at a stage is switched on if



Figure 15: Integrating Power Management and Clocking

 $R_i = S_i$ , otherwise the power mosfet is switched off.

At steady state, the phase difference between  $R_{CLK}$  and  $S_{CLK}$  settles to a constant value such that the amount of current provided by the power mosfets is sufficient to match the load current and maintain the output voltage at a suitable level. The phase locking happens at each stage of the johnson counter and the switch on and switch off of the power mosfets in a time interleaved manner provides the load current and voltage regulation. Our design has four johnson counters for each stage, two counter running off  $R_{CLK}$  and  $S_{CLK}$  and another two runnning off the complement of  $R_{CLK}$ and  $S_{CLK}$ . This allows phase lock and interleaved regulation at both the positive and negative clock edges improving both the efficiency and response of the linear regulator.

The output voltage of the linear regulator is fed to both the core (load circuitry) and the clocking circuitry. The clocking circuitry comprises of a novel tunable replica circuit (TRC) based voltage controlled oscillator (VCO) that generates the core clock as shown in figure 17. Tuning the TRC to match the profile of the critical path causes the VCO to generate a clock signal whose time-period is just sufficient to meet



Figure 16: Phase-Locked Linear Regulator



Figure 17: Tunable Replica Circuit based Voltage Controlled Oscillator

timing margins of the core. Furthermore, as the output voltage of the linear regulator undergoes a ripple, the TRC in the VCO tracks the timing changes experienced by the critical path and ensures that a timing failure does not occur. Note that since the TRC is employed as an oscillator, its delay should be tuned to match half the delay of the critical path plus some additional guardband margin. This will ensure that the output clock signal generated will be just long enough to meet timing on the critical path.

The clock generated by the TRC based VCO is fed to both the core and the linear regulator, thus closing the control loop. The dynamics of this loop is similar to that of a phase-locked loop used in clock generation and recovery with the XOR gate at each stage of the johnson counter acting as a phase detector. An overrun protection unit [6] prevents the phase difference between  $R_{CLK}$  and  $S_{CLK}$  from being



Figure 18: Overrun Protection Unit[6]

cyclic at  $2\pi$  and does not allow the locking range to be limited by the phase detector. Each stage of the johnson counter has an overrun protection unit and figure 18 shows its implementation. At stage *i* of the johnson counter, the overrun protection unit propagates  $R_{i-1}$  to  $R_i$  if  $R_i \neq S_i$  and propagates  $S_{i-1}$  to  $S_i$  if  $R_i = S_i$ . In extreme cases when  $S_{CLK}$  is either too slow or too fast as compared to  $R_{CLK}$ , the overrun protection unit saturates the phase difference to either 0 or  $2\pi$  respectively.

To understand how this happens, consider the scenario when  $S_{CLK}$  is much faster than  $R_{CLK}$ . At any given instant of time the johnson counters driven by  $S_{CLK}$  and  $R_{CLK}$  contain either a series of 1's followed by a series of 0's or a series of 0's followed by a series of 1's. In every cycle of  $S_{CLK}$ , the overrun protection unit will propagate  $S_{i-1}$  to  $S_i$  at each stage of the johnson counter where  $S_i = R_i$  until  $S_i \neq R_i$  at every stage of the johnson counter. Therefore, the power mosfets would be switched off at every stage of the linear regulator and the output voltage will begin to drop until  $S_{CLK}$  slows down to match  $R_{CLK}$ . This represents the phase difference saturating at  $2\pi$  and the converse occurs when  $S_{CLK}$  is much slower that  $R_{CLK}$  with the phase difference saturating at 0.

#### 3.2 Architecture

The use of Dynamic Voltage and Frequency Scaling to maximize energy efficiency in low power digital systems is well established. Conventionally, the voltage and frequency of a given unit were adjusted independently through complex DVFS control circuitry that used several lookup tables to determine the permissible clock frequency for a given voltage at a particular temperature. Such systems also had to account for voltage noise and therefore imposed frequency guardbands to ensure that timing failure did not occur during voltage droops. As mentioned earlier, while the guardbands were very effective in ensuring reliable operation, they degraded both the performance and energy efficiency of the processor. The lookup table based implementations require on-chip space to store the voltage and frequency pairs and maintaining a large number of such pairs for fine-grained DVFS may not be practically possible. Furthermore, such schemes may not be able to hop between states very quickly, especially if the DVFS is implemented as a mostly software solution.

Motivated by the strong coupling of supply voltage and clock frequency in our design, we propose the notion of Dynamic Frequency Scaling as opposed to Dynamic Voltage and Frequency Scaling. The idea is to use the design to both deliver power and generate the clock for each individual unit or core in a SoC. As shown in figure 19, the master PLL generates a reference clock that is distributed to each core on the chip. Each core has an instance of our design that generates a core clock with frequency identical to the reference clock and regulates the supply voltage to the core to ensure reliable operation. Also, each core would be operating at much higher energy efficiency since there are no frequency or voltage guardbands required. During low-speed or idle states, the reference clock frequency can be reduced and our design would automatically scale the supply voltage to each core. Furthermore, note this is a fully hardware solution that does not require the use of hardware lookup tables and provides a much higher granularity of voltage and frequency scaling.



Figure 19: Dynamic Frequency Scaling

To enable scaling on each individual core a small modification to the design described in figure 15 is proposed in figure 20. Similar to a PLL, the divider in the feedback path causes a sub-sampled version of the core clock to be compared with the reference clock essentially scaling up the core clock frequency by the divide ratio. By changing the divide ratio in the instance of our design associated with a particular core, we can scale the frequency of that core without impacting the frequency and voltage states of other cores.

Using the design in figure 20 in the architecture of figure 19 allows us to have a relatively low reference clock frequency while having high core clock frequencies through the use of appropriate divide ratios. This has the interesting effect of reducing the dynamic power of the clock distribution network and it is advantageous to use high divide ratios to greatly limit the power consumption. However, a high divide ratio would imply that a highly sub-sampled core clock is compared with the reference clock degrading the performance of the linear regulator. The lower clock frequencies would increase the response time of the linear regulator to current transients since load regulation starts at that the next clock edge following the occurrence of the transient. The longer response time would also increase the severity of the voltage droop accompanying the current transient due to the slower rate at which power mosfets in the regulator turn on to supply current to the load. Since the supply voltage is tightly coupled with the core clock in our design, the core clock frequency also experiences larger drops during voltage droops while using high divide ratios. This would result in worse performance of the core during the intermittent voltage droops and cause worse overall SoC performance as compared to a design that uses lower divide ratios. Furthermore, the use of high divide ratios would necessitate the use of very large ratio FIFOs to transfer data across clock domains - from the core to other units or the interconnection network. The exact divide ratios to be used in a design depend heavily on the application and the mathematical analysis presented



Figure 20: Feedback Divider in the System

in the next section provides an analytic framework to assess the impact of the divide ratio on the aforementioned parameters.

#### 3.3 Mathematical Analysis

The dynamics of our system are similar to those of a second order phase-locked loop [17], [7] and the mathematical analysis proceeds in a similar fashion. Assuming a linear model for the tunable replica circuit based voltage controlled oscillator, we can relate the output frequency of the oscillator with the output voltage of the linear regulator as

$$F_{OUT} = K_{VCO} V_{OUT}.$$
 (1)

Integrating the output frequency of the system with respect to time, we obtain the output phase of the TRC based oscillator as

$$\phi_{OUT} = \frac{K_{VCO}V_{OUT}}{s}.$$
(2)

Let the phase of the reference clock be denoted by  $\phi_{REF}$ . The phase difference of the system at steady state is therefore,

$$\phi_{SS} = \phi_{REF} - \phi_{OUT},\tag{3}$$

$$\phi_{SS} = \phi_{REF} - \frac{K_{VCO}V_{OUT}}{s}.$$
(4)

Replacing the reference phase with the reference frequency we get,

$$\phi_{SS} = \frac{F_{REF} - K_{VCO}V_{OUT}}{s}.$$
(5)

Unlike conventional second order PLL designs that seek to minimize the static phase error, the steady state phase difference given by equation (5) is necessary to maintain the output current under regulation.

Both the TRC oscillator clock and the reference clock drive johnson counters whose outputs at stage *i* are denoted as  $S_i$  and  $R_i$  respectively. The XOR of  $S_i$  and  $R_i$  performs a phase detection operation at each stage with the output of the XOR gate being high for a time equal to the phase difference each cycle. At each stage, the XOR gate drives a power mosfet and  $V_{IN}$  is connected to  $V_{OUT}$  for the period of time that the power mosfet is switched on. Therefore, assuming the transconductance of the power mosfets to be  $g_M$ , the current delivered to the load each cycle is

$$I_L = g_M V_{IN} \frac{\phi_{SS}}{2\pi}.$$
(6)

Modeling the load circuit as a parallel combination of a resistance,  $R_{LOAD}$ , and a capacitance,  $C_{LOAD}$ , we obtain the output voltage as

$$V_{OUT} = I_L \frac{R_{LOAD}}{(1 + sR_{LOAD}C_{LOAD})},\tag{7}$$

$$V_{OUT} = g_M V_{IN} \frac{\phi_{SS} R_{LOAD}}{2\pi (1 + s R_{LOAD} C_{LOAD})}.$$
(8)

To simplify the analysis let

$$K_P = \frac{g_M V_{IN}}{2\pi C_{LOAD}} \tag{9}$$

and

$$\tau = \frac{1}{R_{LOAD}C_{LOAD}}.$$
(10)

Equation (8) therefore becomes

$$V_{OUT} = \frac{\phi_{SS} K_P}{s(s+\tau)}.$$
(11)

Substituting for  $\phi_{SS}$  in equation (11) and simplifying we obtain the transfer function between  $V_{OUT}$  and  $F_{REF}$  as

$$V_{OUT} = \frac{K_P}{s^2 + s\tau + K_P K_{VCO}} F_{REF}.$$
(12)

Equation (12) is similar to the control dynamics of a second order PLL. Note that the loop gain is primarily determined by the gain of the power mosfets  $(K_P)$  and that the output poles and consequently the stability of the system is affected by the load resistance and the load current.

#### CHAPTER IV

#### SYSTEM PERFORMANCE

#### 4.1 Steady State Characteristics

Circuit performance is studied in this chapter through extensive simulations and the steady state characteristics of the system are determined to begin with. Figure 21 plots the output voltage of the system as both the reference frequency and, the process and temperature conditions are varied. At a particular reference clock frequency, the output voltage of the system reduces at higher temperatures and faster process conditions as expected. Also, as the reference clock frequency increases the output voltage of the systems increases as well to ensure reliable circuit operation of the load.

The TRC based oscillator models the processor critical path and its important for the TRC to be able to accurately capture scaling of delay with voltage for the given critical path circuit. Tunable replica circuits are made up of components that represent gate dominated delay and interconnect dominated delay. Our design uses a three stage tunable replica circuit for accurate modeling in which the first stage contains a chain of inverters, the second stage contains a chain of nand and nor gates and the last stage contains resistor and capacitor components. Figure 22 shows the results of simulations carried out to determine the accuracy with which our TRC design can model two arbitrary circuit paths - a chain of inverters and a few inverters followed by a heavy RC (interconnect) component. As seen in the figure, our tunable replica circuit can model the delay sensitivities of the two circuits to within picoseconds of delay.



Figure 21: Output Voltage at a different Clock Frequencies



Figure 22: Delay Scaling of Two Different Circuits



Figure 23: Response to a Current Transient Inside the Regulation Bandwidth

# 4.2 Transient Characteristics

Figure 23 shows the transient response of the system without a feedback divider to a gradual increase in the load current by 1mA over a time duration of 500ns. At each point during the transition the phase difference between the reference clock and the core clock (same as  $S_{CLK}$  in this case) increases sufficiently to match the increased load current demand. Consequently, the output voltage is well regulated and the current transient is within the "regulation bandwidth" of the system.

Voltage droops occur when the current transient is outside the "regulation bandwidth" of the system. Figure 24 shows the response of the system when the load current increases by 3mA over the time period of 10ns. The initial increase in phase difference, at the start of the transient, is insufficient to match the load current demands and consequently the output voltage drops. The decreased output voltage slows down the core clock which rapidly increases the phase difference causing the regulator to supply more current to the load and the system eventually returns to its steady state. Note that the core clock frequency drops during the voltage droop to ensure that timing failure does not occur.



Figure 24: Response to a Current Transient Outside the Regulation Bandwidth

As mentioned earlier the introduction of a divider in the system enables power savings in the clock distribution network but slows the response of the regulator. Figure 25 shows the response of the system to an increase of load current from 1mAto 4mA over the time span of 10ns when the system is configured with a divide ratio of 16. Compared with the simulation results in figure 24, the resonance of the regulator is much slower, the voltage droop is more pronounced and consequently the core clock frequency experiences a steeper drop.

The intensity of the voltage droop for a given current transient increases as the divide ratio increases and this is shown in figure 26. Similarly figure 27 shows how the recovery time increases with the divide ratio. Finally, figure 28 shows the minimum core clock frequency during the current transient event at different divide ratios.

# 4.3 Regulator Efficiency

The efficiency of the linear regulator used in the system depends on the reference clock frequency, the number of johnson counter stages and the load current drawn. In the first set of simulations the reference clock frequency was varied between 600



**Figure 25:** Response to a Current Transient Outside the Regulation Bandwidth with a Divide Ratio of 16



Figure 26: Severity of a Voltage Droop at different Divide Ratios



Figure 27: Time taken to recover from a Voltage Droop at different Divide Ratios



**Figure 28:** Minimum Core Clock frequency during a Voltage Droop at different Divide Ratios



Figure 29: Voltage Ripple vs No of Stages at Constant Load Current

MHz, 400 MHz, 200 MHz and 100 MHz while changing the number of stages from 2 to 16 and keeping the load current constant at 3mA.

Increasing the number of stages used in the johnson counter, increases the number of power mosfets through which current is delivered to the load. This increases the interleaved delivery of current to the load leading to a smaller voltage ripple at the output. The increased interleaving of current delivery to the load also has a beneficial effect in terms of reducing the output voltage ripple. Figure 29 shows the output voltage ripple seen at steady state as the no of stages are progressively increased from two to sixteen. Also, as seen in the plots increasing the reference frequency reduces the voltage ripple at the output. As the linear regulator operates on a faster clock, the time interleaving of current delivery improves and consequently the output voltage ripple reduces.

Each additional stage of the johnson counter also adds flip-flops and logic switching at each clock edge significantly increasing the controller power. Furthermore, the stage is added to four johnson counters - one each for positive and negative edge counters driven by  $R_{CLK}$  and one each for positive and negative edge counters driven by  $S_{CLK}$ . This amplifies the current consumption by the controller and as shown in



Figure 30: Controller Power vs No of Stages at Constant Load Current

figure 30 the controller power increases rapidly with increase in the number of stages.

The current and power efficiency of a linear regulator are typically calculated as shown below,

$$\eta_{CURRENT} = \frac{I_{LOAD}}{I_{LOAD} + I_{CONTROLLER}} \tag{13}$$

$$\eta_{POWER} = \frac{I_{LOAD}}{I_{LOAD} + I_{CONTROLLER}} \times \frac{V_{OUT}}{V_{IN}}.$$
(14)

Since the load current was kept constant at 3mA for these simulations, as the controller current (consequently power) increases the current efficiency decreases and this is shown in figure 31. However, the current efficiency of the regulator is quite high in general and increasing the number of stages only degrades the current efficiency slightly. Power efficiency, shown in figure 32, on the other hand is dominated by the ratio of the output to the input voltage and had almost negligible degradation as the number of stages increase. Furthermore, its interesting to note that out of the four reference frequencies simulated, 600 MHz produced the worst current efficiency but the best power efficiency. The high reference clock frequency increases the controller power decreasing current efficiency, but the high reference clock frequency also results in a much higher  $V_{OUT}$  leading to a higher  $V_{OUT}/V_{IN}$  ratio and consequently a higher



Figure 31: Current Efficiency vs No of Stages at Constant Load Current

power efficiency.

From figures 29, 31 and 32 its interesting to note that decreasing the reference clock frequency increases the regulator efficiency but slows down its transient response to voltage droops. Therefore, a design tradeoff exists and choosing the right reference clock frequency is non-trivial and requires careful analysis based on the application. Also note that an equivalent argument applies when using the feedback divider at a fixed core clock frequency, increasing the divider ratio improves the regulator efficiency but slows down the transient response to voltage droops.

In the second set of simulations, the reference clock frequency is kept constant at 600 MHz while the load current is varied from a light load of  $500\mu A$  to a heavy load of 3mA and the number of stages is increased from 2 to 16. The stability of the linear regulator depends heavily on the load current as shown in the mathematical analysis. At low enough load currents linear regulator configurations with a small number of stages in the johnson counter fail due to insufficient time interleaved current delivery to the load. Note that in the simulation results that follow data for such configurations



Figure 32: Power Efficiency vs No of Stages at Constant Load Current

are omitted since the system does not settle to a steady state.

At a load current of 1mA, linear regulator configurations with less than 6 stages fail and at a load current of  $500\mu A$ , linear regulator configurations with less than 12 stages fail. As seen from figure 33, reducing the load current while keeping the number of stages constant increases the voltage ripple. For a given configuration, reducing the load current increases the controller current slightly. As the load current decreases, increased switching in the controller is necessary to maintain the time interleaved delivery of smaller amounts of current. Figures 35 and 36 show the variation of the current and power efficiency of the linear regulator at different load currents and number of stages. Reducing the load current while keeping the number of stages constant reduces both the current and power efficiency since the controller current increases slightly and the denominator of equation (13) becomes larger as compared to the numerator.



Figure 33: Voltage Ripple vs No of Stages at Different Load Currents



Figure 34: Controller Power vs No of Stages at Different Load Currents



Figure 35: Current Efficiency vs No of Stages at Different Load Currents



Figure 36: Power Efficiency vs No of Stages at Different Load Currents

#### CHAPTER V

#### CONCLUSION

This thesis describes the design of an integrated power management and clocking scheme based on a phase-locked linear regulator and a tunable replica circuit based voltage controlled oscillator. After calibrating the tunable replica circuit, the design is capable of automatically setting the output voltage of the load circuit given a frequency target to operate with minimal guardbands. When the design experiences current noise outside the regulation bandwidth of the linear regulator, there is a droop of the output voltage which can last for several cycles and the frequency adjusts accordingly to avoid timing failures on the critical paths.

A design tradeoff exists with respect to the number of johnson counter stages clocking the power mosfets - increasing the number of stages will reduce the output voltage ripple but consume more controller power leading to decreased current efficiency. Observations based on simulations indicate that the optimal balance lies somewhere in between 8 to 12 stages, where the output ripple is significantly reduced to a few millivolts while there is minimal degradation to the current efficiency.

The design also functions as a secondary PLL capable of frequency multiplication and can be used to decrease the power of the clock distribution network. However, this trades-off the clock network power with the regulator response time and not all multiplication ratios may be permissible due to stability concerns.

The output voltage and clock frequency of the design are tightly coupled and can be used effectively with high-efficiency dc-dc switching converters that usually have high output voltage ripple. The design can supress the voltage ripple and generate an adaptive clock that can ensure that timing demands are met.

#### REFERENCES

- [1] BOWMAN, K. A., RAINA, S., BRIDGES, T., YINGLING, D., NGUYEN, H., APPEL, B., KOLLA, Y., JEONG, J., ATALLAH, F., and HANSQUINE, D., "A 16 nm all-digital auto-calibrating adaptive clock distribution for supply voltage droop tolerance across a wide operating range," *IEEE Journal of Solid-State Circuits*, vol. 51, Jan. 2016.
- [2] BOWMAN, K. A., TOKUNGA, C., KARNIK, T., DE, V. K., and TSCHANZ, J. W., "A 22 nm all-digital dynamically adaptive clock distribution for supply voltage droop tolerance," *IEEE Journal of Solid-State Circuits*, vol. 48, pp. 907– 916, Apr. 2013.
- [3] BOWMAN, K. A., TOKUNGA, C., TSCHANZ, J. W., RAYCHOWDHURY, A., KHELLAH, M. M., GEUSKENS, B. M., LU, S.-L. L., ASERON, P. A., KARNIK, T., and DE, V. K., "All-digital circuit-level dynamic variation monitor for silicon debug and adaptive clock control," *IEEE Transactions on Circuits and Systems*, vol. 58, pp. 2017–2025, Sept. 2011.
- [4] BOWMAN, K. A., TSCHANZ, J. W., LU, S.-L. L., ASERON, P. A., KHELLAH, M. M., RAYCHOWDHURY, A., GEUSKENS, B. M., TOKUNGA, C., WILKER-SON, C. B., KARNIK, T., and DE, V. K., "A 45 nm resilient microprocessor core for dynamic variation tolerance," *IEEE Journal of Solid-State Circuits*, vol. 46, pp. 194–208, Jan. 2011.
- [5] BURD, T. D., PERING, T. A., STRATAKOS, A. J., and BRODERSEN, R. W., "A dynamic voltage scaled microprocessor system," *IEEE Journal of Solid State Circuits*, vol. 35, pp. 1571–1580, Nov. 2000.
- [6] GANGOPADHYAY, S., SOMASEKHAR, D., TSCHANZ, J. W., and RAYCHOWD-HURY, A., "A 32 nm embedded, fully-digital, phase-locked low dropout regulator for fine grained power management in digital circuits," *IEEE Journal of Solid-State Circuits*, vol. 49, pp. 2684–2693, Nov. 2014.
- [7] GARDNER, F. M., *Phaselock Techniques*. John Wiley & Sons, Inc., Copyright 2005 by John Wiley & Sons, Inc.
- [8] JEVTIC, R., LE, H.-P., BLAGOJEVIC, M., BAILEY, S., ASANOVIC, K., ALON, E., and NIKOLIC, B., "Per-core dvfs with switched-capacitor converters for energy efficiency in manycore processors," *IEEE Transactions on Very Large Scale Integration (VLSI) Systems*, vol. 23, pp. 723–730, Apr. 2015.
- [9] JIAO, D., GU, J., JAIN, P., and KIM, C. H., "Enhancing beneficial jitter using phase-shifted clock distribution," pp. 21–26, ACM/IEEE International Symposium on Low Power Electronics and Design (ISLPED), Aug. 2008.

- [10] JIAO, D., GU, J., and KIM, C. H., "Circuit design and modeling techniques for enhancing the clock-data compensation effect under resonant supply noise," *IEEE Journal of Solid-State Circuits*, vol. 45, pp. 2130–2140, Oct. 2010.
- [11] JIAO, D., KIM, B., and KIM, C. H., "Design, modeling, and test of a programmable adaptive phase-shifting pll for enhancing clock data compensation," *IEEE Journal of Solid-State Circuits*, vol. 47, pp. 2505–2516, Oct. 2012.
- [12] KIM, W., GUPTA, M. S., WEI, G.-Y., and BROOKS, D., "System level analysis of fast, per-core dvfs using on-chip switching regulators," pp. 123–134, IEEE International Symposium on High Performance Computer Architecture, Feb. 2008.
- [13] KURD, N., MOSALIKANTI, P., NEIDENGARD, M., DOUGLAS, J., and KUMAR, R., "Next generation intel core micro-architecture (nehalem) clocking," *IEEE Journal of Solid-State Circuits*, vol. 44, pp. 1121–1129, Apr. 2009.
- [14] KURD, N. A., BARKATULLAH, J. S., DIZON, R. O., FLETCHER, T. D., and MADLAND, P. D., "A multigigahertz clocking scheme for the pentium 4 microprocessor," *IEEE Journal of Solid-State Circuits*, vol. 36, pp. 1647–1653, Nov. 2001.
- [15] LE, H.-P., SANDERS, S. R., and ALON, E., "Design techniques for fully integrated switched-capacitor dc-dc converters," *IEEE Journal of Solid-State Circuits*, vol. 46, pp. 2120–2131, Sept. 2011.
- [16] LEE, J. and KIM, N. S., "Optimizing throughput of power- and thermalconstrained multicore processors using dvfs and per-core power-gating," pp. 47– 50, ACM/IEEE Design Automation Conference, July 2009.
- [17] RAZAVI, B., *Design of Analog CMOS Integrated Circuits*. The McGraw-Hill Companies, Inc., Copyright 2001 by The McGraw-Hill Companies, Inc.
- [18] REDDI, V. J., KANEV, S., KIM, W., CAMPANONI, S., SMITH, M. D., WEI, G.-Y., and BROOKS, D., "Voltage noise in production processors," *IEEE Micro*, vol. 31, pp. 20–28, Jan. 2011.
- [19] SANDERS, S. R., ALON, E., LE, H.-P., SEEMAN, M. D., JOHN, M., and NG, V. W., "The road to fully integrated dc-dc conversion via the switched-capacitor approach," *IEEE Transactions on Power Electronics*, vol. 28, pp. 4146–4155, Sept. 2013.
- [20] TSCHANZ, J., BOWMAN, K., WALSTRA, S., AGOSTINELLI, M., KARNIK, T., and DE, V., "Tunable replica circuits and adaptive voltage-frequency techniques for dynamic voltage, temperature, and aging variation tolerance," pp. 112–113, IEEE Symposium on VLSI Circuits, June 2009.

- [21] WATERMAN, A., LEE, Y., PATTERSON, D. A., and ASANOVIC, K., The RISC-V instruction set manual, volume I: Base user-level ISA. Department of Electrical Engineering and Computer Sciences, University of California Berkeley, May 2011.
- [22] WONG, K. L., RAHAL-ARABI, T., MA, M., and TAYLOR, G., "Enhancing microprocessor immunity to power supply noise with clock-data compensation," *IEEE Journal of Solid-State Circuits*, vol. 41, pp. 749–757, Apr. 2006.