# Multipliers Featuring Approximate 4-2 Compressors Alongside Error Recovery Modules

Prasant Sethi, College of Engineering Bhubaneswar

Abstract— One often utilized operation in approximate computing techniques for high performance and low power is approximate multiplication. computing With an approximation 4-2 compressor, power-efficient circuits for approximate multiplication can be made possible. This letter proposes a brand-new design that incorporates an error recovery module and is based on a modified version of an earlier approximate 4-2 compressor concept. Compared to earlier suggested 4-2 compressor-based approximate multiplier designs, the suggested design is more accurate, uses less hardware, and consumes less power-even with the added error recovery module.

## I. INTRODUCTION

PPROXIMATE computing is a promising solution for fast computation with restricted resource budgets [1]. As a commonly used arithmetic unit, multipliers are addes irrest and the target for circuit-level approximate computing [2]–[9]. Multipliers are commonly implemented using three phases. Given two *n*-bit operands to be multiplied.

- 1) The bits of the multiplicand are AND'ed with the bits of the multiplier (and shifted left by the multiplier bit position) to produce a set of *n* partial product terms.
- 2) The *n* partial product terms are compressed (or reduced) down to two terms by adding them together in some manner.
- 3) The final two terms are added together to form the final product.

Oftheabovethreephasesusedinamultiplier, phase2) typ- ically requires the dominant proportion of area, delay, and power [10]. Also, methods developed for phase 2) can be included with other methods developed for phases 1) and 3). Thus, focusing specifically on phase 2), an *n*-to-2 compressor can be designed by using a sequence of k-2 compressors [10], fulladders, and half addersplaced at strategic locations in ann-to-2compressortree.Althoughthe3-2compressor, also known as a carry-save is most adder, the well known, Changetal. [10] have shown that a 4-2 compressor can

ManuscriptreceivedJanuary11,2017;revisedMay23,2017;accepted

August 23, 2017. Date of publication August 29, 2017; date of currentversion February 24, 2018. This work was supported by the SamsungResearch Funding and Incubation Center of Samsung Electronics underProject SRFC-TB1703-07. This manuscript was recommended for publica-tion by S. Hellebrand, J. Henkel, A. Raghunathan, and H.-J. Wunderlich.(*Corresponding author: Sunggu Lee.*)

The authors are with the Department of Electrical Engineering, PohangUniversityofScienceandTechnology,Pohang790-784,SouthKorea(e-mail:mh0205@postech.ac.kr; slee@postech.ac.kr).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

DigitalObjectIdentifier10.1109/LES.2017.2746084



Fig. 1.Previous 4-2 compressor designs: (a) accurate 4-2 compressor designin [10] (XOR\* is an XOR–XNORmodule), (b) approximate 4-2 compressordesign in [11] and [12], and (c) approximate 4-2 compressor design in [13](AO222 refers to a custom AND-OR complex compound gate).

beusedtoproducebetterresults.Byusing*approximate*4-2compressors(showninFig.1)[11]–[13],itisalsopossible todesignapproximatemultiplierswithsignificantlydecreased power, area, and delay characteristics.

This letter improves upon the results of [12] and [13] by presenting an approximate multiplier design-based on a modified 4-2 compressor and a novel error recovery unit design. Example image processing applications are used to show that the proposed approximate multiplier is practical and produces acceptable results with real applications. Error analysis and application-specified integrated circuit (ASIC) postsynthesis results are used to show that, even with the addition of the errorrecoverymodule,theproposeddesignimprovesuponthe previous work. 8-32 bit multipliers with the proposed design require 23.2%–24.4% smaller hardware area, 22.4%–24.5% lowerpowerusage,and11.2%–17.0%shorterdelaythanacorresponding exact multiplier. Also, when compared to [13], whichoutperforms[12],theaccuracyisimprovedby11.7% and theerrordistributionshowsalowerandmoreevendistribution.

### II. MULTIPLICATIONUSINGAPPROXIMATE 4-2COMPRESSORS

#### A. PreviousApproximate4-2Compressor Designs

Recent research in [11] and [12] used the approximate 4-2 compressor design of Fig. 1(b). The main idea behind their design is the observation that, if a small number of erroneousentriesarepermitted,thetruthtableforanexact 4-2 compressor can be modified to yield a simpler logic implementation. Yang *et al.* [13] then proposed an improvement on the approximate 4-2 compressor design of [12]. A low error rate approximate 4-2 compressor was designed by considering the characteristics of the AND gate that is used in partial product generation. The probability that a partial product bit equals 1 and 0 are 1/4 and 3/4, respectively. Thus, an input to

1943-06 Co2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.htmlformore information.

TABLEI TRUTHTABLEOFAPPROXIMATE4-2COMPRESSOR DESIGNIN[13]ANDPROPOSEDDESIGN

| $X_4X_3X_2X_1$ | $C'_{rang}$ | $S'_{ m wag}$ | $D'_{rang}$ | $C'_{proposed}$ | $S'_{proposal}$ | D'proposed |
|----------------|-------------|---------------|-------------|-----------------|-----------------|------------|
| 0 0 0 0        | 0           | 0             | 0           | 0               | 0               | 0          |
| 0001           | 0           | 1             | 0           | 0               | 1               | 0          |
| 0010           | 0           | 1             | 0           | 0               | 1               | 0          |
| 0011           | 1           | 0             | 0           | 1               | 0               | 0          |
| 0100           | 0           | 1             | 0           | 0               | 1               | 0          |
| 0101           | 1           | 0             | 0           | 1               | 0               | 0          |
| 0110           | 1           | 0             | 0           | 1               | 0               | 0          |
| 0 1 1 1        | 1           | 1             | 0           | 1               | 1               | 0          |
| 1000           | 0           | 1             | 0           | 0               | 1               | 0          |
| 1001           | 1           | 0             | 0           | 1               | 0               | 0          |
| 1010           | 1           | 0             | 0           | 1               | 0               | 0          |
| 1011           | 1           | 1             | 0           | 1               | 1               | 0          |
| 1100           | 1           | 1             | 1           | 0               | 1               | -1         |
| 1101           | 1           | 0             | -1          | 1               | 0               | -1         |
| 1110           | 1           | 0             | - 1         | 1               | 0               | = <u> </u> |
| 1111           | 1           | 1             | -1          | 1               | 1               | -1         |

a 4-2 compressor is three times more likely to be a 0 than a 1. Also, if there is no carry-in, a carry-out can only occur when allfourinputbitsare1.Denotingthetwooutputsofa4-2com-

pressor as *C*<sup>r</sup>and *S*<sup>r</sup>and simply letting *C*<sup>r</sup>*S*<sup>r</sup>equal 11 (instead of 00 with a carry-out of 1) when all inputs ( $X_1$ ,  $X_2$ ,  $X_3$ , and  $X_4$ ) are 1, (1) and (2) can be used for the carry *C*<sup>r</sup>and sum *S*<sup>r</sup>bits.Thisnot onlyobviates theneed for carry-inand carry-out

bits, it also makes the resulting logic implementation simpler. The tradeoff is a small loss in accuracy, as reflected in the truth table of Table I. In Table I, D refers to the difference between the outputs of approximate and exact 4-2 compressor circuits

$$C_{\text{yang}}^{\text{r}} = X_1 \cdot X_2 + X_1 \cdot X_3 + X_1 \cdot X_4 + X_2 \cdot X_3 + X_2 \cdot X_4 + X_3 \cdot X_4$$
(1)

$$S_{\text{yang}}^{r} = (X_1 \bigoplus X_2) \bigoplus (X_3 + X_4).$$
(2)

# B. Proposed 4-2 Compressor and Error Recovery Module

Inthisletter, an improved multiplier design has been devised by manipulating the truth table entries that result in erroneous outputs in a 4-2 compressor. For inputs 1100, the exact outputs should be 10, while the compressor in [13] produces 11 ( $S^{r}=1$  instead of 0). Thus,  $D_{yang}=11-10=1$  using binary arithmetic. However, of the four inexact entries in Table I, this is the only row that has a difference of 1 (the others have difference -1).

As shown in the last column of Table I, in the proposed design, all truth table entries are designed to either pro-

duceD=0orD=-1.Thisisdonebyproducing

outputs01whentheinputsare1100,whileleavingallof

the other entries of Table I the same. Since the rearenow

thesamenumberofinexactentries(inthelastfourrows of Table I), but the error is always in the same direction (toward making the outputs smaller), a simple error recov-ery circuit can be used to correct the error when  $X_4X_3$ = 11. Equations(3)and(4)showthelogicequationsforthe *C*<sup>r</sup>and *S*<sup>r</sup>termsi ntheproposed4-

2compressor.Notably,C<sup>r</sup>hasonefewerproducttermthaninthedesi gnof [13]. New approximate multiplier circuits can be designed usingtheproposed4-2compressordesignincombinationwith



Fig. 2.Partial product accumulation process using truncation and proposed approximate compressors for an 8-bit approximate multiplier with errordetection module and error recovery module.

asimpleerrorrecoverymodule

$$C_{\text{proposed}}^{r} = X_{1} \cdot X_{2} + X_{1} \cdot X_{3} + X_{1} \cdot X_{4} + X_{2} \cdot X_{3} + X_{2} \cdot X_{4}$$
(3)

$$S_{\text{proposed}}^{i} = (X_1 \bigoplus X_2) \bigoplus (X_3 + X_4).$$
(4)

To illustrate the proposed multiplier design, the phase 2) multiplier part (*n*-2 compression) for an 8-bit approximate multiplierdesignisshowninFig.2.Asin[13],thepar-

tial product accumulation part of an approximate multiplier is divided into three regions: 1)a7-bit accurate region; 2)a4-bit

approximate region; and 3) a 4-bit truncated region. Carry propagation, shown with dashed and solid arrows in Fig. 2, onlyoccursinafewplacesandinanoncascadingmanner (a carry propagates only once and can be computed in par- allel with other sum and carry terms). The dashed arrowsshow the carry propagation used in an exact 4-2 compressor (the interested reader is referred to [13] for details). The solid arrow shows the only carry propagation used in the proposed multiplierdesign. Thetwopairsoffourentries in themselves and the solid arrow state.

nificant portion of the approximate region produce two error recovery terms. These are then OR'ed together to produce the carry into the least significant bit of the accurate region. Error recovery is only used at this position because it has the largest influenceonaccuracy.A16-bit(32-bit)approximatemultiplier

would have 2 (4) such carry terms produced from 4 (8) error recovery terms in the most significant portion of the approximateregion.Sincetheerrorrecoverymoduleisonlyused in the most significant column of the approximate region, the overall hardware overhead is small.

In the proposed *n*-bit approximate multiplier, partial product accumulation requires  $\log n-1$  steps since the *n* partial producttermscanbereducedto2termswithlog*n*-1stepsof 4-2 compression. Using truncation and approximate compres- sors in the less significant bits of the partial and accumulated products results in decreased power consumption and circuit area, while using accurate compressors in the most significant

#### TABLEII MEDANDNUMBEROFCORRECTOUTPUTS COMPARISON(8-BITCASE)



Fig.3.Errordistributionin8-bitapproximatemultiplierdesigns:(a)approximatemultiplierdesignin [12];(b)approximatemultiplierdesignin [13];and (c) proposed approximate multiplier.

bits can mitigate the errors caused by truncation and the useofapproximatecompressors.AsshowninFig.2,theproposed error recovery module is a simple 2-level AND–OR circuit.

#### III. ANALYSISANDEXPERIMENTALRESULTS

Approximate multipliers using the proposed 4-2 compressoranderrorrecoverymodulehavebeencompared with possible alternatives [12] and [13] ([11] uses the same compressor as [12]) using analysis and ASIC postsynthesis implementation.

#### A. ErrorAnalysis

Twocommonmetricsforanalyzingtheaccuracyofapproximate multipliers are the mean error distance (MED) and number of exact outputs. The MED of an n-bit approximate multiplier can be defined as

MED= 
$$\frac{\sum_{2^{n}-1}\sum_{j=0}^{2^{n}-1} \text{product}}{2^{2^{n}}} \frac{(i,j)-\text{product}}{2^{2^{n}}}.$$
(5)

Using n = 8 as an example, Table IIshows the MED, maximum error distance (ED), standard deviation, and number of correct outputs with the proposed approximate multiplier and the approximate multipliers in [12] and [13]. Approximate multipliers based on the design of [13] and the proposed design have the same number of correct outputs. However, because of the error recovery module, the proposed approximate multiplier has the smallest MED value.

TABLEIII **POSTSYNTHESISRESULTSFORMULTIPLIERS** WITH4-2COMPRESSORS Area Power Delay (mW)(um², (ns) Accurate design 674.56 0.2153 3.04 2.79 620.80 0.1760 Design in [12] 8-bit Design in [13] 521.28 0.1724 2.70510.08 **Proposed Design** 0.1646 2.70Accurate design 2667.52 0.97725.905.55 Design in [12] 2401.60 0.824916-bit Design in [13] 2101.12 0.79855.070.7584 5.02 **Proposed Design** 2048.32 10750.40 4.5206 11.66 Accurate design Design in [12] 9453.44 3.6205 11.35 32-bit Design in [13] 8391.68 9.79 3.5874**Proposed Design** 8163.52 3.4149 9.73 25 20 15 10



Fig.4.(a)Area,(b)power,and(c)delayreductionratiooftheproposed multipliers and multipliers in [12] and [13].

In order to analyze the behavior of approximate multiplier designs, a plot of the distribution of errors were generated, as showninFig.3.Theproposedmethodhasanerrordistribution with most values concentrated at lower ED values.

## B. ASICPostsynthesisResults

Forahardwareoverheadandpower/delaycomparison, the proposed approximate multiplier and the approximate multipliersin[12]and[13]havebeensynthesizedusing

65 nm CMOS cell library. To synthesize these approximate multiplier designs, Synopsys Design Compiler has been used. The postsynthesis results are summarized in Table III. Even though the proposed design has an added error recovery module, the proposed design still has betterarea, power, and delay characteristics than the previous approximate multiplier designs in [12] and [13]. Fig. 4shows the reductions in area, power, and delay for 8, 16, and 32-bit multipliers.



Fig.5.Imagesharpeningexampleusingtheproposedapproximatemultiplierdesign: (a) original image and (b) sharpened image.

TABLEIV PSNRResultsofImageSharpeningMethod

|                 | PSNR <sub>image1</sub> | PSNR <sub>hmage2</sub> | PSNR <sub>image3</sub> |
|-----------------|------------------------|------------------------|------------------------|
| Design in [12]  | 33.7787                | 40.9170                | 45.6257                |
| Design in [13]  | 48.1978                | 45.5306                | 50.1626                |
| Proposed Design | 48.2951                | 48.1506                | 53.5515                |

#### IV. PRACTICALUSAGEOFPROPOSEDMETHOD

Todemonstrate the practicality of the proposed approximate multiplier for error-resilient applications, an image processing application, *image sharpening*, has been used. The main calculation required for image sharpening, as shown in [13], is

$$S(x,y)=2I(x,y)-1 \sum_{273} \sum_{i=-2j=-2}^{2} G(i+3,j+3)I(x-i,y-j)$$

where *I*(*S*) is the original (sharpened) image and *G* is

$$G = \begin{bmatrix} 1 & 4 & 7 & 4 & 1 \\ 4 & 16 & 26 & 16 & 4 \\ 7 & 26 & 41 & 26 & 7 \\ 4 & 16 & 26 & 16 & 4 \\ 1 & 4 & 7 & 4 & 1 \end{bmatrix}$$
(7)

Inthismethod, onlythemultiplications are approximate, while all others are exact operations. This method has been tested using MATLABR2016a. Fig. 5 shows an example of the use of this method with the proposed approximate multiplier design.

Thepeaksignal-to-noiseratio(PSNR)isacommon metric used to evaluate the performance of image processingalgorithms.TableIVcomparesthePSNRval-

uesachievedwiththeproposeddesignandprevious designs in [12] and [13] when tested with three test images. ThePSNRoftheproposeddesignisbetterthan [12] and [13], by up to 30.0%, for all three test images.

#### V. CONCLUSION

The operations of addition and multiplication in a digital signal processor are arguably the most crucial.. With an emphasis on the latter, a novel 4-2 compressor design has been put out in this letter and can be applied during the binary multiplier's partial product compression phase. The suggested design makes a minor alteration to an already-existing 4-2 compressor design to alter the error profile so that all errors lead to somewhat lower values (over the exact values). The resultant error profile makes it easy to detect every one of these fault locations using a straightforward two-input AND. When needed, such faults can then be recovered from using a basic error recovery module. It is possible to create an approximation multiplier with precise and near regions.Next, a few error recovery modules can be employed in the most important bit position of the approximate region, and the proposed 4-2 compressor can be applied in the approximate region.

With this suggested design, an approximate 8-32 bit multiplier uses 23.2%-24.4% less hardware area, 22.4%-24.5% less power consumption, and 11.2%-17.0% less latency than an exact multiplier. Notably, improvements have been made to all three metrics without requiring any trade-offs. Ultimately, the suggested multiplier design raises the MED of the multiplication results by 11.7%, according to error analysis. additionally earlier approximate multiplier designs.

#### REFERENCES

- [1] S. Mittal, "A survey of techniques for approximate computing,"
- ACMComput. Surveys, vol. 48, no. 4, p. 62, May 2016.
  [2] S.Narayanamoorthy,H.A.Moghaddam,Z.Liu,T.Park,and N.S.Kim, "Energy-efficientapproximatemultiplicationfordigitalsignalprocessingandclassificationapplications,"*IEEETrans.VeryLarge*
- ScaleIntegr.(VLSI)Syst.,vol.23,no.6,pp. 1180–118,Jun. 2015.
  [3] B.ShaoandP.Li,"Array-basedapproximatearithmeticcomputing:A general model and applications to multiplier and squarer design,"*IEEETrans.CircuitsSyst.I,Reg.Papers*,vol.62,no.4,pp. 1081–1090,Apr. 2015.
- [4] H.Jiang,J.Han,F.Qiao,andF.Lombardi, "Approximateradix-8Boothmultipliersforlow-powerandhighperformanceoperation,"*IEEETrans. Comput.*,vol.65,no.8,pp.2638–2644,Aug.2016.
- [5] G. Zervakis, S. Xydis, K. Tsoumanis, D. Soudris, and K. Pekmestzi, "Hybrid approximate multiplier architectures for improved power-accuracy trade-offs," in *Proc. IEEE/ACM Int. Symp. Low PowerElectron. Design*, Rome, Italy, 2015, pp. 79–84.
- [6] S. Hashemi, R. I. Bahar, and S. Reda, "DRUM: A dynamic range unbiased multiplier for approximate applications," in *Proc. IEEE/ACM Int.Conf. Comput.-Aided Design*, Austin, TX, USA, 2015, pp. 418–425.
- [7] K. Bhardwaj, P. S. Mane, and J. Henkel, "Power- and areaefficientapproximatewallacetreemultiplierforerrorresilientsystems,"in *Proc. Int. Symp. Qual. Electron. Design*, SantaClara, CA , USA, 2014, pp.263–269.
- [8] R.Zendegani,M.Kamal,M.Bahadori,A.Afzali-Kusha,and M.Pedram, "RoBAmultiplier: Arounding-based approximatemultiplier for high-speed yet energy-efficient digital signal processing," *IEEETrans.VeryLargeScaleIntegr.(VLSI)Syst.*,vol.25,no.2,pp. 393– 401,Feb. 2017.
- [9] M. Shafique, R. Hafiz, S. Rehman, W. El-Harouni, and J. Henkel, "Cross-layer approximate computing: From logic to architectures," in *Proc. ACM/EDAC/IEEE Design Autom. Conf.*, Austin, TX, USA, 2016, pp.1–6.
- [10] C.-H. Chang, J. Gu, and M. Zhang, "Ultra low-voltage low-powerCMOS 4-2 and 5-2 compressors for fast arithmetic circuits," *IEEETrans. Circuits Syst. I, Reg. Papers*, vol. 51, no. 10, pp. 1985–1997, Oct. 2004.
- [11] N. Maheshwari, Z. Yang, J. Han, and F. Lombardi, "A design approachfor compressor based approximate multipliers," in *Proc. Int. Conf. VLSIDesign*, Bengaluru, India, 2015, pp. 209–214.
- [12] A. Momeni, J. Han, P. Montuschi, and F. Lombardi, "Design andanalysis of approximate compressors for multiplication," *IEEE*

(6)

*Trans.Comput.*, vol. 64, no. 4, pp. 984–994, Apr. 2015. [13] Z. Yang, J. Han, and F. Lombardi, "Approximate compressors forerror-

resilientmultiplierdesign,"in*Proc.IEEEInt.Symp.DefectFaultTolerance VLSINanotechnol.Syst.*,Amherst,MA,USA,2015,

[14] J. Liang, J. Han, and F. Lombardi, "New metrics for the reliability of approximate and probabilistic adders," *IEEE Trans. Comput.*, vol. 62,no. 9, pp. 1760–1771, Sep. 2013.