The Intel MMX™ Technology

A SEMINAR REPORT ON
THE INTEL MMX TECHNOLOGY

Submitted in partial fulfillment for the requirement of the award for the Degree of Bachelor in Technology In Electronics & Comm nic!tion En"ineerin"
S #mitte$ %&' Renu Kanwar B. Tech. !%th Sem.$
S #mitte$ to' Mr.Yogendra boti !".#.D.$ Department of &'&
(EPARTMENT O) ELECTRONICS & COMM*NICATION ENGINEERING MAR+AR ENGINEERING COLLEGE & RESEARCH CENTRE ,O(HP*R -RA,ASTHAN. RA,ASTHAN TECHNICAL *NI/ERSITY0 1OTA -RA,ASTHAN. -234352344.
Intro$ ction'5
Intel() MM*+ technology ,-. /0 i) an e1ten)ion to the ba)ic Intel rchitecture !I $ de)igned to impro2e performance of multimedia and communication algorithm). The technology include) new in)truction) and data type). which achie2e new le2el) of performance for the)e algorithm) on ho)t proce))or). MM* technology e1ploit) the paralleli)m inherent in many of the)e algorithm). Many of the)e algorithm) e1hibit the property of 3fi1ed4 computation on a large data )et. The definition of MM* technology e2ol2ed from earlier wor5 in the i%67+ architecture ,80. The i%67 architecture wa) the indu)try() fir)t general purpo)e proce))or to pro2ide )upport for graphic) rendering. The i%67 proce))or pro2ided in)truction) that operated on multiple ad9acent data operand) in parallel. for e1ample. four ad9acent pi1el) of an image. fter the introduction of the i%67 proce))or. Intel e1plored e1tending the i%67 architecture in order to deli2er high performance for other media application). for e1ample. image proce))ing. te1ture mapping. and audio and 2ideo decompre))ion. Se2eral of the)e algorithm) naturally lent them)el2e) to SIMD proce))ing. Thi) effort laid the foundation for )imilar )upport for Intel() main)tream general purpo)e architecture. I . The MM* technology e1ten)ion wa) the fir)t ma9or addition to the in)truction )et )ince the Intel8%6+ architecture. :i2en the large in)talled )oftware ba)e for the I . a )ignificant e1ten)ion to the architecture required )pecial attention to bac5ward compatibility and de)ign i))ue). MM* technology pro2ide) benefit) to the end u)er by impro2ing the performance of multimedia;rich application) by a factor of -.<1 to /1. and impro2ing the performance of 5ey 5ernel) by a factor of =1 on the ho)t proce))or. MM* technology al)o pro2ide) benefit) to )oftware 2endor) by enabling new multimedia;rich application) for a general purpo)e proce))or with an e)tabli)hed cu)tomer ba)e. dditionally. MM* technology pro2ide) an integrated )oftware de2elopment en2ironment for )oftware 2endor) for media application).
Thi) paper pro2ide) in)ight into the proce)) and con)ideration) u)ed to define the MM* technology. It al)o pro2ide) )pecific) on MM* in)truction) that were added to the I a) well a) the approach ta5en to add thi) )ignificant capability without adding a new )oftware 2i)ible architectural )tate. The paper al)o pre)ent) application e1ample) that )how the u)age and benefit) of MM* in)truction). Data )howing the performance benefit) for the application) i) al)o pre)ented.
AC1NO+LE(GEMENT
Theoretical 5nowledge i) impro2ed through )eminar preparation a) it contribute) )ignificantly to the )tudent() under)tanding and gi2e) him to fir)t hand 5nowledge of the comple1itie) of engineering arena. >ir)t I would li5e to than5 the almighty and my parent) who ga2e me their 2aluable )upport and ble))ing) to complete thi) minor pro9ect . I would al)o li5e to than5 &r. ?. K. Bhan)ali !Director. M.&.'.R.'. @odhpur$ A &r. Yogendra boti !"ead of Department. &lectronic A 'ommunication &ngineering$ for their encouragement A appreciation). n accompli)hment of any )ignificance depend) on the Synergy and 'ooperation of re)ource) both material and human. I e1pre)) my heartfelt gratitude to all tho)e who ha2e contributed directly or indirectly in thi) endea2or. My fir)t and foremo)t regard) are for my family member) who patiently and pain)ta5ingly helped me out in e2ery way the can.
IN(EX
4
S. Bo. 77/ 78 7= 7< 76 7D 7% 7E -7 --/ -8 -= -< -6 -D -% -E /7 /-
'ontent $e6in!tion 7rocess #!sic conce7ts P!c8e$ $!t! 6orm!t 'on$ition!l e9ec tion S!t r!tin" !rit:metic )i9e$ 7oint !rit:metic Re7ositionin" o6 $!t! elements ;it:in 7!c8e$ $!t! 6orm!t (!t! !li"nment 6e!t res Ne; $!t! t&7es En:!nce$ Instr ction Set <=5%it MMX Re"isters Preser>in" ) ll %!c8;!r$ Com7!ti#ilit No Ne; St!te No Ne; E9ce7tions C:oice o6 O7co$es 6or MMX Instr ctions *se o6 )P (LL Mo$el 6or MMX Co$e Per6orm!nce A$>!nt!"e Potenti!l A77lic!tions o6 MMX T:e 6lo!tin" 7oint re"isters A$>!nt!"es o6 re"isters in MMX sin" t:e 6lo!tin" 7oint
Cage Bo. D E E
-7 -/ -/ -8
-= -<
-< -6 -% /7 //// /8 /8 /D /% /%
// /8 /=
Go!ls Concl sion Re6erences
/% /E 87
(e6inition Process'5
MM* technology() definition proce)) wa) an out)tanding ad2enture for it) participant). a path with many twi)t) and turn). It wa) a bottom;up proce)). &ngineering input and managerial dri2e made MM* technology happen. The definition of MM* technology wa) guided by a clear )et of prioritie) and goal) )et forth by the definition team. Criority number one wa) to )ub)tantially impro2e the performance of multimedia. communication). and emerging Internet application).
6
lthough targeted at thi) mar5et. any application that ha) e1ecution con)truct) that fit the SIMD architecture paradigm can en9oy )ub)tantial Cerformance )peed;up) from the technology. It wa) al)o imperati2e that proce))or) with MM* technology retain bac5ward compatibility with e1i)ting )oftware. both operating )y)tem) and application). The addition of MM* technology to the I proce))or family had to be )eamle)). ha2ing no compatibility or negati2e performance effect on all e1i)ting I )oftware or operating )y)tem). pplication) that u)e MM* technology had to run on any e1i)ting I operating )y)tem) without ha2ing to ma5e any operating )y)tem modification) what)oe2er and coe1i)t in a )eamle)) way with the e1i)ting I application ba)e. >or e1ample. any e1i)ting 2er)ion of an operating )y)tem !i.e.. Findow) BT$ would ha2e to run without modification). Bew application) that u)e MM* technology together with e1i)ting I application) would al)o ha2e to run without modification) on a proce))or with MM* technology. The 5ey principle that allowed compatibility to be maintained wa) that MM* technology wa) defined to map in)ide the e1i)ting I floating;point architecture and regi)ter) ,=0. Since e1i)ting operating )y)tem) and application) already 5new how to deal with the I floating;point !>C$ )tate. mapping the MM* technology in)ide the floating;point architecture wa) a clean way to add SIMD without adding any new architectural )tate. The operating )y)tem doe) not need to 5now if an application i) u)ing MM* technology. &1i)ting technique) to perform multiproce))ing !)haring e1ecution time among multiple application) by frequently )witching among them$ would ta5e care of any application with MM* technology. nother important guideline that we followed wa) to ma5e it po))ible for application de2eloper) to ea)ily migrate their application) to u)e MM* technology. RealiGing that I proce))or) with and without MM* technology would be on the mar5et for )ome time. we wanted to ma5e )ure that migration would not become a problem for )oftware de2eloper). By enabling a )oftware program to detect the pre)ence of MM* technology during run time. a )oftware de2eloper need de2elop only one 2er)ion of an application that can run both on newer proce))or) that )upport MM* technology and older one) which do not. Fhen reaching a point in the e1ecution of a program where a code )equence enhanced with MM* in)truction) can boo)t performance. the program chec5) to )ee if MM* technology i) )upported and e1ecute) the new code )equence. #n older proce))or) without MM* technology. a different code )equence would be e1ecuted. Thi) call) for duplication of )ome 5ey application code )equence). but our e1perience )howed it to a2erage le)) than -7H growth in program )iGe. Fe wanted to 5eep MM* technology )imple )o that it would not depend on any comple1 implementation which would not )cale ea)ily with future ad2anced microarchitecture technique) and increa)ing proce))or frequencie). thu) ma5ing it a burden on the future. Fe made )ure MM* technology would add a minimal amount of incremental die area. ma5ing it practical to incorporate MM* technology into all future Intel microproce))or).
7
Fe al)o wanted to 5eep MM* technology general enough )o that it would )upport new algorithm) or change) to e1i)ting one). ) a re)ult. we a2oided algorithm;)pecific )olution). )ometime) )acrificing potential performance but a2oiding the ri)5 of ha2ing to )upport feature) in the future if they become redundant. The deci)ion of whether to add )pecific in)truction) wa) ba)ed on a co)t;benefit analy)i) for a large )et of e1i)ting and futuri)tic application) in the area of multimedia and communication). The)e application) included MC&:-I/ 2ideo. mu)ic )ynthe)i). )peech compre))ion. )peech recognition. image proce))ing. 8D graphic) in game). 2ideo conferencing. modem. and audio application). The definition team al)o met with e1ternal )oftware de2eloper) of emerging multimedia application) to under)tand what they needed from a new Intel rchitecture proce))or to enhance their product). pplication) were collected from different )ource). and in )ome ca)e) where no application wa) readily a2ailable. we de2eloped our own. pplication) we collected were bro5en down to re2eal that. in mo)t ca)e). they were built out of a few 5ey compute;inten)i2e routine) where the application )pend) mo)t of it) e1ecution time. The)e 5ey routine) were then analyGed in detail u)ing ad2anced computer;aided profiling tool). Ba)ed on the)e )tudie). we found that 5ey code )equence) had the following common characteri)tic)J
Small. nati2e data type) !for e1ample. %;bit pi1el). -6;bit audio )ample)$ Regular and recurring memory acce)) pattern) KocaliGed. recurring operation) performed on the data 'ompute;inten)i2e Thi) common beha2ior enabled u) to come up with MM* technology. which i) a )olution that )upport) well a wide 2ariety of application) from different domain).
%!sic Conce7ts'5
#ur ob)er2ation) of multimedia and communication) application) pointed u) in the direction of an architecture that would enable e1ploiting the paralleli)m noted in our )tudie). Beyond the ob2iou) performance enhancement potential gained by pac5ing relati2ely )mall data element) !% and -6 bit)$ together and operating on them in parallel. thi)
8
5ind of pac5ing al)o naturally enable) utiliGing wide data path) and e1ecution capabilitie) of )tate;of;the;art proce))or). n efficient )olution for media application) nece))itate) addre))ing )ome concept) that are fundamental to the SIMD approach and multimedia application) Cac5ed data format 'onditional e1ecution Saturating arithmetic 2). wrap;around arithmetic >i1ed;point arithmetic Repo)itioning data element) within pac5ed data format Data alignment
P!c8e$ (!t! )orm!t'5

MM* technology define) new regi)ter format) for data repre)entation. The 5ey feature of multimedia application) i) that the typical data )iGe of operand) i) )mall. Mo)t of the data operand)( )iGe) are either a byte or a word !-6 bit)$. l)o. multimedia proce))ing typically in2ol2e) performing the )ame computation on a large number of ad9acent data element). The)e two propertie) lend them)el2e) to the u)e of SIMD computation. #ne que)tion to an)wer when defining the SIMD computation model i) the width or the data type for SIMD in)truction). "ow many element) of data )hould we operate on in parallelL The an)wer depend) on the characteri)tic) of the natural organiGation and alignment of the data for targeted application) and de)ign con)ideration). >or e1ample. for a motion e)timation algorithm. data i) naturally organiGed in -6 row). with each row containing only -6 byte) of data. In thi) ca)e. operating on more than -6 data element) at a time will require reformatting the input data. De)ign con)ideration) in2ol2e i))ue) )uch a) the practical width of the data path and how many time) functional unit) will replicate. :i2en that current Intel proce))or) already ha2e 6=;bit data path) !for e1ample. floating;point data path). a) well a) a data path between the integer regi)ter file and memory )ub)y)tem due to dual loadI)tore capability in the Centium proce))or$. we cho)e the width of MM* data type) to be 6= bit).
Con$ition!l E9ec tion'5

#perating on multiple data operand) u)ing a )ingle in)truction pre)ent) an intere)ting i))ue. Fhat happen) when a computation i) only done if the operand 2alue pa))e)
9
)ome conditional chec5L >or e1ample. in an ab)olute 2alue calculation. only if the number i) alreadynegati2e do we perform a /() complement on itJ for I M -. -77 if a,i0 N 7 then b,i0 M ; a,i0 el)e b,i0 M a,i0 O b)olute 2alue calculation There are different approache) po))ible. and )ome are impler than other). P)ing a branch approach doe) not wor5 well for two rea)on)J fir)t. a branch;ba)ed )olution i) )lower becau)e of the inherent branch mi)prediction penalty. and )econd. becau)e of the need to con2ert pac5ed data type) to )calar). Direct conditional e1ecution )upport doe) not wor5 well for the I )ince it require) three independent operand) !)ource. )ourceIde)tination. and predicate 2ector$. Keeping with the philo)ophy of performance and )implicity. we cho)e a )impler )olution. The ba)ic idea wa) to con2ert a conditional e1ecution into a conditional a))ignment. 'onditional a))ignment in turn can be implemented through different approache). #ne approach would be to pro2ide the fle1ibility of )pecifying a dynamically generated ma)5 with an a))ignment in)truction. Such an approach would ha2e required defining in)truction) with three operand) !)ource. )ourceIde)tination. and ma)5$. "ere al)o. we adopted a )olution that i) more amenable to higher performance de)ign). 'ompare operation) in MM* technology re)ult in a bit ma)5 corre)ponding to the length of the operand). >or e1ample. a compare operation operating on pac5ed byte operand) produce byte;wide ma)5). The)e ma)5) then can be u)ed in con9unction with logical operation) to achie2e conditional a))ignment. 'on)ider the following e1ampleJ If True Ra JM Rb el)e Ra JM Rc Ket u) )ay regi)ter R1 contain) all -() if the condition i) true and all 7() if the condition i) fal)e. Then we can compute Ra with the following logical e1pre))ionJ Ra M !Rb BD R1$ #R !Rc BDB#T R1$ Thi) approach wor5) for operation) with a regi)ter a) the de)tination. 'onditional a))ignment to memory can be implemented a) a )equence of load. conditional a))ignment. and )tore. Fe re9ected more efficient )upport for conditional )tore) for two rea)on)J fir)t. the )upport require) three )ource operand). which doe) not map well to high;performance architecture). and )econd. the benefit of )uch )upport i) dependent on )upport from the platform for efficient partial tran)fer).
10
The MM* in)truction )et contain) a pac5ed compare in)truction that generate) a bit ma)5. enabling data dependent calculation) to be e1ecuted without branch in)truction) and to be e1ecuted on )e2eral data element) in parallel. The bit ma)5 re)ult of the pac5ed compare in)truction ha) all -() in element) where the relation te)ted for i) true and all 7() otherwi)e !)ee >igure -$.
S!t r!tin" Arit:metic'5

#perand )iGe) typically u)ed in multimedia are )mall !for e1ample. % bit) for repre)enting a color component$. n %;bit number allow) only /<6 different )hade) of a color to be di)played. Fhile thi) re)olution i) more than enough for what the eye can )ee. it pre)ent) u) with a problem in computation. :i2en only an %;bit repre)entation. the accumulation of color 2alue) of a large number of pi1el) i) li5ely to e1ceed the ma1imum 2alue that can be repre)ented by the %;bit number. In the default computational model. if the addition of two number) re)ult) in a 2alue that i) more than the ma1imum 2alue that can be repre)ented by the de)tination operand. a wrapped;around 2alue i) )tored in the de)tination. If an application cared to )afeguard again)t )uch a po))ibility. then it ha) to e1plicitly e1amine for an occurrence of an o2erflow.
In media application). typically the de)ired beha2ior i) to pro2ide not the wrap;around 2alue but the ma1imum 2alue a) the re)ult. MM* technology pro2ide) an option to the application program. which determine) whether a wraparound re)ult or ma1imum re)ult i) pro2ided in ca)e of an o2erflow.
11
There may be ca)e) where an application want) to e1amine the occurrence of an o2erflow in a computation. Cro2iding a flag to indicate thi) !i.e.. indicating whether or not the 2alue wa) )aturated$ would ha2e been de)irable. "owe2er. we decided again)t pro2iding thi) flag. )ince we did not want to add any additional new )tate) to the architecture to pre)er2e the bac5ward compatibility. #ur analy)i) al)o )howed that it wa) not critical to pro2ide thi) information in mo)t application). If needed. an application can determine if )aturation wa) encountered by comparing the re)ult of a computation with the ma1imum and minimum 2alueO typically. )aturation i) the correct beha2ior.
)i9e$5Point Arit:metic'5
Media application) in2ol2e wor5ing on fraction 2alue). for e1ample. the u)e of a weighting coefficient in filtering a2eraging. etc. #ne way to )upport operation) on fraction 2alue) i) to pro2ide SIMD operation) for floating;point operand). "owe2er. floating;point unit) are hardware inten)i2e. l)o. for )e2eral media application). e2en preci)ion of -7 to -/ binary bit) and dynamic range of = to 6 bit) are )ufficient. Indu)try;)tandard floating;point !I&&& >C$ require) a minimum of /8 bit) of preci)ion. Koo5ing at application requirement) and the trade;off of performance and de)ign comple1ity lead) to the u)e of a fi1ed;point arithmetic paradigm for )e2eral media application). Bote that )ome of the computation) may )till require the dynamic range and the preci)ion )upported by I&&& floating;point. for e1ample. geometry tran)formation for )tate;of;the;art 8D application).
In fi1ed;point computation. from the point of 2iew of the proce))or architecture. computation) are done on integer 2alue). but programmerIapplication) interpret the integer 2alue) a) fraction 2alue). Some number of leading bit) !determined by the application$ are interpreted a) an integer. while the remaining bit) of the 2alue are interpreted a) a fraction. It i) the application() re)pon)ibility to perform appropriate )hift) in order to )cale the number.
Re7ositionin" o6 (!t! Elements +it:in P!c8e$ (!t! )orm!t'5

The pac5ed data format pre)ent) one other i))ue. There are )e2eral ca)e) where element) of pac5ed data may be required to be repo)itioned within the pac5ed data. or the element) of two pac5ed data operand) may need to be merged. There are ca)e) where either input or the de)ired output repre)entation of a data may not be ideal for ma1imiGing computation throughput. >or e1ample. it may be preferable to compute on color component) of a pi1el in 3planar format4 while the input may be in 3pac5ed format.4
12
There are al)o )ituation) where one need) to perform intermediate computation) in wider format !perhap) pac5ed word format$. while the re)ult i) pre)ented in pac5ed byte format. In the abo2e ca)e). there i) a need to e1tract )ome element) of a pac5ed data type and write them into a different po)ition in the pac5ed re)ult. #ne general )olution to thi) i))ue i) to pro2ide an in)truction that ta5e) two pac5ed data operand) and allow) merging of their byte) in any arbitrary order into the de)tination pac5ed data operand. "owe2er. )uch a general )olution i) e1pen)i2e to implement. Thi) )olution e))entially will require a full cro)) bar connection. In the MM* technology architecture. we defined an in)truction that require) a relati2ely ea)y )wiGGle networ5 and yet allow) the efficient repo)itioning and combining of element) from pac5ed data operand) in mo)t ca)e). The in)truction unpack ta5e) two pac5ed data operand) and merge) them a) )hown in >igure /.
The unpack in)truction can be u)ed for a 2ariety of efficient repo)itioning of data element). including data replication. within pac5ed data. >or e1ample. con)ider con2erting a color repre)entation from pac5ed form !i.e.. for each pi1el. four con)ecuti2e byte) repre)ent R. :. B. and lpha 2alue)$ to planar format !i.e.. four con)ecuti2e byte) repre)ent the red component of four con)ecuti2e pi1el)$.
(!t! Ali"nment'5
13
P)e of pac5ed data al)o pre)ent) data alignment i))ue). In )ome ca)e). the data may be aligned on it) natural boundary and not on the )iGe of the pac5ed data operand. >or e1ample. in a motion e)timation routine. the -61-6 bloc5 i) aligned at an arbitrary byte boundary and not at a 6=;bit boundary. Therefore. in )ome ca)e). there i) a need to )upport efficient acce)) of unaligned data for media application). #ne approach i) to )upport unaligned acce))e) directly in hardware. which generally doe) not wor5 well with the high; performance cache de)ign. lternati2ely. one can limit memory acce))e) to aligned data and e1tract out the de)ired data from the acce))ed data u)ing e1plicit in)truction). MM* technology include) logical )hift;left and )hift;right operation) on 6= bit). The)e in)truction) enable u)ing a )equence of Shift left. Shift right. and Or operation) to a))emble the de)ired byte from the aligned data that encompa))e) the de)ired byte).
)e!t res'5
MM* technology feature) includeJ Bew data type) built by pac5ing independent data element) together into one regi)ter. n enhanced in)truction )et that operate) on all independent data element) in a regi)ter. u)ing parallel SIMD fa)hion. Bew 6=;bit MM* regi)ter) that are mapped on the I floating;point regi)ter). >ull I compatibility.
Ne; (!t! T&7es'5

MM* technology introduce) four new data type)J three pac5ed data type) and a new 6=;bit entity. &ach element within the pac5ed data type) i) an independent fi1ed;point integer. The architecture doe) not )pecify the place of the fi1ed point within the element). becau)e it i) the u)er() re)pon)ibility to control it) place within each element throughout the calculation. Thi) add) a burden on the u)er. but it al)o lea2e) a large amount of fle1ibility to choo)e and change the preci)ion of fi1ed;point number) during the cour)e of the application in order to fully control the dynamic range of 2alue). The following four data type) are defined !)ee >igure 8$J
14
Cac5ed byte Cac5ed word Cac5ed double word Cac5ed quad word
% byte) pac5ed into 6= bit) = word) pac5ed into 6= bit) / double word) pac5ed into 6= bit) 6= bit)
En:!nce$ Instr ction Set'5

MM* technology define) a rich )et of in)truction) that perform parallel operation) on multiple data element) pac5ed into 6= bit) !%1%;bit. =1-6;bit. or /18/;bit fi1ed point integer data element)$. Fe 2iew the MM* technology in)truction )et a) an e1ten)ion of the ba)ic operation) one would perform on a )ingle datum in the SIMD domain. In)truction) that operate on pac5ed byte) were defined to )upport frequent image operation) thatin2ol2e %;bit pi1el) or one of the %;bit color component) of /=I8/;bit pi1el) !Red. :reen. Blue. lpha channel$. Fe
15
defined full )upport for pac5ed word !-6;bit$ data type).Thi) i) becau)e we found -6; bit data to be a frequent data type in many multimedia algorithm) !e.g.. M#D&M. udio$ and )er2e) a) the higher preci)ion bac5up for operation) on byte data.
ba)ic in)truction )et i) pro2ided for pac5ed doubleword data type) to )upport operation) that need intermediate higher preci)ion than -6 bit) and a 2ariety of 8D graphic) algorithm). Becau)e MM* technology i) a 6=;bit capability. new in)truction) to )upport 6= bit) were added. )uch a) 6=;bit memory mo2e) or 6=;bit logical operation). #2erall. <D new MM* in)truction) were added to the Intel )et. rchitecture in)truction
The MM* in)truction) 2ary from one another by a few characteri)tic). The fir)t i) the data type on which they operate. In)truction) are )upplied to do the )ame operation on different data type). There are al)o in)truction) for both )igned and un)igned arithmetic. MM* technology )upport) )aturation on pac5ed add. )ubtract. and data type con2er)ion in)truction). Thi) facilitate) a quic5 way to en)ure that 2alue) )tay within a gi2en range. which i) a frequent need in multimedia operation). In mo)t ca)e). it i) more important to )a2e the e1ecution time )pent on chec5ing if a 2alue e1ceed) a certain range than worry about the inaccuracy introduced by clamping 2alue) to minimum or ma1imum range 2alue). Saturation i) not a mode acti2ated by )etting a control bit but i) determined by the in)truction it)elf. Some in)truction) ha2e )aturation a) part of their operation. MM* technology added data type con2er)ion in)truction) to addre)) the need to con2ert between the new data type) and to enable )ome intermediate calculation) to ha2e more bit) a2ailable for e1tended preci)ion. l)o. many algorithm) u)ed in multimedia and communication) application) perform multiply;accumulate computation). MM* technology addre))ed thi) with a )pecial multiply add in)truction. MM* in)truction) are non;pri2ileged in)truction) and can be u)ed by any )oftware. application). librarie). dri2er). or operating )y)tem).
16
Table - )ummariGe) the in)truction) introduced by MM* technologyJ
<=5%it MMX Re"isters'5

MM* technology pro2ide) eight new 6=;bit general purpo)e regi)ter) that are mapped on the floating;point regi)ter). &ach can be directly addre))ed within the a))embly by de)ignating the regi)ter name) MM7 ; MMD in MM* in)truction). MM* regi)ter) are random acce)) regi)ter). that i). they are not acce))ed 2ia a )tac5 model li5e the floating;point regi)ter). MM* regi)ter) are u)ed for holding MM* data only. MM* in)truction) that )pecify a memory operand u)e the I integer regi)ter) to addre)) that operand.
17
) the MM* regi)ter) are mapped o2er the floating;point regi)ter). application) that u)e MM* technology ha2e -6 regi)ter) to u)e. &ight are the MM* regi)ter). each 6= bit) in )iGe that hold pac5ed data. and eight are integer regi)ter). which can be u)ed for different operation) li5e addre))ing. loop control. or any other data manipulation. MM* data 2alue) re)ide in the low order 6= bit) !the manti))a$ of the I %7;bit floatingpoint regi)ter) !)ee >igure =$.
The e1ponent field of the corre)ponding floating;point regi)ter !bit) 6=;D%$ and the )ign bit !bit DE$ are )et to one) !-()$. ma5ing the 2alue in the regi)ter a BaB !Bot a Bumber$ or infinity when 2iewed a) a floating;point 2alue. Thi) help) to reduce confu)ion by en)uring that an MM* data 2alue will not loo5 li5e a 2alid floating;point 2alue. MM* in)truction) only acce)) the low;order 6= bit) of the floating;point regi)ter) and are not affected by the fact that they operate on in2alid floating;point 2alue). The dual u)age of the floating;point regi)ter) doe) not preclude application) from u)ing both MM* code and floating;point code. In)ide the application. the MM*
18
codeand floating;point code )hould be encap)ulated in )eparate code )equence). fter one )equence complete). the floating;point )tate i) re)et and the ne1t )equence can )tart. The need to u)e floating;point data and MM* !fi1ed;point integer$ data at the )ame time i) infrequent. t a gi2en time in an application. data being operated upon i) u)ually of one type. Thi) enabled u) to u)e the floating;point regi)ter) to )tore the MM* technology 2alue) and achie2e our full bac5ward compatibility goal.
Preser>in" ) ll %!c8;!r$ Com7!ti#ilit&'5

#ne of the important requirement) for MM* technology wa) to enable u)e of MM* in)truction) in application) without requiring any change) in the I )y)tem )oftware. n additional requirement wa) that an application )hould be able to utiliGe performance benefit) of MM* technology in a )eamle)) fa)hion. i.e.. it )hould be able to employ MM* in)truction) in part of the application. without requiring the whole of the application to be MM* technology;aware. Crimary bac5ward compatibility requirement) and their implication) areJ pplication) u)ing MM* in)truction) )hould wor5 on all e1i)ting multita)5ing and non;multita)5ing operating )y)tem). Thi) require) that MM* technology )hould not add any new architecturally 2i)ible )tate) or e2ent) !e1ception)$. &1i)ting application) that do not u)e MM* in)truction) )hould run unchanged. Thi) require) that MM* technology )hould not redefine the beha2ior of any e1i)ting I 8/;bit in)truction). #nly tho)e undefined opcode) that are not relied on for cau)ing illegal e1ception) by e1i)ting )oftware )hould be u)ed to define MM* in)truction). l)o. MM* in)truction) )hould only affect the I 8/; bit )tate when in u)e. &1i)ting application) )hould be able to utiliGe MM* technology without being required to ma5e the whole application MM* technology;aware. It )hould be po))ible to employ MM* in)truction) within a procedure in an e1i)ting application without requiring any change) in the re)t of the application. Thi) require) that MM* in)truction) wor5 well within the conte1t of e1i)ting I calling con2ention) for procedure call). It )hould be po))ible to run an application e2en in an older generation of proce))or) that doe) not )upport MM* technology. P)ing dynamically lin5ed librarie) !DKK)$ for MM* and non;MM* technology proce))or) i) an ea)y way to do thi). MM* in)truction) )hould be )emantically compatible with other I in)truction). i.e.. it )hould be ea)y to )upport new MM* in)truction) in e1i)ting a))embler). They )hould al)o ha2e minimal impact on the in)truction decoder. nother a)pect of thi) i) that MM* in)truction) )hould not require programmer) to thin5 in new way) regarding the ba)ic beha2ior of in)truction).
19
>or e1ample. addre))ing mode) and the a2ailability of operation) with memory )hould conceptually wor5 the )ame.
No Ne; St!te'5
The MM* technology )tate o2erlap) with the >loating; Coint )tate. #2erlapping the MM* )tate with the >C )tac5 pre)ented an intere)ting challenge. >or performance rea)on) a) well a) for ea)e of implementation for )ome micro architecture). we wanted to allow the acce))ing of the MM* regi)ter) in a flat regi)ter model. Fe needed to enable o2erlapping MM* regi)ter) with the >C )tac5 while )till allowing a flat regi)ter acce)) model for MM* in)truction). Thi) wa) accompli)hed by enforcing a fi1ed relation)hip between the logical and phy)ical regi)ter) for the >C )tac5. when acce))ed 2ia MM* in)truction). dditionally. e2ery MM* in)truction ma5e) the whole MM* regi)ter file 2alid. Thi) i) different from the floating;point )tac5 model. where new )tac5 entrie) are made 2alid only if the in)truction )pecifie) a 3pu)h4 operation. MM* in)truction) them)el2e) do not update >C in)truction )tate regi)ter) !for e1ample. >C opcode. >#C. >C Data )elector. >DS. >C IC. >IC. etc.$. The >C in)truction )tate i) u)ed only by >C e1ception handler). Since MM* in)truction) do not create any computation e1ception). thi) )tate i) really not meaningful for MM* in)truction). dditionally. not updating the)e )tate) eliminate) the comple1ity of maintaining thi) )tate for MM* technology implementation). Therefore. we made a deci)ion to let the >C in)truction )tate regi)ter point to the la)t >C in)truction e1ecuted e2en though future MM* in)truction) will update the >C )tac5 and T : regi)ter. &2entually. when an >C in)truction i) e1ecuted. all of the >C in)truction )tate get) updated. Therefore. >C e1ception handler) alway) )ee con)i)tent >C in)truction )tate.
No Ne; E9ce7tions'5
MM* in)truction) can be 2iewed a) new non;I&&& floating;point in)truction) that do not generate computation e1ception). "owe2er. )imilar to >C in)truction). they do report any pending >C e1ception). >or compatibility with e1i)ting )oftware. it i) critical that any pending >C e1ception i) reported to the )oftware prior to e1ecution of any MM* in)truction which could update the >C )tate. t the point of rai)ing the pending >C e1ception. the >C e1ception )tate )till point) to the la)t >C in)truction creating the >C condition. Therefore. the fact that the e1ception get) reported by an MM* in)truction in)tead of an >C in)truction i) tran)parent to the >C e1ception handler. dditional e1ception) that are pertinent to MM*
20
technology are memory e1ception). de2ice;not;a2ailable !DB and >C emulation e1ception).
; IBTD$ e1ception).
"andling of memory e1ception). in general. doe) not depend on the opcode of the in)truction cau)ing the e1ception. Therefore. MM* technology e1ception) do not cau)e a malfunction of any memory acce));related e1ception handler. #ur e1ten)i2e compatibility 2erification 2alidated thi) further. DB e1ception i) cau)ed when the TS bit in 'R7 i) )et. and any other in)truction that could modify the >C )tate i) i))ued. Thi) include) e1ecution of an MM* in)truction when the TS bit i) )et. In thi) ca)e. )imilar to the >C ca)e. a DB e1ception i) in2o5ed. The re)pon)e of thi) e1ception i) to )a2e the >C )tate and free it up for u)e by future >CIMM* in)truction). Thi) e1ception handler al)o doe) not ha2e a u)e for the opcode of the in)truction cau)ing thi) e1ception. Fhen the 'R7.&M bit i) )et. a floating;point in)truction cau)e) an >C emulation e1ception. In thi) ca)e. in)tead of u)ing >C hardware. >C functionality i) )upported 2ia )oftware emulation. Since the MM* technology architecture )tate o2erlap) with the >C architecture )tate. the i))ue ari)e) a) to the correct beha2ior for MM* in)truction) when the 'R7.&M bit i) )et. 'au)ing an emulation e1ception for MM* in)truction) when 'R7.&M i) )et i) not the right beha2ior )ince the e1i)ting >C emulator doe) not 5now about MM* in)truction). Therefore. the fir)t natural choice )eemed to ignore 'R7.&M for MM* technology. "owe2er. thi) choice ha) a problem. Ignoring 'R7.&M for MM* in)truction) would re)ult in two )eparate conte1t) for the >C Stac5 and T : word)J one conte1t in the emulator memory for >C and one conte1t in the hardware for MM* in)truction). Thi) lead) to an architectural incon)i)tency between the ca)e) when 'R7.&M i) )et and when it i) not )et. Fe had to find )ome other logical way to deal with thi) without defining any new e1ception). Fe cho)e to define the 'R7.&M M - ca)e to re)ult in an illegal opcode e1ception. Thu). e))entially when 'R7.&M i) )et. the MM* technology architecture e1ten)ion i) di)abled.
C:oice o6 O7co$es 6or MMX Instr ctions'5

The MM* in)truction opcode) were cho)en after e1ten)i2e analy)i) of the undefined opcode map. Fe hadto ma5e )ure that the a2ailable opcode) were reallyunu)ed. Thi) required en)uring that no )oftware wa) relying on the illegal opcode fault beha2ior of the)e opcode). Intel wa) already wor5ing with )oftware 2endor) to en)ure that they relied only on one )pecific encoding 7>>> to cau)e an illegal opcode fault. #ther encoding may cau)e an illegal e1ception fault in future implementation). &1cept for a few ca)e). we found that )oftware wa) u)ing only pre)cribed encoding for cau)ing a programcontrolled in2alid opcode fault.
21
#nly addre)) prefi1e) are defined to be meaningful for MM* in)truction). P)e of a Repeat. Koc5. or Data prefi1 i) illegal for MM* in)truction). The addre)) prefi1 ha) the )ame beha2ior a) for any other in)truction.
*se o6 )P (LL Mo$el 6or MMX Co$e'5

To enable common multimedia application) for proce))or) with and without MM* technology. we cho)e to promote the Dynamic Kin5ed Kibrary !DKK$ model a) the primary model to )upport MM* in)truction).
In the DKK model. depending upon whether the proce))or pro2ide) MM* technology )upport in hardware !the proce))or 'CPID pro2ide) thi) information$. the appropriate 2er)ion of the media library function i) lin5ed dynamically. MM* technology DKK) )ugge)t the )ame guideline) a) that of >C DKK). The primary guideline) areJ t the end of a DKK. lea2e the floating;point regi)ter) in the correct )tate for the calling procedure. Thi) generally mean) lea2ing the floating;point )tac5 empty. unle)) a procedure ha) a return 2alue. Thi) al)o mean) that the caller )hould chec5 for. and handle. any >C e1ception) that it might ha2e generated. Do not a))ume that the floating;point )tate remain) the )ame acro)) procedure). The callee can typically a))ume that at entry. the >C )tac5 i) empty unle)) there i) )ome )et con2ention for parameter pa))ing. Bote that nothing in the MM* technology architecture depend) on the)e guideline) for functional correctne)). MM* technology can be u)ed in any other u)age model). MM* technology pro2ide) an in)truction to clear all of >C )tate with a )ingle in)truction !&MMS in)truction$. If )ome DKK i) written to return with the >C )tac5 only partially empty. one need) to u)e a combination of &MMS and floating;point load) to create the correct >C )tac5 )tate. 'lean the )tate of MM* with &MMS in)truction.
Per6orm!nce A$>!nt!"e'5
Fe will analyGe the performance enhancement due to MM* technology through an e1ample of a matri1;2ector multiplication 2ery much li5e the one in >igure <. The multiply;accumulate !M '$ operation i) one of the mo)t frequent operation) in
22
multimedia and communication) application) u)ed in ba)ic mathematical primiti2e) li5e matri1 multiply and filter).
multiply;accumulate operation !M '$ i) defined a) the product of two operand) added to a third operand !the accumulator$. Thi) operation require) two load) !operand) of the multiplication operation$. a multiply. and an add !to the accumulator$. MM* technology doe) not )upport three operand in)truction)O therefore. it doe) not ha2e a full M ' capability. #n the other hand. the pac5ed multiply;add in)truction !CM DDFD$ i) defined. which compute) four -6;bit 1 -6; bit multiplie) generating four 8/;bit product) and doe) two 8/;bit add) !out of the four needed$. )eparate pac5ed add double word !C DDD$ add) the two 8/;bit re)ult) of the pac5ed multiply;add to another MM* regi)ter. which i) u)ed a) an accumulator.
>or thi) performance e1ample. we will a))ume both input 2ector) to be the length of -6 element). each element in the 2ector) being )igned -6 bit). ccumulation will be performed in 8/;bit preci)ion. The Centium proce))or. for e1ample. would ha2e to proce)) each of the operation) one at a time in a )equential fa)hion. Thi) amount) to 8/ load). -6 multiplie). and -< addition). a total of 68 in)truction). ))uming we perform = M ') !out of the -6$ per iteration. we need to add -/ in)truction) for loop control !8 in)truction) per iteration. increment. compare. branch$. and one in)truction for )toring the re)ult. The total i) D6 in)truction). ))uming all data and in)truction) are in the on;chip cache) and that e1iting the loop will incur one branch mi)prediction. the integer a))embly optimiGed 2er)ion of thi) code !utiliGing both pipeline)$ ta5e) 9u)t o2er /77 cycle) on a Centium proce))or microarchitecture. The cycle count i) dominated by the integer multiply being a non;pipelined --;cycle operation. Pnder the )ame condition) but a))uming the data i) in a floating;point format. the floating;point optimiGed a))embly 2er)ion e1ecute) in D= cycle). The floating;point 2er)ion i) fa)ter !a))uming the data i) in floating;pointing format$ )ince the floating;point multiply ta5e) three cycle) to e1ecute and i) a pipelined unit.
23
MM* technology. on the other hand. compute) four element) at a time. Thi) reduce) the in)truction count to eight load). four CM DDFD in)truction). three C DDD in)truction). one )tore in)truction. and three additional in)truction) !o2erhead due to pac5ed data type)$. totaling -E in)truction). Cerforming loop unrolling of four CM DDFD in)truction) eliminate) the need to in)ert any loop control in)truction). Thi) i) becau)e four CM DDFD) already perform all the -6 required M '). The MM* in)truction count i) four time) le)) than when u)ing integer or floating;point operation)Q Fith the )ame a))umption) a) abo2e on Centium proce))or with MM* technology. an MM* technology;optimiGed a))embly 2er)ion of the code utiliGing both pipeline) will e1ecute in only -/ cycle). 'ontinuing the abo2e e1ample. a))ume a -61-6 matri1 i) multiplied by a -6;element 2ector. Thi) operation i) built of -6 ?ector;Dot;Croduct) !?DC$ of length -6. Repeating the )ame e1erci)e a) before and a))uming a loop unrolling that perform) four ?DC) each iteration. the regular Centium proce))or code will total =R!=RD6S8$ M -//% in)truction). P)ing MM* technology will require =R!=R-ES8$ M 8-6 in)truction). The MM* in)truction count i) 8.E time) le)) than when u)ing regular operation). The be)t regular code implementation !floating;point optimiGed 2er)ion$ ta5e) 9u)t under -/77 cycle) to complete in compari)on to /7D cycle) for the MM* code 2er)ion. Intel ha) introduced two proce))or familie) with MM* technologyJ the Centium proce))or with MM* technology and the Centium II proce))or. The performance of both proce))or) wa) compared on the Intel Media Benchmar5 !IMB$ ,<.60. which mea)ure) the performance of proce))or) running algorithm) found in multimedia application). The IMB incorporate) audio and 2ideo playbac5. image proce))ing. wa2e )ample rate con2er)ion. and 8D geometry.
>igure 6 and Table / compare the Centium proce))or with MM* technology and the Centium II proce))or again)t the Centium proce))or and the CentiumT Cro proce))or.
24
25
Cotential pplication) of MM* J;

-. /. 8. =. <. 6. D. %. E. graphic) M&: 2ideoIimage proce))ing mu)ic )ynthe)i) )peech compre))ionIrecognition 2ideo conferencing matri1 and 2ector calculation) d2anced 8D graphic) !SS&/$ Speech recognition !SS&/$ Scientific and engineering application) !SS&/$
26
The floating point regi)ter)J;

-. >loating point i) proce))ed by eight %7 bit regi)ter) ST!7$. ST!-$. UST!D$ in the floating point unit. /. Fhen doing floating point arithmetic. the)e regi)ter) are organiGed in a )tac5. 8. Crogramming floating point i) quite different that programming integer arithmetic. =. >loating point calculation) are done u)ing %7 bit) e2en when the program )pecifie) )toring 8/ or 6= bit data 2alue).
d2antage) of u)ing the floating point regi)ter) in MM*J;

-. The regi)ter) already e1i)t. #nly logic had to be added to the chip. /. The operating )y)tem already 5now) about the floating point regi)ter). 8. Fhen a computer i) )witche) from one program to another. the )tate !regi)ter)$ of the current program mu)t be )a2ed )o )tate can be re)tored when the program become) the acti2e program once again. =. The floating point regi)ter) are automatically )a2ed a) part of the )tate of a program. <. MM* wor5ed under e1i)ting operating )y)tem)Q
:oal)J;
V accelerate multimedia and communication) application). V maintain full compatibility with e1i)ting operating )y)tem) and application). V e1ploit inherent paralleli)m in multimedia and communication algorithm)
V
include) new in)truction) and data type) to impro2e performance.
27
Concl sion'5
MM* technology implement) a high;performance technique that enhance) the performance of Intel rchitecture microproce))or) for media application). The core algorithm) in the)e application) are compute inten)i2e. The)e algorithm) perform operation) on a large amount of data. u)e )mall data type). and pro2ide many opportunitie) for paralleli)m. The)e algorithm) are a natural fit for SIMD architecture. MM* technology define) a general purpo)e and ea)y;to;implement )et of primiti2e) to operate on pac5ed data type). MM* technology. while deli2ering performance boo)t to media application). i) fully compatible with the e1i)ting application and operating )y)tem ba)e. MM* technology i) general by de)ign and can be applied to a 2ariety of )oftware media problem). Some e1ample) of thi) 2ariety were de)cribed in thi) paper. >uture media related )oftware technologie) for u)e on the Intranet and Internet )hould benefit from MM* technology. Centium proce))or) with MM* technology pro2ide a )ignificant performance boo)t !appro1imately =1 for )ome of the 5ernel)$ for media application). Cerformance gain) from the technology will )cale well with an increa)ed proce))or operating frequency and future microarchitecture).
28
Re6erences'5
,-0 . Celeg. P. Fei)er. MMX Technology Extension to the Intel Architecture. I&&& Micro. ?ol. -6. Bo. =. ugu)t -EE6. pp. =/;<7. ,/0 . Celeg. S. Fil5ie. P. Fei)er, Intel MMX for Multime ia !"s. 'ommunication) of the 'M. ?ol. =7. Bo. -. @anuary -EED. pp. /<;8%. ,80 Intel 'orporate Kiterature. i#$% Microprocessor &amily !rogrammers Reference Manual. #rder number /=7%D<. Intel 'orporate Kiterature Sale). -EE-. ,=0 !entium &amily 'ser(s Manual. )olume *+ Architecture an !rogramming Manual, #rder number /=-=87. Intel 'orporate Kiterature Sale). Mt. Cro)pect. IK. -EE=. ,<0 M. Slater. The ,an -eyon -enchmarks, "omputer an "ommunications OEM Maga.ine. ?ol. =. Bo. 8-. September -EE6. pp. 6=;DD. ,60 Intel Media Benchmar5 PRKJ httpJIIpentium.intel.comIproc)IperfIicompIimb.htm
29

The Intel MMX™ Technology

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

The Intel MMX™ Technology

Transféré par

Droits d'auteur :

Formats disponibles

A SEMINAR REPORT ON

THE INTEL MMX TECHNOLOGY

S #mitte$ %&' Renu Kanwar B. Tech. !%th Sem.$

S #mitte$ to' Mr.Yogendra boti !".#.D.$ Department of &'&

S. Bo. 77/ 78 7= 7< 76 7D 7% 7E -7 --/ -8 -= -< -6 -D -% -E /7 /-

Go!ls Concl sion Re6erences

P!c8e$ (!t! )orm!t'5

Con$ition!l E9ec tion'5

S!t r!tin" Arit:metic'5

Re7ositionin" o6 (!t! Elements +it:in P!c8e$ (!t! )orm!t'5

Ne; (!t! T&7es'5

En:!nce$ Instr ction Set'5

Table - )ummariGe) the in)truction) introduced by MM* technologyJ

<=5%it MMX Re"isters'5

Preser>in" ) ll %!c8;!r$ Com7!ti#ilit&'5

C:oice o6 O7co$es 6or MMX Instr ctions'5

*se o6 )P (LL Mo$el 6or MMX Co$e'5

Cotential pplication) of MM* J;

The floating point regi)ter)J;

d2antage) of u)ing the floating point regi)ter) in MM*J;

include) new in)truction) and data type) to impro2e performance.

Vous aimerez peut-être aussi