Vous êtes sur la page 1sur 16

DisksforDataCenters

WhitepaperforFAST2016

EricBrewer,LawrenceYing,
LawrenceGreenfield,RobertCypher,andTheodoreTs'o
Google,Inc.

February23,2016
Version1.1,revisedFebruary29,2016

Onlineat:
http://research.google.com/pubs/pub44830.html


Copyright2016GoogleInc.Allrightsreserved.

.
Googlemakesnowarrantiesconcerningtheinformationcontainedinthisdocument



GoogleInc

Abstract
DisksformthecentralelementofCloudbasedstorage,whosedemandfaroutpacesthe
considerablerateofinnovationindisks.Exponentialgrowthindemand,alreadyinprogressfor
15+years,impliesthatmostfuturediskswillbeindatacentersandthuspartofalarge
collectionofdisks.Wedescribethecollectionviewofdisksandhowitandthefocusontail
latency,drivenbyliveservices,placenewanddifferentrequirementsondisks.Beyonddefining
keymetricsfordatacenterdisks,weexplorearangeofnewphysicaldesignoptionsand
changestofirmwarethatcouldimprovethesemetrics.

Wehopethisisthebeginningofaneweraofdatacenterdisksandanewbroadandopen
discussionabouthowtoevolvedisksfordatacenters.Theideaspresentedhereprovidesome
guidanceandsomeoptions,butwebelievethebestsolutionswillcomefromthecombined
effortsofindustry,academiaandotherlargecustomers.

TableofContents
Introduction
PhysicalDesignOptions
1)AlternativeFormFactors[TCO,IOPS,capacity]
ParallelAccesses[IOPS]
MultiDiskPackages[TCO]
PowerDelivery[TCO]
2)CacheMemory[TCO,taillatency]
3)OptimizedSMRImplementation[capacity]
FirmwareDirections
4)ProfilingData[IOPS,taillatency]
5)HostManagedReadRetries[taillatency]
6)TargetErrorRate[capacity,taillatency]
7)BackgroundTasksandBackgroundManagementAPIs[taillatency]
Backgroundscanning[taillatency]
8)FlexibleCapacity[TCO,capacity]
9)LargerSectorSizes[capacity]
10)OptimizedQueuingManagement[IOPS]
Summary
Bibliography

Acknowledgements: JoeHeinrich,SeanQuinlan,AndrewFikes,RobEwaschuk,Marilee
Schultz,UrsHlzle,BrentWelch,RemziArpaciDusseau(UniversityofWisconsin),Garth
Gibson(CarnegieMellonUniversity),RickBoyle,KaushalUpadhyaya,MaireMahony,Danner
Stodolsky,andFlorentinaPopovici

Page2of16


GoogleInc

Introduction

TheriseofportabledevicesandservicesintheCloudhastheconsequencethat(spinning)hard
diskswillbedeployedprimarilyaspartoflargestorageserviceshousedindatacenters.Such
servicesarealreadythefastestgrowingmarketfordisksandwillbethemajoritymarketinthe
nearfuture.Forexample,forYouTubealone,usersuploadover400hoursofvideoevery
minute,whichatonegigabyteperhourrequiresapetabyte(1MGB)ofnewstorage everyday.
Asshowninthegraph,thiscontinuestogrowexponentially,witha10xincreaseeveryfive
years.

Butthecurrentgenerationofmoderndisks,
oftencallednearlineenterprisedisks,are
notoptimizedforthisnewusecase,and
insteadaredesignedaroundtheneedsof
traditionalservers.Webelieveitistimeto
develop,jointlywiththeindustryand
academia,anewlineofdisksthatare
specificallydesignedforlargescaledata
centersandservices.

Thispaperintroducestheconceptofdatacenterdisksanddescribesthekeygoalsand
metricsforthesedisks.Thekeydifferencesfallintothreebroadcategories:1)thecollection
viewinwhichwefocusonaggregatepropertiesofalargecollectionofdisks,2)afocusontail
latencyderivedfromtheuseofstorageforliveservices,and3)variationsinsecurity
requirementsthatstemfromstoringothersdata.Aftercoveringeachinmoredetail,weexplore
arangeoffuturedirectionstomeetthesegoals.

TheCollectionView
Theessentialdifference:disksfordatacentersarealwayspartofalargecollectionofdisks
designsthus optimizethecollection withlessregardforindividualdisks.Inthissense,current
enterprisedisksrepresentalocalmaximum.Therealtargetistheglobalmaximumusingthe
followingfivekeymetrics:
1. HigherI/Ospersecond( IOPS ),typicallylimitedbyseeks,
2. Higher capacity ,inGB
3. Lower taillatency ,coveredbelow,
4. Meetsecurity requirements,and
5. Lowertotalcostofownership( TCO ).

Atthedatacenterlevel,theTCOofadiskbasedstoragesolutionisdominatedbythedisk
acquisitioncost,thediskpowercost,thediskconnectionoverhead(diskslot,compute,and
network),andtherepairsandmaintenanceoverhead.

Page3of16


GoogleInc

Achievingdurabilityinpracticerequiresstoringvaluabledataonmultipledrives.Withinadata
center,highavailabilityinthepresenceofhostfailuresalsorequiresstoringdataonmultiple
disks,evenifthediskswereperfectandfailurefree.Tolerancetodisastersandotherrare
eventsrequiresreplicationofdatatomultipleindependentlocations.Althoughitisagoalfor
diskstoprovidedurability,theycanatbestbeonlypartofthesolutionandshouldavoid
extensiveoptimizationtoavoidlosingdata.

Thisisavariationoftheendtoendargument:avoiddoinginlowerlayerswhatyouhavetodo
anywayinupperlayers [1].Inparticular,sincedataofvalueisneverjustononedisk,thebit
errorrate(BER)forasingle diskcouldactuallybeordersofmagnitudehigher(i.e.losemore
15
bits)thanthecurrenttargetof1in10 ,assumingthatwecantradeoffthaterrorrate(atafixed
TCO)forsomethingelse,suchascapacityorbettertaillatency.

Thecollectionviewalsoimplieshigherlevelmaintenanceofthebits,includingbackground
checksummingtodetectlatenterrors,datarebalancingformoreevenuseofdisks(including
newdisks),aswellasdatareplicationandreconstruction.Modernsdisksdovariationsofthese
internally,whichispartiallyredundant,andasinglediskbyitselfcannotalwaysdothemaswell.
Atthesametime,thediskcontainsextensiveknowledgeaboutthelowleveldetails,whichin
generalfavorsnewAPIsthatenablebettercooperationbetweenthediskandhigherlevel
systems.

Athirdaspectofthecollectionviewisthatweoptimizeforthe overall
balanceofIOPSand
capacity,usingacarefullychosenmixofdrivesthatchangesovertime.Weselectnewdisksso
thatthemarginalIOPSandcapacityaddedbringusclosertoouroverallgoalsforthecollection.
Workloadchanges,suchasbetteruseofSSDsorRAM,canshifttheaggregatetargets.

Anobviousquestioniswhyarewetalkingaboutspinningdisksatall,ratherthanSSDs,which
havehigherIOPSandarethefutureofstorage.TherootreasonisthatthecostperGB
remainstoohigh,andmoreimportantlythatthegrowth ratesincapacity/$betweendisksand
SSDsarerelativelyclose(atleastforSSDsthathavesufficientnumbersofprogramerase
cyclestouseindatacenters),sothatcostwillnotchangeenoughinthecomingdecade.Wedo
makeextensiveuseofSSDs,butprimarilyforhighperformanceworkloadsandcaching,and
thishelpsdisksbyshiftingseekstoSSDs.However,SSDsarereallyoutsidethescopeofthis
paper,whichisaboutdisksandtheirroleinverylargescalestoragesystems.

Thefollowinggraphshowsonewaytothinkaboutthis:thelettersrepresentvariousmodelsof
disksintermsoftheir$/GB(inTCO)versustheirIOPSperGB.1Theidealdrivewouldbe down
andtotheright.Newerdrivesofthesameformfactortendtobelower(less$/GB),buttowards
theleft(sincetheygetbigger more thantheygetfaster).Changingtosmallerplattersizeisone
waytomovetowardtheright.Theaggregatepropertiesofamixofanytwodrivesformaline

1
Thelettersdonotrepresentactualdiskmodelsthegraphrepresentshowwethinkaboutthespace.An
earlierversionofthisgraphhadtheyaxisasGB/$(ratherthan$/GB),whichismoreintuitive,butthe
convexhullpropertydoesnotholdinthatcase.Thisupdatedversioniswhatweuseinternally.

Page4of16


GoogleInc

betweenthepoints,forexamplethelinebetweenCandD.Ingeneral,suchlinesmustgoup
andtotherightiftheygodownandtotheright,thedriveontherightisstrictlysuperiorand
thereisnoneedtomixthem(e.g.FdominatesE).TheoptimalmixforacertainIOPS/GBtarget
canbemetbycombiningdrivesalongthebottomoftheconvexhull,shownasasetofredlines.


Thisworkswellaslongaswecangetmostof both thecapacityandIOPSoutofeachdrive,
whichisnoteasy.Theindustryisrelativelygoodatimproving$/GB,butlesssoatIOPS/GB,
andthisdistinctionaffectsmostofthesuggesteddirectionsbelow.

Arelatedconsequenceisthatwedonotcareaboutprecisecapacity,e.g.exactly6TB,oreven
thattheusablecapacityofadrivebeconstantovertime.Forexample,thelossofahead
shouldnotpreventtheuseoftherestofthecapacityandIOPS.

ServicesandTailLatency
Datacenterdisksrequirebettercontrolover taillatency ,duetothefocusonservingdatafor
remoteuserswhiletheywait.Broadlyspeaking,wedistinguishbetween liveservices
,inwhicha
useriswaitingforaresponseandlowtaillatencyiscritical,andthroughput workloads,suchas
MapReduceorCloudDataflow,inwhichperformanceisthroughputorientedandtheoperations
arelowerpriority.Wecanoftenhidethelatencyofdiskwrites,sowefocusonthetaillatencyof
reads,whichwecannothideingeneral(andwhichcachingonlyimprovesslightly).Ourprimary
th
metricisthussomeformoftaillatencyforreads,suchasthetimeforthe99 percentileread
i.e.,99%ofreadsarewithinthistimebound.

Aswiththecollectionview,thefocusontaillatencyhasarangeofconsequences:
Welabelourreadsaslowlatencyorthroughputsothatwecanoptimizethem
appropriatelythroughoutthestack,andsowecanmonitorthemindependently.Wealso
haveotherbackgroundbesteffortworkloadsaswell.
Wedoadmissioncontrolonlowlatencyreadsbyhavingaquotaforthemotherwise,all
readswouldhavetobefasterthanaverage.Overquotareadsarenotpromisedlow
latency.

Page5of16


GoogleInc

Forbetteroveralltaillatency,wesometimesissuethesamereadmultipletimesto
differentdisksandusethefirstonetoreturn.Thisconsumesextraresources,butisstill
sometimesworthwhile.Oncewegetareturnedvalue,wesometimes canceltheother
outstandingrequests(toreducethewastedwork).Thisisprettylikelyifonedisk(or
system)hasitcached,buttheothershavequeueduprealreads.(SeeDeanand
Barroso [2]
formoreinformationontaillatencyandcancellation.)

Security
Althoughmostofthispaperisaboutchangesinprioritizationduetoanewusecase,security
requirementsfallinthemusthavecategory,atleastovertime.

Thefirstgeneralproblemisthatthesizeandcomplexityofthefirmwareinsideamoderndisk
leadtobugs,includingsecuritybugsthatcanbeusedtoattackthediskoreventhehost.
Harddiskfirmwareattacksarenotonlypossible,butappeartohavebeenused [3,4]
.Solving
thisisbeyondthescopeofthispaper,butitisclearthatitmustbeeasiertoassurecorrect
firmwareandrestrictunauthorizedchanges,andinthelongtermwemustapplythefullrangeof
hardeningtechniquesalreadyusedinothersystems.Weapproachthisproblemintheshort
termbyrestrictingphysicalaccesstothedisksandbyisolationofuntrustedcodefromthehost
OS(whichhasthepowertoreflashthediskfirmware).

Thesecondbroadcategoryistheneedtoensurethedataisencryptedatrest(i.e.thedataon
aplatterisencryptedincaseadeviceisstolenorincorrectlyreused),andaccesstodifferent
typesofdataarecontrolledthroughstrictauthentication.Allmodernenterprisedisksarealready
encryptedatresttoday,buttraditionallywithasinglekey.Asideeffectofthecollectionviewis
thatwemixawidevarietyoftypesofdatafromdifferentcustomersandwitharangeof
restrictionsonthesamephysicaldisks.Finegrainedaccesscontrol,usingdifferentkeysfor
differentareasofthedisk,allowsustousedifferentkeysfordifferentcustomersorrestrictions,
andingeneralbetterimplementstheprincipleofleastprivilege.

TheTrustedComputingGroup(TCG)hasdevelopedsomegoodsolutions(TCGOPAL [5]
and
Enterprise [6]
),buttheadoptionoffinegrainedaccesscontrolisstillinitsinfancy.Further,the
useofTCG,especiallyinSATAdrives,hasbeenquitelimitedandhencethereislimited
enthusiasmsofarforinvestinginthisarea.Ourhopeisthatfinegrainaccesscontrolwill
becomewidelydeployed,wellimplementedandtested,andviewedasastandardaspectof
bestpractices.

Wewillnotdigdeeperintosecurity,butinsteadexploreavarietyofdesigndirectionsinmore
detail,broadlygroupedintophysicalchangestodisksandfirmwarechanges.Foreachtopic,we
labelwhichofthefourkeymetricsithelpstoaddress(TCO,taillatency,IOPS,capacity).

Page6of16


GoogleInc

PhysicalDesignOptions
Standardslastaverylongtimeandthusshouldbedevelopedthoughtfullyandwithbroad
participation.Theseoptionsarethusthebeginningofalongconversationwithmany
constituents.Buttheshiftinusageisrealandthepotentialofnewdesignsincreasesevery
year,andthussomenewdesignwillmakesenseinthenearfuture.Theseoptionsarebutone
setofproposals,andweknowtheindustryandotherswillcontributemanymore.

1)AlternativeFormFactors[TCO,IOPS,capacity]
Thecurrent3.5HDDgeometrywasadoptedforhistoricreasonsitssizeinheritedfromthePC
floppydisk.AnalternativeformfactorshouldyieldabetterTCOoverall.Changingtheform
factorisalongtermprocessthatrequiresabroaddiscussion,butwebelieveitshouldbe
considered.Althoughwecouldspecourownformfactor(withhighvolume),theunderlying
issuesextendbeyondGoogle,anddevelopingnewsolutionstogetherwillbetterservethe
wholeindustry,especiallyoncestandardized.

Intermsoftheplattersize,wequestionwhether3.5or2.5formfactorsareoptimalpoints.A
largerplattersizeincreasesGB/$,butlowersIOPS/GB,whilesmallerplattersizeresultsinthe
opposite.Asimilarargumentappliestothecommonrotationalspeedof7200RPM,where
higherRPM(rotationsperminute)increasesIOPS/GB,butlowersGB/$,whilealowerRPM
resultsintheopposite.

Weproposeincreasingtheallowableheight(Zheight).Currentdiskshavearelativelysmall
fixedheight:typically1"for3.5"disksand15mmmaximumfor2.5"drives.Tallerdrivesallowfor
moreplattersperdisk,whichaddscapacity,andamortizesthecostsofpackaging,the
printedcircuitboard,andthedrivemotor/actuator.Givenafixedtotalcapacityperdisk,
smaller
platterscanyieldsmallerseekdistancesandhigherRPM(duetoplatterstability),andthus
higherIOPS,butworseGB/$.ThenetresultisahigherGB/$foranyspecificIOPS/GBthat
couldbeachievedbyalteringanyothersingleaspect,suchasplattersizesorRPMalone.It
mayalsobethatamixofdifferentplattersizes(indifferentdisks)providesthebestaggregate
solution.

Therearearangeofsecondaryoptimizationsaswell,someofwhichmaybesignificant.These
includesystemlevelthermaloptimization,systemlevelvibrationoptimization,automationand
roboticshandlingoptimization,systemlevelhelium2optimization,andsystemlevelweight
optimizations.

ParallelAccesses[IOPS]
TherearemultiplewaystoenableparallelaccessesoftwoormoreIOstreamssimultaneously,
andsomeimplementationsbenefitaparticularworkloadmorethananother.Companieshave
mostlyavoidedthesedesignsinthepastduetotheirextracost,butascapacitycontinuesto

2
Somedisksuseheliuminsidethedrivetoreduceseektime(duetoitslowdensity),butitrequires
sealingtheplatterarea,whichwouldbeeasierwithdifferentdimensions.

Page7of16


GoogleInc

outpacegrowthinIOPS,theseoptionsbecomemorecompellingovertime.Herearefour
optionsthatcouldmakesenseinthefuture:
1) Twostandardsizedactuatorarmunits,mounteddiagonallyfromeachotheracrossthe
diskplatter,eachwithheadsthatcoverallplattersurfaces.Thisimplementationisthe
mostexpensive,butcanbenefitallworkloads.Notethatthishasbeenimplemented
before [7,8]
,althoughperhapsbeforeitstime.
2) Oneactuatorlocation,buttwohalfheightactuatorarmunits,mountedontopofeach
other,eachwithheadsthatcoverhalfoftheplattersurfaces.Thisimplementationcan
benefitbothmultiqueuerandomaccessorsinglequeuesequentialworkloads.
3) Oneactuatorarmunit,withasmalldualstageactuatordesignthatallowstwoheadsto
tracktwoplattersurfacessimultaneously.Thisimplementationcandoublesequential
workloadthroughput,butdoesnothelpwithrandomaccess.
4) Onceactuatorarmwithheadsthatcanreadtwoadjacenttracksperplattersurface
simultaneously.Thisimplementationcandoublesequentialworkloadthroughput.

Althoughbothsequentialthroughputandrandomaccessratesareimportant,randomaccesses
areparticularlyimportantinthediskcollectionmodeldescribedhereasmanyusersshareeach
driveandsomeaccesseshavestricttaillatencybounds.Furthermore,arealdensity
improvementshavealreadyresultedinmoreimprovementinsequentialthroughputthanin
randomaccessperformance.

MultiDiskPackages[TCO]
Givenwebuysomanydisks,itwouldmakesensetobuygroupsofdisksasoneunit.Many
vendorsselldisksystems,suchasNASdevices,butsuchsystemsaretoohighlevelanddonot
provideenoughcontrol.Insteadwewouldgroupdisksinorderto1)sharelargercaches
(coveredbelow),2)amortizefixedcosts,and3)improvepowerdistribution(coverednext).The
largerpackagemightalsohelpwithvibrationandyield(seeFlexibleCapacitybelow).The
packagecouldstilluseoneormoreSATAinterfaces,orcouldbechangedtoPCIE(see
Cachingbelow).

Itisnotclearyethowmanydisksshouldbeinagroup.Largergroupshavebetteramortization
andpowersavings,butalsotakeoutmoredatainafailure,whichimplieshigherrecoverycosts.
Agrouptypicallyalsohasahigherfailureratethanitsconstituentdisks,assumingafailureof
onediskrequiresreplacingthegroup.Fourdisksmightbeareasonableplacetostart.

PowerDelivery[TCO]
Thecurrentpowerdeliveryfromtheservertothediskistypicallyviaaconnectorthatdelivers
someof12V,5Vand3.3V(allDC).Eachvoltagehasacurrentspecthattheservermustmeet,
andoverallthereisconsiderablemismatchbetweenwhatisneededtosupportagenericdisk
(allvoltagesatfullcurrent)andwhatisneededforanyspecificmodel.

Page8of16


GoogleInc

Instead,weproposeasingle12VDCsupplywithamaxcurrent(oralternativevoltages3such
as48V).Thiscanbeusefulinsomemoderndesignswheretheoverallstoragesystemdoes
notevenuse5Vor3.3Vanywhereelse.

2)CacheMemory[TCO,taillatency]
Currently,weconnectmanydiskstothesamehost.Eachdiskhasitsownreadandwrite
cachesofconsiderablesize(30100MB).FromaTCOperspectiveitmakesmoresenseto
moveRAMcachingfromthediskstothehostortray,asasinglebigcachewillbebothcheaper
andmoreeffective.Itismoreeffectiveinpartduetobeinglargerandshared,butalsodueto
thefactthatwecanmoreeasilyimprovethecachingpolicyovertime.

Thatbeingsaid,basedontracesweknowthatthecachehitratesinthedisksarerelatively
high,despitethefactthatwealreadyhavecachinginmultiplehigherlayers.Thisissurprising
andstillneedsmoreexploration,butwesuspectthemainreasonforthisistheeffectivenessof
readahead(andreadbehind).Ifso,mostoftheRAMactuallyinthediskshouldbeusedfor
thispurpose,inadditiontowritebuffering.Furthermore,thediskcandoreadaheadand
readbehindwhenitsfreeevenwhentherearependingrequests,sincethoserequestsmight
requiresomerotationaltimetoservice.Thehostdoesnothavevisibilityintowhenitisfreeto
extendaread.

OnepossibilitylongertermistokeepthesameinterfaceandthesamefixedamountofRAM
perdisk,buthavethedisksusehostmemoryviaPCIE,withoutanyotherchangestothedisk
cachingalgorithms.HostmemoryischeaperperMB,sothisshouldbeacostsavings.Host
memoryisalsofartheraway,soperhapsthiswouldbeaperformanceproblem.Thisisapretty
bigchangeandwouldonlymakesenseinthecontextofmovingtoPCIE,perhapsaspartof
supportingmultidiskpackages,wheretheextrabandwidthmattersandthecostiseasierto
amortize.

Wecantrytocombinethedisksabilitytoreaddatawithoutperformancecostandthehosts
abilitytobettermanagethelifetimeofspeculativereadsbyhavingthehostgivereadswith
flexiblebounds:Iwantatleast12KBstartingatthisLBA,andImwillingtoacceptupto900
KBifitdoesntdelayservicinganotherI/O.ThisAPIchangeforvariablereadsissimilarin
spirittorangewritesinwhichthediskhasmultipleoptionsandpicksonebasedonitsextra
internalknowledge [10]
.

3)OptimizedSMRImplementation[capacity]
Shingledmagneticrecording(SMR)allowsaharddisktoachieveahigherarealdensity
typically1020%today,withpotentiallyhighercapacitygainsinthefuturebylimitingtheability
toperformrandomwritesonthephysicalmedia.Inparticularwritesmustbedoneinorder,like

3
Google,Intel,andotherspresentedFutureofPowerEfficiencyandTechnologyforGreenComputing
duringDesignCon2016toadvocatefor48Vserverdesigns [9]
.

Page9of16


GoogleInc

shingles,andthewritesdestroythenexttrack.Asdiscussed,thevalueofadiskcomesnotjust
fromitscapacity,butalsothenumberofusefulI/Orequestsitcanservice.Becauseofthewrite
restrictionsimposedbySMR,whendataisdeleted,thatdeletedcapacitycannotbereused
untilthesystemcopiestheremaininglivedatainthatSMRzonetoanotherpartofthedisk,a
formofgarbagecollection(GC).

Somedatahaslifetimesthataremorepredictable,whichallowsthestoragesystemusinga
hostmanagedSMRdrivetoreducetheGCoverheadbygroupingdatawiththesamepredicted
lifetimeintoanSMRzone.Withoutthis,theGCoverheadintermsofboththeIOPSand
capacitybecomessignificant.StoringonlySMRfriendlydataontheSMRdiskscouldmitigate
theproblem,butthisdataisgenerallycolderthantherestofthedata,whichmeanssomeofthe
IOPSontheSMRdiskswillbewasted.Worse,removingtheSMRfriendlydataconcentrates
theIOPSrequirementsfortheremaininghotterdataontheconventionalspindles(conventional
magneticrecordingorCMR),whichincreasestheIOPSneededontheCMRdisks.More
broadly,SMRdrivesrevealabiastowardsGB/$improvementsoverIOPS/GBimprovements.

OnewayofaddressingthisistomixCMRandSMRtechnologiesinasingle,hybridharddrive.
Thisbettermixeshotandcolddataonthesamediskandincreasesthechanceofeffectively
usingboththecapacityandtheIOPS.Inparticular,thisallowsnewdatatobestoredinCMR,
withSMRspaceusedfor1)newdataknowntobelonglivedortohaveapredictablelifetime
(wheninitiallywritten),or2)olderdatastoredinCMRthathasagedtothepointwhereitis
presumedtobelonglivedandcanbemovedtoSMR.

Thesimplestimplementationofahybriddriveistouseamixofplattersandheads,withsome
optimizedforSMR,andtherestforCMR.ThisgivesapermanentfixedratioofCMRtoSMR,
whichmaybecomesuboptimalovertime.Alternatively,ifaCMRoptimizedheadisusedfor
bothSMRandCMRrecording,theoutertrackscouldbeusedforCMR,whiletheinnertracks
couldbeusedforSMR.NotethattheSMRcapacitygainwillbelessinthiscase,buttheCMR
portionofthediskwillseeIOPSimprovementsduetotheloweraverageseekdistance.

AnotherwaytoreduceGCoverheadisbyrelaxingthewriterestrictionsimposedbydefininga
hostmanagedSMRabstraction.AnexampleisCaveatScriptor [11]
,whichallowsthehostto
writerandomly,knowingthatitwilldestroy(specific)nearbydata.TheCaveatScriptorproposal
alsosupportsothermoreinterestingwritepatterns,includingacircularbuffer,andeffectively
allowsthehosttohavedynamicallyresizableSMRzones.Inthelongterm,anevenmore
aggressiveapproachwouldbetoallowdynamicallyconfigurableCMRandSMRtracks,which
maybeachievablewithmoreadvancedheadmediainterface(HMI)technologies.

FirmwareDirections
Althoughthelongtermshouldincludephysicaldesignchanges,thereareawiderangeof
firmwareonlychangesthatcanbemadeintheshorttermusingexistingformfactors.Wethus

Page10of16


GoogleInc

expectthesetotheprimaryfocusoverthenextseveralyears,hopefullyinparallelwith
longertermdiscussionsaboutnewphysicaldesigns.

4)ProfilingData[IOPS,taillatency]
Animportantgoal,bothshorttermandlongterm,istoenableustounderstandwhatisgoingon
withthediskintermsofperformance.Thisdatawouldimproveboththehostsoftwareandthe
designofdisktrays.

Oneapproachwouldbeaprofiledatalogimplementedaseithercountersoraringbufferin
RAM.Thisdatacanbereadusingeitheranewcommandornormalreadsofmagicblock
numbers.Thecontentsofthelogwouldbeimportantperformanceorotherfirmware
backgroundevents,suchasrewritesfordataintegrity.Thehostwouldperiodicallyreadthis
log,withoutanyperformancepenaltyfromthedisk(otherthanusingsomebandwidth).For
example,whenthehostnoticedahighlatencyread,itcouldthenreadthelogfordiagnosis.

Arelatedideawouldbetoincreasethenumberofpossibleerrorcodesreturnedbyareadto
enablefinegrainexplanationsofwhatcausedadelay.Inmostcases,suchreadsreturncorrect
databuttooslowly,andtheerrorcoderevealswhy.

Ataminimum,weenvisionprofilingreturningthefollowinginformation:
timespentseeking(includingrotationaldelay)forahostinitiatedread,
timespenttransferringdataforahostinitiatedread,
timespentinerrorrecoveryforahostinitiatedread.
Likewise,profiledatashouldincludecountersforhostinitiatedwritesandfordriveinitiated
readsandwrites.Wealsowantdrivestoaccumulateseparatelythetimeformajorbackground
activities(e.g.backgroundmediascans,trackrewrites).

5)HostManagedReadRetries[taillatency]
Thegoalistoallowfordifferentpoliciesforreads.Highenddisksallowsomemanagementof
readsatagloballevel,butthegoalhereistobeabletochangethereadpolicyoneachread,
whichrequiresanAPIchange.Theobviousdefaultpolicywouldbethecurrentone,reallytry
hard,inwhichthediskgoestogreatlengthstocompletetheread.However,mostreadswould
beissuedwithalimitedretrypolicy,whichreturnsanerrortothehost(quickly)ifafewinitial
attemptsatreadingreturnsbaddata.Wepreferthisapproachfortaillatencyanditfitswellwith
theparallelreadsandcancellationsmentionedabove(andinmoredetailbelow).Inaddition,
afteralimitedretryreadfails,thehoststillhastheoptiontousethereallytryhardreadfor
thesamedata.

6)TargetErrorRate[capacity,taillatency]
Fromtheviewofasingleserver,itisveryimportantthatdiskdrivesdonotlosedata.Disk
15
drivesthuspromiseincrediblylowlossrates,typicallylosing1bitin10 orless.However,in

Page11of16


GoogleInc

thedatacentercontext,durabledatamustalreadybespreadacrossmultipledisksandtypically
multiplelocations,sothatdisastersareunlikelytodestroyallcopiesofthedata.Highavailability
alsoimpliestheneedformultiplecopies.

Atthesametime,currentdrivesmakemanysacrificestoachievetheselowerrorrates.They
sacrificecapacitybylimitinghowtightthearealdensitycanbeandbyusingmoreextensive
coding,andtheyworsentaillatency,byrequiringmorebackgroundrepairprocessesthat
interferewiththenormalperformanceofthedisk.

Overall,ahighertargeterrorratewouldhavelittleeffectontheoveralldurability,whileatthe
sametimeenablinghighercapacityandfewerrepairoperations.Itmightbepossibleto
eliminaterepairoperations insidethediskaltogether,althoughthedrivestillmustenable
detectionoferrors(andpossiblypredictriskoferrors)sothathigherlevelsystemscaninitiate
repairand/orshiftresponsibilityforthatdatatoanotherdrive.

7)BackgroundTasksandBackgroundManagementAPIs[taillatency]
Thediskusesbackgroundtaskstofixorpreventproblems.Oneexampleisrefreshingtracks
fordataintegrity,morecommonlyknownasATImitigation(foradjacenttrackinterference).
Backgroundtasksdamagetaillatencyseverely,primarilybecausetheycanimpactimportant
latencysensitivereads.

Sincewecannotremovetheneedforbackgroundtasksaltogether,weneedtoexplore
mechanismstobettercontrolthetiming.Ourbestsuggestionsofaristhefollowingcontract:
a) Mostbackgroundtasksshouldbepreemptible.
b) Therearenononpreemptiblebackgroundtaskswithoutatimer.Thetaskscan
potentiallybegroupedintourgent(needstobeexecutedrelativelyquickly)vs.
nonurgent(canbeexecutedlessquickly).
c) Thehostwillperiodicallyissueacommandthatsuggestswhenthediskdrivecando
backgroundwork.Thiscouldbearepairforasingletrack,orforuptoktrackswhen
webelievethereismoretime.Thelatterallowsforbetterseekordering.
d) Thediskwillhaveanestimateandreportingmechanismforhowmuchbackgroundwork
ispilingup.
e) Whenthetimerexpires,thediskcanalsodomaintenance,regardlessoftaillatency.The
hostshouldnotallowthistohappeningeneral.Itmightalsomakesensetohavethe
diskentermaintenancemodeanytimetherefreshcountisabovesomehighwatermark
(andstopwhenthecountgetsbelowsomelowwatermark).Itwouldstillprocess
commands,butwithnopromisesontaillatency.Thehostshouldhaveawaytotellthe
diskisinthismode.
f) Ifnewcommandsarenotarriving,somethingmaybewrongwiththehostandthusdoing
allthemaintenanceworknowisagoodidea.Thisimpliessomekindofidletimerthat
initiatesmaintenancemode.Thediskstaysinthatmodeuntilacommandarrives.

Page12of16


GoogleInc

Therearesomeproblemswiththisasisthatmeritdiscussion.First,itmaybethatabad
pattern,suchasoverwritingthesametrackmanytimesinarow,mayrequirebackgroundwork
thatdoesnothappenintime.Thiswillresultinthebackgroundtaskbeingescalatedfromthe
nonurgenttotheurgentbucket.Ifthehostdoesnotallowthedisktoinitiatethebackground
taskwhileintheurgentbucketfortoolong,thendatamaypotentiallybelostorcorrupted.

Oneoptionwouldbetoallowthedataloss,whichmightbefinedependingonthefrequency.
Whentherefreshdoesfinallyoccur,itshouldbeclearwhetherdatawaslostornot,andwecan
returnanerroratthattime.

Anotheroptionwouldbetofailanewwritethatis abouttocauseexcessinterferencewitha
nearbytrack,withanerrorthatmeansurgentrefreshrequired(forthatareaofthedisk).
Completingthewritealittlelaterisfineifrare.Plus,thiskindoferrorshoulditselfhavelow
latency.Evenbettermightbetodotherefreshinsteadofthewrite(andreturnanerror),ordo
therefresh
andthewrite,sincetheyarephysicallyclose.Thelatterisbadonlyifthelatencyis
toohigh,sincewearestarvingreadsduringthedoublewrite.Anadvantageofthisapproachis
thatithandlespathologicalcases(betterthantimers),suchasthehostrepeatedlywritingthe
sametrack.Thebestapproachhereremainsanopenquestion.

Backgroundscanning[taillatency]
Manydisksdomediascanningacrossallsectors,whetherornottheyreallocated,asawayto
detectlatenterrors.Alternatively,wecandobackgroundscanningonthehostside,butwitha
moreendtoendapproach.However,theprimarysourceoflatenterrorsisthediskitself,due
tophysicallynearbywritesintheformofATI(adjacenttrackinterference)orFTI(fartrack
interference),andthusthediskhasabetterestimateofwhichtracksorsectorsaremostatrisk.

Oneapproachistouseamixofhostanddiskinitiatedbackgroundscans,assumingwecan
controlthetimingasdiscussedabove.Inparticular,wecouldmakethediskresponsiblefor
estimation ofriskandthenreporttothehostthetracksinneedofrefreshwrites.Thisleavesthe
schedulinginthehandsofthehost,whichprovidesbettercontroloverthetimingandthushelps
taillatency.Similarly,ifthediskdetectsapartial(orcomplete)erroritcouldnotifythehost,even
iftheerrorswerecorrectedviacoding.Suchnotificationsreducethewindowofvulnerabilityto
losingthiscopyi.e.,higherlevelrepairscanbemadebeforethereisaproblem.

ForLBArangesthatarenotspecificallycalledoutbythediskasbeingatrisk,thehostcould
promisetoissueaspecialreadcommand(oroverwrite)toeveryallocatedsectoreveryXdays
(andhavesomewayofknowingwhatthemanufacturerthinksistherightvalueofX).

ForLBArangesthatarenotscannedbythehost,thediskshouldcontinuetoensurethatthey
areprotectedusingitsownBERguaranteeschemeandwhateverscanningthatimplies.
Auditingforhostinitiatedscanscanbedoneandreported,whichwouldbeuseful,aslongas
thenumberofdistinctLBAranges(orbands)isrelativelysmall.

Page13of16


GoogleInc

8)FlexibleCapacity[TCO,capacity]
Currently,drivescomeinfixedcapacities(suchasXTB).Therearethreewaysinwhich
movingtoflexiblecapacityshouldreduceTCO.

First,inordertoprovidemarketabledriveswithfixednumbersofTBsofcapacity,vendorsmust
provisionplatterswithsomewhatlargercapacitiesinordertoaccommodateanunknown
numberofdefects.Drivesthathaveaverageorbelowaveragenumbersofdefectswill
thereforeberemanufactured,orsoldashavingasmallercapacitythantheycanactually
support.

Second,driveshavenumerousreservedsectorsforreallocation,cache,orotheruses.Someof
thosereservedsectorscouldbeexposedtostoredataearlyinthelifeofthedrive,providedthat
thedriveisabletoreclaimthemataslowrateduringitslifetime.

Third,driveheadfailurescouldbehandledbymappingoutthefailedhead(andrecoveringthe
lostdatafromotherdrives),althoughthislikelyrequirereinitializingthedrivetoachieve
contiguousLBAnumbering.Continuingtousethedriveafteroneormoreheadfailureswould
extendthelifetimeofthedrive.

9)LargerSectorSizes[capacity]
Diskdriveerrorcorrectingcodes(ECCs)areabletotolerateagivenfractionoferrorswitha
smallerfractionofcheckbitsasthecodewordsizeisincreased.However,thecodewordcannot
exceedthedrivessectorsize(unlessmultiplephysicalsectorsarereadperlogicalsectorread).
Asaresult,drivevendorshaveincreasedthedrivesectorsizefrom512Bto4KiBandreduced
theECCoverhead.Sincestorageservicestypicallywritetheirowndistributedfilesystems,
theremaybeanadvantagetofurtherincreasingthisto64KiBorlarger.

Furthermore,sincehostsoftwaretypicallyaddsCRCtodatawrittenondisks,havingSATA
drivesthatsupportextendedsectorsizessuchas4k+16Bor64k+256Bwouldbemore
efficientoverall,andwouldallowtheECCtobeexposedtothehostforendtoend
checksumming.(SomeSCSIdisksalreadysupportthisfeature.)

10)OptimizedQueuingManagement[IOPS]
Nativecommandqueueing(NCQ)allowsIOstobequeuedatthediskinsteadofthehost,
enablinghigherthroughputtobeachievedthroughrealtimegeometryoptimizations.Currently,
NCQbyitselfdoesnotsatisfyourneed(nordoesTCQ),asitdoesnotconveythehostsperI/O
requirementstothedisktoallowoptimalI/Oreordering.Thereareanumberoftrafficclasses
mixedwithaquotamanagementsysteminashareddistributedfilesystem.Anyattemptto
increasethroughputbyreorderingrequestshastoaddresstwokeychallenges.

Page14of16


GoogleInc

First,itisnecessarytomeetthelatencytargetsforinquotalowlatencyandthroughputreads.
Second,itisnecessarytomeetperuserthroughputtargetsforinquotareadsandwrites.

GiventhestateofNCQanditsexpectedevolutioninthefuture,itseemsnaturaltohavethe
hostmanagethequotasystemthroughthrottling,andhavethediskmanagethepertrafficclass
throughputthroughinformedI/Oreordering.Amuchmoredetailedinvestigationandawell
thoughtoutdesignisneededinthisarea,whichisbeyondthescopeofthisdocument.Thedisk
schedulerwillneedtolookverymuchlikearealtimeoperatingsystem(RTOS),capableof
handlingverycomplexperI/Oinformation,matchingthatagainstrealtimeheadand
mediarelativepositioning,andallowingschedulinginterruptstohappeninstantaneouslyasthey
arise.Thediskschedulerwillalsoneedtosupportthequeueingandproperschedulingof
systemmanagementcommands,suchasperiodicenvironmentalmonitoringcommands.

SomepotentialAPIsneededforthehostinthisareamightinclude:finegrainedperI/Opriority
levelinformation,finegrainedperI/Odeadlineinformation,perI/Olightweightcancellation,
queuebarrierinsertion,headofqueueinsertion,headofprioritygroupinsertion,andgeneral
queuemanagementinformation(suchaspullingthedetailedstateofthecurrentqueueand
expectedexecutionordering).SomeoftheideasinNCQICCcanpotentiallybeusedforthis
purpose,butNCQICCitselfisoftenconsideredtobetoocomplextobefullyimplementedat
theHDDfirmwarelevel,andisnotfundamentallydesignedtosupporttheneedsofadistributed,
datacenterscalefilesystem.

ArelatedusewouldbetouseNCQcommandstoissueagiantstreamingread.Theideais
thatifweknowwewanttostreammanymegabytesoverthenextkmilliseconds,wewould
issueaseriesoflowpriorityreads,withthemoreneartermreadshavinghigherpriority.The
goalisasimplewaytointerleaveblocksforthestreamwithregularreads/writes.Itwouldalso
begoodtohavecachecontrolsfortheseblocks:oncea(streaming)readisdeliveredtothe
host,itcanberemovedfromthecache,asthereisnoexpectedtemporallocality.

Summary
DisksarethecentralelementofCloudbasedstorage,whosedemandfaroutpacesthe
considerablerateofinnovationindisks.Exponentialgrowthindemandimpliesthatmostfuture
diskswillbeindatacentersandthuspartofalargecollectionofdisks.Thecollectionview,
alongwithafocusontaillatencyandsecurity,placenewanddifferentrequirementsondisks.

Wehopethisisthebeginningofaneweraofdatacenterdisksandanewbroadandopen
discussionabouthowtoevolvedisksfordatacenters.Theideaspresentedhereprovidesome
guidanceandsomeoptions,butwebelievethebestsolutionswillcomefromthecombined
effortsofindustry,academiaandotherlargecustomers.

Page15of16


GoogleInc

Bibliography
[1]J.H.Saltzer,D.P.ReedandD.D.Clark.
"EndtoEndArgumentsinSystemDesign".In:
ProceedingsoftheSecondInternationalConferenceonDistributedComputingSystems.April
1981.IEEEComputerSociety.SeeWikipedia: https://en.wikipedia.org/wiki/Endtoend_principle
[2]J.DeanandL.Barroso.TheTailatScale
CommunicationsoftheACM
,Vol.56,No.2,pp.7480,
February2013.
[3]S.Malenkovich.IndestructiblemalwarebyEquationcyberspiesisouttherebutdontpanic
(yet),KaperskyBlog.February17,2015.
https://blog.kaspersky.com/equationhddmalware/7623/
[4]K.Zetter.HowtheNSAsFirmwareHackingWorksandWhyItsSoUnsettling.
Wired
online,
February2015,http://www.wired.com/2015/02/nsafirmwarehacking/
[5]TrustedComputingGroup.TCGStorageSecuritySubsystemClass:OpalSpecificationVersion
2.00,Revision1.00,February24,2012
https://www.trustedcomputinggroup.org/files/resource_files/B15F1F8F1A4BB294D03F09D51
22B21F6/Opal_SSC_2%2000_rev1%2000_final.pdf
[6]TrustedComputingGroup.TCGStorageSecuritySubsystemClass:EnterpriseSpecification
Version1.0,Revision1.0,January27,2009
http://www.trustedcomputinggroup.org/files/resource_files/87FE68471D093519ADF6E656807
00A9F/TCG_SWG_SSC_Enterprisev1r1090120.pdf Alsopresentation:
https://www.trustedcomputinggroup.org/files/resource_files/0B968DDB1A4BB294D02FF4F40
2F72707/SWG_TCG_Enterprise%20Introduction_Sept2010.pdf
[7]C.Kozierok.Singlevs.MultipleActuatorsin
ThePCGuide
(
http://www.PCGuide.com
).April
2001.
http://www.pcguide.com/ref/hdd/op/actMultiplec.html
[8]Wikipedia.ConnerPeripherals,seeChinookdualactuatordrive,
https://en.wikipedia.org/wiki/Conner_Peripherals#Performance_issues_and_the_.22Chinook.22
_dualactuator_drive
[9]R.Merritt.Google,IntelPrep48VServers.
EETimes
,January21,2016.
http://www.eetimes.com/document.asp?doc_id=1328741
[10]A.Anand,S.Sen,A.Krioukov,F.Popovici,A.Akella,A.ArpaciDusseau,R.ArpaciDusseau,S.
Banerjee.AvoidingFileSystemMicromanagementwithRangeWritesProceedingsofthe8th
USENIXconferencenoOperatingSystemsDesignandImplementation(OSDI2008).Pp.
161176.
[11]T.FeldmanandG.Gibson.ShingledMagneticRecording:ArealDensityIncreaseRequiresNew
DataManagement.Usenixlogin:
,Vol.30,No.3,June2013.
https://www.cs.cmu.edu/~garth/papers/05_feldman_022030_final.pdf

Page16of16

Vous aimerez peut-être aussi