Académique Documents
Professionnel Documents
Culture Documents
WhitepaperforFAST2016
EricBrewer,LawrenceYing,
LawrenceGreenfield,RobertCypher,andTheodoreTs'o
Google,Inc.
February23,2016
Version1.1,revisedFebruary29,2016
Onlineat:
http://research.google.com/pubs/pub44830.html
Copyright2016GoogleInc.Allrightsreserved.
.
Googlemakesnowarrantiesconcerningtheinformationcontainedinthisdocument
GoogleInc
Abstract
DisksformthecentralelementofCloudbasedstorage,whosedemandfaroutpacesthe
considerablerateofinnovationindisks.Exponentialgrowthindemand,alreadyinprogressfor
15+years,impliesthatmostfuturediskswillbeindatacentersandthuspartofalarge
collectionofdisks.Wedescribethecollectionviewofdisksandhowitandthefocusontail
latency,drivenbyliveservices,placenewanddifferentrequirementsondisks.Beyonddefining
keymetricsfordatacenterdisks,weexplorearangeofnewphysicaldesignoptionsand
changestofirmwarethatcouldimprovethesemetrics.
Wehopethisisthebeginningofaneweraofdatacenterdisksandanewbroadandopen
discussionabouthowtoevolvedisksfordatacenters.Theideaspresentedhereprovidesome
guidanceandsomeoptions,butwebelievethebestsolutionswillcomefromthecombined
effortsofindustry,academiaandotherlargecustomers.
TableofContents
Introduction
PhysicalDesignOptions
1)AlternativeFormFactors[TCO,IOPS,capacity]
ParallelAccesses[IOPS]
MultiDiskPackages[TCO]
PowerDelivery[TCO]
2)CacheMemory[TCO,taillatency]
3)OptimizedSMRImplementation[capacity]
FirmwareDirections
4)ProfilingData[IOPS,taillatency]
5)HostManagedReadRetries[taillatency]
6)TargetErrorRate[capacity,taillatency]
7)BackgroundTasksandBackgroundManagementAPIs[taillatency]
Backgroundscanning[taillatency]
8)FlexibleCapacity[TCO,capacity]
9)LargerSectorSizes[capacity]
10)OptimizedQueuingManagement[IOPS]
Summary
Bibliography
Acknowledgements: JoeHeinrich,SeanQuinlan,AndrewFikes,RobEwaschuk,Marilee
Schultz,UrsHlzle,BrentWelch,RemziArpaciDusseau(UniversityofWisconsin),Garth
Gibson(CarnegieMellonUniversity),RickBoyle,KaushalUpadhyaya,MaireMahony,Danner
Stodolsky,andFlorentinaPopovici
Page2of16
GoogleInc
Introduction
TheriseofportabledevicesandservicesintheCloudhastheconsequencethat(spinning)hard
diskswillbedeployedprimarilyaspartoflargestorageserviceshousedindatacenters.Such
servicesarealreadythefastestgrowingmarketfordisksandwillbethemajoritymarketinthe
nearfuture.Forexample,forYouTubealone,usersuploadover400hoursofvideoevery
minute,whichatonegigabyteperhourrequiresapetabyte(1MGB)ofnewstorage everyday.
Asshowninthegraph,thiscontinuestogrowexponentially,witha10xincreaseeveryfive
years.
Butthecurrentgenerationofmoderndisks,
oftencallednearlineenterprisedisks,are
notoptimizedforthisnewusecase,and
insteadaredesignedaroundtheneedsof
traditionalservers.Webelieveitistimeto
develop,jointlywiththeindustryand
academia,anewlineofdisksthatare
specificallydesignedforlargescaledata
centersandservices.
Thispaperintroducestheconceptofdatacenterdisksanddescribesthekeygoalsand
metricsforthesedisks.Thekeydifferencesfallintothreebroadcategories:1)thecollection
viewinwhichwefocusonaggregatepropertiesofalargecollectionofdisks,2)afocusontail
latencyderivedfromtheuseofstorageforliveservices,and3)variationsinsecurity
requirementsthatstemfromstoringothersdata.Aftercoveringeachinmoredetail,weexplore
arangeoffuturedirectionstomeetthesegoals.
TheCollectionView
Theessentialdifference:disksfordatacentersarealwayspartofalargecollectionofdisks
designsthus optimizethecollection withlessregardforindividualdisks.Inthissense,current
enterprisedisksrepresentalocalmaximum.Therealtargetistheglobalmaximumusingthe
followingfivekeymetrics:
1. HigherI/Ospersecond( IOPS ),typicallylimitedbyseeks,
2. Higher capacity ,inGB
3. Lower taillatency ,coveredbelow,
4. Meetsecurity requirements,and
5. Lowertotalcostofownership( TCO ).
Atthedatacenterlevel,theTCOofadiskbasedstoragesolutionisdominatedbythedisk
acquisitioncost,thediskpowercost,thediskconnectionoverhead(diskslot,compute,and
network),andtherepairsandmaintenanceoverhead.
Page3of16
GoogleInc
Achievingdurabilityinpracticerequiresstoringvaluabledataonmultipledrives.Withinadata
center,highavailabilityinthepresenceofhostfailuresalsorequiresstoringdataonmultiple
disks,evenifthediskswereperfectandfailurefree.Tolerancetodisastersandotherrare
eventsrequiresreplicationofdatatomultipleindependentlocations.Althoughitisagoalfor
diskstoprovidedurability,theycanatbestbeonlypartofthesolutionandshouldavoid
extensiveoptimizationtoavoidlosingdata.
Thisisavariationoftheendtoendargument:avoiddoinginlowerlayerswhatyouhavetodo
anywayinupperlayers [1].Inparticular,sincedataofvalueisneverjustononedisk,thebit
errorrate(BER)forasingle diskcouldactuallybeordersofmagnitudehigher(i.e.losemore
15
bits)thanthecurrenttargetof1in10 ,assumingthatwecantradeoffthaterrorrate(atafixed
TCO)forsomethingelse,suchascapacityorbettertaillatency.
Thecollectionviewalsoimplieshigherlevelmaintenanceofthebits,includingbackground
checksummingtodetectlatenterrors,datarebalancingformoreevenuseofdisks(including
newdisks),aswellasdatareplicationandreconstruction.Modernsdisksdovariationsofthese
internally,whichispartiallyredundant,andasinglediskbyitselfcannotalwaysdothemaswell.
Atthesametime,thediskcontainsextensiveknowledgeaboutthelowleveldetails,whichin
generalfavorsnewAPIsthatenablebettercooperationbetweenthediskandhigherlevel
systems.
Athirdaspectofthecollectionviewisthatweoptimizeforthe overall
balanceofIOPSand
capacity,usingacarefullychosenmixofdrivesthatchangesovertime.Weselectnewdisksso
thatthemarginalIOPSandcapacityaddedbringusclosertoouroverallgoalsforthecollection.
Workloadchanges,suchasbetteruseofSSDsorRAM,canshifttheaggregatetargets.
Anobviousquestioniswhyarewetalkingaboutspinningdisksatall,ratherthanSSDs,which
havehigherIOPSandarethefutureofstorage.TherootreasonisthatthecostperGB
remainstoohigh,andmoreimportantlythatthegrowth ratesincapacity/$betweendisksand
SSDsarerelativelyclose(atleastforSSDsthathavesufficientnumbersofprogramerase
cyclestouseindatacenters),sothatcostwillnotchangeenoughinthecomingdecade.Wedo
makeextensiveuseofSSDs,butprimarilyforhighperformanceworkloadsandcaching,and
thishelpsdisksbyshiftingseekstoSSDs.However,SSDsarereallyoutsidethescopeofthis
paper,whichisaboutdisksandtheirroleinverylargescalestoragesystems.
Thefollowinggraphshowsonewaytothinkaboutthis:thelettersrepresentvariousmodelsof
disksintermsoftheir$/GB(inTCO)versustheirIOPSperGB.1Theidealdrivewouldbe down
andtotheright.Newerdrivesofthesameformfactortendtobelower(less$/GB),buttowards
theleft(sincetheygetbigger more thantheygetfaster).Changingtosmallerplattersizeisone
waytomovetowardtheright.Theaggregatepropertiesofamixofanytwodrivesformaline
1
Thelettersdonotrepresentactualdiskmodelsthegraphrepresentshowwethinkaboutthespace.An
earlierversionofthisgraphhadtheyaxisasGB/$(ratherthan$/GB),whichismoreintuitive,butthe
convexhullpropertydoesnotholdinthatcase.Thisupdatedversioniswhatweuseinternally.
Page4of16
GoogleInc
betweenthepoints,forexamplethelinebetweenCandD.Ingeneral,suchlinesmustgoup
andtotherightiftheygodownandtotheright,thedriveontherightisstrictlysuperiorand
thereisnoneedtomixthem(e.g.FdominatesE).TheoptimalmixforacertainIOPS/GBtarget
canbemetbycombiningdrivesalongthebottomoftheconvexhull,shownasasetofredlines.
Thisworkswellaslongaswecangetmostof both thecapacityandIOPSoutofeachdrive,
whichisnoteasy.Theindustryisrelativelygoodatimproving$/GB,butlesssoatIOPS/GB,
andthisdistinctionaffectsmostofthesuggesteddirectionsbelow.
Arelatedconsequenceisthatwedonotcareaboutprecisecapacity,e.g.exactly6TB,oreven
thattheusablecapacityofadrivebeconstantovertime.Forexample,thelossofahead
shouldnotpreventtheuseoftherestofthecapacityandIOPS.
ServicesandTailLatency
Datacenterdisksrequirebettercontrolover taillatency ,duetothefocusonservingdatafor
remoteuserswhiletheywait.Broadlyspeaking,wedistinguishbetween liveservices
,inwhicha
useriswaitingforaresponseandlowtaillatencyiscritical,andthroughput workloads,suchas
MapReduceorCloudDataflow,inwhichperformanceisthroughputorientedandtheoperations
arelowerpriority.Wecanoftenhidethelatencyofdiskwrites,sowefocusonthetaillatencyof
reads,whichwecannothideingeneral(andwhichcachingonlyimprovesslightly).Ourprimary
th
metricisthussomeformoftaillatencyforreads,suchasthetimeforthe99 percentileread
i.e.,99%ofreadsarewithinthistimebound.
Aswiththecollectionview,thefocusontaillatencyhasarangeofconsequences:
Welabelourreadsaslowlatencyorthroughputsothatwecanoptimizethem
appropriatelythroughoutthestack,andsowecanmonitorthemindependently.Wealso
haveotherbackgroundbesteffortworkloadsaswell.
Wedoadmissioncontrolonlowlatencyreadsbyhavingaquotaforthemotherwise,all
readswouldhavetobefasterthanaverage.Overquotareadsarenotpromisedlow
latency.
Page5of16
GoogleInc
Forbetteroveralltaillatency,wesometimesissuethesamereadmultipletimesto
differentdisksandusethefirstonetoreturn.Thisconsumesextraresources,butisstill
sometimesworthwhile.Oncewegetareturnedvalue,wesometimes canceltheother
outstandingrequests(toreducethewastedwork).Thisisprettylikelyifonedisk(or
system)hasitcached,buttheothershavequeueduprealreads.(SeeDeanand
Barroso [2]
formoreinformationontaillatencyandcancellation.)
Security
Althoughmostofthispaperisaboutchangesinprioritizationduetoanewusecase,security
requirementsfallinthemusthavecategory,atleastovertime.
Thefirstgeneralproblemisthatthesizeandcomplexityofthefirmwareinsideamoderndisk
leadtobugs,includingsecuritybugsthatcanbeusedtoattackthediskoreventhehost.
Harddiskfirmwareattacksarenotonlypossible,butappeartohavebeenused [3,4]
.Solving
thisisbeyondthescopeofthispaper,butitisclearthatitmustbeeasiertoassurecorrect
firmwareandrestrictunauthorizedchanges,andinthelongtermwemustapplythefullrangeof
hardeningtechniquesalreadyusedinothersystems.Weapproachthisproblemintheshort
termbyrestrictingphysicalaccesstothedisksandbyisolationofuntrustedcodefromthehost
OS(whichhasthepowertoreflashthediskfirmware).
Thesecondbroadcategoryistheneedtoensurethedataisencryptedatrest(i.e.thedataon
aplatterisencryptedincaseadeviceisstolenorincorrectlyreused),andaccesstodifferent
typesofdataarecontrolledthroughstrictauthentication.Allmodernenterprisedisksarealready
encryptedatresttoday,buttraditionallywithasinglekey.Asideeffectofthecollectionviewis
thatwemixawidevarietyoftypesofdatafromdifferentcustomersandwitharangeof
restrictionsonthesamephysicaldisks.Finegrainedaccesscontrol,usingdifferentkeysfor
differentareasofthedisk,allowsustousedifferentkeysfordifferentcustomersorrestrictions,
andingeneralbetterimplementstheprincipleofleastprivilege.
TheTrustedComputingGroup(TCG)hasdevelopedsomegoodsolutions(TCGOPAL [5]
and
Enterprise [6]
),buttheadoptionoffinegrainedaccesscontrolisstillinitsinfancy.Further,the
useofTCG,especiallyinSATAdrives,hasbeenquitelimitedandhencethereislimited
enthusiasmsofarforinvestinginthisarea.Ourhopeisthatfinegrainaccesscontrolwill
becomewidelydeployed,wellimplementedandtested,andviewedasastandardaspectof
bestpractices.
Wewillnotdigdeeperintosecurity,butinsteadexploreavarietyofdesigndirectionsinmore
detail,broadlygroupedintophysicalchangestodisksandfirmwarechanges.Foreachtopic,we
labelwhichofthefourkeymetricsithelpstoaddress(TCO,taillatency,IOPS,capacity).
Page6of16
GoogleInc
PhysicalDesignOptions
Standardslastaverylongtimeandthusshouldbedevelopedthoughtfullyandwithbroad
participation.Theseoptionsarethusthebeginningofalongconversationwithmany
constituents.Buttheshiftinusageisrealandthepotentialofnewdesignsincreasesevery
year,andthussomenewdesignwillmakesenseinthenearfuture.Theseoptionsarebutone
setofproposals,andweknowtheindustryandotherswillcontributemanymore.
1)AlternativeFormFactors[TCO,IOPS,capacity]
Thecurrent3.5HDDgeometrywasadoptedforhistoricreasonsitssizeinheritedfromthePC
floppydisk.AnalternativeformfactorshouldyieldabetterTCOoverall.Changingtheform
factorisalongtermprocessthatrequiresabroaddiscussion,butwebelieveitshouldbe
considered.Althoughwecouldspecourownformfactor(withhighvolume),theunderlying
issuesextendbeyondGoogle,anddevelopingnewsolutionstogetherwillbetterservethe
wholeindustry,especiallyoncestandardized.
Intermsoftheplattersize,wequestionwhether3.5or2.5formfactorsareoptimalpoints.A
largerplattersizeincreasesGB/$,butlowersIOPS/GB,whilesmallerplattersizeresultsinthe
opposite.Asimilarargumentappliestothecommonrotationalspeedof7200RPM,where
higherRPM(rotationsperminute)increasesIOPS/GB,butlowersGB/$,whilealowerRPM
resultsintheopposite.
Weproposeincreasingtheallowableheight(Zheight).Currentdiskshavearelativelysmall
fixedheight:typically1"for3.5"disksand15mmmaximumfor2.5"drives.Tallerdrivesallowfor
moreplattersperdisk,whichaddscapacity,andamortizesthecostsofpackaging,the
printedcircuitboard,andthedrivemotor/actuator.Givenafixedtotalcapacityperdisk,
smaller
platterscanyieldsmallerseekdistancesandhigherRPM(duetoplatterstability),andthus
higherIOPS,butworseGB/$.ThenetresultisahigherGB/$foranyspecificIOPS/GBthat
couldbeachievedbyalteringanyothersingleaspect,suchasplattersizesorRPMalone.It
mayalsobethatamixofdifferentplattersizes(indifferentdisks)providesthebestaggregate
solution.
Therearearangeofsecondaryoptimizationsaswell,someofwhichmaybesignificant.These
includesystemlevelthermaloptimization,systemlevelvibrationoptimization,automationand
roboticshandlingoptimization,systemlevelhelium2optimization,andsystemlevelweight
optimizations.
ParallelAccesses[IOPS]
TherearemultiplewaystoenableparallelaccessesoftwoormoreIOstreamssimultaneously,
andsomeimplementationsbenefitaparticularworkloadmorethananother.Companieshave
mostlyavoidedthesedesignsinthepastduetotheirextracost,butascapacitycontinuesto
2
Somedisksuseheliuminsidethedrivetoreduceseektime(duetoitslowdensity),butitrequires
sealingtheplatterarea,whichwouldbeeasierwithdifferentdimensions.
Page7of16
GoogleInc
outpacegrowthinIOPS,theseoptionsbecomemorecompellingovertime.Herearefour
optionsthatcouldmakesenseinthefuture:
1) Twostandardsizedactuatorarmunits,mounteddiagonallyfromeachotheracrossthe
diskplatter,eachwithheadsthatcoverallplattersurfaces.Thisimplementationisthe
mostexpensive,butcanbenefitallworkloads.Notethatthishasbeenimplemented
before [7,8]
,althoughperhapsbeforeitstime.
2) Oneactuatorlocation,buttwohalfheightactuatorarmunits,mountedontopofeach
other,eachwithheadsthatcoverhalfoftheplattersurfaces.Thisimplementationcan
benefitbothmultiqueuerandomaccessorsinglequeuesequentialworkloads.
3) Oneactuatorarmunit,withasmalldualstageactuatordesignthatallowstwoheadsto
tracktwoplattersurfacessimultaneously.Thisimplementationcandoublesequential
workloadthroughput,butdoesnothelpwithrandomaccess.
4) Onceactuatorarmwithheadsthatcanreadtwoadjacenttracksperplattersurface
simultaneously.Thisimplementationcandoublesequentialworkloadthroughput.
Althoughbothsequentialthroughputandrandomaccessratesareimportant,randomaccesses
areparticularlyimportantinthediskcollectionmodeldescribedhereasmanyusersshareeach
driveandsomeaccesseshavestricttaillatencybounds.Furthermore,arealdensity
improvementshavealreadyresultedinmoreimprovementinsequentialthroughputthanin
randomaccessperformance.
MultiDiskPackages[TCO]
Givenwebuysomanydisks,itwouldmakesensetobuygroupsofdisksasoneunit.Many
vendorsselldisksystems,suchasNASdevices,butsuchsystemsaretoohighlevelanddonot
provideenoughcontrol.Insteadwewouldgroupdisksinorderto1)sharelargercaches
(coveredbelow),2)amortizefixedcosts,and3)improvepowerdistribution(coverednext).The
largerpackagemightalsohelpwithvibrationandyield(seeFlexibleCapacitybelow).The
packagecouldstilluseoneormoreSATAinterfaces,orcouldbechangedtoPCIE(see
Cachingbelow).
Itisnotclearyethowmanydisksshouldbeinagroup.Largergroupshavebetteramortization
andpowersavings,butalsotakeoutmoredatainafailure,whichimplieshigherrecoverycosts.
Agrouptypicallyalsohasahigherfailureratethanitsconstituentdisks,assumingafailureof
onediskrequiresreplacingthegroup.Fourdisksmightbeareasonableplacetostart.
PowerDelivery[TCO]
Thecurrentpowerdeliveryfromtheservertothediskistypicallyviaaconnectorthatdelivers
someof12V,5Vand3.3V(allDC).Eachvoltagehasacurrentspecthattheservermustmeet,
andoverallthereisconsiderablemismatchbetweenwhatisneededtosupportagenericdisk
(allvoltagesatfullcurrent)andwhatisneededforanyspecificmodel.
Page8of16
GoogleInc
Instead,weproposeasingle12VDCsupplywithamaxcurrent(oralternativevoltages3such
as48V).Thiscanbeusefulinsomemoderndesignswheretheoverallstoragesystemdoes
notevenuse5Vor3.3Vanywhereelse.
2)CacheMemory[TCO,taillatency]
Currently,weconnectmanydiskstothesamehost.Eachdiskhasitsownreadandwrite
cachesofconsiderablesize(30100MB).FromaTCOperspectiveitmakesmoresenseto
moveRAMcachingfromthediskstothehostortray,asasinglebigcachewillbebothcheaper
andmoreeffective.Itismoreeffectiveinpartduetobeinglargerandshared,butalsodueto
thefactthatwecanmoreeasilyimprovethecachingpolicyovertime.
Thatbeingsaid,basedontracesweknowthatthecachehitratesinthedisksarerelatively
high,despitethefactthatwealreadyhavecachinginmultiplehigherlayers.Thisissurprising
andstillneedsmoreexploration,butwesuspectthemainreasonforthisistheeffectivenessof
readahead(andreadbehind).Ifso,mostoftheRAMactuallyinthediskshouldbeusedfor
thispurpose,inadditiontowritebuffering.Furthermore,thediskcandoreadaheadand
readbehindwhenitsfreeevenwhentherearependingrequests,sincethoserequestsmight
requiresomerotationaltimetoservice.Thehostdoesnothavevisibilityintowhenitisfreeto
extendaread.
OnepossibilitylongertermistokeepthesameinterfaceandthesamefixedamountofRAM
perdisk,buthavethedisksusehostmemoryviaPCIE,withoutanyotherchangestothedisk
cachingalgorithms.HostmemoryischeaperperMB,sothisshouldbeacostsavings.Host
memoryisalsofartheraway,soperhapsthiswouldbeaperformanceproblem.Thisisapretty
bigchangeandwouldonlymakesenseinthecontextofmovingtoPCIE,perhapsaspartof
supportingmultidiskpackages,wheretheextrabandwidthmattersandthecostiseasierto
amortize.
Wecantrytocombinethedisksabilitytoreaddatawithoutperformancecostandthehosts
abilitytobettermanagethelifetimeofspeculativereadsbyhavingthehostgivereadswith
flexiblebounds:Iwantatleast12KBstartingatthisLBA,andImwillingtoacceptupto900
KBifitdoesntdelayservicinganotherI/O.ThisAPIchangeforvariablereadsissimilarin
spirittorangewritesinwhichthediskhasmultipleoptionsandpicksonebasedonitsextra
internalknowledge [10]
.
3)OptimizedSMRImplementation[capacity]
Shingledmagneticrecording(SMR)allowsaharddisktoachieveahigherarealdensity
typically1020%today,withpotentiallyhighercapacitygainsinthefuturebylimitingtheability
toperformrandomwritesonthephysicalmedia.Inparticularwritesmustbedoneinorder,like
3
Google,Intel,andotherspresentedFutureofPowerEfficiencyandTechnologyforGreenComputing
duringDesignCon2016toadvocatefor48Vserverdesigns [9]
.
Page9of16
GoogleInc
shingles,andthewritesdestroythenexttrack.Asdiscussed,thevalueofadiskcomesnotjust
fromitscapacity,butalsothenumberofusefulI/Orequestsitcanservice.Becauseofthewrite
restrictionsimposedbySMR,whendataisdeleted,thatdeletedcapacitycannotbereused
untilthesystemcopiestheremaininglivedatainthatSMRzonetoanotherpartofthedisk,a
formofgarbagecollection(GC).
Somedatahaslifetimesthataremorepredictable,whichallowsthestoragesystemusinga
hostmanagedSMRdrivetoreducetheGCoverheadbygroupingdatawiththesamepredicted
lifetimeintoanSMRzone.Withoutthis,theGCoverheadintermsofboththeIOPSand
capacitybecomessignificant.StoringonlySMRfriendlydataontheSMRdiskscouldmitigate
theproblem,butthisdataisgenerallycolderthantherestofthedata,whichmeanssomeofthe
IOPSontheSMRdiskswillbewasted.Worse,removingtheSMRfriendlydataconcentrates
theIOPSrequirementsfortheremaininghotterdataontheconventionalspindles(conventional
magneticrecordingorCMR),whichincreasestheIOPSneededontheCMRdisks.More
broadly,SMRdrivesrevealabiastowardsGB/$improvementsoverIOPS/GBimprovements.
OnewayofaddressingthisistomixCMRandSMRtechnologiesinasingle,hybridharddrive.
Thisbettermixeshotandcolddataonthesamediskandincreasesthechanceofeffectively
usingboththecapacityandtheIOPS.Inparticular,thisallowsnewdatatobestoredinCMR,
withSMRspaceusedfor1)newdataknowntobelonglivedortohaveapredictablelifetime
(wheninitiallywritten),or2)olderdatastoredinCMRthathasagedtothepointwhereitis
presumedtobelonglivedandcanbemovedtoSMR.
Thesimplestimplementationofahybriddriveistouseamixofplattersandheads,withsome
optimizedforSMR,andtherestforCMR.ThisgivesapermanentfixedratioofCMRtoSMR,
whichmaybecomesuboptimalovertime.Alternatively,ifaCMRoptimizedheadisusedfor
bothSMRandCMRrecording,theoutertrackscouldbeusedforCMR,whiletheinnertracks
couldbeusedforSMR.NotethattheSMRcapacitygainwillbelessinthiscase,buttheCMR
portionofthediskwillseeIOPSimprovementsduetotheloweraverageseekdistance.
AnotherwaytoreduceGCoverheadisbyrelaxingthewriterestrictionsimposedbydefininga
hostmanagedSMRabstraction.AnexampleisCaveatScriptor [11]
,whichallowsthehostto
writerandomly,knowingthatitwilldestroy(specific)nearbydata.TheCaveatScriptorproposal
alsosupportsothermoreinterestingwritepatterns,includingacircularbuffer,andeffectively
allowsthehosttohavedynamicallyresizableSMRzones.Inthelongterm,anevenmore
aggressiveapproachwouldbetoallowdynamicallyconfigurableCMRandSMRtracks,which
maybeachievablewithmoreadvancedheadmediainterface(HMI)technologies.
FirmwareDirections
Althoughthelongtermshouldincludephysicaldesignchanges,thereareawiderangeof
firmwareonlychangesthatcanbemadeintheshorttermusingexistingformfactors.Wethus
Page10of16
GoogleInc
expectthesetotheprimaryfocusoverthenextseveralyears,hopefullyinparallelwith
longertermdiscussionsaboutnewphysicaldesigns.
4)ProfilingData[IOPS,taillatency]
Animportantgoal,bothshorttermandlongterm,istoenableustounderstandwhatisgoingon
withthediskintermsofperformance.Thisdatawouldimproveboththehostsoftwareandthe
designofdisktrays.
Oneapproachwouldbeaprofiledatalogimplementedaseithercountersoraringbufferin
RAM.Thisdatacanbereadusingeitheranewcommandornormalreadsofmagicblock
numbers.Thecontentsofthelogwouldbeimportantperformanceorotherfirmware
backgroundevents,suchasrewritesfordataintegrity.Thehostwouldperiodicallyreadthis
log,withoutanyperformancepenaltyfromthedisk(otherthanusingsomebandwidth).For
example,whenthehostnoticedahighlatencyread,itcouldthenreadthelogfordiagnosis.
Arelatedideawouldbetoincreasethenumberofpossibleerrorcodesreturnedbyareadto
enablefinegrainexplanationsofwhatcausedadelay.Inmostcases,suchreadsreturncorrect
databuttooslowly,andtheerrorcoderevealswhy.
Ataminimum,weenvisionprofilingreturningthefollowinginformation:
timespentseeking(includingrotationaldelay)forahostinitiatedread,
timespenttransferringdataforahostinitiatedread,
timespentinerrorrecoveryforahostinitiatedread.
Likewise,profiledatashouldincludecountersforhostinitiatedwritesandfordriveinitiated
readsandwrites.Wealsowantdrivestoaccumulateseparatelythetimeformajorbackground
activities(e.g.backgroundmediascans,trackrewrites).
5)HostManagedReadRetries[taillatency]
Thegoalistoallowfordifferentpoliciesforreads.Highenddisksallowsomemanagementof
readsatagloballevel,butthegoalhereistobeabletochangethereadpolicyoneachread,
whichrequiresanAPIchange.Theobviousdefaultpolicywouldbethecurrentone,reallytry
hard,inwhichthediskgoestogreatlengthstocompletetheread.However,mostreadswould
beissuedwithalimitedretrypolicy,whichreturnsanerrortothehost(quickly)ifafewinitial
attemptsatreadingreturnsbaddata.Wepreferthisapproachfortaillatencyanditfitswellwith
theparallelreadsandcancellationsmentionedabove(andinmoredetailbelow).Inaddition,
afteralimitedretryreadfails,thehoststillhastheoptiontousethereallytryhardreadfor
thesamedata.
6)TargetErrorRate[capacity,taillatency]
Fromtheviewofasingleserver,itisveryimportantthatdiskdrivesdonotlosedata.Disk
15
drivesthuspromiseincrediblylowlossrates,typicallylosing1bitin10 orless.However,in
Page11of16
GoogleInc
thedatacentercontext,durabledatamustalreadybespreadacrossmultipledisksandtypically
multiplelocations,sothatdisastersareunlikelytodestroyallcopiesofthedata.Highavailability
alsoimpliestheneedformultiplecopies.
Atthesametime,currentdrivesmakemanysacrificestoachievetheselowerrorrates.They
sacrificecapacitybylimitinghowtightthearealdensitycanbeandbyusingmoreextensive
coding,andtheyworsentaillatency,byrequiringmorebackgroundrepairprocessesthat
interferewiththenormalperformanceofthedisk.
Overall,ahighertargeterrorratewouldhavelittleeffectontheoveralldurability,whileatthe
sametimeenablinghighercapacityandfewerrepairoperations.Itmightbepossibleto
eliminaterepairoperations insidethediskaltogether,althoughthedrivestillmustenable
detectionoferrors(andpossiblypredictriskoferrors)sothathigherlevelsystemscaninitiate
repairand/orshiftresponsibilityforthatdatatoanotherdrive.
7)BackgroundTasksandBackgroundManagementAPIs[taillatency]
Thediskusesbackgroundtaskstofixorpreventproblems.Oneexampleisrefreshingtracks
fordataintegrity,morecommonlyknownasATImitigation(foradjacenttrackinterference).
Backgroundtasksdamagetaillatencyseverely,primarilybecausetheycanimpactimportant
latencysensitivereads.
Sincewecannotremovetheneedforbackgroundtasksaltogether,weneedtoexplore
mechanismstobettercontrolthetiming.Ourbestsuggestionsofaristhefollowingcontract:
a) Mostbackgroundtasksshouldbepreemptible.
b) Therearenononpreemptiblebackgroundtaskswithoutatimer.Thetaskscan
potentiallybegroupedintourgent(needstobeexecutedrelativelyquickly)vs.
nonurgent(canbeexecutedlessquickly).
c) Thehostwillperiodicallyissueacommandthatsuggestswhenthediskdrivecando
backgroundwork.Thiscouldbearepairforasingletrack,orforuptoktrackswhen
webelievethereismoretime.Thelatterallowsforbetterseekordering.
d) Thediskwillhaveanestimateandreportingmechanismforhowmuchbackgroundwork
ispilingup.
e) Whenthetimerexpires,thediskcanalsodomaintenance,regardlessoftaillatency.The
hostshouldnotallowthistohappeningeneral.Itmightalsomakesensetohavethe
diskentermaintenancemodeanytimetherefreshcountisabovesomehighwatermark
(andstopwhenthecountgetsbelowsomelowwatermark).Itwouldstillprocess
commands,butwithnopromisesontaillatency.Thehostshouldhaveawaytotellthe
diskisinthismode.
f) Ifnewcommandsarenotarriving,somethingmaybewrongwiththehostandthusdoing
allthemaintenanceworknowisagoodidea.Thisimpliessomekindofidletimerthat
initiatesmaintenancemode.Thediskstaysinthatmodeuntilacommandarrives.
Page12of16
GoogleInc
Therearesomeproblemswiththisasisthatmeritdiscussion.First,itmaybethatabad
pattern,suchasoverwritingthesametrackmanytimesinarow,mayrequirebackgroundwork
thatdoesnothappenintime.Thiswillresultinthebackgroundtaskbeingescalatedfromthe
nonurgenttotheurgentbucket.Ifthehostdoesnotallowthedisktoinitiatethebackground
taskwhileintheurgentbucketfortoolong,thendatamaypotentiallybelostorcorrupted.
Oneoptionwouldbetoallowthedataloss,whichmightbefinedependingonthefrequency.
Whentherefreshdoesfinallyoccur,itshouldbeclearwhetherdatawaslostornot,andwecan
returnanerroratthattime.
Anotheroptionwouldbetofailanewwritethatis abouttocauseexcessinterferencewitha
nearbytrack,withanerrorthatmeansurgentrefreshrequired(forthatareaofthedisk).
Completingthewritealittlelaterisfineifrare.Plus,thiskindoferrorshoulditselfhavelow
latency.Evenbettermightbetodotherefreshinsteadofthewrite(andreturnanerror),ordo
therefresh
andthewrite,sincetheyarephysicallyclose.Thelatterisbadonlyifthelatencyis
toohigh,sincewearestarvingreadsduringthedoublewrite.Anadvantageofthisapproachis
thatithandlespathologicalcases(betterthantimers),suchasthehostrepeatedlywritingthe
sametrack.Thebestapproachhereremainsanopenquestion.
Backgroundscanning[taillatency]
Manydisksdomediascanningacrossallsectors,whetherornottheyreallocated,asawayto
detectlatenterrors.Alternatively,wecandobackgroundscanningonthehostside,butwitha
moreendtoendapproach.However,theprimarysourceoflatenterrorsisthediskitself,due
tophysicallynearbywritesintheformofATI(adjacenttrackinterference)orFTI(fartrack
interference),andthusthediskhasabetterestimateofwhichtracksorsectorsaremostatrisk.
Oneapproachistouseamixofhostanddiskinitiatedbackgroundscans,assumingwecan
controlthetimingasdiscussedabove.Inparticular,wecouldmakethediskresponsiblefor
estimation ofriskandthenreporttothehostthetracksinneedofrefreshwrites.Thisleavesthe
schedulinginthehandsofthehost,whichprovidesbettercontroloverthetimingandthushelps
taillatency.Similarly,ifthediskdetectsapartial(orcomplete)erroritcouldnotifythehost,even
iftheerrorswerecorrectedviacoding.Suchnotificationsreducethewindowofvulnerabilityto
losingthiscopyi.e.,higherlevelrepairscanbemadebeforethereisaproblem.
ForLBArangesthatarenotspecificallycalledoutbythediskasbeingatrisk,thehostcould
promisetoissueaspecialreadcommand(oroverwrite)toeveryallocatedsectoreveryXdays
(andhavesomewayofknowingwhatthemanufacturerthinksistherightvalueofX).
ForLBArangesthatarenotscannedbythehost,thediskshouldcontinuetoensurethatthey
areprotectedusingitsownBERguaranteeschemeandwhateverscanningthatimplies.
Auditingforhostinitiatedscanscanbedoneandreported,whichwouldbeuseful,aslongas
thenumberofdistinctLBAranges(orbands)isrelativelysmall.
Page13of16
GoogleInc
8)FlexibleCapacity[TCO,capacity]
Currently,drivescomeinfixedcapacities(suchasXTB).Therearethreewaysinwhich
movingtoflexiblecapacityshouldreduceTCO.
First,inordertoprovidemarketabledriveswithfixednumbersofTBsofcapacity,vendorsmust
provisionplatterswithsomewhatlargercapacitiesinordertoaccommodateanunknown
numberofdefects.Drivesthathaveaverageorbelowaveragenumbersofdefectswill
thereforeberemanufactured,orsoldashavingasmallercapacitythantheycanactually
support.
Second,driveshavenumerousreservedsectorsforreallocation,cache,orotheruses.Someof
thosereservedsectorscouldbeexposedtostoredataearlyinthelifeofthedrive,providedthat
thedriveisabletoreclaimthemataslowrateduringitslifetime.
Third,driveheadfailurescouldbehandledbymappingoutthefailedhead(andrecoveringthe
lostdatafromotherdrives),althoughthislikelyrequirereinitializingthedrivetoachieve
contiguousLBAnumbering.Continuingtousethedriveafteroneormoreheadfailureswould
extendthelifetimeofthedrive.
9)LargerSectorSizes[capacity]
Diskdriveerrorcorrectingcodes(ECCs)areabletotolerateagivenfractionoferrorswitha
smallerfractionofcheckbitsasthecodewordsizeisincreased.However,thecodewordcannot
exceedthedrivessectorsize(unlessmultiplephysicalsectorsarereadperlogicalsectorread).
Asaresult,drivevendorshaveincreasedthedrivesectorsizefrom512Bto4KiBandreduced
theECCoverhead.Sincestorageservicestypicallywritetheirowndistributedfilesystems,
theremaybeanadvantagetofurtherincreasingthisto64KiBorlarger.
Furthermore,sincehostsoftwaretypicallyaddsCRCtodatawrittenondisks,havingSATA
drivesthatsupportextendedsectorsizessuchas4k+16Bor64k+256Bwouldbemore
efficientoverall,andwouldallowtheECCtobeexposedtothehostforendtoend
checksumming.(SomeSCSIdisksalreadysupportthisfeature.)
10)OptimizedQueuingManagement[IOPS]
Nativecommandqueueing(NCQ)allowsIOstobequeuedatthediskinsteadofthehost,
enablinghigherthroughputtobeachievedthroughrealtimegeometryoptimizations.Currently,
NCQbyitselfdoesnotsatisfyourneed(nordoesTCQ),asitdoesnotconveythehostsperI/O
requirementstothedisktoallowoptimalI/Oreordering.Thereareanumberoftrafficclasses
mixedwithaquotamanagementsysteminashareddistributedfilesystem.Anyattemptto
increasethroughputbyreorderingrequestshastoaddresstwokeychallenges.
Page14of16
GoogleInc
First,itisnecessarytomeetthelatencytargetsforinquotalowlatencyandthroughputreads.
Second,itisnecessarytomeetperuserthroughputtargetsforinquotareadsandwrites.
GiventhestateofNCQanditsexpectedevolutioninthefuture,itseemsnaturaltohavethe
hostmanagethequotasystemthroughthrottling,andhavethediskmanagethepertrafficclass
throughputthroughinformedI/Oreordering.Amuchmoredetailedinvestigationandawell
thoughtoutdesignisneededinthisarea,whichisbeyondthescopeofthisdocument.Thedisk
schedulerwillneedtolookverymuchlikearealtimeoperatingsystem(RTOS),capableof
handlingverycomplexperI/Oinformation,matchingthatagainstrealtimeheadand
mediarelativepositioning,andallowingschedulinginterruptstohappeninstantaneouslyasthey
arise.Thediskschedulerwillalsoneedtosupportthequeueingandproperschedulingof
systemmanagementcommands,suchasperiodicenvironmentalmonitoringcommands.
SomepotentialAPIsneededforthehostinthisareamightinclude:finegrainedperI/Opriority
levelinformation,finegrainedperI/Odeadlineinformation,perI/Olightweightcancellation,
queuebarrierinsertion,headofqueueinsertion,headofprioritygroupinsertion,andgeneral
queuemanagementinformation(suchaspullingthedetailedstateofthecurrentqueueand
expectedexecutionordering).SomeoftheideasinNCQICCcanpotentiallybeusedforthis
purpose,butNCQICCitselfisoftenconsideredtobetoocomplextobefullyimplementedat
theHDDfirmwarelevel,andisnotfundamentallydesignedtosupporttheneedsofadistributed,
datacenterscalefilesystem.
ArelatedusewouldbetouseNCQcommandstoissueagiantstreamingread.Theideais
thatifweknowwewanttostreammanymegabytesoverthenextkmilliseconds,wewould
issueaseriesoflowpriorityreads,withthemoreneartermreadshavinghigherpriority.The
goalisasimplewaytointerleaveblocksforthestreamwithregularreads/writes.Itwouldalso
begoodtohavecachecontrolsfortheseblocks:oncea(streaming)readisdeliveredtothe
host,itcanberemovedfromthecache,asthereisnoexpectedtemporallocality.
Summary
DisksarethecentralelementofCloudbasedstorage,whosedemandfaroutpacesthe
considerablerateofinnovationindisks.Exponentialgrowthindemandimpliesthatmostfuture
diskswillbeindatacentersandthuspartofalargecollectionofdisks.Thecollectionview,
alongwithafocusontaillatencyandsecurity,placenewanddifferentrequirementsondisks.
Wehopethisisthebeginningofaneweraofdatacenterdisksandanewbroadandopen
discussionabouthowtoevolvedisksfordatacenters.Theideaspresentedhereprovidesome
guidanceandsomeoptions,butwebelievethebestsolutionswillcomefromthecombined
effortsofindustry,academiaandotherlargecustomers.
Page15of16
GoogleInc
Bibliography
[1]J.H.Saltzer,D.P.ReedandD.D.Clark.
"EndtoEndArgumentsinSystemDesign".In:
ProceedingsoftheSecondInternationalConferenceonDistributedComputingSystems.April
1981.IEEEComputerSociety.SeeWikipedia: https://en.wikipedia.org/wiki/Endtoend_principle
[2]J.DeanandL.Barroso.TheTailatScale
CommunicationsoftheACM
,Vol.56,No.2,pp.7480,
February2013.
[3]S.Malenkovich.IndestructiblemalwarebyEquationcyberspiesisouttherebutdontpanic
(yet),KaperskyBlog.February17,2015.
https://blog.kaspersky.com/equationhddmalware/7623/
[4]K.Zetter.HowtheNSAsFirmwareHackingWorksandWhyItsSoUnsettling.
Wired
online,
February2015,http://www.wired.com/2015/02/nsafirmwarehacking/
[5]TrustedComputingGroup.TCGStorageSecuritySubsystemClass:OpalSpecificationVersion
2.00,Revision1.00,February24,2012
https://www.trustedcomputinggroup.org/files/resource_files/B15F1F8F1A4BB294D03F09D51
22B21F6/Opal_SSC_2%2000_rev1%2000_final.pdf
[6]TrustedComputingGroup.TCGStorageSecuritySubsystemClass:EnterpriseSpecification
Version1.0,Revision1.0,January27,2009
http://www.trustedcomputinggroup.org/files/resource_files/87FE68471D093519ADF6E656807
00A9F/TCG_SWG_SSC_Enterprisev1r1090120.pdf Alsopresentation:
https://www.trustedcomputinggroup.org/files/resource_files/0B968DDB1A4BB294D02FF4F40
2F72707/SWG_TCG_Enterprise%20Introduction_Sept2010.pdf
[7]C.Kozierok.Singlevs.MultipleActuatorsin
ThePCGuide
(
http://www.PCGuide.com
).April
2001.
http://www.pcguide.com/ref/hdd/op/actMultiplec.html
[8]Wikipedia.ConnerPeripherals,seeChinookdualactuatordrive,
https://en.wikipedia.org/wiki/Conner_Peripherals#Performance_issues_and_the_.22Chinook.22
_dualactuator_drive
[9]R.Merritt.Google,IntelPrep48VServers.
EETimes
,January21,2016.
http://www.eetimes.com/document.asp?doc_id=1328741
[10]A.Anand,S.Sen,A.Krioukov,F.Popovici,A.Akella,A.ArpaciDusseau,R.ArpaciDusseau,S.
Banerjee.AvoidingFileSystemMicromanagementwithRangeWritesProceedingsofthe8th
USENIXconferencenoOperatingSystemsDesignandImplementation(OSDI2008).Pp.
161176.
[11]T.FeldmanandG.Gibson.ShingledMagneticRecording:ArealDensityIncreaseRequiresNew
DataManagement.Usenixlogin:
,Vol.30,No.3,June2013.
https://www.cs.cmu.edu/~garth/papers/05_feldman_022030_final.pdf
Page16of16