Vous êtes sur la page 1sur 4

INTRODUCTION TO WAFL "WRITE ANYWHERE FILE LAYOUT" PRINT DOCUMENT

Introduction to WAFL "Write Anywhere File Layout" Hello, I'm John Edwards. Today I'll be giving an overview of WAFL. I'm the Technical Director for WAFL. Today's presentation will be complimenting and extending the information in Engineering 101, and we'll be preparing for the later WAFL modules which we'll go into more detail on similar subjects. Outline First we'll start with a brief introduction about WAFL and talking about the overall general theory of WAFL and its intent, then we'll go into depth about the on-disk organization, both the high level and the low level parts of the organization, and finally talk about WAFL process structure and programming, and then the logging consistency points and miscellany. And this will give a good general overview of how WAFL's laid out on disk and the high level view of how it's processed in memory and serves data. Introduction So, a quick introduction to the introduction. We'll be talking about the historical notes and trends, and then we'll go into some of the main ideas of WAFL, and I'll briefly touch at the high level on some of the key WAFL features. Acknowledgments I'm not going to read this acknowledgment slide. Suffice it to say that these slides are a large group effort by many people in WAFL over the years and have been repeatedly stolen and reused from a variety of people. Historical notes This one I stole from I think Blake Lewis. It's a historical note. The original version of WAFL was written by Dave Hitz about 14 years ago. A very small amount of his code is still there. You can see it's mostly comments, brackets, white space. There's some actual real code in there too. Historical notes (2) So today WAFL's still fundamentally the same in many ways as the original WAFL of 1992. But there have been many changes, and the scope and functionality of WAFL have greatly been extended over that time. Just as a quick reference point, the current on-disk format of WAFL is somewhere in the seventies. I don't have the exact number, and even if I did it would be obsolete by the time you saw this. Some scaling trends So here we'll look at some of the scaling trends. You can see that the original Filers had about 256 megabytes of memory and that's now up to 32 Gigabytes. Similarly the number of Ops that we served original was 626 SFS Ops a second. That's roughly 65,000 now. The numbers aren't exactly comparable because of benchmark changes in the meantime, but they give a big general feel to it. Both those number are up by about a factor of a hundred. In the meantime our maximum capacity of systems has gone up by well over a thousand -- most of the way, or half the way to 10,000, geometrically. So this is important to notice, because the memory head, the system heads to disk ratio has been becoming decreasingly, or increasingly unfavorable over time. So it's, we're managing more and more disks with proportionally less and less head. And you can see the number of lines of code have gone up tremendously as well. Main original ideas of WAFL So just to touch briefly on some of the main original ideas in WAFL, and these are still present in the system today. The NVLOG or the non volatile memory log is used for low latency. The whole structure of how WAFL writes out data was designed around the notion of efficient writing to RAID-4. This is still true even though we have different versions of RAID now. We have the RAID dual parity that you've heard about in the Engineering 101. But still the main ideas of trying to organize the writing of data for efficient writes exists today. The on-disk consistency in WAFL is, assuming nothing goes wrong, which of course things always do, or sometimes do. The on-disk consistency is guaranteed at all times. So it jumps atomically ahead from one inconsistent file system to another. One reason that this can be made true is that we never do any overwrites on disk of ordinary WAFL data. We're always writing into places on disk that are currently unused by the file system. There's a couple of exceptions to this that we'll touch on a little bit later, but that's almost always true. And the original implementation was targeted at short code paths and WAFL controlling all the resources. WAFL is still in many ways is the center of the system. It originally, in more than 10 years ago, Data ONTAP basically consisted of WAFL with enough stuff bolted on the sides to control disks and to serve data. However now it's a more mature system, but WAFL's still very much the center of memory management and the process flow of the system. General overview So now that we've talked about a little bit of the historical things, we'll go on to a general overview. The on-disk format, there's currently two types of volumes which we'll be talking about a little more. The on-disk format for

most of our volumes is pretty conventional. There's inodes which we'll talk about in detail, directories, indirect blocks, bitmaps. It's basically all just files on-disk and trees of blocks make up files. All the metadata in the system resides in files. WAFL was one of the early file systems to do this and there's only one super block at a fixed location. That's the piece of data you go and read in order to get started at boot time. It's the hook you can find to send into the tree of the file system. And volumes and what we call aggregates are the basic units of administration, and we'll see some pictures of those in just a moment. Metadata basics Before we go into that I want to give a little bit of metadata basics on this. There's the inodes, which we'll talk about in detail in the section on the disk structures. An inode is essentially the top of a file. It's the piece of information that represents the file on disk. We have an inode file and an allocation bitmap. If you've worked in any file system this is a fairly familiar concept. In some file systems the inodes are in fixed locations on disk, but essentially we have a large group of inodes and then we have blocks and bitmaps. The one thing I do want to mention is that we currently have one large inode file. LSM or the Logical SnapMirror project is about to split this into two inode files, one for user data and one for our own metadata, and that's going to facilitate mirroring so we can isolate the data that we want to mirror as opposed to the data that's really local to a given filer. But that's a little bit of a future, but since it's in the pipeline I wanted to warn you. Volumes -- sorry. Volumes generally contain several file system images. Snapshots is really a core technology of WAFL. And so any given volume will typically have multiple versions of the file system at any given time and one active file system and a bunch of read only snapshots. And there's a bunch of other metadata files and we'll be discussing them in a later sub-module. Aggregates+FlexVols vs. TradVols So here's a picture, I'm going to talk to this for a little while. So if you look at the bottom of this picture we see a lot of little cylinders, green cylinders marked D and gray cylinders marked P. These represent disks, data disks and a parity disk. So disks are assembled in the system, into RAID groups, to provide data integrity and resiliency in the case of disk fails. And all those are assembled at the RAID layer into RAID groups. You can see the left hard portion of the slide there's two RAID groups and on the right hand there's one RAID group. And those are assembled by RAID into something we call a RAID plex. This is really just a big concatenation of all the blocks in all the disks. And this is the organization that's then presented up to WAFL so WAFL can deal with it. There's also some geometry information so WAFL knows where the disks are. It doesn't lose all that information along the way. But really, for the most part, it's just presented with one big -- essentially one big disk. Now the interesting things happen up on top. On the right hand side of this picture you'll see the traditional volume. That's essentially one big container that takes up all the blocks that RAID's presented us with. And you can see inside there there's several small things inside. Those represent files, directories. There's something called a QTree which is a special kind of directory we'll see in a volume. There's a LUN which is essentially a big -- it's like a big disk that's presented over Fibre Channel or iSCSI, and in WAFL it's implemented just as a file. Now, the thing about the traditional volume that you see here, it's sized to the RAID plex beneath it. And the nature of RAID and the fact that WAFL's on top of it, it's very easy to add more disks to the system and you can grow the traditional volume by adding more disks. But once you've added them you can't take it away. This is very inflexibly sized. The traditional volume that we see here is the only kind of volume that we had until 7.0, or the 7G release. Now, on the left hand side you see a somewhat different picture. There's the large container that's in yellow. It's called an aggregate. And essentially what it is is it's the WAFL version of the container of all those blocks that RAID's presenting it with. And then inside you see, in this case, four flexible volumes of different sizes, and they're sort of free-floating in the aggregate, and they all contain their own files and LUNs and QTrees. Whatever a volume can contain, they each contain their own sets of this. And the important thing here is that the flexible volumes here are not tied to any particular portions of those disks in any way. They're really floating freely inside the aggregate. And they can be resized, made them larger or smaller. That's the one key part of the flexibility part. And they also share many spindles. One benefit of aggregates is that when you have the large aggregate and you can move the space easily between different volumes, you don't have as much disincentive to creating the large groups of disks together. It was common in the past for people with traditional volumes to be afraid to add a disk to a volume, because once they added it they couldn't get it out and they couldn't use that disk on some other part of their data set. And so they really would hoard their spares. The aggregate picture on the left hand side really encourages you to add all your disks into your aggregate, or most of your disks into your aggregate right away, and then you can parcel it up among flexible volumes as you want. This has a bunch of benefits. First of all you're getting more of your disks into use right away, which means that your data can use all your disks as opposed to having some sitting idle on the side waiting for something bad to happen. And you can also, all the flexible volumes that are in the aggregate, since they're not tied to any particular part of the disk, they're getting to use all the portions of the disk. WAFL Volumes and Aggregates So now let's move on to a little bit of the talk, and I think I've hit most of these points already. The aggregate is really are just a bunch of disks. It's assembled by RAID and presented as a container to WAFL. The volume, on the other hand, is a file system image plus the snapshots. It's our volume size as our aggregate size is limited to 16 terabytes. That's because we have up to four billion blocks. You can guess that's a 32 bit block pointer, and there are 4 kilobyte blocks. So if you multiple those out it comes out to 16 terabytes. Currently we allow 500 volumes per filer. That's in the process of being changed dramatically by the Sentinel-Extreme project. There current target is around 10,000 volumes per filer, per single head filer. In a GX system which is clustered that will be multiplied by the number of machines in the cluster. And there's really two flavors of volumes, flexible and traditional. If you noticed in the previous picture there was the aggregate that had the traditional volume really sort of looked like the

aggregate. And the notion of flexible volumes was really we were separating the data management properties of volumes which are focused around files and snapshots and sizes and all the properties of your data from the physical properties of all your disks. And so the traditional volume is really, if you view it in reverse time, it's a blend of a flexible volume, having the data management features of a flexible volume, and the physical properties of the disk. But what we did with flexible volumes and aggregates is separate those two concerns from each other, the disk management concerns and the data management concerns. Traditional vs. flexible A little more on traditional volumes and flexible volumes. Traditional volumes and aggregates can be grown by adding disks to them, and they have the very large size in granularity. And that's a problem for traditional volumes much more so than for aggregates because traditional volumes are really a data management entity as well, and you're getting two concerns tied together. The flexible volume, on the other hand, shares it's RAID groups because they're all in the same aggregate. Shares all of its RAID groups with the other flexible volumes in the aggregate, and you can vary your size. You can also do things like oversubscribe your space. That's a very key property for a lot of customers. They can create many volumes that they know aren't going to get filled up right away and over provision them, and then supply disks as needed. There is a performance penalty. It ranges from -- it says small here. It ranges from about 5 percent on SFS. It can get up somewhat higher than 10 percent on random writes. There's a variety of pieces of work going on that will reduce that penalty and actually provide some real benefits over time. Traditional vs. flexible (2) And one more thing I want to highlight. FlexVols really support a lot of our newer features, and a lot of our newer features are only on FlexVols. Things like FlexCache and FlexClone, which I'll talk about, segment cleaning, write-inplace, restore and demand. All of the new significant WAFL developments going on really focused on flexible volumes because they are our data management entity in the future, and tradvols are -- we're not removing support for them right away, but we're trying to deprecate them as much as we can. And in fact, in a GX cluster you will not be able to have any traditional volumes. It only supports flexible volumes. Key WAFL features (1) Now I'm just going to go into a couple of key WAFL features right now. The original main WAFL feature that attracted people was the snapshot, and WAFL was really one of the first places to make this usable for average customers and the real end users. And the notion here is that, that you get an instant frozen in time copy of the active file system. So you're writing all their files to the filer and then at some point you can just, via a command or a schedule, you get a snapshot which is all of that data in that file system frozen. You don't have to worry about anymore changes on those. You get a read only copy. And the snapshots could be exported and accessed just via a sort of magical dot snapshot directory. And this allowed your end users who are constantly going to system administrators saying, yes, I deleted my files or I edited this and I shouldn't have, or some various sad tale of woe, as users always have, would then be able to recover their own files from the inside of the snapshot. They could go to whatever directory their files had been in and they could move from that directory into a magical directory that the filer made appear there, that snapshot, and then go into, as you can see in this particular slide, you have hourly 0, hourly 1, hourly 2. Those were scheduled snapshots that were taken maybe every four hours and a nightly. And they could go, say, to the nightly and get the state of their files from last night and copy it into the file system, and this greatly reduced the burden on administrators. Key WAFL Features (2) The fact that the snapshots have frozen in time a copy of the file system really enables a whole bunch of other value added features. One I won't be talking about much in this talk because it's sort of outside and above WAFL is SnapMirror, which transfers snapshots one at a time from a local system to a remote system and is really the core of our data replication policies. But there's many other features that are just inside of WAFL. One is SnapRestore. Just as I said a moment ago, users could go and copy their files from a snapshot to the active file system and recover whatever they'd lost. Sometimes that wasn't really convenient because it may be that the file was a multiterabyte Oracle database and copying it would be a big overhead and a big problem. And so we implemented SnapRestore which allows the filer to, in a short period of time, revert any active file in the active file system to its copy in a snapshot. We also have a whole volume of SnapRestore which, because of the quirks and properties of WAFL is actually much, much faster than single file SnapRestore, and that takes your entire file system and rolls it back to some snapshot. And so if I do a SnapRestore from this afternoon to last night's snapshot, as far as the file system that I get is concerned it's as if nothing happened after midnight last night. That's really useful for applications like Exchange. It's where maybe your Exchange database gets corrupted because that's something Exchange databases do and everybody can't get at their mail. And you're able to rapidly roll back to a time when you knew that the data was good and process your recovery logs. This, also Oracle databases, for example, can do this. And you can really get back up and running much quicker after something's gone wrong. Internally we had to use it in one instance because of a virus that had gotten onto our internal net and deleted many, many, many files and we recovered it all from snapshots, and it saved us. Another case, another feature is file folding. One use case that customers had was they might have many, many people in their company who each have their Windows desktop and they have their own files on their Windows desktop. And each day they would tell the people to go and copy all their files over onto the filer as a backup. Now what would happen here is people would be copying all their files via CIFs in Windows, and it would look to the -- the Filer would just see a whole bunch of new files coming in.

And as it happened, those were exact copies of the old files but we had the snapshotted copies of the old files, the frozen in time copies that we were keeping around. And then we were writing the new copies and we'd end up with two copies of all this data storage. File folding is a mechanism that allows us to fairly intelligently find, oh, you just wrote all this new data in but it's exactly the same as the old data, so I'm not going to actually store two copies of it, I'll just store one. Key WAFL Features (3) So more features that have been added in over time. De-duplication which also goes with the more marketing friendly name of advanced single instance storage. This is the notion that, again, it's a more productized notion of the file folding case where many streams of data might be coming into the filer, perhaps tape backup streams. And they might have many, many, many blocks that all carry the same data in them. And what the De-dup does -- that's the internal name -- is it allows us to point at the same block many times, and we can get a large multiple space savings by doing so. SnapLock is, it allows you to set retention dates on data. This is very important for executives who don't want to go to jail because something's been changed in their database. Sarbanes-Oxley requires retention of data over certain periods of time, and SnapLock allows you to set a date. Like for example, anything that's written to the filer has to stay there for ten years. It can't be deleted within ten years and you can't destroy the disk -- not destroy the disk, you can't format the disk in the Filer. So it take something really, really physically intrusive to actually destroy that data. On a more prosaic scale we have quotas, which are another data management feature. There's a variety of level of quotas from users and groups. SIDs, which is the Windows version of an ID which might be a user or a group. QTree quotas. QTrees are a special kind of directory we'll talk about in a little bit. And then there's, you can also set default quotas so you don't have to set up a quota for everybody. If you have, say, 10,000 students at the university, you don't want to be managing 10,000 different quotas. Key WAFL Features (4) A few more key WAFL features. FlexShare is a mechanism that allows us to set -- well, the relative importance of workloads so you can insure that no one workload that you care about will get starved on the Filer. It's a lightweight version of Quality of Service. It's not really a Quality of Service guarantee though because -- in that it's very easy to get Quality of Service wrong. This is a lightweight version of that. FlexCache is an integrated WAFL feature that allows us to connect two filers to each other, and you might have one filer with the real active data. For example, Dreamworks might be making a movie in their Northern California offices and they'll want to get access to that data in the Southern California offices. So they'll install machines with FlexCache at the Southern California office, and it will go and pull data in a file system caching basis as needed by the Southern California workers, animators, or whatever. And then really anytime the new data changes in Northern California, the FlexCache will know and be able to go get the new data, and the Southern California people could write to the FlexCache and that will get pushed to Northern California. But this really works well in terms of accelerating data access to filers, particularly over a WAN. Introduction - Summary Okay, so a quick summary of everything we've been through so far. We talked a little bit about the historical notes and trends. A couple of the big points were that the CPU and memory are both getting bigger and faster. I mean, CPU, memory and the disks are all getting bigger and faster, but the CPU and memory are not keeping up with the disks in terms of size. This is creating an increasing imbalance in the systems and is something we'll be working on more in the future. We talked about the main ideas of WAFL which are really optimizing for write performance, always having a consistent image on disk so you never have to run FSCK or check disk, or any of those if you're familiar with them, after rebooting. And then we went through a long list of key WAFL features just to give you a feel for all of the rich feature set that we've built over the last 12, 14 years.

Vous aimerez peut-être aussi