Vous êtes sur la page 1sur 63

Chapter 0: Introduction

What is source control? Sometimes we call it "version control". Sometimes we call it "SCM", which stands for either "software configuration management" or "source code management". Sometimes we call it "source control". I use all these terms interchangeably and make no distinction between them (for now anyway -configuration management actually carries more advanced connotations I'll discuss later). By any of these names, source control is an important practice for any software development team. The most basic element in software development is our source code. A source control tool offers a system for managing this source code. There are many source control tools, and they are all different. However, regardless of which tool you use, it is likely that your source control tool provides some or all of the following basic features: It provides a place to store your source code. It provides a historical record of what you have done over time. It can provide a way for developers to work on separate tasks in parallel, merging their efforts later. It can provide a way for developers to work together without getting in each others' way. HOWTO My goal for this series of articles is to help people learn how to do source control. I work for SourceGear, a developer tools ISV. We sell an SCM tool called Vault. Through the experience of selling and supporting this product, I have learned something rather surprising:

Nobody is teaching people how to do source control.


Our universities often don't teach people how to do source control. We graduate with Computer Science degrees. We know more than we'll ever need to know about discrete math, artificial intelligence and the design of virtual memory systems. But many of us enter the workforce with no knowledge of how to use any of the basic tools of software development, including bug-tracking, unit testing, code coverage, source control, or even IDEs. Our employers don't teach people how to do source control. In fact, many employers provide their developers with no training at all. SCM tool vendors don't teach people how to do source control. We provide documentation on our products, but the help and the manuals usually amount to simple explanations of the program's menus and dialogs. We sort of assume that our customers come to us with a basic background. Here at SourceGear, our product is positioned specifically as a replacement for SourceSafe. We assume that everyone who buys Vault already knows how to use SourceSafe. However, experience is teaching us that this assumption is often untrue. One of the most common questions received by our support team is from users asking for a solid explanation of the basics of source control. We need some materials that explain how source control is done. My goal for this series of articles is to create a comprehensive guide to help meet this need.

Best Practice: Use source control Some surveys indicate that 70% of software teams do not use any kind of source control tool. I cannot imagine how they cope. Throughout this series of articles, I will be sprinkling Best Practices that will appear in sidebar boxes like this one. These boxes will contain pithy and practical tips for developers and managers using SCM tools. Somewhat tool-specific Ideally, a series of articles on the techniques of source control would be tool-neutral, applicable to any of the available SCM tools. It simply makes sense to teach the basic skills without teaching the specifics of any single tool. We learn the basic skills of writing before we learn to use a word processor. However, in the case of SCM tools, this tool-agnostic approach is somewhat difficult to achieve. Unlike writing, source control is simply not done without the assistance of specialized tools. With no tools at all, the methods of source control are not practical. Complicating matters further is the fact that not all source control tools are alike. There are at least dozens of SCM tools available, but there is no standard set of features or even a standard terminology. The word "checkout" has different meanings for CVS and SourceSafe. The word "branch" has very different semantics for Subversion and PVCS. So I will keep the tool-neutral ideal in mind as I write, but my articles will often be somewhat toolspecific. Vault is the tool I know best, since I have played a big part in its design and coding. Furthermore, I freely acknowledge that I have a business incentive to talk about my own product. Although I will often mention other SCM tools, the articles in this series will use the terminology of Vault. The world's most incomplete list of SCM tools Several SCM tools that I mention in this series are listed below, with hyperlinks for more information. Vault. Our product. 'Nuff said. SourceSafe. Microsoft. Old. Loved. Hated. Subversion. Open source. New. Neato. CVS. Open source. Old. Reliable. Dusty. Perforce. Commercial. A competitor of SourceGear, but one that I admire. BitKeeper. Commercial. Uses a distributed repository architecture, so I won't be talking about this one much. Arch. Open source. Distributed repository architecture. Again, I spend most of my words here on tools with a centralized server. This is a very incomplete list. There are many SCM tools, and I am not interested in trying to produce and maintain and accurate listing of them all. Audience I am writing about source control for programmers and web developers.

When we apply some of the concepts of source control to the world of traditional documents, the result is called "document management". I'm not writing about any of those usage scenarios. When we apply some of the concepts of source control to the world of graphic design, the result is called "asset management". I'm not writing about any of those usage scenarios. My audience here is the group of people who deal primarily with source code files or HTML files. Warnings about my writing style First of all, let me say a thing or two about political correctness. Through these articles, I will occasionally find the need for gender-specific pronouns. In such situations, I generally try to use the male and female variants of the words with approximately equal frequency. Second of all, please accept my apologies if my dry sense of humor ever becomes a distraction from the material. I am writing about source control and trying to make it interesting. That's like writing about sex and trying to make it boring, so please cut me some slack if I try to make you chuckle along the way. Looking Ahead Source control is a large topic, so there is much to be said. I plan for the chapters of this series to be sorted very roughly from the very basic to the very advanced. In the next chapter, I'll start by defining the most fundamental terminology of source control.

Chapter 1: Basics
A tale of two trees Our discussion of source control must begin by defining the basic terms and describing the basic operations. Let's start by defining two important terms: repository and working folder. An SCM tool provides a place to store your source code. We call this place a repository. The repository exists on a server machine and is shared by everyone on your team. Each individual developer does her work in a working folder, which is located on a desktop machine and accessed using a client.

Each of these things is basically a hierarchy of folders. A specific file in the repository is described by its path, just like we describe a specific file on the file system of your local machine. In Vault and SourceSafe, a repository path starts with a dollar sign. For example, the path for a file might look like this: $/trunk/src/myLibrary/hello.cs The workflow of a developer is an infinite loop which looks something like this:

Copy the contents of the repository into a working folder. Make changes to the code in the working folder. Update the repository to incorporate those changes. Repeat.

I've omitted certain details like staff meetings and vacations, but this loop essentially describes the life of a developer who is working with an SCM tool. The repository is the official place where all completed work is stored. A task is not considered to be completed until the repository contains the result of that task. Let's imagine for a moment what life would be like without this distinction between working folder and repository. In a single-person team, the situation could be described as tolerable. However, for any plurality of developers, things can get very messy. I've seen people try it. They store their code on a file server. Everyone uses Windows file sharing and edits the source files in place. When somebody wants to edit main.cpp, they shout across the hall and ask if anybody else is using that file. Their Ethernet is saturated most of the time because the developers are actually compiling on their network drives. When we sell our source control tool to someone in this situation, I feel like an ER doctor. I go home that night with a feeling of true contentment, because I know that I have saved a life. With an SCM tool, working on a multi-person team is much simpler. Each developer has a working folder which is a private workspace. He can make changes to his working folder without adversely affecting the rest of the team. Terminology note: Not all SCM tools use the exact terms I am using here. Many systems use the word "directory" instead of "folder". Some SCM tools, including SourceSafe, use the word "database" instead of "repository". In the context of Vault, these two words have a different meaning. Vault allows multiple repositories to exist within a single SQL database. For this reason, I use the word "database" only when I am referring to the SQL database. In and Out Best Practice: Don't break the tree The benefit of working folders is mostly lost if the contents of the repository become "broken". At all times, the contents of the repository should be in a state which allows everyone on the team to continue to work. If a developer checks in some code which won't build or won't pass the test suite, the entire team grinds to a halt. Many teams have some sort of a social penalty which is applied to developers who break the tree. I'm not talking about anything severe, just a little incentive to remind developers to be careful. For example, require the guilty party put a dollar in a glass jar. (Use the money to take the team to go see a movie after the product is shipped.) Another idea is to require the guilty developer to make the coffee every morning. The point is to make the developer feel embarrassed, but not punished.

The repository exists on a server machine which is far away from the desktop machine containing the working folder where the developer does her work. The word "far" in the previous sentence is intended to mean anything from a few centimeters to thousands of kilometers. The physical distance doesn't really matter. The SCM tool provides the ability to communicate between the client and the server over TCP/IP, whether the network is a local Ethernet or an Internet connection to another continent.

Because of this separation between working folder and repository, the most frequently used features of an SCM tool are the ones which help us move things back and forth between them. Let's define some terms:

Add: A repository starts out completely empty, so we need to "Add" things to it. Using the "Add Files" command in Vault you can specify files or folders on your desktop machine which will be added to the repository. Get: When we copy things from the repository to the working folder, we call that operation "Get". Note that this operation is usually used when retrieving files that we do not intend to edit. The files in the working folder will be read-only. Checkout: When we want to retrieve files for the purpose of modifying them, we call that operation "Checkout". Those files will be marked writable in our working folder. The SCM server will keep a record of our intent. Checkin: When we send changes back to the repository, we call that operation "Checkin". Our working files will be marked back to read-only and the SCM server will update the repository to contain new versions of the changed files. Note that these definitions are merely starting points. The descriptions above correspond to the behavior of SourceSafe and Vault (with its default settings). However, we will see later that other tools (such as CVS) work somewhat differently, and Vault can optionally be configured in a mode which matches the behavior of CVS. Terminology note: Some SCM tools use these words a bit differently. Vault and SourceSafe use the word "checkout" as a command which specifically communicates the intent to edit a file. For CVS, the "checkout" command is used to retrieve files from the repository regardless of whether the user intends to edit the files or not. Some SCM tools use the word "commit" instead of the word "checkin". Actually, Vault uses either of these terms, for reasons that will be explained in a later chapter. H.G. Wells would be proud Your repository is more than just an archive of the current version of your code. Actually, it is an archive of every version of your code. Your repository contains history. It contains every version of every file that has ever been checked in to the repository. For this reason, I like to think of a source control tool as a time machine. The ability to travel back in time can be extremely useful for a software project. Suppose we need the ability to retrieve a copy of our source code exactly as it looked on April 28th, 2002. An SCM tool makes this kind of thing easy to do. An even more common case is the situation where a piece of code looks goofy and nobody can figure out why. It's handy to be able to look back at the history and understand when and why a certain change happened. Over time, the complete history of a repository can become large and overwhelming, so SCM tools provide ways to cope. For example, Vault provides a History Explorer which allows the history entries to be queried and searched and sorted. Perhaps more importantly, most SCM tools provide a feature called a "label" or a "tag". A label is basically a way to mark a specific instant in the history of the repository with a meaningful name. The label makes it easy to later retrieve a snapshot of exactly what the repository contained at that instant.

Looking Ahead This chapter merely scratches the surface of what an SCM tool can provide, making brief mention of two primary benefits: Working folders provide developers with a private workspace which is distinct from the main repository. Repository history provides a complete archive of every change and why it was made. In the next chapter, I'll be going into much greater detail on the topic of checkins.

Chapter 2: Checkins
In this chapter, I will explore the various situations wherein a repository is modified, starting with the simplest case of a single developer making a change to a single file. Editing a single file Consider the simple situation where a developer needs to make a change to one source file. This case is obviously rather simple: 1. Checkout the file 2. Edit the working file as needed 3. Checkin the file I won't talk much about step 2 here, as it doesn't really involve the SCM tool directly. Editing the file usually involves the use of some other tools, like an integrated development environment (IDE). But I do want to explore steps 1 and 3 in greater detail. Step 1: Checkout Checking out a file has two basic effects: On the server, the SCM tool will remember the fact that you have the file checked out so that others may be informed. On your client, the SCM tool will prepare your working file for editing by changing it to be writable.
The server side of checkout

File checkouts are a way of communicating your intentions to others. When you have a file checked out, other users can be aware and avoid making changes to that file until you are done with it. The checkout status of a file is usually displayed somewhere in the user interface of the SCM client application. For example, in the following screendump from Vault, users can see that I have checked out libsgdcore.cpp:

This screendump also hints at the fact there are actually two kinds of checkouts. The issue here is the question of whether two people can checkout a file at the same time. The answer varies across SCM tools. Some SCM tools can be configured to behave either way. Sometimes the SCM tool will allow multiple people to checkout a file at the same time. SourceSafe and Vault both offer this capability as an option. When this "multiple checkouts" feature is used, things can get a bit more complicated. I'll talk more about this later. If the SCM tool prevents anyone else from checking out a file which I have checked out, then my checkout is "exclusive" and may be described as a "lock". In the screendump above, the user interface is indicating that I have an exclusive lock on libsgdcore.cpp. Vault will allow no one else to checkout this file.

Best Practice: Use checkouts and locks carefully It is best to use checkouts and locks only when you need them. A checkout discourages others from modifying a file, and a lock prevents them from doing so. You should therefore be careful to use these features only when you actually need them. Don't checkout files just because you think you might need to edit them. Don't checkout whole folders. Checkout the specific files you need. Don't checkout hundreds or thousands of files at one time. Don't hold exclusive locks any longer than necessary. Don't go on vacation while holding exclusive locks on files.

The client side of checkout

On the client side, the effect of a checkout is quite simple: If necessary, the latest version of the file is retrieved from the server. The working file is then made writable, if it was not in that state already. All of the files in a working folder are made read-only when the SCM tool retrieves them from the repository. A file is not made writable until it is checked out. This prevents the developer from accidentally editing a file.
Undoing a checkout

Normally, a checkout ends when a checkin happens. However, sometimes we checkout a file and subsequently decide that we did not need to do so. When this happens, we "undo the checkout". Most SCM tools have a command which offers this functionality. On the server side, the command will remove the checkout and release any exclusive lock that was being held. On the client side, Vault offers the user three choices for how the working file should be treated: Revert: Put the working file back in the state it was in when I checked it out. Any changes I made while I had the file checked out will be lost. Leave: Leave the working file alone. This option will effectively leave the file in a state which we call "Renegade". It is a bad idea to edit a file without checking it out. When I do so, Vault notices my transgression and chastises me by letting me know that the file is "Renegade". Delete: Delete the working file. I usually prefer to work with "Revert" as my option for how the Undo Check Out command behaves. Step 3: Checkin One issue does deserve special mention. Most SCM tools ask the user to enter a comment when making a checkin. This comment will be stored in the repository forever along with the changes being submitted. The comment provides a place for the developer to explain what was changed and why the change was made. Best Practice: Explain your checkins completely

Every SCM tool provides a way to associate a comment when checking changes into the repository. This comment is important. If we consistently use good checkin comments, our After the file is checked out, the developer repository's history contains not only every proceeds to make her changes. She edits the file change we have ever made, but it also contains an and verifies that her change is correct. Having explanation of why those changes happened. completed all this, she is ready to submit her These kinds of records can be invaluable later as changes to the repository. Doing so will make her we forget things. change permanent and official. Submitting her changes to the repository is the operation we call I believe developers should be encouraged to "checkin". enter checkin comments which are as long as necessary to explain what is going on. Don't just The process of a checkin isn't terribly type "minor change". Tell us what the minor complicated: change was. Don't just tell us "fixed bug 1234". 1. The new version of the file is sent to the Tell us what bug 1234 is and tell us a little bit SCM server where it is stored. about the changes that were necessary to fix it. 2. The version number of the file in the repository is incremented by one. 3. The file is no longer considered to be checked out or locked.

4. The working file on the client side is made read-only again. The following screendump shows the checkin dialog box from Vault:

Checkins are additive It is reassuring to remember one fundamental axiom of source control: Nothing is ever destroyed. Let us suppose that we are editing a file which is currently at version 4. When we checkin our changes, our new version of the file becomes version 5. Clients will be notified that the latest version is now 5. Clients that are still holding version 4 in their working folder will be warned that the file is now "Old". But version 4 is still there. If we ask the server for the latest version, we will get 5. But if we specifically ask for version 4, and for any previous version, we can still get it. Each checkin adds to the history of our repository. We never subtract anything from that history. Other kinds of checkins We will informally use the word "checkin" to refer to any change which is made to the repository. It is common for a developer to say, "I made some checkins this afternoon to fix that bug", using the word "checkin" to include any of the following types of changes to the repository: Create a new folder Add a file Rename a file or folder Delete a file or folder

Move a file or folder It may seem odd to refer to these operations using the word "checkin", because there is no corresponding "checkout" step. However, this looseness is typical of the way people use the word "checkin", so you'll get used to it. I will take this opportunity to say a few things about how these operations behave. If we conceptually think of a folder as a list of files and subfolders, each of these operations is actually a modification of a folder. When we create a folder inside folder A, then we are modifying folder A to include a new subfolder in its list. When we rename a file or folder, the parent folder is being modified. Just as the version number of a file is incremented when we modify it, these folder-level changes cause the version number of a folder to be incremented. If we ask for the previous version of a folder, we can still retrieve it just the way it was before. The renamed file will be back to the old name. The deleted file will reappear exactly where it was before. It may bother you to realize that the "delete" command in your SCM tool doesn't actually delete anything. However, you'll get used to it. Atomic transactions I've been talking mostly about the simple case of making a change to a single source code file. However, most programming tasks require us to make multiple repository changes. Perhaps we need to edit more than one file to accomplish our task. Perhaps our task requires more than just file modifications, but also folder-level changes like the addition of new files or the renaming of a file. When faced with a complex task that requires several different operations, we would like to be able to submit all the related changes together in a single checkin operation. Although tools like SourceSafe and CVS do not offer this capability, some source control systems (like Vault and Subversion) do include support for "atomic transactions". The concept is similar to the behavior of atomic transactions in a SQL database. The Vault server guarantees that all operations within a transaction will stay together. Either they will all succeed, or they will all fail. It is impossible for the repository to end up in a state with only half of the operations done. The integrity of the repository is assured. Best Practice: Group your checkins logically I recommend that each transaction you check into the repository should correspond to one task. A "task" might be a bug fix or a feature. Include all of the repository changes which were necessary to complete that task, and nothing else. Avoid fixing multiple bugs in a single checkin transaction.

To ensure that a transaction can contain all kinds of operations, Vault supports the notion of a pending change set. Essentially, the Vault client keeps a running list of changes you have made which are waiting to be sent to the server. When you invoke the Delete command, not only will it not actually delete anything, but it doesn't even send the command to the server. It merely adds the Delete operation to the pending change set, so that it can be sent later as part of a group. In the following screen dump, my pending change set contains three operations. I have modified libsgdcore.cpp. I have renamed libsgdcore.h to headerfile.h. And I have deleted libsgdcore_diff_file.c.

Note that these operations have not actually happened yet. They won't happen unless I submit them to the server, at which time they will take place as a single atomic transaction. Vault persists the pending change set between sessions. If I shutdown my Vault client and turn off my computer, next time I launch the Vault client the pending change set will contain the same items it does now. The Church of "Edit-Merge-Commit" Up until now, I have explained everything about checkouts and checkins in a very "matter of fact" fashion. I have claimed that working files are always read-only until they are checked out, and I have claimed that files are always checked out before they are checked in. I have made broad generalizations and I have explained things in terms that sound very absolute. I lied. In reality, there are two very distinct doctrines for how this basic interaction with an SCM tool can work. I have been describing the doctrine I call "checkout-edit-checkin". Reviewing the simple case when a developer needs to modify a single file, the practice of this faith involves the following steps:: 1. Checkout the file 2. Edit the working file as needed 3. Checkin the file Followers of the "checkout-edit-checkin" doctrine are effectively submitting to live according to the

following rules: Files in the working folder are read-only unless they are checked out. Developers must always checkout a file before editing it. Therefore, the entire team always knows who is editing which files. Checkouts are made with exclusive locks, so only one developer can checkout a file at one time. This approach is the default behavior for SourceSafe and for Vault. However, CVS doesn't work this way at all. CVS uses the doctrine I call "edit-merge-commit". Practicers of this religion will perform the following steps to modify a single file: Edit the working file as needed Merge any recent changes from the server into the working file Commit the file to the repository The edit-merge-commit doctrine is a liberal denomination which preaches a message of freedom from structure. Its followers live by these rules: Files in the working folder are always writable. Nobody uses checkouts at all, so nobody knows who is editing which files. When a developer commits his changes, he is responsible for ensuring that his changes were made against the latest version in the repository. As I said, this is the approach which is supported by CVS. Vault supports edit-merge-commit as an option. In fact, when this option is turned on, we informally say that Vault is running in "CVS mode". Each of these approaches corresponds to a different style of managing concurrent development on a team. People tend to have very strong feelings about which style they prefer. The religious flame war between these two churches can get very intense. Holy Wars The "checkout-edit-checkin" doctrine is obviously more traditional and conservative. When applied strictly, it is impossible for two people to modify a given file at the same time, thus avoiding the necessity of merging two versions of a file into one. The "edit-merge-commit" teaches a lifestyle which is riskier. The risk is that the merge step may be tedious or cause problems. However, the acceptance of this risk rewards us with a concurrent development style which causes developers to trip over each other a lot less often. Still, these risks are real, and we will not flippantly disregard them. A detailed discussion of file merging appears in the next chapter. For now I will simply mention that most SCM tools include features that can safely do a three-way merge automatically. Not all developers are willing to trust this feature, but many do. So, when using the "edit-merge-commit" approach, the merge must happen, and we are left with two choices: Attempt the automerge. (can be scary) Merge the files by hand. (can be tedious) Developers who prefer "checkout-edit-checkin" often find both of these choices to be unacceptable.

I will confess that I am a disciple of the editBest Practice: Get the best of both worlds merge-commit religion. People who use editmerge-commit often say that they cannot imagine Here at SourceGear we are quite proud of the fact going back to what life was like before. I agree. that Vault allows each developer to choose their It is so very convenient to never be required to own concurrent development style. Developers checkout a file. All the files in my working folder who prefer "checkout-edit-checkin" can work that are always writable. If I want to start working on way. Developers who prefer "edit-mergea bugfix or a feature, I simply open a text editor commit" can use that approach, and they still and begin making my changes. have exclusive locks available to them for those times when they are needed. As far as I know, This benefit is especially useful when I am Vault is the only product that offers this disconnected from the server. When people ask flexibility. me about the best way to use Vault while "offline", I tell them to consider using edit-mergeI apologize for this completely shameless plug. I commit. Since I don't have to contact the server to checkout a file, I can simply proceed with my won't do it very often. changes. The only time I need the server is when it comes time to merge and commit. As I said, automerge is amazingly safe in practice. Thousands of teams use it every day without incident. I have been actively using edit-merge-commit as my development style for over five years, and I cannot remember a situation where automerge produced an incorrect file. Experience has made me a believer. Looking Ahead In the next chapter, I will be talking in greater detail about the process of merging two modified versions of a file.

Chapter 3: File Merge


How did we get ourselves into this mess? There are several reasons why we may need to merge two modified versions of a file: When using "edit-merge-commit" (sometimes called "optimistic locking"), it is possible for two developers to edit the same file at the same time. Even if we use "checkout-edit-checkin", we may allow multiple checkouts, resulting once again in the possibility of two developers editing the same file. When merging between branches, we may have a situation where the file has been modified in both branches. In other words, this mess only happens when people are working in parallel. If we serialize the efforts of our team by never branching and never allowing two people to work on a module at the same time, we can avoid ever facing the need to merge two versions of a file.

However, we want our developers to work concurrently. Think of your team as a multithreaded piece of software, each developer running in its own thread. The key to high performance in a multithreaded system is to maximize concurrency. Our goal is to never have a thread which is blocked on some other thread. So we embrace concurrent development, but the threading metaphor continues to apply. Multithreaded programming can sometimes be a little bit messy, and the same can be said of a multithreaded software team. There is a certain amount of overhead involved in things like synchronization and context switching. This overhead is inevitable. If your team is allowing concurrent development to happen, it will periodically face a situation where two versions of a file need to be merged into one. In rare cases, the situation can be properly resolved by simply choosing one version of the file over the other. However, most of the time, we actually need to merge the two versions to create a new version. What do we do about it? Let's carefully state the problem as follows: We have two versions of a file, each of which was derived from the same common ancestor. We sometimes call this common ancestor the "original" file. Each of the other versions is merely the result of someone applying a set of changes to the original. What we want to create is a new version of the file which is conceptually equivalent to starting with the original and applying both sets of changes. We call this process "merging". The difficulty of doing this merge varies greatly for different types of files. How would we perform a merge of two Excel spreadsheets? Two PNG images? Two files which have digital signatures? In the general case, the only way to merge two modified versions of a file is to have a very smart person carefully construct a new copy of the file which properly incorporates the correct elements from each of the other two. However, in software and web development there is a special case which is very common. As luck would have it, most source code files are plain text files with an average of less than 80 characters per line. Merging files of this kind is vastly simpler than the general case. Many SCM tools contain special features to assist with this sort of a merge. In fact, in a majority of these cases, the two files can be automatically merged without requiring the manual effort of a developer. An example Let's call our two developers Jane and Joe. Both of them have retrieved version 4 of the same file and both of them are working on making changes to it. One of these developers will checkin before the other one. Let's assume it is Jane who gets there first. When Jane tries to checkin her changes, nothing unusual will happen. The current version of the file is 4, and that was the version she had when she started making her changes. In other words, version 4 was her baseline for these changes. Since her baseline matches the current version, there is no merge necessary. Her changes are checked in, and a version of the file is created in the repository. After her checkin, the current version of the file is now 5. The responsibility for merging is going to fall upon Joe. When he tries to checkin his changes, the SCM tool will protest. His baseline version is 4, but the current version in the repository is now 5. If Joe is allowed to checkin his version of the file, the changes made by Jane in version 5 will be lost. Therefore, Joe will not be allowed to checkin this file until he convinces the SCM tool that he has merged Jane's version 5 changes into his working copy of the file. Vault reports this situation by setting the status on this file to be "Needs Merge", as shown in the screen

dump below:

In order to resolve this situation, Joe effectively needs to do a three-way comparison between the following three versions of the file:

Best Practice: Keep the repository in sight

This example happens to involve the need to Version 4 (the baseline from which he and merge only a single checkin. Since Joe's baseline Jane both started) is 4 and the current repository version is 5, Joe is only 1 version out of date. If the repository Version 5 (Jane's version) version were 25 instead of 5, then Joe would be Joe's working file (containing his own 21 versions out of date instead of just 1, but the changes) technique is the same. No matter how old his Version 4 is the common ancestor for both Joe's baseline is, Joe still needs to retrieve the latest version and Jane's version of the file. By running version and do a three-way merge. However, the a diff between version 4 and version 5, Joe can older his baseline, the more likely he is to see exactly what changes Jane made. He can use encounter conflicts in the merge. this information to apply those changes to his own version of the file. Once he has done so, he Keep in touch with the repository. Update your can credibly claim that his version is a merge of working folder as often as you can without his changes and Jane's. interrupting your own work. Commit your work Strictly speaking, Joe is responsible for whatever changes Jane made, regardless of how difficult the merge may be. He must perform the changes to his file that Jane would have made if she has started with his file instead of with version 4. In theory, this could be very difficult: to the repository as often as you can without breaking the build. It isn't wise to let the distance between your working folder and the repository grow too large.

What happens if Jane changed some of the same lines that Joe changed, but in different ways? What happens if Jane's changes are functionally incompatible with Joe's?

What happens if Jane made a change to a C# function which Joe has deleted? What happens if Jane changed 80 percent of the lines in the file? What happens if Jane and Joe each changed 80 percent of the lines in the file, but each did so for entirely different reasons? What happens if Jane's intent was not clear and she cannot be reached to ask questions? All of these situations are possible, and all of them are Joe's responsibility. He must incorporate Jane's changes into his file before he can checkin a version 6. In certain rare situations, Joe may examine Jane's changes and realize that his version needs nothing from Jane's version 5. Maybe Jane's change simply isn't relevant anymore. In these cases, the merge isn't needed, and Joe can simply declare the merge to be resolved without actually doing anything. This decision remains subject to Joe's judgment. However, most of the time it will be necessary for the merge to actually happen. In these cases, Joe has the following options: Attempt to automerge Use a visual merge tool Redo one set of changes by hand Each of these will be explained further in the sections below. Attempt to automerge As I mentioned above, a surprising number of cases can be easily handled automatically. Most source control tools include the ability to attempt an automatic merge. The algorithm uses all three of the involved versions of the file and attempts to safely produce a merged version. The reason that automerge is so safe in practice is that the algorithm is extremely conservative. Automerge will refuse to produce a merged version if Joe's changes and Jane's changes appear to be in Best Practice: Only use "automerge on get" It is widely accepted that SCM tools should only attempt automerge on the "get" of a file. In other words, when Joe realizes that he must merge in the changes Jane made between version 4 and version 5, he will tell his SCM client application to "get" version 5 and attempt to automatically merge it into his working file. CVS, Subversion and Vault all function in this manner. Unfortunately, SourceSafe attempts to "automerge on checkin". This is just a really bad idea. When Joe tries to checkin his changes, SourceSafe attempts the automerge. If it believes that it has succeeded, then his changes are checked in and version 6 was created. However, it is possible that Joe never examined version 6, or even compiled it. The repository now contains a file which has never existed in the working folder of any developer on earth. Its contents have never been seen by human eyes, and it has never been run through a compiler. Automerge is safe, but it's not that safe. It is much better to "automerge on get". This way, the developer can (and should) examine the file after the automerge has happened. This simple change makes it easier to trust automerge. Instead of trying to do the developer's job, automerge simply becomes a tool which the developer can use to get his job done faster.

conflict. In the most obvious case, if Joe and Jane both modified the same line, automerge will detect this "conflict" and refuse to proceed. In other cases, automerge may fail with conflicts if two changes are too close to each other. Use a visual merge tool In cases where automerge cannot automatically resolve conflicts, we can use a visual merge tool to make the job easier. These tools provide a visual display which shows all three files and highlights exactly what has changed. This makes it much easier for the developer to perform the merge, since she can zero in on the conflicts very quickly. There are several excellent visual merge tools available, including Guiffy and Araxis Merge. The following screen dump is from "SourceGear DiffMerge", the visual merge tool which is included with Vault. (Please note sometimes I have to reduce the size of screen dumps to make them fit. In those cases, you can click on the image to see it at full resolution).

This picture is typical of other three-way visual merge applications. The left pane shows Jane's version of the file. The right pane shows Joe's version. The center pane shows the original file, the common ancestor from which they both started to make changes. As you can see, Jane and Joe have each inserted a one-line comment. By right-clicking on each change, the developer can choose whether to apply that change to the middle pane. In this example, the two changes don't conflict. There is no reason that the resulting file cannot incorporate both changes. The following picture shows an example of changes which are conflicting.

Both Jane and Joe have tried to change the wording of this comment. In the original file, the word used in the comment was "Global". Jane decided to change this word to "Worldwide", but Joe has changed it to the word "Rampant". These two changes are conflicting, as indicated by the yellow background color being used to display them. Automerge cannot automatically handle cases like these. Only a human being can decide which change to keep. The visual merge tool makes it easy to handle this situation. I can decide which change I want to keep and apply it to the center pane. A visual merge tool can make file merging a lot easier by quickly showing the developer exactly what has changed and allowing him to specify which changes should be applied to get the final merged result. However, as useful as these kinds of tools can be, they're not magic. Redo one set of changes by hand Some situations are so complicated that a visual merge tool just isn't very helpful. In the worst case scenario, Joe might have to manually redo one set of changes. This situation recently happened here at SourceGear. We currently have Vault development happening in two separate branches: When we shipped version 2.0, we created a branch for maintenance of the 2.0 release. This is the tree where we develop minor bug fix releases like 2.0.1. Our "trunk" is the place where active development of the next major release is taking place. Obviously we want any bug fixes that happen in the 2.0 branch to also happen in the trunk so that they can be included in our upcoming 2.1 release. We use Vault's "Merge Branches" command to migrate changes from one place to the other. I will talk more about branching and merging in a later chapter. For now, suffice it to say that the merging of branches can create exactly the same kind of three-way merge situation that we've been discussing in this chapter. In this case, we ended up with a very difficult merge in the sections of code that deal with logins. In the 2.0 branch, we implemented a fix to prevent dictionary attacks on passwords. We considered this a bug fix, since it is related to the security of our product. In concept this change was simple. We simply block login for any account which is seeing too many failed

login attempts. However, implementing this mini-feature required a surprising number of lines to be changed. In the trunk, we added the ability for Vault to authenticate logins against Active Directory. In other words, we made substantial changes to the login code in both these branches. When it came time to merge, the DiffMerge was extremely colorful. In this case, it was actually simpler to just start with the trunk version and reimplement the dictionary attack code. This may seem crazy, but it's actually not that bad. Redoing the changes takes a lot less time than coding the feature the first time. We could still copy and paste code from the 2.0 version. Getting back to the primary example, Joe has a choice to make. His current working file already contains his own set of changes. He could therefore choose to redo Jane's change starting with his current working file. The problem here is that he might not really know how. He might have no idea what Jane's approach was. Jane's office might be 10,000 miles away. Jane might have written a lousy comment explaining her checkin. As an alternative, Joe could set aside his working file, start with the latest repository version and redo his own changes. Bottom line: If a merge gets this bad, it takes some time and care to resolve it properly. Luckily, this situation doesn't happen very often. Verifying the merge Regardless of which of the above methods is used to complete the merge, it is highly recommended for Joe to verify the correctness of his work. Obviously he should check that the entire source tree still compiles. If a test suite is available, he should build and verify that the tests still pass. After Joe has completed the merge and verified it, he can declare the merge to be "resolved", after which the SCM tool will allow him to checkin the file. In the case of Vault, this is done by using the Resolve Merge Status command, which explicitly tells the Vault client application that the merge is completed. At this time, Vault would change the baseline version number from 4 to 5, indicating that as far as anyone knows, Joe made his changes by starting with version 5 of the file, not with version 4. Since his baseline version now matches the current version of the file, the Vault server will now allow Joe to do his checkin. Best Practice: Give concurrent development a try Worth the trouble Many teams avoid all forms of concurrent I hope I have not scared you away from development. Their entire team uses "checkoutconcurrent development by explaining the gory details of merging files. In fact, my goal is quite edit-checkin" with exclusive locks, and they never branch. the opposite. Remember that easily-resolved merges are the most common case. Automerge handles a large percentage of situations with no problems at all. A large percentage of the remaining cases can be easily handled with a visual merge tool. The difficult situations are rare, and can still be handled easily by a developer who is patient and careful. For some small teams, this approach works just fine. However, the larger your team, the more frequently a developer becomes "blocked" by having to wait for someone else. Modern source control systems are designed to make concurrent development easy. Give them a try.

Many software teams have discovered that the tradeoff here is worth the trouble. Concurrent development can bring substantial gains in the productivity of a team. The extra effort to deal with merge situations is usually a small price to pay. Looking Ahead In the next chapter I will be discussing the concept of a repository in a lot more detail.

Chapter 4: Repositories
Cars and clocks In previous chapters I have mentioned the concept of a repository, but I haven't said much further about it. In this chapter, I want to provide a lot more detail. Please bear with me as I spend a little time talking about how an SCM tool works "under the hood". I am doing this because an SCM tool is more like a car than a clock. An SCM tool is not like a clock. Clock users have no need to know how a clock works inside. We just want to know what time it is. Those who understand the inner workings of a clock cannot tell time any more skillfully than the rest of us. An SCM tool is more like a car. Lots of people do use cars without knowing how they work. However, people who really understand cars tend to get better performance out of them. Rest assured, that this book is still a "HOWTO". My goal here remains to create a practical explanation of how to do source control. However, I believe that you can use an SCM tool more effectively if you know a little bit about what's happening inside. Repository = File System * Time A repository is the official place where you store all your source code. It keeps track of all your files, as well as the layout of the directories in which they are stored. It resides on a server where it can be shared by all the members of your team. But there has to be more. If the definition in the previous paragraph were the whole story, then an SCM repository would be no more than a network file system. A repository is much more than that. A repository contains history. A file system is two-dimensional: its space is defined by directories and files. In contrast, a repository is three-dimensional: it exists in a continuum defined by directories, files and time. An SCM repository contains every version of your source code that has ever existed. The additional dimension creates some rather interesting challenges in the architecture of a repository and the decisions about how it manages data. How do we store all those old versions of everything? As a first guess, let's not be terribly clever. We need to store every version of the source tree. Why not just keep a complete copy of the entire tree for every change that has happened?

We obviously use Vault as the SCM tool for our own development of Vault. We began development of Vault in the fall of 2001. In the summer of 2002, we started "dogfooding". On October 25th, 2002, we abandoned our repository history and started a fresh repository for the core components of Vault. Since that day, this tree has been modified 4,686 times. This repository contains approximately 40 MB of source code. If we chose to store the entire tree for every change, those 4,686 copies of the source tree would consume approximately 183 GB, without compression. At today's prices for disk space, this option is worth considering. However, this particular repository is just not very large. We have several others as well, but the sum total of all the code we have ever written still doesn't qualify as "large". Many of our Vault customers have trees which are a lot bigger. As an example, consider the source tree for OpenOffice.org. This tree is approximately 634 MB. Based on their claim of 270 developers and the fact that their repository is almost four years old, I'm going to conservatively estimate that they have made perhaps 20,000 checkins. So, if we used the dumb approach of storing a full copy of their tree for every change, we'll need around 12 TB of disk space. That's 12 terabytes. At this point, the argument that "disk space is cheap" starts to break down. The disk space for 12 TB of data is cheaper than it has ever been in the history of the planet. But this is mission critical data. We have to consider things like performance and backups and RAID and administration. The cost of storing 12 TB of ultra-important data is more than just the cost of the actual disk platters. So we actually do have an incentive to store this information a bit more efficiently. Fortunately, there is an obvious reason why this is going to be easy to do. We observe that tree N is often not terribly different from tree N-1. By definition, each version of the tree is derived from its predecessor. A checkin might be as simple as a one-line fix to a single file. All of the other files are unchanged, so we don't really need to store another copy of them. So, we don't want to store the full contents of the tree for every single change. Instead, we want a way to store a tree represented as a set of changes to another tree. We call this a "delta". Delta direction As we decide to store our repositories using deltas, we must be concerned about performance. Retrieving a tree which is in a deltified representation requires more effort than retrieving one which is stored in full. For example, let's suppose that version 1 of the tree is stored in full, but every subsequent revision is represented as a delta from its predecessor. This means that in order to retrieve version 4,686, we must first retrieve version 1 and then apply 4,685 deltas. Obviously, this approach would mean that retrieving some versions will be faster than others. When using this approach we say that we are using "forward deltas", because each delta expresses the set of changes from one version to the next. We observe that not all versions of the tree are equally likely to be retrieved. For example, version 83 of the Vault tree is not special in any way. It is likely that we have not retrieved that version in over a year. I suspect that we will never retrieve it again. However, we retrieve the latest version of the tree many times per day. In fact, as a broad generalization, we can say that at any given moment, the most recent version of the tree is probably the most likely one to be needed. The simplistic use of forward deltas delivers its worst performance for the most common case. Not good. Another idea is to use "reverse deltas". In this approach, we store the most recent tree in full. Every

other tree N is represented as a set of differences from tree N+1. This approach delivers its best performance for the most common case, but it can still take an awfully long time to retrieve older trees. Some SCM tools use some sort of a compromise design. In one approach, instead of storing just one full tree and representing every other tree as a delta, we sprinkle a few more full trees along the way. For example, suppose that we store a full tree for every 10th version. This approach uses more disk space, but the SCM server never has to apply more than 9 deltas to retrieve any tree. What is a delta? I've been throwing around this concept of deltas, but I haven't stopped to describe them. A tree is a hierarchy of folders and files. A delta is the difference between two trees. In theory, those two trees do not need to be related. However, in practice, the only reason we calculate the difference between them is because one of them is derived from the other. Some developer started with tree N and made one or more changes, resulting in tree N+1. We can think of the delta as a set of changes. In fact, many SCM tools use the term "changeset" for exactly this purpose. A changeset is merely a list of the changes which express the difference between two trees. For example, let's suppose that Wilbur starts with tree N and makes the following changes: 1. 2. 3. 4. 5. 6. 7. He deletes $/top/subfolder/foo.c because it is no longer needed. He edits $/top/subfolder/Makefile to remove foo.c from the list of file names He edits $/top/bar.c to remove all the calls to the functions in foo.c He renames $/top/hello.c and gives it the new name hola.c He adds a new file called feature_creep.c to $/top/ He edits $/top/Makefile to add feature_creep.c to the list of filenames He moves $/top/subfolder/readme.txt into $/top

At this point, he commits all of these changes to the repository as a single transaction. When the SCM server stores this delta, it must remember all of these changes. For changeset item 1 above, the delete of foo.c is easily represented. We simply remember that foo.c existed in tree N but does not exist in tree N+1. For changeset item 4, the rename of hello.c is a bit more complex. To handle renames, we need each object in the repository to have an identifier which never changes, even when the name or location of the item changes. For changeset item 7, the move of readme.txt is another example of why repositories need IDs for each item. If we simply remember every item by its path, we cannot remember the occasions when that path changes. Changeset item 5 is going to be a lot bulkier than some of the other items here. For this item we need to remember that tree N+1 has a file called feature_creep.c which was never present in tree N. However, a full representation of this changeset item needs to contain the entire contents of that file. Changeset items 2, 3 and 6 represent situations where a file which already existed has been modified in some way. We could handle these items the same way as item 5, by storing the entire contents of the new version of the file. However, we will be happier if we can do deltas at the file level just as we are doing deltas at the tree level.

File deltas A file delta merely expresses the difference between two files. Once again, the reason we calculate a file delta is because we believe it will be smaller than the file itself, usually because one of the files is derived from the other. For text files, a well-known approach to the file delta problem is to compare line-by-line and output a list of lines which have been modified, inserted or changed. This is the same kind of results which are produced by the Unix 'diff' command. The bad news is that this approach only works for text files. The good news is that software developers and web developers have a lot of text files. CVS and Perforce use this approach for repository storage. Text files are deltified using a line-oriented diff. Binary files are not deltified at all, although Perforce does reduce the penalty somewhat by compressing them. Subversion and Vault are examples of tools which use binary file deltas for repository storage. Vault uses a file delta algorithm called VCDiff, as described in RFC 3284. This algorithm is byte-oriented, not line-oriented. It outputs a list of byte ranges which have been changed. This means it can handle any kind of file, binary or text. As an ancillary benefit, the VCDiff algorithm compresses the data at the same time. Binary deltas are a critical feature for some SCM tool users, especially in situations where the binary files are large. Consider the case where a user checks out a 10 MB file, changes a few bytes, and checks it back in. In CVS, the size of the repository will increase by 10 MB. In Subversion and Vault, the repository will only grow by a small amount. Deltas and diffs are different Please note that I make a distinction between the terms "delta" and "diff". A "delta" is the difference between two versions. If we have one full file and a delta, then we can construct the other full file. A delta is used primarily because it is smaller than the full file, not because it is useful for a human being to read. The purpose of a delta is efficiency. When deltas are done at the level of bytes instead of textual lines, that efficiency becomes available to all kinds of files, not just text files. A "diff" is the human-readable difference between two versions of a text file. It is usually lineoriented, but really cool visual diff tools can also highlight the specific characters on a line which differ. The purpose of a diff is to show a developer exactly what has changed between two versions of a file. Diffs are really useful for text files, because human beings tend to read text files. Most human beings don't read binary files, and human-readable diffs of binary files are similarly uninteresting. As mentioned above, some SCM tools use binary deltas for repository storage or to improve performance over slow network lines. However, those tools also support textual diffs. Deltas and diffs serve two distinct purposes, both of which are important. It is merely coincidence that some SCM tools use textual diffs as their repository deltas. The evolution of source control technology At this point I should admit that I have presented a somewhat idealized view of the world. Not all SCM tools work the way I have described. In fact, I have presented things exactly backwards, discussing tree-wide deltas before file deltas. That is not the way the history of the world unfolded. Prehistoric ancestors of modern programmers had to live with extremely primitive tools. Early version

control systems like RCS only handled file deltas. There was no way for the system to remember folder-level operations like add, renaming or deleting files. Over time, the design of SCM tools matured. CVS is probably the most popular source control tool in the world today. It was originally developed as a set of wrappers around RCS which essentially provided support for some folder-level operations. Although CVS still has some important limitations, it was a big step forward. Today, several modern source control systems are designed around the notion of tree-wide deltas. By accurately remembering every possible operation which can happen to a repository, these tools provide a truly complete history of a project. What can be stored in a repository? People sometimes ask us what kind of things can be stored in a repository. In general, the answer is: "Any file". It is true that I am focusing on tools which are designed for software developers and web developers. However, those tools don't really care what kind of file you store inside them. Vault doesn't care. Perforce, Subversion and CVS don't care. Any of these tools will gratefully accept any file you want to store. If you will be storing a lot of binary files, it is helpful to know how your SCM tool handles them. A tool which uses binary deltas in the repository may be a better choice. If all of your files are binary, you may want to explore other solutions. Tools like Vault and Subversion were designed for programmers. These products contain features designed specifically for use with source code, including diff and automerge. You can use these systems to store all of your Excel spreadsheets, but they are probably not the best tool for the job. Consider exploring "document management" systems instead. How is the repository itself stored? Best Practice: Checkin all the canonical stuff, and nothing else Although you can store anything you want in a repository, that doesn't mean you should. The best practice here is to store everything which is necessary to do a build, and nothing else. I call this "the canonical stuff". To put this another way, I recommend that you do not store any file which is automatically generated. Checkin your hand-edited source code. Don't checkin EXEs and DLLs. If you use a code generation tool, checkin the input file, not the generated code file. If you generate your product documentation in several different formats, checkin the original format, the one that you manually edit. If you have two files, one of which is automatically generated from the other, then you just don't need to checkin both of them. You would in effect be managing two expressions of the same thing. If one of them gets out of sync with the other, then you have a problem.

We need to descend through one more layer of abstraction before we turn our attention back to more practical matters. So far I have been talking about how things are stored and managed within a repository, but I have not broached the subject of how the repository itself is stored. A repository must store every version of every file. It must remember the hierarchy of files and folders for every version of the tree. It must remember metadata, information about every file and folder. It must remember checkin comments, explanations provided by the developer for each checkin. For large trees and trees with very many revisions, this can be a lot of data that needs to be managed efficiently and reliably. There are several different ways of approaching the problem. RCS kept one archive file for every file being managed. If your file was called "foo.c" then the archive

file was called "foo.c,v". Usually these archive files were kept in a subdirectory of the working directory, just one level down. RCS files were plain text, you could just look at them with any editor. Inside the file you would find a bunch of metadata and a full copy of the latest version of the file, plus a series of line-oriented file deltas, one for each previous version. (Please forgive me for speaking of RCS in the past tense. Despite all the fond memories, that particular phase of my life is over.) CVS uses a similar design, albeit with a lot more capabilities. A CVS repository is distinct, completely separate from the working directory, but it still uses ",v" files just like RCS. The directory structure of a CVS repository contains some additional metadata. When managing larger and larger source trees, it becomes clear that the storage challenges of a repository are exactly the same as the storage challenges of a database. For this reason, many SCM tools use an actual database as the backend data store. Subversion uses Berkeley DB. Vault uses SQL Server 2000. The benefit of this approach is enormous, especially for SCM tools which support atomic transactions. Microsoft has invested lots of time and money to ensure that SQL Server is a safe place to store important information. Data corruption simply doesn't happen. All of the ultra-tricky details of transactions are handled by the underlying database. Perforce uses somewhat of a hybrid approach, storing all of the metadata in a database but keeping all of the actual file contents in RCS files. This approach trades some safety for speed. Since Perforce manages its own archive files, it has to take responsibility for all the strange things that threaten to corrupt them. On the other hand, writing a file is a bit faster than writing a blob into a SQL database. Perforce has the reputation of being one of the fastest SCM tools. Managing repositories Creating a source control repository is kind of a special event. It's a little bit like adopting a cat. People often get a cat without realizing the animal is going to be around for 10-20 years. Your repository may have similar longevity, or even longer. Shortly after SourceGear was founded in 1997, we created a SourceSafe repository. Over seven years later, that repository is still in use, almost every day. (Along with a whole bunch of legacy projects, it contains the source code for SourceOffSite. We never migrated that project to Vault because we wanted the SourceOffSite developers to continue eating their own dogfood.) Best Practice: Use separate repositories for things which are truly separate Most SCM tools offer the ability to have multiple distinct repositories. Vault can even host multiple repositories on the same Vault server. People often ask us when this capability should be used. In general, you should store related items in the same repository. Start a separate repository only in situations where the contents of the two are completely unrelated. In a small ISV, it may be quite logical to have only one repository which contains every project.

That repository is well over a gigabyte in size (which is actually rather small, but then SourceGear has never been a very big company). It contains thousands of files, thousands of checkins, and has been backed up thousands of times. Treat your repository well and it will serve you well: Obviously you should do regular backups. That repository contains everything your fussy and expensive programmers have ever created. Don't risk losing it. Just for fun, take an hour this week and check your backup to see if it actually works. It's shocking how many people are doing daily backups that cannot actually be restored when they are needed.

Put your repository on a reliable server. If your repository goes down, your entire team is blocked from doing work. Disk drives like to fail, so use RAID. Power supplies like to fail, so get a server with redundant power supplies. The electrical grid likes to fail, so get a good Uninterruptible Power Supply (UPS). Be conservative in the way your SCM server machine is managed. Don't put anything on that machine that doesn't need to be there. Don't feel the need to install every single Service Pack on the day it gets released. I've been shocked how many times one of our servers went south simply because we installed a service pack or hotfix from Windows Update. Obviously I want our machines to be kept current with the latest security fixes, but I've been burned too many times not to be cautious. Install those patches on some other machine before you put them on critical servers. Keep your SCM server inside a firewall. If you need to allow your developers to access the repository from home, carefully poke a hole, but leave everything else as tight as you can. Make sure your developers are using some sort of bulk encryption. Vault uses SSL. Tools like Perforce, CVS and Subversion can be tunneled through ssh or something similar. This brief list of tips is hardly a complete guide for administrators. I am merely trying to describe the level of care and caution which should be used for your SCM repository. Undo As I have mentioned, one of the best things about source control is that it contains your entire history. Every version of everything is stored. Nothing is ever deleted. However, sometimes this benefit can be a real pain. What if I made a mistake and checked in something that should not be checked in? My history contains something I would rather forget. I want to pretend that it never happened. Isn't there some way to really delete from a repository? In general, the recommended way to fix a problem is to checkin a new version which fixes it. Try not to worry about the fact that your repository contains a full history of the error. Your mistakes are a part of your past. Accept them and move on with your life. However, most SCM tools do provide one or more ways of dealing with this situation. First, there is a command I call "rollback". This command is essentially an "undo" for revisions of a file. For example, let's say that a certain file is at version 7 and we want to go back to version 6. In Vault, we select version 6 and choose the Rollback command. To be fair, I should admit that the rollback command is not always destructive. In some SCM tools, the rollback feature really does make version 7 Best Practice: Never obliterate anything that disappear forever. Vault's rollback is nondestructive. It simply creates a version 8 which is was real work identical to version 6. The designers of Vault are fanatical purists, or at the very least, one of them The purist in me wants to recommend that nothing should ever be obliterated. However, my is. pragmatist side prevails. There are situations As a concession to those who are less fanatical, where obliterate is not sinful. Vault does support a way to truly destroy things in a repository. We call this feature "obliterate". I However, obliterate should never be used to believe Subversion and Perforce use the same delete actual work. Don't obliterate a file simply term. The obliterate command is the only way to because you discovered it to be a bad idea. Don't delete something and make it truly gone forever. obliterate a file simply because you don't need it In my original spec for Vault, I had decided that anymore. Obliterate is for situations where something in the repository should never have been there at all. For example, if you accidentally checkin a gigabyte of MP3s alongside your C++ include files, obliterate is a justifiable choice.

we would not implement any form of destructive delete. We eventually decided to compromise and implement this command, but I really wanted to discourage its use. SourceSafe makes it far too easy to rewrite history and pretend that something never happened. In the Delete dialog box, SourceSafe includes a checkbox called "Destroy Permanently". This is an atrocious design decision, roughly equivalent to leaving a sledgehammer next to the server machine so that people can bash the hard disks with it every once in a while. This checkbox is almost irresistible. It simply begs to be checked, even though it is very rarely the right thing to do. When we first designed the obliterate command for Vault, I wanted its user interface to somehow make the user feel guilty. I argued that the obliterate dialog box should include a photograph of a 75-year old catholic nun scowling and holding a yardstick. The rest of the team agreed that we should discourage people from using this command, but in the end, we settled on a less graphical approach. In Vault, the obliterate command is available only in the Admin client, not the regular client people use every day. In effect, we made the obliterate command available, but inconvenient. People who really need to obliterate can find the command and get it done. Everyone else has to think twice before they try to rewrite history and pretend something never happened. Kimchi again? Recently when I asked my fifth grade daughter what she had learned in school, she proudly informed me that "everyone in Korea eats kimchi at every meal, every day". In the world of a ten-year-old, things are simpler. Rules don't have exceptions. Generalizations always apply. This is how we learn. We understand the basic rules first and see the finer points later. First we learn that memory leaks are impossible in the CLR. Later, when our app consumes all available RAM, we learn more. My habit as I write these chapters is to first present the basics in a "matter of fact" fashion, rarely acknowledging that there are exceptions to my broad generalizations. I did this during the chapter on checkins, failing to mention the "edit-merge-commit" until I had thoroughly explored "checkout-editcheckin". In this chapter, I have written everything from the perspective of just one specific architecture. SCM tools like Vault, Perforce, CVS and Subversion are based on the concept of a centralized server which hosts a single repository. Each client has a working folder. All clients contact the same server. I confess that not all SCM tools work this way. Tools like BitKeeper and Arch are based on the concept of distributed repositories. Instead of one repository, there can be several, or even many. Things can be retrieved or committed to any repository at any time. The repositories are synchronized by migrating changesets from one repository to another. This results in a merge situation which is not altogether different from merging branches. From the perspective of this SCM geek, distributed repositories are an attractive concept. Admittedly, they are advanced and complex, requiring a bit more of a learning curve on the part of the end user. But for the power user, this paradigm for source control is very cool. Having no experience in the implementation of these systems, I will not be explaining their behavior in any detail. Suffice it to say that this approach is similar in some ways, but very different in others. This series of articles will continue to focus on the more mainstream architecture for source control.

Looking ahead In this chapter, I discussed the details of repositories. In the next chapter, I'll go back over to the client side and dive into the details of working folders.

Chapter 5: Working Folders


The joy of indifference CVS calls it a sandbox. Subversion calls it a working directory. Vault calls it a working folder. By any of these names, a working folder is a directory hierarchy on the developer's client machine. It contains a copy of the contents of a repository folder. The very basic workflow of using source control involves three steps: 1. Update the working folder so that it exactly matches the latest contents of the repository. 2. Make some changes to the working folder. 3. Checkin (or commit) those changes to the repository. The repository is the official archive of our work. We treat our repository with great respect. We are extremely careful about what gets checked in. We buy backup disks and RAID arrays and air conditioners and whatever it takes to make sure our precious repository is always comfortable and happy. In contrast, we treat our working folder with very Best Practice: Don't let your working folder little regard. It exists for the purpose of being become too valuable abused. Our working folder starts out worthless, nothing more than a copy of the repository. If it Checkin your work to the repository as often as is destroyed, we have lost nothing, so we run you can without breaking the build. risky experiments which endanger its life. We attempt code changes which we are not sure will ever work. Sometimes the contents of our working folder won't even compile, much less pass the test suite. Sometimes our code changes turn out to be a Really Bad Idea, so we simply discard the entire working folder and get a new one. But if our code changes turn out to be useful, things change in a very big way. Our working folder suddenly has value. In fact, it is quite precious. The only copy of our most recent efforts is sitting on a crappy, laptop-grade hard disk which gets physically moved four times a day and never gets backed up. The stress of this situation is almost intolerable. We want to get those changes checked in to the repository as quickly as possible. Once we do, we breathe a sigh of relief. Our working folder has once again become worthless, as it should be. Hidden state information Once again I need to spend some time explaining grungy details of how SCM tools work. I don't want

to repeat the analogy I used in the last chapter, so the following line of "code" should suffice: Response.Write(previousChapter.Section["Cars and Clocks"]); Let's suppose I have a brand new working folder. In other words, I started with nothing at all and I retrieved the latest versions from the repository. At this moment, my new working folder is completely in sync with the contents of the repository. But that condition is not likely to last for long. I will be making changes to some of the files in my working folder, so it will be "newer" than the repository. Other developers may be checking in their changes to the repository, thus making my working folder "out of date". My working folder is going to be new and old at the same time. Things are going to get confusing. The SCM tool is responsible for keeping track of everything. In fact, it must keep track of the state of each file individually. Best Practice: Use non-working folders when you are not working SCM tools need this "hidden state information" so it can efficiently keep track of things as you make changes to your working folder. However, sometimes you want to retrieve files from the repository with no plan of making changes to them. For example, if you are retrieving files to make a source tarball, or for the purpose of doing an automated build, you don't really need the hidden state information at all.

Your SCM tool probably has a way to retrieve things "plain", without writing the hidden state information anywhere. I call this a "non-working For housekeeping purposes, the SCM tool usually folder". In Vault, this is done automatically keeps a bit of extra information on the client side. whenever you retrieve files to a destination which When a file is retrieved, the SCM client stores its is not configured as the working folder, although I contents in the corresponding working file, but it sometimes wish we had made this functionality a also records certain information for later. completely separate command. Examples: Your SCM tool may record the timestamp on the working file, so that it can later detect if you have modified it. It may record the version number of the repository file that was retrieved, so that it may later know the starting point from which you began to make your changes. It may even tuck away a complete copy of the file that was retrieved, so that it can show you a diff without accessing the server. I call this information "hidden state information". Its exact location depends on which SCM tool you are using. Subversion hides it in invisible subdirectories in your working directory. Vault can work similarly, but by default it stores hidden state information in the current user's "Application Data" directory. Working file states Because of the changes happening on both the client and the server, a working file can be in one of several possible states. SCM tools typically have some way of displaying the state of each file to the user. Vault shows file states in the main window. CVS shows them in response to the 'cvs status' command. The table below shows the possible states for a working file. The column on the left shows my particular name for each of these states, which through no coincidence is the name that Vault uses. The column on the far right shows the name shown by the 'cvs status' command. However, the terminology doesn't really matter. One way or another, your SCM tool is probably keeping track of all these things and can tell you the state of any file in your working folder hierarchy.

State Name

Has the Does the repository have working file a newer version than the been modified? last one retrieved?

Remarks

'cvs status'

None

No

No

The working file matches the latest version in the repository.

Up-to-date

Old Edited Needs Merge Missing

No Yes

Yes No

Needs Patch Locally Modified Needs Merge The working file does not exist. You have modified a file without first checking it out. There is a working file, but the SCM tool has no hidden state information about it. Needs Checkout N/A

Yes

Yes

N/A

N/A

Renegade

Yes

No

Unknown

No

No

Unknown

Refresh In order to keep all this file status information current, the SCM client must have ways of staying up to date with everything that is happening. Whenever something changes in the working folders or in the repository, the SCM client wants to know. Changes in the working folders on the client side are relatively easy. The SCM client can quickly scan files in the working folders to determine what has changed. On some operating systems, the client can register to be notified of changes to any file. Notification of changes on the server can be a bit trickier. The Vault client periodically queries the server to ask for the latest version of the repository tree structure. Most of the time, the server will simply respond that "nothing has changed". However, when something has in fact changed, the client receives a list of things which have changed since the last time that client asked for the tree structure. For example, let's assume Laura retrieves the tree structure and is informed that foo.cpp is at version 7. Later, Wilbur checks in a change to foo.cpp and creates version 8. The next time Laura's Vault client performs a refresh, it will ask the server if there is anything new. The server will send down a list, informing her client that foo.cpp is now at version 8. The actual bits for foo.cpp will not be sent until Laura specifically asks for them. For now, we just want the client to have enough information so that it can inform Laura that her copy of foo.cpp is now "Old".

Operations that involve a working folder OK, let's go back to speaking a bit more about practical matters. In terms of actual usage, most interaction with your SCM tool happens in and around your working folder. The following operations are the basic things I can do to a working folder: Make changes Review changes Undo changes Update Commit changes This is the whole point.

Show me the changes I have made to my working folder so far. Some of my changes didn't work out the way I planned. Undo them, restoring my working folder back to the way it was when I started. The repository has changes which I want to be included in my working folder. I'm ready to send my changes to the repository and make them permanent.

In the following sections, I will cover each of these operations in a bit more detail. Make the changes The primary thing you do to a working folder is make changes to it. In an idealized world, it would be really nice if the SCM tool didn't have to be involved at all. The developer would simply work, making all kinds of changes to the working folder while the SCM tool eavesdrops, keeping an accurate list of every change that has been made. Unfortunately, this perfect world isn't quite available. Most operations on a working folder cannot be automatically detected by the SCM client. They must be explicitly indicated by the user. Examples: It would be unwise for the SCM client to notice that a file is "Missing" and automatically assume it should be deleted from the repository. Automatically inferring an "Add" operation is similarly unsafe. We don't want our SCM tool automatically adding any file which happens to show up in our working folder. Rename and move operations also cannot be reliably divined by mere observation of the result. If I rename foo.cpp to bar.cpp, how can my SCM client know what really happened? As far as it can tell, I might have deleted foo.cpp and added bar.cpp as a new file. All of these so-called "folder-level" operations require the user to explicitly give a command to the SCM tool. The resulting operation is added to the pending change set, which is the list of all changes that are waiting to be committed to the repository. However, it just so happens that in the most common case, our "eavesdropping" ideal is available. Developers who use the edit-merge-commit model typically do not issue any explicit command telling the SCM tool of their intention to edit a file. The files in their working folder are left in a writable state, so they simply open their text editor or their IDE and begin making changes. At the appropriate time, the SCM tool will notice the change and add that file to the pending change set. Users who prefer "checkout-edit-checkin" actually have a somewhat more consistent rule for their

work. The SCM tool must be explicitly informed of all changes to the working folder. All files in their working folder are usually marked read-only. The SCM tool's Checkout command not only informs the server of the checkout request, but it also flips the bit on the working file to make it writable. Review changes One of the most important features provided by a working folder is the ability to review all of the changes I have made. For SCM tools that do keep track of a pending change set (Vault, Perforce, Subversion), this is the place to start. The following screen dump shows the pending change set pane from the Vault client, which is showing me that I have currently made two changes in my working folder:

The pending change set view shows all kinds of changes, including adds, deletes, renames, moves, and modified files. It is helpful to keep an eye on the pending change set as I work, verifying that I have not forgotten anything. However, for the case of a modified file, this visual display only shows me which files have changed. To really review my changes, I need to actually look inside the modified files. For this, I invoke a diff tool. The following screen dump is from a popular Windows diff tool called Beyond Compare:

This picture is fairly typical of the visual diff tool genre, showing both files side-by-side and highlighting the parts that are different. There are quite a few tools like this. The following screen dump is from the visual diff tool which is provided with Vault:

The left panel shows version 21 of sgdmgui_props.cpp, which is the current version in the repository. The right panel shows my working file. The colored regions show exactly what has changed:

On line 33 I changed the type of this function from long to short. At line 35 I inserted a one-line comment. Note that SourceGear's diff tool shows inserted lines by drawing lines in the center gap to indicate exactly where the insertion occurs. In contrast, Beyond Compare is showing a dead region on the left side across from the inserted line on the right. This particular issue is a matter of personal preference. The latter approach does have the benefit that identical lines are always across from each other.

Best Practice: Run diff just before you checkin, every time Never checkin your changes without giving them a quick review in some sort of a diff tool.

Both of these tools do a nice job on the modification to line 33, showing exactly which part of the line was changed. Most of the recent visual diff tools support this ability to highlight intraline differences. Visual diff tools are indispensable. They give me a way to quickly review exactly what has changed. I strongly recommend you make a habit of reviewing all of your changes just before you checkin. You can catch a lot of silly mistakes by taking the time to be sure that your changes look the way you think they look. Undo changes Sometimes I make changes which I simply don't intend to keep. Perhaps I tried to fix a bug and discovered that my fix introduced five new bugs that are worse than the one I started with. Or perhaps I just changed my mind. In any case, a very nice feature of a working folder is the ability to undo. In the case of a folder-level operation, perhaps the Undo command should actually be called "Nevermind". After all, the operation is pending. It hasn't happened yet. I'm not really saying that I want to Undo something which has already happened. Rather, I am just saying that I no longer want to do something that I previously said I did. For example, if I tell the Vault client to delete a file, the file isn't really deleted until I commit that change to the repository. In the meantime, it is merely waiting around in my pending change set. If I then tell the Vault client to Undo this operation, the only thing that actually has to happen is to remove it from my pending change set. In the case of a modified file, the Undo command simply overwrites the working file with the "baseline" version, the one that I last retrieved. Since Vault has been keeping a copy of this baseline version, it merely needs to copy this baseline file from its place in the hidden state information over the working file. Best Practice: Be careful with undo When you tell your SCM client to undo the changes you have made to a file, those changes will be lost. If your working folder has become valuable, be careful with it.

For users who use the checkout-edit-checkin style of development, closely related here is the need to undo a checkout. This is essentially similar to undoing the changes in a file, but involves the extra step of informing the server that I no longer want the file to be checked out. Digression: Your skillet is not a working folder Source control tools have been a daily part of my life for well over a decade. I can't imagine doing software development without them. In fact, I have developed habits that occasionally threaten my mental health. Things would be so much easier if the concept of a working folder were available in

other areas of life: "Hmmm. I can't remember which of these pool chemicals I have already done. Luckily, I can just diff against the version of the pool water from an hour ago and see exactly what changes I have made." "Boy am I glad I remembered to set the read-only bit on my front lawn to remind me that I'm not supposed to cut the grass until a week after the fertilizer was applied." "No worries -- if I accidentally put too much pepper on this chicken, I can just revert to the latest version in the repository." Unfortunately, SCM tools are unique. When I make a mistake in my woodshop, I can't undo it. Only in software development do I have the luxury of a working folder. It's a place where I can work without constantly worrying about making a mistake. It's a place where I can work without having to be too careful. It's a place where I can experiment with ideas that may not work out. I wish I had working folders everywhere. Update the working folder Ten milliseconds after I retrieve a fresh working folder, it might be out of date. An SCM repository is a busy hub of activity. New stuff arrives regularly as team members finish tasks and checkin their work. I don't like to let my working folder get too far behind the current state of the repository. SCM tools typically allow the user to invoke a diff tool to compare two repository versions of a file. When I am working on a feature, I periodically like to review the recent changes in the repository. Unless those changes look likely to disrupt my own work, I usually proceed to retrieve the latest versions of things so that my working folder stays up to date. In CVS, the command to update a working folder is [rather conveniently] called 'update'. In Vault, this operation is done with the Get Latest Version command. The screen dump below is the corresponding dialog box:

I want to update my working folder to contain all of the changes available on the server, so I have invoked the Get Latest Version operation starting

Best Practice: Don't get too far behind Update your working folder as often as you can.

at the very top folder of my repository. The Recursive checkbox in the dialog above indicates that this operation will recursively apply to every subfolder. Note that this dialog box gives me a few choices for how I may want to handle situations where a change has happened on both the client and the server. Let us suppose for a moment that I am not using exclusive checkouts and that somebody else has also modified sgdmgui_props.cpp. In this case, I have three choices available when I want to update my working folder: Overwrite my working file. This effect here is similar to an Undo. My changes will be lost. Use with care. Attempt automatic merge. The Vault client will attempt to construct a file which contains my changes and the changes which were made on the server. If the automerge succeeds, my working file will end up in the "Edited" status. If the automerge fails, the status of my working file will be "Needs Merge", and the Vault client will nag and pester me until I resolve the situation. Do not overwrite/Merge later. This option leaves my working file untouched. However, the status of the file will change to "Needs Merge". Vault will not allow me to checkin my changes until I affirm that I have done the right thing and merged in the changes from the repository. Note also that the "Prompt for modified files" checkbox allows me to specify that I want the Vault client to allow me to choose between these options for every file that ends up in this situation. As you can see, the Get Latest Version dialog box includes a few other options which I won't describe in detail here. Other SCM tools have similar abilities, although the user interface may be very different. In any case, it's a good idea to update your working folder as often as you can. Commit changes In most situations, I eventually decide that my changes are Good and should be sent back to the repository so they can become a permanent part of the history of my project. In Vault, Subversion and CVS, the command is called Commit. The following screen dump shows the Commit dialog box from Vault:

Note that the listbox at the top contains all of the items in my pending change set. In this particular example, I only have two changes, but this listbox typically has a scrollbar and contains lots of items. I can review all of the operations and choose exactly which ones I want to commit to the repository. It is possible that I may want to checkin only some of my currently pending changes. (Perforce has a nifty solution to this problem. The user can have multiple pending change sets, so that changes can be logically grouped together even as they are waiting to be checked in.) The "Change Set Comment" textbox offers a place for me to type an explanation of what I changed and why I did it. Please note that this textbox has a scrollbar, encouraging you to type as much text as necessary to give a full explanation of the problem. In my opinion, checkin comments are more important than the comments in the actual code. When I click OK, all of the selected items will be sent to the server to be committed to the repository. Since Vault supports atomic checkin transactions, I know that my changes will succeed or fail as a united group. It is not possible for the repository to end up in a state where only some of these changes made it. #region CARS_AND_CLOCKS Remember the discussion in chapter 4 about binary file deltas? This same technology is also used for checkin operations. When Vault sends a modified version of a file up to the server, it actually sends only the bytes which have changed, using the same VCDiff format which is used to make repository storage more efficient. The reason this is possible is because it has kept a copy of the baseline file in the hidden state information. The Vault client simply runs the VCDiff algorithm to construct the difference between

this baseline file and the current working file. So in the case of my running example, the Vault client will send three pieces of information: The binary delta. Since the pending change set pane shows that my working file is 40 bytes larger than the baseline where I started, the binary delta is going to be somewhere in the vicinity of 40 bytes long, perhaps with a few extra bytes for overhead. The fact that this binary delta was computed against version 21 of the file. Since version 21 is known and exists on both the client and the server, the SCM server can simply apply the binary delta to its own copy of version 21 to reconstruct an exact copy of the contents of my working file. The CRC checksum of the original working file. When the server reconstructs its copy of the working file, the CRC will be compared to ensure that nothing was corrupted during transit. The file that is stored in the repository will be exactly the same as the working file. No corruption, no surprises. Whenever possible, Vault uses binary file deltas "over the wire" in both directions, from client to server as well as from server to client. In this example, the entire file is only 3,762 bytes, so the savings in network bandwidth isn't all that significant. However, for larger files, the increase in network performance for offsite users can be quite dramatic. This capability of using binary file deltas between client and server is supported by some other SCM tools as well, including (I believe) Subversion and Perforce. #endregion When the checkin has completed successfully, if I am working in "checkout-edit-checkin" mode, the SCM tool will flip the read-only bit on my working files to prevent me from accidentally making changes without informing the server of my intentions. Having completed my checkin, the cycle is completed. My working folder is once again worthless, since my changes are a permanent part of the repository. I am ready to start again on my next development task. Looking ahead In the next chapter, it's time to start talking about some of the more advanced stuff. I'll start with an overview of labels and history.

Chapter 6: History

Wednesday, January 05, 2005 Chapter 6: History This is part of an online book called Source Control HOWTO, a best practices guide on source control,

version control, and configuration management. < Chapter 5 Chapter 7 >

Confronting your past You may now be tired of hearing me say it, but I will say it again: Your repository contains every version of everything which has ever been checked in to the repository. This is a Good Thing. We sleep better at night because we know that our efforts are always additive, never subtractive. Nothing is ever lost. As the team regularly checks in more stuff, the complete historical record is preserved, just in case we ever need it. But this feature is also a Bad Thing. It turns out that keeping absolutely everything isn't all that useful if you can't find anything later. My woodshop is a painfully vivid illustration of this problem. I have a habit of never throwing anything away. When I build a piece of furniture, I save every scrap of wood, telling myself that I might need it someday. I keep every screw, nail, bolt or nut, just in case I ever need it. But I don't organize these things very well. So when the time comes that I need something, I usually can't find it. I'm not necessarily proud of this confession, but my workshop stands as an expression of who I am. Those who love me sometimes find my habits to be endearing. But there is nothing endearing about a development team that can't find something when they need it. A good SCM tool must do more than just keep every version of everything. It must also provide ways of searching and viewing and sorting and organizing and finding all that stuff. In the rest of this chapter, I will discuss several mechanisms that SCM tools provide to help make the historical data more useful. Labels Perhaps the most important feature for dealing with old versions is the notion of a "label". In CVS, this feature is called a "tag". By either name, the concept is the same -- labels offer the ability to associate a name with a specific version of something in the repository. A label assigns a meaningful symbolic name to a snapshot of your code so you can later find that snapshot more easily. This is not altogether different from the descriptive and memorable names we use for variables and constants in our code. Which of the following two lines of code is easier to understand? if (errorcode == ERR_FILE_NOT_FOUND) if (e == -43) Similarly, which of the following is a more intuitive description of a specific version of your code? LAST_VERSION_BEFORE_COREY_FOULED_EVERYTHING_UP

378 We create (or "apply") a label by specifying a few things: 1. The string for the name of the label. This should be something descriptive that you can either remember or recognize later. Don't be afraid to put enough information in the name of the label. Note that CVS has strict rules for the syntax of a tag name (must start with a letter, no spaces, almost no punctation allowed). I still follow that tradition even though Vault is more liberal. 2. The folder to which the label will be applied. (You can apply a label or tag to a single file if you want, but why? Like most source control operations, labels are most useful when applied recursively to a whole folder.) 3. Which versions of everything should be included in the snapshot. Often this is implicitly understood to be the latest version, but your SCM tool will almost certainly allow you to label something in the past. If it won't, take it out back and shoot it. 4. A comment explaining the label. This is optional, and not all SCM tools support it, (CVS doesn't), but a comment can be handy when you want to explain more than might be appropriate to say in the name of the label. This is particularly handy if your team has strict rules for the syntax of label (V1.3.2.1426.prod) which prevent you from putting in other information you need. For example, in the following screen dump from Vault, I am labeling version 155 of the folder $/src/sgd/libsgdcore:

It is worth clarifying here that labels play a slightly different role in some SCM tools. In Subversion or Vault, folders have version numbers. Using the example from my screen dump above, the folder $/src/sgd/libsgdcore is at version 155. Each of the various files inside that folder has its own version number, but every time one of those files changes, the version number of the folder is increased by one as well. So the version number of a folder is a little bit like a label because it maps to a specific snapshot of the contents of the folder. However, CVS doesn't work this way. There is no folder version number which can be mapped to a

specific snapshot of the contents of that folder. For this reason, tags are all the more important in CVS, since there is no other way to easily mark specific versions of multiple items as a snapshot. When to use a label Labels are cheap. They don't consume a lot of resources. Your SCM tool won't slow down if you use lots of them. Having more labels does not increase your responsibilities. So you can use them as often as you like. The following situations are examples of when you might want to use a label: When you make a release A release is the most obvious time to apply a label. When you release a version of your application to customers, it can be very important to later know exactly which version of the code was released. When something is about to change Sometimes it is necessary to make a change which is widespread or fundamental. Before destabilizing your code, you may want to apply a label so you can easily find the version just before things started getting messed up. When you do an automated build Some automated build systems apply a label every time a build is done. The usual approach is to first apply the label and then do a "get by label" operation to retrieve the code to be used for the build. Using one of these tools can result in an awful lot of labels, but I still like the idea. It eliminates the guesswork of trying to figure out exactly which code was in the build. When you move some changes from one place to another Labels are handy ways to mark the sync points between two branches or two copies of the same tree. For example, suppose your company has two groups with separate source control systems. Group A has a library called SuperDuperNeatoUtilityLib. Group B uses this library as well, but they keep their own copy in the their own source control repository. Every so often, they login into Group A's repository and see if there are any bug fixes they want to migrate into their own copy. By applying a label to Group A's repository, they can more easily remember the latest point at which their two versions were in sync. Once you have a label, the question is what you Best Practice: Use labels often can do with it. The truth is that some labels never get used. That's okay. Like I said, they're cheap. Labels are very lightweight. Don't hesitate to use them as often as you want. But many labels do get used. The "get by label" operation is the most common way that a label comes in handy. By specifying a label as the version you want to retrieve, you can get a copy of every file exactly as it was when the label was created.

It's also very handy to diff against a label. For example, in the following screendump from Vault, I am asking to see all the differences between the contents of my working folder and the contents of the label named "Build 3.0.0.2752". (This label was applied by our automated build system when it made build 2752.)

Admonishments on the evils of "Label Promotion" Sometimes after you apply a label you realize that you want to make a small change. As an example, consider the following scenario: One week ago, you finalized the code for the 4.0 release of your product. You applied a label to the tree, and your team has proceeded with development on a few post4.0 tasks. But now Bob (one of your QA guys) comes crawling into your office. His clothes are torn and his face is covered with soot. While gasping for air he informs you that he has found a potential showstopper bug in the 4.0 release candidate. Apparently if you are running your app on the Elbonian version of Windows NT 3.5 with the time zone set to Pacific Standard Time and you enter a page margin size of 57 inches while printing a 42 page document on a Sunday morning before 9am, the whole machine locks up. In fact, if you don't quickly kill the app, the computer will soon burst into flame. As Bob finishes explaining the situation, a developer walks in and announces that he has already found the fix for this bug, and it affects only one line of code in FOO.CPP. Should he make the fix and generate a new release candidate? After scolding Bob for not being more diligent in finding this bug sooner, you begrudgingly decide that the severity of this bug does indeed make it a showstopper for the 4.0 release. But how to proceed? The label for the 4.0 build has already been applied. You want a new release candidate which contains exactly the contents of the 4.0 label plus this one-line change. None of the other stuff which has been checkin in during the past week should be included.

I'm sure it was this very situation which prompted Microsoft to implement a feature in SourceSafe 6.0 called "label promotion". The idea is that a minor change to a label can be made after it was originally created. Returning to our example, let's suppose that the 4.0 label contained version 6 of FOO.CPP. So now we would make the one-line change and check it in, resulting in version 7 of that file. Then we "promote" version 7 of the file to be included in the 4.0 label, instead of version 6. Personally I think "label promotion" is a terrible Best Practice: Avoid using label promotion name for this feature. In fact, I think label promotion is a terrible feature. I am doctrinally Your repository should contain an accurate opposed to any SCM feature which allows the reflection of what really happened. Don't use user to alter the historical record. The history of label promotion. If you must, do at least try to the repository should be a complete record of feel guilty about it. what really happened. If we use label promotion in this situation, there will be no record of the fact that the original 4.0 release candidate actually contained version 6 of that file. In situations where label promotion seems necessary, a fanatical purist like me would just create a new branch, which is a topic I will discuss in the next chapter. However, even though I dislike this feature for philosophical reasons, customers really want it. Here at SourceGear, I tell people that "the customer is not always right, but the customer is always the customer". So in order to remain true to our goal of making Vault a painless transition from SourceSafe, we implemented label promotion. But that doesn't mean I have to be happy about it. History Another important feature is the ability to view and browse historical versions of the repository. In its simplest form, this can be just a list of changes with the following information about each change: What was changed When the change was made Who did it Why (the comment entered at checkin time)

But without a way of filtering and sorting this information, using history is like trying to take a drink from a fire hose. Fortunately, most SCM tools provide plenty of flexibility in helping you see the data you need. In CVS, history is obtained using the 'cvs log' command. In the Vault GUI client, we use the History Explorer. In either case, the first way to filter history is to decide where to invoke the command. Requesting the full history from the root folder of a repository is like the aforementioned fire hose. Instead, invoke the command on a subfolder or even on a file. In this way, you will only see the changes which have been made to the item you selected. Most SCM tools provide other ways of filtering history information as well: Show only changes made during a specific range of dates Show only changes made by a specific user Show only changes made to files of a certain extension

Show only changes where the checkin comment contains specific words The following screendump from Vault shows all the changes I made to one of the Vault libraries during October 2004:

Sometimes the history features of your SCM tool are used merely to figure out what happened in the past, but often we need to dig even deeper. Perhaps we want to retrieve ("get") an old version? Perhaps we want to diff against an old version, or diff two old versions against each other? We may want to apply a label to a version that happened in the past. We may even want to use an old version as the starting point for a new branch. Good SCM tools make all of these things easy to do. A word about changesets and history

Best Practice: Do as I say, not as I do It is while using the history features of an SCM tool that we notice what a lousy job our developers do on their checkin comments. Please, make your checkin comments as complete as possible. The screen dump above contains an example of checkin comments written by a slacker who was in too much of a hurry.

For tools like Subversion and Vault which support atomic transactions and changesets, history can be slightly different. Because changesets are a grouping of individual changes, history is no longer just a flat list of individuals changes, but rather, can now be viewed as a hierarchy which is two levels deep. To ease the transition for SourceSafe users, Vault allows history to be viewed either way. You can ask Vault's History Explorer to display individual changes. Or, you can ask to see a list of changesets, each of which can be expanded to see the individual changes contained inside it. Personally, I prefer the changeset-oriented view. I like the mindset of thinking about the history of my repository in terms of groups of related changes. Blame Vault has a feature which can produce an HTML view of a file with each line annotated with information about the last person who changed that line. We call this feature "Blame". For example, the following screen dump shows the Blame output for the source code to the Vault command line client:

This poor function has had all kinds of people stomping through it. I was the last person to change line 828, which I apparently did in revision 106 of the file. However, line 829 was last modified by Jeff, and line 830 belongs to Dan. By now the reason for the silly-sounding name of Best Practice: Don't actually use the blame this feature should be obvious. If I find a bug on feature to be harsh with people about their line 832, the Blame feature makes it easy for me mistakes. to see that it must be Dan's fault! Note that we here at SourceGear take absolutely no credit or blame for the name of this command. We took our inspiration for this feature from the blame feature found in the CVS world, popularized by the Bonsai tool from the Mozilla project. The following screen dump shows this CVS Blame feature in action using the Bonsai installation on www.abisource.com. I was delighted to discover that the AbiWord layout engine actually still contains some of my code: Even though this Best Practice box is more about team management than source control, I don't feel like I'm straying too far off topic to offer the following tidbit: Tim Krauskopf, an early mentor of mine, said many wise things to me, including the following piece of management advice which I have never forgotten: "Spend more time on credit than on blame, and don't spend very much time on either one."

Whether you like the name or not, the Blame feature can be awfully handy sometimes. Looking ahead In the next chapter, we'll start talking about branches.

Chapter 7: Branches
What is a branch? A branch is what happens when your development team needs to work on two distinct copies of a project at the same time. This is best explained by citing a common example: Suppose your development team has just finished and released version 1.0 of UltraHello, your new flagship product, developed with the hope of capturing a share of the rapidly growing market for "Hello World" applications.

But now that 1.0 is out the door, you have a new problem you have never faced before. For the last two years, everybody on your team has been 100% focused on this release. Everybody has been working in the same tree of source code. You have had only one "line of development", but now you have two: Development of 2.0. You have all kinds of new features which just didn't make it into 1.0, including "multilingual Hello", DirectX support for animated Hellos, and of course, the ability to read email. Maintenance of 1.0. Now that real customers are using UltraHello, they will probably find at least one bug your testing didn't catch. For bug fixes or other minor improvements requested by customers, it is quite possible that you will need to release a version 1.0.1. It is important for these two lines of development to remain distinct. If you release a version 1.0.1, you don't want it to contain a half-completed implementation of a 2.0 feature. So what you need here is two distinct source trees so your team can work on both lines of development without interfering with each other. The most obvious way to solve this problem would simply be to make a copy of your entire source control repository. Then you can use one repository for 1.0 maintenance and the other repository for 2.0 development. I know people who do it this way, but it's definitely not a perfect solution. The two-repository approach becomes disappointing in situations where you want to apply a change to both trees. For example, every time we fix a bug in the 1.0 maintenance tree, we probably also want to apply that same bug fix to the 2.0 development tree. Do we really want to have to do this manually? If the bug fix is a simple change, like fixing the incorrect spelling of the word "Hello", then it won't take a programmer very long to make the change twice. But some bug fixes are more involved, requiring changes to multiple files. It would be nice if our source control tool would help. A primary goal for any source control tool should be to help software teams be more concurrent, everybody busy, all at the same time, without getting in each other's way. To address this very type of problem, source control tools support a feature which is usually called "branching". This terminology arises from the tendency of computer scientists to use the language of a physical tree every time hierarchy is involved. In this particular situation, the metaphor breaks down very quickly, but we keep the name anyhow. A somewhat better metaphor happens when we envision a nature path which forks into two directions. Before the fork, there was one path. Now there are two, but they share a common history. When you use the branching feature of your source control tool, it creates a fork in the path of your development progress. You now have two trees, but the source control has not forgotten the fact that these two trees used to be one. For this reason, the SCM tool can help make it easier to take code changes from one fork and apply those changes to the other. We call this operation "merging branches", a term which highlights why the physical tree metaphor fails. The two forks of a nature path can merge back into one, but two branches of an oak tree just don't do that. I'll talk a lot more about merging branches in the next chapter. At this point I should take a step back and admit that my example of doing 1.0 maintenance and 2.0 features is very simplistic. Real life examples are sometimes far more complicated, involving multiple branches, active development in each branch, and the need to easily migrate changes between any two of them. Branching and merging is perhaps the most complex operation offered by a source control tool, and there is much to say about it. I'll begin with some "cars and clocks" stuff and talk about how branching works "under the hood".

Two branching models First of all, let's acknowledge that there are [at least] two popular models for branching. In the first approach, a branch is like a parallel universe. The hierarchy of files and folders in the repository is sort of like the regular universe. For each branch, there is another universe which contains the same hierarchy of files and folders, but with different contents. Best Practice: Organize your branches The "folder" model of branching usually requires you to have one extra level of hierarchy in your repository tree. Keep your main development in a folder named $/trunk. Then create another folder called $/branches. Each time you create a branch off of the trunk, put it in $/branches.

In order to retrieve a file, you specify not just a path but the name of the universe, er, branch, from which you want the file retrieved. If you don't specify a branch, then the file will be retrieved from the "default branch". This is the approach used by CVS and PVCS. In the other branching model, a branch is just another folder, located in the same repository hierarchy as everything else. When you create a branch of a folder, it shows up as another folder. With this approach, a repository path is sufficient to describe a location. Personally, I prefer the "folder" style of branching over the "parallel universe" style of branching, so my writing will generally come from this perspective. This is the approach used by most modern source control tools, including Vault, Subversion (they call it "copy"), Perforce (they call it "Inter-File Branching") and Visual Studio Team System (looks like they call it branching in "path space"). Under the hood Good source control tools are clever about how they manage the underlying storage issues of branching. For example, let us suppose that the source code tree for UltraHello is stored in $/projects/Hello/trunk. This folder contains everything necessary to do a complete build of the shipping product, so there are quite a few subfolders and several hundred files in there. Now that you need to go forward with 1.0 maintenance and 2.0 development simultaneously, it is time to create a branch. So you create a folder called $/projects/Hello/branches. Inside there, you create a branch called 1.0. At the moment right after the branch, the following two folders are exactly the same: $/projects/Hello/trunk $/projects/Hello/branches/1.0 It appears that the source control tool has made an exact copy of everything in your source tree, but actually it hasn't. The repository database on disk has barely increased in size. Instead of duplicating the contents of every file, it has merely pointed the branch at the same contents as the trunk. As you make changes in one or both of these folders, they diverge, but they continue to share a common history. The Pitiful Lives of Nelly and Eddie In order to use your source control tool most effectively, you need to develop just the right amount of

fear of branching. This delicate balance seems to be very difficult to find. Most people either have too much fear or not enough. Nelly is an example of a person who has too much fear of branching. Nelly has a friend who has a cousin with a neighbor who knows somebody whose life completely fell apart after they tried using the branch and merge features of their source control tool. So Nelly refuses to use branching at all. In fact, she wrote a 45-page policy document which requires her development team to never use branching, because after all, "it's not safe". So Nelly's development team goes to great lengths to avoid using branching, but eventually they reach a point where they need to do concurrent development. When this happens, they do anything they can to solve the problem, as long as it doesn't involve the word "branch". They fork a copy of their tree and begin working with two completely separate repositories. When they need to make a change to both repositories, they simply make the change by hand, twice. Obviously these people are still branching, but they keep Nelly happy by never using "the b word". These folks are happy, and we should probably just leave them alone, but the whole situation is kind of sad. Their source control tool has features which were specifically designed to make their lives easier. Best Practice: Don't be afraid of branches If you're doing parallel development, let your source control tool help. That's what it was designed to do.

At the other end of the spectrum is Eddie, who uses branching far too often. Eddie started out just like Nelly, afraid of branching because he didn't understand it. But to his credit, Eddie overcame his fear and learned how powerful branching and merging can be. And then he went off the deep end. After he tried branching and had a good first experience with it, Eddie now uses it all the time. He sometimes branches multiple times per week. Every time he makes a code change, he creates a private branch. Eddie arrives on Monday morning and discovers that he has been assigned bug 7136 (In the Elbonian version, the main window is too narrow because the Elbonian language requires 9 words to say "Hello World".) So Eddie sits down at his desk and begins the process of fixing this bug. The first thing he does is create a branch called "bug_7136". He makes his code change there in his "private branch" and checks it in. Then, after verifying that everything is working okay, he uses the Merge Branches feature to migrate all changes from the trunk into his private branch, just to make sure his code change is compatible with the very latest stuff. Then he runs his test suite again. Then he notices that the repository has changed yet again, then he does this loop once more. Finally, he uses Merge Branches to apply his code fixes to the trunk. Then he grabs a copy of the trunk code, builds it and runs the test suite to verify that he didn't accidentally break anything. When at last he is satisfied that his code change is proper, he marks bug 7136 as complete. By now it is Friday afternoon at 4:00pm, and there's no point in starting anything new at this point, so he just decides to go home. Eddie never checks anything into the main trunk. He only checks stuff into his private branch, and then merges changes into the trunk. His care and attention to detail are admirable, but he's spending far more time using his source control tool than working on his code. Let's not even think about what the kids would be like if Eddie and Nelly were to get married.

Dev--Test--Prod Once you established the proper level of comfort with the branching features of your source control tool, the next question is how to use those features effectively. One popular methodology for SCM is often called "code promotion". The basic idea here is that your code moves through three stages, "dev" (stuff that is in active development), "test" (stuff that is being tested) and "prod" (stuff that is ready for production release): As code gets written by programmers, it is placed in the dev tree. This tree is "basically unstable". Programmers are only allowed to check code into dev. When the programmers decide they are done with the code, they "promote" it from dev to test. Programmers are not allowed to check code directly into the test tree. The only way to get code into test is to promote it. By promoting code to test, the programmers are handing the code over to the QA team for testing. When the testers decide the code meets their standards, they promote it from test to prod. Code can only be part of a release when it has been promoted to prod. For a variety of reasons, I personally don't like working this way, but there's nothing wrong with it. Lots of people use this code promotion model effectively, especially in larger companies where the roles of programmer and tester are very clearly separated. I understand that PVCS has specific feature support for "promotion groups", although I've never used this product personally. With other source control tools, the code promotion model can be easily implemented using three branches, one for dev, one for test, and one for prod. The Merge Branches feature is used to promote code from one level to the next. Eric's Preferred Branching Practice Here at SourceGear our main development tree is Best Practice: Keep a "basically unstable" called the "trunk". In our repository it is rooted at trunk. $/trunk and it contains all the source code and documentation for our entire product. Do your active development in the trunk, the Most new code is checked into the trunk. In stability of which increases as you approach general, our developers try to never "break the release. After you ship, create a maintenance tree". Anyone who checks in code which causes branch and always keep it very stable. the trunk builds to fail will be the recipient of heaping helpings of trash talk and teasing until he gets it fixed. The trunk should always build, and as much as possible, the resulting build should always work. Nonetheless, the trunk is the place where active development of new features is happening. The trunk could be described as "basically unstable", a philosophy of branching which is explained in Essential CVS, a fine book on CVS by O'Reilly. In our situation, the stability of the trunk build fluctuates over the months during our development cycle. During the early and middle parts of a development cycle, the trunk is often not very stable at all. As we approach alpha, beta and final release, things settle down and the trunk gets more and more stable. Not long before release, the trunk becomes almost sacred. Every code change gets reviewed carefully to ensure that we don't regress backwards. At the moment of release, a branch gets created. This branch becomes our maintenance tree for that release. Our current maintenance branch is called "3.0", since that's the current major version number of our product. When we need to do a bug fix or patch release, it is done in the maintenance branch.

Each time we do a release out of the maintenance branch (like 3.0.2), we apply a label. After the maintenance branch is created, the trunk once again becomes "basically unstable". Developers start adding the risky code changes we didn't want to include in the release. New feature work begins. The cycle starts over and repeats itself. When to branch? Part 1: Principles Your decisions about when to branch should be Best Practice: Don't create a branch unless you guided by one basic principle: When you create a are willing to take care of it. branch, you have to take care of it. There are responsibilities involved. A branch is like a puppy. In most cases, you will eventually have to perform one or more merge operations. Yes, the SCM tool will make that merge easy, but you still have to do it. If a merge is never necessary, then you probably have the responsibility of maintaining the branch forever. If you create a branch with the intention of never merging to or from it, and never making changes to it, then you should not be creating a branch. Use a label instead. Be afraid of branches, but not so afraid that you never use the feature. Don't branch on a whim, but do branch when you need to branch. When to branch? Part 2: Scenarios There are some situations where branching is NOT the recommended way to go: Simple changes. As I mentioned above in my "Eddie" scenario, don't branch for every bug fix or feature. Customer-specific versions. There are exceptions to this rule, but in general, you should not branch simply for the sake of doing a custom version for a specific customer. Find a way to build the customizability into your app. And there are some situations where branching is the best practice: Maintenance and development. The classic example, and the one I used above in my story about UltraHello. Maintaining version N while developing version N+1 is the perfect example of a time to use branching. Subteam. Sometimes a subset of your team needs to work on something experimental that will take several weeks. When they finish, their work will be folded into the main tree, but in the meantime, they need a separate place to work. Code promotion. If you want to use the dev-test-prod methodology I mentioned above, use a branch to model each of the three levels of code promotion. When to branch? Part 3: Pithy Analogy A branch is like a working folder for multiple people. A working folder facilitates parallel development by allowing each person to have their own private place to work. When multiple people need a private place to work together, they need a branch.

Looking Ahead In the next chapter I will delve into the topic of merging branches.

Chapter 8: Merge Branches


What is "merge branches"? Many users find the word "merge" to be confusing, since it seems to imply that we start out with two things and end up with only one. I'm not going to start trying to invent new vocabulary. Instead, let's just try to be clear about what we mean we speak about merging branches. I define "merge branches" like this: To "merge branches" is to take some changes which were done to one branch and apply them to another branch. Sounds easy, doesn't it? In practice, merging branches often is easy. But the edge cases can be really tricky. Consider an example. Let's say that Joe has made a bunch of changes in $/branch and we want to apply those changes to $/trunk. At some point in the past, $/branch and $/trunk were the same, but they have since diverged. Joe has been making changes to $/branch while the rest of the team has continued making changes to $/trunk. Now it is time to bring Joe back into the team. We want to take all the changes Joe made to $/branch, no matter what those changes were, and we want to apply those changes to $/trunk, no matter what changes have been to $/trunk during Joe's exile. The central question about merge branches is the matter of how much help the source control tool can provide. Let's imagine that our SCM tool provided us with a slider control:

If we drag this slider all the way to the left, the source control tool does all the work, requiring no help at all from Joe. Speaking as a source control vendor, this is the ideal scenario that we strive for. Most of us don't make it. However, here at SourceGear we made the decision to build our source control product on the .NET Framework, which luckily has full support for the kind of technology needed to implement this. The code snippet below was pasted from our implementation of the Merge Branches feature in Vault:
public void MergeBranches(Folder origin, Folder target) { ArrayList changes = GetSelectedChanges(origin); DeveloperIntention di = System.Magic.FigureOutWhatDeveloperWasTryingToDo(changes); di.Apply(target); }

Boy do I feel sorry for all those other source control vendors trying to implement Merge Branches without the DeveloperIntention class! And to think that so many people believe the .NET Framework is too large. Sheesh! OK, I lied. (Stop trying to add a reference to the System.Magic DLL. It doesn't exist.) The actual truth is that this slider can never be dragged all the way to the left. Best Practice: Take responsibility for the merge.

Successfully using the branching and merging If we drag the slider all the way to the right, we features of your source control tool is first a get a situation which is actually closer to reality. matter of attitude on the part of the developer. No Joe does all the work and the source control tool matter how much help the source control tool is no help at all. In essence, Joe sits down with provides, it is not as smart as you are. You are $/trunk and simply re-does the work he did in responsible for doing the merge. Think of the tool $/branch. The context is different, so the changes as a tool, not as a consultant. he makes this time may be very different from what he did before. But Joe is smart, and he can figure out The Right Thing to do. In practice, we find ourselves somewhere between these two extremes. The source control tool cannot do magic, but it can usually help make the merge easier. Since the developer must still take responsibility for the merge, things will go more smoothly if she understands what's really going on. So let's talk about how merge branches works. First I need to define a bit of terminology. For the remainder of this chapter I will be using the words "origin" and "target" to refer to the two branches involved in a merge branches operation. The origin is the folder which contains the changes. The target is the folder to which we want those changes to be applied. Note that my definition of merge branches is a one-way operation. We apply changes from the origin to the target. In my example above, $/branch is the origin and $/trunk is the target. That said, there is nothing which prevents me switching things around and applying changes in the opposite direction, with $/trunk as the origin and $/branch as the target, but that would simply be a separate merge branches operation. Conceptually, a merge branches operation has four steps: 1. 2. 3. 4. Developer selects changes in the origin Source control tool applies some changes automatically to the target Developer reviews the results and resolves any conflicts Commit

Each of these steps is described a bit more in the following sections. 1. Selecting changes in the origin When you begin a merge branches operation, you know which changes from the origin you want to be applied over in the target. Most of the time you want to be very specific about which changes from the origin are to be merged. This is usually evident in the conversation which preceded the merge: "Dan asked me to merge all the bug fixes from 3.0.5 into the main trunk." "Jeff said we need to merge the fix for bug 7620 from the trunk into the maintenance tree." "Ian's experimental rewrite of feature X is ready to be merged into the trunk." One way or another, you need to tell your source control tool which changes are involved in the merge.

The interface for this operation can vary significantly depending on which tool you are using. The screen shot below is the point where the Merge Branches Wizard in Vault is asking me to specify which changes should be merged. I'm selecting everything back to the last build label:

2. Applying changes automatically to the target After selecting the changes to be applied, it's time to try and make those changes happen in the target. It is important here to mention that merging branches requires us to consider every kind of change, not just the common case of edited files. We need to deal with renames, moves, deletes, additions, and whatever else the source control tool can handle. I won't spell out every single case. Suffice it to say that each operation should be applied to the target in the way that Makes Sense. This won't succeed in every situation, but when it does, it is usually safe. Examples: If a file was edited in the origin and a file with the same relative path exists in the target, try to make the same edit to the target file. Use the automerge algorithm I mentioned in chapter 3. If automerge fails, signal a conflict and ask the user what to do. If a file was renamed in the origin, try doing the same rename in the target. Here again, if the rename isn't possible, signal a conflict and ask the user what to do. For example, the target file may have been deleted. If a file was added in the origin, add it to the target. If doing so would cause a name clash, signal a conflict and ask the user what to do. What happens if an edited file in the origin has been moved in the target to a different subfolder? Should we try to apply the edit? I'd say yes. If the automerge succeeds, there's a good chance it is safe. Bottom line, a source control tool should do all the operations which seem certain to be safe. And even then, the user needs a chance to review everything before the merge is committed to the repository.

Let's consider a simple example from Subversion. I created a folder called trunk, added a few files, and then branched it. Then I made three changes to the trunk: Deleted __init__.py Modified panel.py Added a file called anydbm.py Then I asked Subversion to merge all changes between version 2 and 4 of my trunk into my branch:

Subversion correctly detected all three of my changes and applied them to my working copy of the branch. 3. Developer review The final step in a merge branches operation is a review by the developer. The developer is ultimately responsible, and is the only one smart enough to declare that the merge is correct. So we need to make sure that the developer is given final approval before we commit the results of our merge to the repository. This is the developer's opportunity to take care of anything which could not be done automatically by the source control tool in step 2. For example, suppose the tree contains a file which is in a binary format that cannot be automatically merged, and that this file has been modified in both the origin and the target. In this case, the developer will need to construct a version of this file which correctly incorporates both changed versions. Best Practice: Review the merge before you commit. After your source control tool has done whatever it can do, it's your turn to finish the job. Any conflicts need to be resolved. Make sure the code still builds. Run the unit tests to make sure everything still works. Use a diff tool to review the changes. Merging branches should always take place in a working folder. Your source control tool should give you a chance to do these checks before you commit the final results of a merge branches operation.

4. Commit The very last step of a merge branches operation is to commit the results to the repository. Simplistically, this is a commit like any other. Ideally, it is more. The difference is whether or not the source control tool supports "merge history". The benefits of merge history Merge history contains special historical information about all merge branch operations. Each time you do use the merge branches feature, it remembers what happened. This allows us to handle two cases with a bit more finesse: Repeated merge. Frequently you want to merge from the same origin to the same target multiple times. Let's suppose you have a sub-team working in a private branch. Every few weeks you want to merge from the branch into the trunk. When it comes time to select the changes to be merged over, you only want to select the changes that haven't already been merged before. Wouldn't it be nice if the source control tool would just remember this for you? Merge history allows this and makes things more convenient. The workaround is simply to use a label to mark the point of your last merge. Merge in both directions. A similar case happens when you have two branches and you sometimes want to merge back and forth in both directions. For example: 1. 2. 3. 4. 5. Create a branch Do some work in both the branch and the trunk Merge some changes from the branch to the trunk Do some more work Merge some changes from the trunk to the branch

At step 5, when it comes time to select changes to be merged, you want the changes from step 3 to be ignored. There is no need to merge those changes from the trunk to the branch because the branch is where those changes came from in the first place! A source control tool with a smart implementation of merge history will know this. Not all source control tools support merge history. A tool without merge history can still merge branches. It simply requires the developer to be more involved, to do more thinking. In fact, I'll have to admit that at the time of this writing, my own favorite tool falls into this category. We're planning some major improvements to the merge branches feature for Vault 4.0, but as of version 3.x, Vault does not support merge history. Subversion doesn't either, as of version 1.1. Perforce is reported to have a good implementation of merge history, so we could say that its "slider" rests a bit further to the left. Summary I don't want this chapter to be a step-by-step guide to using any one particular source control tool, so

I'm going to keep this discussion fairly high-level. Each tool implements the merging of branches a little differently. For some additional information, I suggest you look at Version Control with Subversion, a book from O'Reilly. It is obviously Subversion-specific, but it contains a discussion of branching and merging which I think is pretty good. The one thing all these tools have in common is the need for the developer to think. Take the time to understand exactly how the branching and merging features work in your source control tool.

Chapter 9: Source Control Integration with IDEs


Background: What is an IDE?
The various applications used by software developers are traditionally called "tools". When we speak of "developer tools", we're talking about the essential items that programmers use every day, like compilers and text editors and syntax checkers and Mountain Dew. Just as a master carpenter uses her tools to build a house, developers use their tools to build software applications. In the old days, each developer would assemble their own collection of their favorite tools. Back around 1991, my preferred toolset looked something like this: gcc (for compiling source code) gdb (for debugging) make (for managing builds) rcs (for managing versions) emacs (for editing source code) vi (for editing the emacs makefile) Fifteen years later, most developers would consider this approach to be strange. Today, everything is "integrated". Instead of selecting one of each kind of tool, we select an Integrated Development Environment (IDE), an application which collects all the necessary tools together in one place. To continue the metaphor, we would say that the focus today is not on the individual tools, but rather, on the workshop in which those tools are used. This trend is hardly new. Ever since Borland released Turbo Pascal in 1983, IDEs have become more popular every year. In the last ten years, many IDE products have disappeared as the industry has consolidated. Today, it is only a small exaggeration to say that there are just two IDEs left: Visual Studio and Eclipse. But despite the industry consolidation, the trend is clear. Developers want their tools to be very well integrated together. Most recently, Microsoft's Visual Studio Team System takes this trend to a higher level than we have previously seen. Mainstream IDEs in the past have provided base operations such as editing, compiling, building and documentation. Now Visual Studio also has unit tests, visual modeling, code generators, and work item tracking. Furthermore, the IDE isn't just for coders

anymore. Every task performed by every person involved in the software development process is moving into the IDE.

Benefits of source control integration with IDEs


Source control is one of the development tools which has been commonly integrated into IDEs for quite some time. The fit is very natural. Here at SourceGear, our source control product has two main client applications: 1. A standalone client application which is specifically designed to talk with the source control server. 2. A client plugin which adds source control features into Visual Studio. Unsurprisingly, the IDE client is very popular with our users. Many of our users would never think about using source control without IDE integration. Why does version control work so nicely inside an IDE? Because it makes the three most common operations a lot easier:

Checkout
When using the checkout-edit-checkin model, files must be checked out before they are edited. With source control integrated into an IDE, this task can be quite automatic. Specifically, when you begin to edit a file, the IDE will notice that you do not have it checked out yet and check the file out for you. Effectively, this means developers never need to remember to checkout a file.

Add
A common and frustrating mistake is to add a new file to a project but forget to place it under source control. So when I am done with my coding task, I checkin my changes to the existing files, but the newly added file never makes it into the repository. The build is broken. When using source control integration with an IDE, this mistake is basically impossible to make. Most IDEs today support the notion of a "project", a list of all files which are considered part of the build process. When used with source control, the IDE decides what files to place under source control because it knows every file that is part of the project. The act of adding a file to the project also adds it to source control.

Checkin
IDEs excel at nagging developers. The user interface of an IDE has special places to nag the developer about compiler errors and unsaved files and even unfixed bugs. Similarly, visual indicators in the IDE can be used to remind the developer that he has not yet checked in his changes. When source control is integrated into an IDE, developers don't have to think about it very much. They don't have to try to remember to Checkout, Add or Checkin because the IDE is either performing those actions automatically or reminding them to do it.

Bigger benefits
Once you integrate source control into an IDE, you open the possibility for cool features that go beyond the basics. For example, source control integration can be incredibly helpful when used with

refactoring. When I use the refactoring features of Eclipse to rename a Java class, it is obviously nice that Eclipse figures out all the changes that need to be made. It's even nicer that Eclipse automatically handles all the necessary source control operations. It even performs the name change of the source file. For another example, here is a screen shot of a Blame feature integrated into Eclipse:

The user story for this feature goes like this: The developer is coding and she encounters something that deserves to be on The Daily WTF. She wants to immediately know who is responsible, so she right-clicks on the offensive line and selects the Blame feature. The source control plugin queries the repository for history and determines who made the change. The task was simpler because the Blame feature is conveniently located in the place where it is most likely to be needed.

Tradeoffs and Problems


For source control, IDE integration is great in theory, but it has not always been so great in practice. The tradeoffs of having your IDE do source control for you are the same as the tradeoffs of having your IDE do anything else. It's easier, but you have less control over the process. Before I continue, I need to make a confession: I personally have never used source control integration with an IDE. Heck, for a long time I didn't use IDEs at all. I'm a control freak. It's not enough for me to know what's going on under the hood. Sometimes I prefer to just do everything myself. I don't like project-based build systems where I add a few files and the IDE magically builds my app. I like diving make systems where I can control exactly where everything is and where the build targets are placed. Except for a brief and passionate affair with Think C during the late eighties, I didn't really start using IDE project files until Visual Studio .NET. Today, I am gradually becoming more and more of an IDE user, but I still prefer to do all source control operations using a standalone GUI client. Eventually, that

will change, and my transformation to IDE user will be complete. Anyway, for the sake of completeness, I will explain the tradeoffs I see with using source control integration with IDEs. This should be taken as information, not as an argument against the feature. IDE integration is the most natural way to use source control on a daily basis. The first observation is that IDE clients have fewer features than standalone clients. The IDE is great for basic source control operations, but it is definitely not the natural place to perform all source control operations. Some things, such as branching, don't fit very well at all. However, this is a minor point which merely illustrates that an IDE client cannot be the only user interface for accessing a source control repository. If this were the only problem, it would not be a problem. This is the sort of tradeoff that I would consciously accept. The real problem with source control integration for IDEs is that it just doesn't work very well. For this sad state of affairs, I put most of the blame on MSSCCI.

MSSCCI
It's pronounced "misskee", and it stands for Microsoft Source Code Control Interface. MSSCCI is the API which defines the interaction between Microsoft Visual Studio and source control tools. A source control tool which wants to support integration with Visual Studio must implement this API. Basically, it's a DLL which defines a number of documented entry points. When configured properly, the IDE makes calls into the DLL to perform source control operations as needed or as requested by the user. Originally, Microsoft's developer tools were the only host environments for MSSCCI. Today, MSSCCI is used by lots of other IDEs as well. It has become sort of a de facto standard. Source control vendors implemented MSSCCI plugins so that their products could be used within Microsoft IDEs. In turn, vendors of other IDEs implemented MSSCCI hosting so that their products could be used with the already-available source control plugins. The ubiquity of MSSCCI is very unfortunate. MSSCCI was designed to be a bridge between SourceSafe and early versions of Microsoft Visual Studio. It served this purpose just fine, but now the API is being used for lots of other version control tools besides SourceSafe and lots of other IDEs besides Visual Studio. It is being used in ways that it was never designed to be used, resulting in lots of frustration. The top three problems with MSSCCI are: 1. Poor performance. SourceSafe has no support for networking, but the architecture of most modern version control tools involves a client and a server with TCP in between. To get excellent performance from a client-server application, careful attention must be paid to the way the networking is done. Things like threading and blocking and buffering are very important. Unfortunately, MSSCCI makes this rather difficult. 2. No Edit-Merge-Commit. SourceSafe is basically built around the Checkout-Edit-Checkin approach, so that's how MSSCCI works. Building a satisfactory MSSCCI plugin for the EditMerge-Commit paradigm is very difficult. 3. No Atomic transactions. SourceSafe has no support for atomic transactions, so MSSCCI and Visual Studio were not designed to use them. This means that sometimes modern version control tools like Vault can't group things together properly at commit time. On top of all this, all the world's MSSCCI hosts tend to implement their side of the API a little

differently. If you implement a MSSCCI plugin and get everything working with Visual Studio 2003, you have approximately zero chance of it working well with Visual Basic 6, FoxPro or Visual Interdev. After you code in all the special hacks to get things compatible with these fringe Microsoft environments, your plugin still has no real chance of working with third party products like MultiEdit. Every IDE requires some different tweaks and quirky behavior to make it work. By the time you get your plugin working with some of these other IDEs, your regression testing shows that it doesn't work with Visual Studio 2003 anymore. Lather. Rinse. Repeat. Most developers who work with MSSCCI eventually turn to recreational pharmaceuticals in a futile effort to cope.

A brighter future
Luckily, MSSCCI is fading away. Earlier in this article I flippantly joked that Visual Studio and Eclipse were the only IDEs left in the world. This is of course an exaggeration, but the fact remains that these two products have the lion's share, so we can take some comfort in their dominance when we think about the prevalence of MSSCCI in the future: Eclipse does not use MSSCCI. It has its own source control integration APIs. Visual Studio 2005 introduced a new and greatly improved API for source control integration. So, the two dominant IDEs today inspire us to dream of a MSSCCI-free world. The planet will certainly be a nicer place to live when MSSCCI is a distant memory. Here at SourceGear, the various problems with MSSCCI have caused us to hold a cautious and reserved stance toward IDE integration. Most of our customers really would prefer an IDE client, so we give them one. But we consider our standalone GUI client to be the primary UI because it is faster and more full-featured. And internally, most of us on the Vault team use the standalone GUI client for our everyday work. But our posture is changing dramatically. We are currently working on an Eclipse plugin as well as a completely new plugin for the new source control API in Visual Studio 2005. Sometime in early 2007, we will be ready to consider our IDE clients to be primary, with our other client applications to be available for less common operations. What do I mean when I say "primary"? Well, among other things, I mean that the IDE clients will be the way we use our own product. Including me. :-) It's not yet terribly impressive to look at, but here's a screen shot of our new Visual Studio 2005 client:

Final thoughts
The direction of this industry right now is toward more and more integration. This is a very good thing. We're going to see many new improvements. Users will be happier. Just as a spice rack belongs near the stove, source control should always be available where the developer is working.

Vous aimerez peut-être aussi