Vous êtes sur la page 1sur 159

Lesson 1: Controlling Input and Output

Either from your own experience or in the SAS® Programming I: Essentials course, you've learned how to use a DATA
step to create a new SAS data set from an existing SAS data set. Typically, SAS reads in one observation from a SAS data
set and writes out one observation. But sometimes, you need to control input and output more precisely in a DATA step.

For example, you might want to create multiple observations from one observation, or conditionally write observations
to specific output data sets. You might want to control which variables are read from an input data set and written to an
output data set. You might even want to control how many observations are processed from an input data set and
written out in a procedure.

In this lesson, you learn to control input and output by using additional options and statements in your code.

Objectives

In this lesson, you learn to do the following:

 explicitly control the output of multiple observations to a SAS data set


 create multiple SAS data sets in a single DATA step
 use conditional processing to control the data set or sets to which an observation is written
 control which variables are written to an output data set during a DATA step
 control which variables are read from an input data set during a DATA step
 control how many observations are processed from an input data set during a DATA or PROC step

Outputting Multiple Observations


The SAS data set growth has three variables: Department, Total_Employees, and Increase. There are six observations in
the data set. The growth rate of each department at Orion Star is forecast in the Increase variable.

Suppose you want to create the SAS data set forecast and calculate the total number of employees in each department
at Orion Star at the end of the first year and the total number of employees at the end of the second year. The new data
set will have two observations for each observation in the growth data set.

Let's see how you accomplish this task by using DATA step processing.

Understanding Implicit Output Statements


You know how to use a SET statement to read in an existing SAS data set and create a new SAS data set. As the DATA
step executes, SAS reads an observation from the existing SAS data set into the program data vector. At the end of the
DATA step, an implicit OUTPUT statement directs SAS to write the observation to the SAS data set or data sets being
created. Then, an implicit RETURN statement returns processing to the top of the DATA step.

Remember, when SAS reads an existing data set with a SET statement, it retains the values in the program data vector.
When the next observation is read, the values of the variables in that observation overwrite the values in the program
data vector. Processing continues this way until all the observations in the existing SAS data set are read. This results in
one observation in the output data set for every observation in the input data set.
Question
The SAS data set orion.growth contains 6 observations and 3 variables (Department, Total_Employees, and
Increase). How many observations and variables will the forecast data set contain?

data forecast;
set orion.growth;
TotalYear1=Total_Employees*(1+Increase);
run;

a. 6 observations, 4 variables
b. 6 observations, 3 variables
c. 12 observations, 4 variables

The correct answer is a. The forecast data set will contain the 6 observations from the growth data set and it will have 4
variables, the 3 variables in the growth data set plus the new variable, TotalYear1.

Using Explicit OUTPUT Statements


You can control when SAS writes an observation to a SAS data set by using an explicit OUTPUT statement in your code.

The syntax for the OUTPUT statement begins with the keyword OUTPUT. Optionally, the keyword can be followed by the
data set name to which the observation should be written. If you do not specify a data set name in the OUTPUT
statement, the observation is written to the data set named in the DATA statement.

Once you use an explicit OUTPUT statement in the DATA step, there is no implicit OUTPUT statement at the bottom of
the DATA step and SAS writes an observation to a data set only when an explicit OUTPUT statement executes. Every
DATA step has an implicit RETURN as its last executable statement.OUTPUT statements can be used alone or they can be
used in conditional or DO group processing. You'll learn more about that later.

Remember, we want to use the growth data set to forecast the number of employees in each department at the end of
year one and at the end of year two. There will be two observations in the data set forecast for every observation in the
data set growth. Let's look at how you can use the OUTPUT statement to control when SAS writes an observation to
output data.

The SET statement reads the first observation in the SAS data set growth into the program data vector. Additional
programming statements create the variable Year and calculate the total number of employees at the end of the first
year. You learn about these additional statements later. For now, just focus on the OUTPUT statement. The first OUTPUT
statement directs SAS to write the contents of the PDV to the output data set. No data set is listed in the OUTPUT
statement, so SAS writes the observation to the forecast data set.

Processing continues with the additional programming statements that assign a new value to Year and calculate the
total number of employees at the end of the second year. Then, the second OUTPUT statement directs SAS to write the
contents of the PDV to the forecast data set. Now there are two observations in the forecast data set from one
observation that SAS read from the growth data set. An implicit RETURN statement returns processing to the top of the
DATA step, and SAS reads the next observation from the growth data set.
Using Explicit OUTPUT Statements
In this demonstration, you run the code that creates the SAS data set forecast with two observations for every
observation in the growth data set.

1. In the orion library, open the data set growth and view the data. Remember, we want to create a data set that
forecasts the growth of each department at the end of year one and year two.
2. Copy and paste the following code into the editor.

data forecast;
set orion.growth;
Year=1;
Total_Employees=Total_Employees*(1+Increase);
output;
Year=2;
Total_Employees=Total_Employees*(1+Increase);
output;
run;

This DATA step creates the Year variable and assign it a value of 1. To calculate Total_Employees, we multiply
the value in Total_Employees by 1 plus the value in Increase. This is the data for the first year, so we need to add
an OUTPUT statement.

Next we want the want the data for the second year, so Year=2. We'll use the same calculation for
Total_Employees. Because we used an OUTPUT statement earlier in the code, there is no implicit OUTPUT
statement at the bottom of the DATA step, so we add a second OUTPUT statement.

3. Submit the DATA step and check the log. How many observations are in the data set forecast?

Looking at the log, you can see that SAS read 6 observations from the growth data set, and the forecast data set
has 12 observations and 4 variables.

4. Add a PRINT procedure to view the data. Copy and paste the following code into the editor.

proc print data=forecast noobs;


var Department Total_Employees Year;
run;

The NOOBS option eliminates the OBS column. The VAR statement specifies the variables to print in the report:
Department, Total_Employees, and Year.

5. Submit the PROC PRINT step and view the output.

If the Administration department grows at the predicted rate, approximately how many employees will the
department have at the end of the second year?

There will be approximately 53 employees.

Processing Explicit Output Statements


To better understand explicit output statements, let's explore how SAS compiled the code in the previous
demonstration.

During compilation, the program data vector is created and a slot is added for each variable in the input data set. The
input data set supplies the variable names, as well the type, length, and other variable attributes. Any variables that are
created in the DATA step are also added to the program data vector. The attributes of each new variable are determined
by the expression in the statement. In this case, the numeric variable Year is created with a default length of 8 bytes.

At the bottom of the DATA step, the compilation phase is complete, and the descriptor portion of the output data set is
created. There are no observations in the output data because the DATA step has not yet executed.

The DATA step executes once for each observation in the input data set. There are six observations in the growth data
set, so the DATA step will execute six times. As execution begins, the program data vector is initialized and the values of
the variables are set to missing.

The SET statement reads the first observation from the input data set and writes the values to the program data vector.
The first assignment statement executes, and Year is assigned the value of 1. Then the second assignment statement
executes and Total_Employees is recalculated as 34 times 1.25. The new value of 42.5 replaces 34 as the value for
Total_Employees. The first OUTPUT statement writes the observation to the forecast data set. Notice that there's no
data set named in the OUTPUT statement.

Here's a question: How does SAS know where to write the observation? When no data set is listed in the OUTPUT
statement, the observation is written to the data set named in the DATA statement.

Processing continues and the variable Year is assigned the value of 2. The value of Total_Employees is 42.5, which is the
value from the first year. The assignment statement executes and the value of Total_Employees is recalculated again.

The second OUTPUT statement directs SAS to write the observation to the forecast data set. The OUTPUT statement is
needed here because implicit output is canceled when you use an OUTPUT statement in your code. The forecast data set
contains two observations after the first iteration of the DATA step.

Here's another question: Which variables will be set to missing in the program data vector at the top of the DATA step?
Only the variable Year will be reinitialized. Variables created by INPUT and assignment statements are reinitialized.
Variables read with a SET statement are not. Those values are overwritten when the next observation is read.

Processing continues with the second observation in the input data set and continues until SAS reaches the end of the
file.

Writing to Multiple SAS Data Sets


The SAS data set orion.employee_addresses contains addresses for all Orion Star employees.

Suppose you want to create three separate data sets based on the value of the variable Country. If the value of Country
is US, the observation should be written to the usa data set. If the value of Country is AU, the observation should be
written to the australia data set. If the value of Country is anything else, the observation should be written to the other
data set.

Creating Multiple SAS Data Sets


The first thing you need to do is create three SAS data sets. So far you've specified a single data set in the DATA
statement. To create more than one data set, you simply specify the names of the SAS data sets you want to create,
separated by a space.
This DATA step creates the data sets usa, australia, and other in the default work library. The implicit OUTPUT statement
writes every observation in the employee_addresses data set to each of the three data sets listed in the DATA
statement, so this code creates three identical data sets. But remember, you want to create three data sets that contain
observations based on the value of the Country variable, so you'll need to write some additional DATA step code.

Using Conditional Logic to Create Multiple SAS Data Sets


You know how to create multiple data sets in a DATA statement and how to use an explicit OUTPUT statement to
control when SAS writes an observation to a data set.

You need to add conditional processing to this DATA step so that SAS writes an observation to one of the three data sets
based on the value of the Country variable. One way to do this is by using IF-THEN/ELSE statements.

Here is the complete DATA step with the conditional logic. Remember that you create the three data sets by listing them
in the DATA statement. SAS evaluates the first condition. If the condition is true, SAS takes the action your specify. If the
condition is false, SAS evaluates the next condition. The condition we want to test is the value of Country. For each value
of Country, we want to write the observation to the data set named in the OUTPUT statement. If neither condition is
true, we want to output the observation to the other data set.

Each data set listed in the OUTPUT statement must also be listed in the DATA statement.

Creating Multiple SAS Data Sets Using IF-THEN/ELSE and OUTPUT Statements
In this demonstration, you run a DATA step that creates three data sets.

1. Here's the DATA step that we've been working with. Copy and paste this DATA step into the editor.

data usa australia other;


set orion.employee_addresses;
if Country='AU' then output australia;
else if Country='US' then output usa;
else output other;
run;

The code will create three different output data sets based on the value of the Country variable. Here's a
question: What data set will contain an observation where the value of Country is lowercase 'us'?

The observation will be written to the other data set because character values are case sensitive and the
lowercase value will not match the uppercase value that is specified.

2. Submit the code and check the log.

The three data sets were created. Which value of Country occurs most frequently in the input data set?

You can tell that the value US occurs most frequently because the USA data set has the greatest number of
observations.

3. Examine the code again. For conditional processing, it's most efficient to check for values in order of decreasing
frequency. Revise the program, as shown below, to check Country='US' first.

data usa australia other;


set orion.employee_addresses;
if Country='US' then output usa;
else if Country='AU' then output australia;
else output other;
run;

4. You can print a listing report for each of the data sets. Copy and paste the following PROC PRINT procedures into
the edior. Each PRINT procedure can only print one data set.

title 'Employees in the United States';


proc print data=usa;
run;

title 'Employees in Australia';


proc print data=australia;
run;

title 'Non US and AU Employees';


proc print data=other;
run;
title;

Here's a question: What is the effect of the NULL TITLE statement in the third PROC PRINT step? The null TITLE
statement cancels all titles.

5. View the reports. They show that observations were selectively written to the appropriate data sets.

Take a closer look at the other data set. This data set contains observations where the values of Country were
miscoded as lowercase. Now that you've seen the data in the other data set, you could fix the data or revise the
conditional logic to look for both uppercase and lowercase values, or you could use a function in the IF
statement to change the values of Country to uppercase.

Using a SELECT Group for Conditional Processing


Another way to perform conditional processing in a DATA step is to use a SELECT group. It's more efficient to use a
SELECT group rather than a series of IF-THEN statements when you have a long series of mutually exclusive conditions.

A SELECT group has these statements: a SELECT statement that begins the group, one or more WHEN statements that
identify statements to execute when a condition is true, an optional OTHERWISE statement that specifies a statement to
execute if none of the WHEN conditions are true, an END statement that ends the group. Although the OTHERWISE
statement is optional, omitting it will result in an error if all WHEN conditions are false. You can use OTHERWISE with a
null statement to prevent SAS from issuing an error message.

Let's see how SAS processes the SELECT group by looking at a simple example. The SELECT group begins with the
keyword SELECT. The optional SELECT expression specifies any SAS expression that evaluates to a single value. Often a
variable name is used as the SELECT expression. In this example, the variable a is specified in the SELECT statement.

The WHEN statement begins with the keyword WHEN and is followed by at least one WHEN expression. The WHEN
expression can be any SAS expression, including a compound expression. In this statement, SAS finds the value of a and
compares that value to 1. If the value of a is equal to 1, the comparison is true and x is multiplied by 10. If the
comparison is false, SAS compares the value of a to 3, 4, or 5. Notice that you can list multiple values and separate them
with commas. If the value of a matches 3, 4, or 5, the comparison is true, and x is multiplied by 100.
If no comparisons are true, the OTHERWISE statement executes. However, nothing happens because there is no
executable statement on the OTHERWISE statement. The SELECT group ends and DATA step processing continues.

Now let's use a SELECT group to create the same three data sets that we created with IF-THEN/ELSE statements. You
start with the same DATA and SET statements. You’re evaluating the value of Country, so Country is the SELECT
expression.

Here's a question: What value in the input data should you check for first? You should check for the value US. Like IF-
THEN/ELSE statements, SAS processes the WHEN statements in a SELECT group from top to bottom, so it is most
efficient to check the values in order of decreasing frequency.

When the value of Country is US, you want to output the observation to the usa data set. Notice that you must enclose
the character value in quotation marks. Character values are case sensitive, so they must match the values in the data.

When the value of Country is AU, you want to output the observation to the australia data set. Otherwise, if the value of
Country is anything else, you want to output the observation to the other data set. Finally, you end the select group and
the DATA step.

Question
If you submit the code shown here, what happens if the value of Country is canada?

data northamerica asia check;


set salesdata;
select(Country);
when ("US","CANADA","MEXICO") output northamerica;
when ("JAPAN") output asia;
otherwise output;
end;
run;
a. Nothing happens because the value does not match any value listed in a WHEN expression.

b. The observation is written only to the check data set.

c. The observation is written to all three output data sets.

The correct answer is c. The lowercase value canada does not match any WHEN expression, so the OTHERWISE
statement executes and writes the observation to all three data sets.

Using a SELECT Group to Create Multiple Data Sets

In this demonstration, you run a DATA step that uses a SELECT group to create multiple data sets.

1. Copy and paste the following code into the editor.

data usa australia other;


set orion.employee_addresses;
select (Country);
when ('US') output usa;
when ('AU') output australia;
otherwise output other;
end;
run;

This DATA step uses a SELECT group to create the data sets usa, australia, and other.

2. Submit the code and check the log.

You see the same results you got when you submitted the DATA step with the IF-THEN/ELSE statements. There
are three data sets and they each have observations.

3. Revise the code to eliminate the optional OTHERWISE statement as follows:

data usa australia other;


set orion.employee_addresses;
select (Country);
when ('US') output usa;
when ('AU') output australia;
end;
run;

Earlier you learned that the OTHERWISE statement is optional, but you’ll get an error if no WHEN statement is
true.

4. Submit the revised program to see what happens when you delete the OTHERWISE statement.

5. Check the log. You can see that DATA step processing stopped when it encountered a value that did not match
the value in the WHEN expression.

6. Revise the code as follows to add a null OTHERWISE statement.

data usa australia other;


set orion.employee_addresses;
select (Country);
when ('US') output usa;
when ('AU') output australia;
otherwise;
end;
run;

7. Submit the revised program and check the log.

This time the code ran successfully but values of Country that do not match US and AU are not written to any
data set.

A null OTHERWISE statement can be useful when you want to ignore certain values. For example, if you only
want to create data sets for employees in the United States and Australia, then you would want to ignore values
for other countries. In this data, however, the observations we are ignoring by using a null OTHERWISE
statement are observations that should be written to one of the data sets.
8. Correct this problem as follows by listing multiple values in the WHEN expression separated by a comma.

data usa australia other;


set orion.employee_addresses;
select (Country);
when ('US','us') output usa;
when ('AU','au') output australia;
otherwise;
end;
run;

9. Resubmit the program and check the log to see if any observations were written to the other data set.

108 observations were written to the australia data set, 316 observations were written to the usa data set,
and no observations were written to the other data set.

Using Functions in a SELECT Expression


In the demonstration, you handled the lowercase values au and us by testing for multiple conditions in the WHEN
expression. Another way to handle the miscoded values in the Country variable is to use the SAS function UPCASE to
change the case of the values in the variable to uppercase before the condition is tested.

When this code runs, the function changes each value of Country to all uppercase letters. Then, the SELECT expression
resolves to uppercase AU, the second WHEN expression is true, and the observation is written to the australia data set.
All values will match one of the two WHEN expressions, and no observations will be written to the other data set.

This solution is more versatile than checking for multiple values in the WHEN statement because you don't have to know
how the values are coded. The UPCASE function handles values coded in lowercase or mixed case. You could also use
the LOWCASE function in a SELECT expression if you were testing for lowercase values in the WHEN expression. You
learn more about these case-changing functions in another lesson.

Using DO-END Groups in a SELECT Group


Suppose you want to do more than write an observation to a particular data set when the WHEN expression is true. You
also want to create a new variable named Benefits and assign a value to the variable based on the WHEN expression.

To execute multiple statements when a WHEN expression is true, you can put the statements in a DO group. A DO group
begins with a DO statement and ends with an END statement.

In this example, when Country is US, SAS executes the assignment statement that creates the variable Benefits and
assigns it the value 1. Then SAS writes the observation to the usa data set.

Omitting the SELECT Expression


So far we've seen SELECT statements that have a SELECT expression, but the SELECT expression is optional.

The way SAS evaluates a WHEN expression in a SELECT group depends on whether or not you specify a SELECT
expression. When you specify a SELECT expression in the SELECT statement, SAS finds the value of the SELECT expression
and then compares the value to each WHEN expression to return a value of true or false.

For example, suppose SAS executes the SELECT statement here and the value of Country is AU. SAS compares the value
AU to the first WHEN expression. If the comparison is false, SAS proceeds either to the next WHEN expression in the
current WHEN statement, or to the next WHEN statement. If the comparison is true, SAS executes the statement in the
WHEN statement. If no WHEN expression is true, SAS executes the OTHERWISE statement if one is present.

If you don't specify a SELECT expression, SAS evaluates each WHEN expression to produce a result of true or false. If we
write the same SELECT group without a SELECT expression, we must add the variable to the WHEN expression so that it
can be evaluated.
Suppose SAS is evaluating the same observation that has a value of AU for Country. In the first WHEN expression, SAS
evaluates Country and finds AU, then compares it to US. The WHEN expression is false, so SAS proceeds either to the
next WHEN expression in the current WHEN statement, or to the next WHEN statement.

In the next WHEN expression, SAS evaluates Country and finds the value AU, then compares it to AU. The WHEN
expression is true, so SAS executes the statement in the WHEN statement. Again, if no WHEN expression is true, SAS
executes the OTHERWISE statement if one is present.

Although this is a simple example, you can see that it's more efficient to use a SELECT expression when you are
evaluating one condition. Here, SAS only has to evaluate the Country once in the SELECT expression rather than in each
WHEN expression. If you had many WHEN expressions, checking the value of Country once would be more efficient to
process and more concise in your code.

There are times when you cannot use a SELECT expression. For example, you might want to check the condition of more
than one variable in a WHEN expression. One thing to keep in mind is that SAS executes WHEN statements in the order
that you write them and once a WHEN expression is true, no other WHEN expressions are evaluated.

Suppose you want to write observations to the newoffice data set when Country=AU and City=Melbourne. You can write
a WHEN statement with a WHEN expression that checks the value of the two variables. Now we have two WHEN
expressions that check for Country=AU, and both of these WHEN expressions will be true for some observations.

When SAS executes this WHEN statement, observations that have Country=AU and City=Melbourne are written to the
newoffice data set. If the WHEN expression is false, SAS executes the next WHEN statement. If Country=AU, the
observation is written to the australia data set.

The australia data set will contain all observations in which Country is Australia and City is NOT Melbourne. This is the
result that we want. If you reverse the order of these two WHEN statements, all observations in which Country=AU will
be written to the australia data set and no observations will be written to the newoffice data set.

Question
Which of the following programs correctly assigns values of Bonus based on values of Promotion?
a.
select;
when (2) Bonus=Salary*.6;
end;

b.
select (Promotion);
when (Promotion=2) Bonus=Salary*.6;
end;

c.
select (Promotion);
when (2) Bonus=Salary*.6;
end;
The correct answer is c. To compare the value of Promotion to a specific value, you specify the variable name in the
SELECT statement and the value in the WHEN statement

Controlling Variable Input and Output


You've learned how to create multiple data sets by selectively writing observations to different data sets. Suppose you
want to create the usa, australia, and other data sets as you did before―based on the value of Country in the
employee_addresses data set. However, this time you want to write different variables to each output data set.

Using DROP and KEEP Statements (Review)


By default, SAS writes all variables from the input data set to every output data set, but there are several ways to specify
which variables are written to output data. Remember, you can use DROP and KEEP statements to control which
variables are written to output data sets.

The DROP statement drops the specified variables from the output data set. The KEEP statement keeps only the
specified variables in the output data set. So, in this example, there are only two variables, a and b, in outputdata.

Question
The data set employee_addresses has 9 variables. After this code runs, how many variables will there be in the
other data set?

data usa australia other;


drop Street_ID;
set orion.employee_addresses;
if Country='US' then output usa;
else if Country='AU' then output australia;
else output other;
run;

a. 9
b. 8
c. 0

The correct answer is b. Street_ID is dropped from every data set. Each output data set, including other, has 8 variables.

Using the DROP= and KEEP= Data Set Options


You know you can use the DROP statement to drop a variable from all output data sets, but what if you want to drop
Street_ID and Country from the usa data set, and you want to drop Street_ID, Country, and State from the australia data
set, and you don't want to drop anything from the other data set?

The DROP or KEEP statements cannot be used in this case. Instead, you can use the DROP= or KEEP= data set option.
Data set options refer to a particular data set, so with these options, you can exclude or include variables from specific
output data sets.

The DROP= data set option drops variables from a specific output data set – but all variables remain available in the
program data vector. The syntax for the option is simple. After the data set name, you specify the DROP= option and the
variables you want to drop in parentheses. If you list more than one variable, you separate them with a space. All data
set options, including DROP=, apply ONLY to the data set name that they follow.
The KEEP= data set option works like the DROP= data set option, but as you might suspect, it lists the variables to be
written to a specific output data set. As with the DROP= data set option, all of the variables remain available in the PDV.
When you have more variables to drop than keep, it's more concise to list the variables to keep. If you attempt to drop
and keep the same variable, the DROP= option takes effect first, and SAS generates a warning for the KEEP= option.

Data set options affect how SAS reads and writes data. The effect they have depends on what statement they are
associated with in a program. When the DROP= option is associated with output data in the DATA statement, SAS does
not write the specified variables to the output data set. However, all variables are available for processing. When the
DROP= option is associated with input data in a SET statement, the variables are not read into the program data vector
and are not available for processing. When the DROP= option is associated with data in a PROC PRINT step, the variable
is excluded from the report.

Question
The data set employee_addresses has 9 variables. After this code runs, how many variables will there be in the usa data
set?

data usa(keep=Employee_Name City State) australia other;


set orion.employee_addresses;
if Country='US' then output usa;
else if Country='AU' then output australia;
else output other;
run;

a. 6

b. 3

c. 9

The correct answer is b. The KEEP= option specifies only 3 variables to keep in the output data set.

Controlling Variable Output Using the DROP= and KEEP= Data Set Options
In this demonstration, you use the DROP= and KEEP= data set options to control variable output.

1. Copy and paste the following code into the editor.

data usa(keep=Employee_Name City State)


australia(drop=Street_ID State Country)
other;
set orion.employee_addresses;
if Country='US' then output usa;
else if Country='AU' then output australia;
else output other;
run;

This DATA step creates 3 data sets: usa, australia, and other. The input data set, employee_addresses, contains 9
variables. We want to drop 6 of the variables from the output data set usa, so it's easier to add the KEEP= data
set option to specify the variables we want to keep. For usa, we want to keep Employee_Name, City, and State.
In the australia data set we want to keep 6 variables, so it's easier to use a DROP= data set option to drop the 3
variables we want to drop. We want to drop Street_ID, State, and Country.
2. Submit the code and check the log. Notice that The usa data set contains three variables, the australia data
contains six variables, and the other data set contains nine variables.

Controlling Variable Input Using the DROP= and KEEP= Data Set Options
You've seen how you can use the DROP= and KEEP= data set options to control variable output. Now let's see how you
can use them to control variable input.

Remember that when you associate the DROP= and KEEP= data set options with an output data set, the variables are
still available for processing. In contrast, when you associate these options with an input data set in a SET statement, the
variables are not read into the program data vector, and therefore they are not available for processing.

In this program, the DROP= option is used in the SET statement. In the SET statement, these options determine which
variables SAS reads from the input data set, and that affects how the program data vector is built. When the program
data vector is created, Street_ID, Street_Number, Street_Name, and Country are not read in. For cases where you don't
need all the variables in an input data set, this is an efficient way to drop them so that they aren't processed at all.

Although you can drop unnecessary variables in the SET statement, you have to be careful which variables you drop.
Suppose you want to drop Country and Employee_ID from every output data set and drop State from the australia data
set. To accomplish this, you use the DROP= data set option in the SET statement to drop Country and Employee_ID from
all output data sets, and you use the DROP= data set option in the DATA statement to drop State from the australia
output data set.

Here's a question: What's wrong with this code? This program drops Country from the input data, so it's not
available for processing, but it's used in the IF-THEN/ELSE statements. Let’s look at the log for this code. The
first note indicates that Country was never assigned a value in an assignment statement or from an input data
set. The value in the Country variable never equals US or AU, so all observations are written to the data set
other. Two variables are dropped from the usa and other data sets, and three variables are dropped from the
australia data set.

Combining Options and Statements


You can use a combination of DROP= and KEEP= data set options and DROP and KEEP statements to control variable
input and output. You just need to remember how the statements and options affect your input and output.

The DROP= and KEEP= options applied to the input data set will affect the contents of the program data vector. DROP
and KEEP statements affect all output data sets named in the DATA statement. DROP= and KEEP= options affect only the
data sets that they are associated with. If a DROP or KEEP statement is used at the same time as a data set option, the
statement is applied first.

Remember that you want to drop Employee_ID and Country from every data set, and you want to drop State from the
australia data set. You can do this by using a combination of options and statements. Let's start over with the code that
creates the three data sets with all nine variables. You can use the DROP= data set option in the SET statement to drop
Employee_ID from the input data because it's not used for processing in the DATA step. Next you want to drop Country
from every output data set, but the variable needs to be available for processing.

Here's a question: What's the simplest way to drop Country from all three output data sets? It’s easiest to use a DROP
statement. You could use the DROP= data set option to drop the variable from each output data set individually, but it's
more concise to use a DROP statement.
Finally, you use the DROP= data set option to drop State from the australia data set. When the code compiles, only the
Employee_ID variable is dropped from the input data, and all other variables are included in the program data vector
and are available for processing.

Question
The SAS data set car has the variables CarID, CarType, Miles, and Gallons. Select the DATA step or steps that creates the
ratings data set with the variables CarType and MPG. Select all that apply.

a.

data ratings(keep=CarType MPG);


set car(drop=CarID);
MPG=Miles/Gallons;
run;

b.

data ratings;
set car(drop=CarID);
drop Miles Gallons;
MPG=Miles/Gallons;
run;

The correct answer is a and b. Both programs produce the same output. You can use various combinations of data set
options and statements to achieve the same result.

Controlling Variable Input and Output Using Options and Statements


In this demonstration, you run the code we've been working on and then revise it.

1. Copy and paste the following code into the editor.

data usa australia(drop=State) other;


drop Country;
set orion.employee_addresses(drop=Employee_ID);
if Country='US' then output usa;
else if Country='AU' then output australia;
else output other;
run;

2. Submit the code and check the log. The usa and other data set should have seven variables and the australia
data set should have six variables.

3. Revise the program, as shown below, to read in only the four variables Employee_Name, City, State, and
Country. From the usa data set, drop Country. From the australia data set, drop State and Country.

data usa(drop=Country)
australia(drop=State Country) other;
set orion.employee_addresses
(keep=Employee_Name City State Country);
if Country='US' then output usa;
else if Country='AU' then output australia;
else output other;
run;
To read in only specific variables, we use a data set option on the SET statement. Because we are dropping more
variables than we’re keeping, we use the KEEP= data set option to save a little typing.

We have four variables from our input data. But we want to drop some variables from the output. We want to
drop Country from the usa data set and Country and State from the australia data set. We want all 4 of the
variables in the other data because we want to list the values in the Country variable so we can see how the
observations were written to that data set.

4. Submit the code and view the log. The usa data set should have three variables, the australia data set should
have two variables and the other data set should have four variables.

5. Copy and paste the following PROC PRINT steps into the editor.

proc print data=usa;


run;

proc print data=australia;


run;

proc print data=other;


run;

6. Submit the PROC PRINT steps and view the results. As you can see, using a combination of DROP and KEEP
statements and DROP= and KEEP= data set options can be very versatile.

Controlling Which Observations are Read


You've learned to contol variable input and output. But, suppose you want to control observation input and output?

The employee_addresses data set has more than 400 observations. Though this data set isn't particularly large, you
might want to process only a subset of the observations as you test your SAS programs. For example, you might want to
process only the first 100 observations or only observations 50 through 100, where the value of country is Australia.

You can also limit the number of observations in your reports by limiting the number of observations that a procedure
step processes. For example, you might want to use the PRINT procedure to print only observations 10 through 15 from
employee_addresses data set.

Controlling Which Observations Are Read from Input Data


By default, SAS processes every observation in a SAS data set, from the first observation to the last. You can use the
FIRSTOBS= data set option to control where SAS starts processing an input data set. You can use the OBS= data set
option to control where SAS stops processing an input data set.

Both the FIRSTOBS= and the OBS= options are used with input data sets. You cannot use either option with output data
sets. When you limit the number of observations that SAS reads from input data, the number of observations in your
output data is also limited.

Using the OBS= and the FIRSTOBS= Data Set Options


First let's look at OBS=. The OBS= data set option is enclosed in parentheses and follows the data set name. N specifies a
positive integer that is less than or equal to the number of observations in the data set, or zero. This data set option
specifies the numbers of the last observation to process, not how many observations should be processed. There are
over 400 observations in the employee_addresses data set, but the OBS=100 data set option in this SET statement
causes the DATA step to stop processing after observation 100.

Another way to control which observations are read from input data is to specify where to start reading data. You can
use the FIRSTOBS= data set option to specify a starting point for processing an input data set. The FIRSTOBS= data set
option is enclosed in parentheses and follows the data set name. N specifies a positive integer that is less than or equal
to the number of observations in the data set. In this example, FIRSTOBS=20, so the SET statement starts reading
observations from the input data set at observation number 20 and continues processing until the last observation is
read.

FIRSTOBS= and OBS= are often used together to define a range of observations in the data set. In this example,
FIRSTOBS=50 and OBS=100. These data set options cause the SET statement to read 51 observations from the
employee_addresses data set. Processing begins with observation 50 and ends after observation 100.

In addition to using FIRSTOBS= and OBS= in a SET statement, you can also use these options in an INFILE statement to
control which records are read when you read raw data files. Notice that the syntax is different. In an INFILE statement,
the options follow the filename, but they are not enclosed in parentheses.

Question
Which program reads data from observation 100 to observation 200 and prints 50 observations?
a.
data new(firstobs=100 obs=200);
set old;
run;
proc print data=new(obs=50);
run;

b.
data new;
set old(firstobs=100 obs=200);
run;
proc print data=new(obs=50);
run;

The correct answer is b. The data set options in the SET statement direct SAS to begin reading at observation 100 and
stop after observation 200. The data set option in the PROC PRINT step directs SAS to stop printing after 50
observations.

Controlling Observations in Procedure Output


Just as you can use the FIRSTOBS= and OBS= data set options on input data in a SET statement, you can also use them in
SAS procedures. In this example, the PRINT procedure begins processing the data set employee_addresses at
observation 10 and ends after observation 15.

In the report, you can see that the procedure read six observations from the employee_addresses data set. Notice that
the original observation numbers are listed in the Obs column.

When FIRSTOBS= or OBS= and a WHERE statement are used together, the subsetting WHERE is applied first, and then
the FIRSTOBS= and OBS= data set options are applied to the resulting observations. In this code, the procedure selects
all the observations from employee_addresses where the value of Country is AU. Then from that subset, it begins
processing at observation 1 and ends after observation 10.

Summary: Controlling Input and Output

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Outputting Multiple Observations


You can control when SAS writes an observation to a SAS data set by using explicit OUTPUT statements in a DATA step.
When an explicit OUTPUT statement is used, implicit output does not occur at the bottom of the DATA step.

OUTPUT <SAS-data-set(s)>;

Writing to Multiple SAS Data Sets


To create more than one data set, you specify the names of the SAS data sets you want to create in the DATA statement.
Separate the data set names with a space.

DATA SAS-data-set-name SAS-data-set-name-n;

You can use OUTPUT statements with IF-THEN-ELSE statments to conditionally write observations to a specific data set
based on the value of a variable in the input data set.

IF expression THEN statement;


ELSE IF expression THEN statement;
<ELSE IF expression THEN statement;>
<...>
<ELSE statement;>

Another way to perform conditional processing in a DATA step is to use a SELECT group. A SELECT group contains these
statements:

 a SELECT statement that begins the group, followed by an optional SELECT expresion
 one or more WHEN statements that identify statements to execute when a condition is true
 an optional OTHERWISE statement that specifies a statement to execute if none of the WHEN conditions are
true
 an END statement that ends the group.

SELECT <(select-expression)>;
WHEN-1 (when-expression-1 <…, when-expression-n>) statement;
WHEN-n (when-expression-1 <…, when-expression-n>) statement;
<OTHERWISE statement;>
END;

The optional SELECT expression specifies any valid SAS expression. Often a variable name is used as the SELECT
expression. When you specify a SELECT expression, SAS evaluates the expression and then compares the result to each
when-expression. When a true condition is encountered, the associated statement is executed and the remaining WHEN
statements are skipped.

If you omit the SELECT expression, SAS evaluates each when-expression until it finds a true condition, then behaves as
described above. This form of SELECT is useful when you want to check the value of more than one variable using a
compound condition, or check for an inequality. One thing to keep in mind is that SAS executes WHEN statements in the
order that you write them and once a when-expression is true, no other when-expressions are evaluated.

Although the OTHERWISE statement is optional, omitting it will result in an error if all when-expressions are false. You
can use OTHERWISE with a null statement to prevent SAS from issuing an error message.

You can execute multiple statements when a when-expression is true by using DO-END groups in a SELECT Group.

Controlling Variable Input and Output

By default, SAS writes all variables from the input data set to every data set listed in the DATA statement. You can use
DROP and KEEP statements to control which variables are written to output data sets. DROP and KEEP statements affect
all output data sets listed in the DATA statement.

When used on a DATA statement, the DROP= and KEEP= data set options specify the variables to drop or keep in each
output data set. Remember, the dropped variables are still in the program data vector, and therefore available for
processing in the DATA step.

When you use the DROP= or KEEP= data set options in a SET statement, the variables are dropped on input. In other
words, they are not read into the program data vector, therfore they are not available for processing.

SAS-data-set-name(DROP=variable(s))
SAS-data-set-name(KEEP=variable(s))

You can use both DROP and KEEP statements and DROP= and KEEP= options in the same step, but do not try to drop and
keep the same variable. If you use a DROP or KEEP statement at the same time as a data set option, the statement is
applied first.

Controlling Which Observations Are Read

You can use the OBS= and FIRSTOBS= data set options to limit the number of observations that SAS processes. The OBS=
data set option specifies the number of the last observation to process. It does not specify how many observations
should be processed. The FIRSTOBS= data set option specifies a starting point for processing an input data set. By
default, FIRSTOBS=1. You can use FIRSTOBS= and OBS= together to define a range of observations for SAS to process.

SAS-data-set-name(OBS=n)
SAS-data-set-name(FIRSTOBS=n)
SAS-data-set-name(OBS=n FIRSTOBS=n)
You can also use FIRSTOBS= and OBS= in a procedure step, to limit the number of observations that are processed. If a
WHERE statement is used to subset the observations, it is applied before the data set options.

Sample Programs

Outputting Multiple Observations

data forecast;
set orion.growth;
year=1;
Total_Employees=Total_Employees*(1+Increase);
output;
Year=2;
Total_Employees=Total_Employees*(1+Increase);
output;
run;

Writing to Multiple SAS Data Sets (Using a SELECT Group)

data usa australia other;


set orion.employee_addresses;
select (Country);
when ('US') output usa;
when ('AU') output australia;
otherwise output other;
end;
run;

Writing to Multiple Data Sets (Using a SELECT Group with DO-END Group in the WHEN statement)

data usa australia other;


set orion.employee_addresses;
select (upcase(Country));
when ('US') do;
Benefits=1;
output usa;
end;
when ('AU') do;
Benefits=2;
output australia;
end;
otherwise do;
Benefits=0;
output other;
end;
end;
run;

Controlling Variable Input and Output

data usa australia(drop=State) other;


drop Country;
set orion.employee_addresses
(drop=Employee_ID);
if Country='US' then output usa;
else if Country='AU' then output australia;
else output other;
run;

Controlling Observation Input and Output

data australia;
set orion.employee_addresses
(firstobs=50 obs=100);
if Country='AU' then output;
run;

proc print data=orion.employee_addresses


(obs=10);
where Country='AU';
var Employee_Name City State Country;
run;

Lesson 2: Summarizing Data


In the previous lesson, you expanded your knowledge of DATA step techniques in order to control input and output
more precisely. In this lesson, you learn how to summarize your data using the DATA step. You learn to retain variable
values across iterations in the program data vector to accumulate totals.

Objectives

In this lesson, you learn to do the following:

 explain how SAS initializes the value of a variable in the PDV


 prevent reinitialization of a variable in the PDV
 create an accumulating variable
 define FIRST. and LAST. processing
 calculate an accumulating total for groups of data
 use a subsetting IF statement to output selected observations

Creating an Accumulating Variable


Suppose a retail manager for Orion Star Sportswear asked to see her department’s daily sales for April, as well as a
month-to-date total for each day.

The SAS data set orion.aprsales contains daily sales data. Notice there is one observation for each day in April 2011
showing the sale date and sale amount. You need to create a new data set named mnthtot. The data set needs to
include a new variable, Mth2Dte, that represents the month-to-date sales total for each day in April.

Creating a Variable Using the Assignment Statement


You can use the DATA step in SAS to summarize your data in meaningful ways. You already know how to create a new
variable from an expression using an assignment statement. Suppose you decide to use this code to create the mnthtot
data set. The assignment statement creates the variable Mth2Dte by adding together the SaleAmt and Mth2Dte variable
values.

Here's a question: will this program create the correct values for Mth2Dte? No, it won't. If you run this code, it creates
the Mth2Dte variable, but all the values for this variable are missing. Let's see why.
By default, when the assignment statement creates a new variable, the value of the variable is initialized to missing at
the top of the DATA step. So the initial value of Mth2Dte is missing. When you add the value of Mth2Dte to the value of
SaleAmt, the resulting value is missing because any mathematical operation on a missing value equals missing.

This program doesn't create the month-to-date totals you need. To generate the month-to-date totals, you need to
create an accumulating variable. Let's learn more about accumulating variables.

What Is an Accumulating Variable?


An accumulating variable accumulates the value of another variable and keeps its own value from one observation to
the next.

In this simple example, X is the accumulating variable. X accumulates the value of Y. The new value of X is 10. This value
remains for the next iteration. SAS adds the next value of Y, 25, to the current value of X, 10. The new value of X is 35.
This value remains for the next iteration. SAS adds the new value of Y, 12, to the current value of X. As you can see, X
holds the cumulative total of the Y values.

Now let's see how to create an accumulating variable for our task.

Creating an Accumulating Variable Using the RETAIN Statement


The Mth2Dte variable needs to accumulate the values of SaleAmt and retain its own value from one observation to the
next. You can use the RETAIN statement to create the Mth2Dte variable. The RETAIN statement is a compile-time-only
statement that prevents SAS from reinitializing the variable at the top of the DATA step. Because the variable is not
reinitialized, it retains its value across multiple iterations of the DATA step.

Let's take a look at the syntax for the RETAIN statement. The statement starts with the keyword RETAIN followed by the
name of the variable whose values you want to retain. You can optionally specify any initial value for the variable. If you
don't specify an initial value, the RETAIN statement initializes the retained variable to missing before the first execution
of the DATA step.

This RETAIN statement sets the initial value of Mth2Dte to 0. Remember that you want to see month-to-date sales
totals, so the first value of Mth2Dte needs to be 0 to indicate that there were no existing sales on the first day of the
month. If you don't supply an initial value, all of the values for Mth2Dte will be missing. Notice also that the program still
contains an assignment statement, which creates the variable Mth2Dte. But now you're using it in combination with the
RETAIN statement.

Processing the DATA Step


Now let's look at how SAS processes this code and stores the values in the PDV.

Remember that the RETAIN statement is a compile-time-only statement. So during compilation, the PDV puts a retain
flag on the Mth2Dte variable. As the DATA step executes, SAS initializes the PDV.

The RETAIN statement assigns an initial value of 0 to Mth2Dte. At this point, the values of SaleDate and SaleAmt are set
to missing. The SET statement reads the first observation into the PDV. As SAS processes the program, it skips the
RETAIN statement because it's compile-time only. The assignment statement calculates Mth2Dte: it increases the initial
Mth2Dte value of 0 by the SaleAmt value of 498.49. The result appears in the Mth2Dte column in the PDV.
At the bottom of the DATA step, SAS writes the observation to the mnthtot data set. SAS continues processing but
doesn't reinitialize the Mth2Dte value at the top of the DATA step. It retains the value that was created during the first
iteration. Next, the SET statement reads the second observation into the PDV. Because the RETAIN statement created
Mth2Dte as an accumulating variable, SAS adds the current Mth2Dte value to the second SaleAmt value for a total of
1444.99.

At the bottom of the DATA step, SAS writes the second observation to the mnthtot data set. Again, SAS continues
processing but does not reinitialize Mth2Dte. SAS continues processing the data until it reaches the end of the file,
incrementally adding the sale amounts for each sale date.

Question
Which DATA step creates Mth2Dte as an accumulating variable with an initial value of 50?

a.

data mnthtot;
set orion.aprsales;
retain SaleAmt 50;
Mth2Dte=Mth2Dte+SaleAmt;
run;
b.

data mnthtot;
set orion.aprsales;
retain Mth2Dte 50;
Mth2Dte=Mth2Dte+SaleAmt;
run;

The correct answer is b. This code assigns an initial value of 50 to SaleAmt and has the correct calculation to create the
Mth2Dte variable (Mth2Dte=Mth2Dte+SaleAmt).

Using the SUM Statement


Now that you've seen how the RETAIN statement works, let's take it one step further. Suppose one of the values for
SaleAmt is missing in the orion.aprsales2 data set. Maybe it was a holiday and the store was closed, but instead of
entering 0 for the daily sales, the employee didn't enter anything.

Here's a question: When SAS processes this code, what happens when it reaches a missing value for SaleAmt in the input
data set? When there's a missing value for SaleAmt, the assignment statement calculates a missing value, and SAS
retains the missing value across iterations of the DATA step.

Remember that the result of any mathematical operation on a missing value equals missing. One missing value for
SaleAmt causes all subsequent values of Mth2Dte to be missing. This is a common problem with data. When you create
an accumulating variable, and need the initial value of the variable to be 0, an alternative to using the RETAIN statement
with an assignment statement is to use the sum statement. You can use the sum statement in SAS to ignore missing
input values. Let's learn more about using the sum statement.

Notice the syntax of the sum statement here. Like the assignment statement, the sum statement doesn't include a
keyword. During compilation, the sum statement creates the variable on the left side of the plus sign if it doesn't already
exist.

This statement creates the Mth2Dte variable. As the execution phase begins, SAS automatically initializes the value of
Mth2Dte to 0 before reading the first observation. SAS retains the value of Mth2Dte from one DATA step execution to
the next. During execution, the sum statement adds the value of the expression SaleAmt to the initial 0 value of the
accumulator variable, Mth2Dte.

In the second observation, there's a missing value for SaleAmt. Because of the sum statement, SAS ignores this missing
value, and the accumulating total for this iteration is 498.49. As you can see, even though the SaleAmt value for April
2nd is missing, the sum statement forces the calculations to continue by ignoring this missing value. Also, the sum
statement allows the value of the Mth2Dte variable to be retained across data step iterations, and the result is an
accumulating total of daily sales.

Accumulating Totals for a Group of Data


The SAS data set orion.specialsals contains information about employees working on special projects. The Salary variable
represents the portion of the employee's salary allocated to the project, and the Dept variable represents the
employee’s department.

Suppose you'd like to see these salary totals by department. You need to create a new data set, deptsals, that has the
total salaries allocated to special projects for each department. To create deptsals, you first need to sort the input data.
Then you need to use the DATA step to summarize the observations by department group. Each group of data needs to
be summarized into a single value.

Using BY-Group Processing


Let’s get started. Because your goal is to see the total salaries allocated to special projects for each department, the
input data needs to be arranged, or sorted, by the Dept variable. To sort the input data by department, you can use the
SORT procedure.

Remember how PROC SORT works. First, SAS rearranges the observations in the input data set. Then, SAS creates a data
set that contains the rearranged observations either by replacing the original data set or by creating a new data set. By
default, SAS replaces the original SAS data set.

Here’s the PROC SORT step that sorts the input data set specialsals by the Dept variable. Salsort is the name of the new
sorted data set. Now that the data is sorted, you need to process the data in groups. You can do this using a BY
statement in the DATA step. Here's the syntax. When you use a BY statement with the SET statement, the data must be
sorted by the BY variable, or have an appropriate index based on the BY variable. That’s why you had to first sort the
data by Dept.

This is a good start for the SAS program. The sort is by Dept, and the processing is specified by Dept.

Finding the First and Last Observations in a Group


When using BY-group processing, you need some way to identify the beginning and ending of each department's group
of observations. By default, a BY statement creates two temporary variables for each BY variable listed. These variables
identify the first and last observation in each BY group. The FIRST. variable has a value of 1 for the first observation in a
BY group; otherwise, it equals 0. The LAST. variable has a value of 1 for the last observation in a BY group; otherwise, it
equals 0.
In this example, the BY statement creates FIRST.Dept and LAST.Dept. To get a sense of how SAS assigns FIRST. and LAST.
values, let's look at an example.

SAS reads the first observation. ADMIN is the first value for Dept, so SAS assigns a value of 1 to FIRST.Dept. How can SAS
determine the value of LAST.Dept? It's unclear at this point whether this is the last occurrence of ADMIN in the data set.
So SAS looks ahead to the second observation to obtain the value of LAST.Dept in the first observation. In the second
observation, ADMIN is the value for Dept, so SAS assigns a value of 0 to LAST.Dept in the first observation. This is not the
last occurrence of ADMIN in the data set.

In the second observation, ADMIN is the value for Dept. This isn't the first occurrence of ADMIN, so SAS assigns a value
of 0 to FIRST. Dept. SAS then looks ahead to the third observation to obtain the value of LAST.Dept. ADMIN is the value
for Dept, so the value of LAST.Dept for the second observation is also 0.

In the third observation, ADMIN is the value for Dept. Again, this isn't the first occurrence of ADMIN, so SAS assigns a
value of 0 to FIRST.Dept. SAS then looks ahead to the fourth observation to obtain the value of LAST.Dept. The
department changes to ENGINR, which means the third observation held the last occurrence of ADMIN, and SAS assigns
a value of 1 to LAST.Dept in the third observation.

In the fourth observation in the data set, ENGINR is the value for Dept. This is the first occurrence of ENGINR, so SAS
assigns a value of 1 to FIRST.Dept. SAS looks ahead at the fifth observation to obtain the value of LAST.Dept. ENGINR is
the value for Dept, so the value of LAST.Dept for the fourth observation is 0. This process continues through all of the
observations.

Question
What are the values for FIRST.Dept and LAST.Dept when the DATA step processes the observation indicated by the
arrow?

Dept Salary
FIRST.Dept
ADMIN 20000

ADMIN 50000
LAST.Dept
ENGINR 25000

FINANC 10000

a. FIRST.Dept is 1 and LAST.Dept is unknown.


b. FIRST.Dept is 0 and LAST.Dept is 1.

c. FIRST.Dept is 1 and LAST.Dept is 1.

d. FIRST.Dept is 0 and LAST.Dept is 0.

The correct answer is c. FIRST.Dept and LAST.Dept are both 1. This happens when a group is comprised of a
single observation.
Summarizing Data by Groups
Now that you understand how SAS assigns the values of the FIRST. and LAST. variables, you can use these variables in
the DATA step to summarize the grouped data.

Remember that you need to produce salary totals by department. Think of using the DATA step to summarize the
grouped data as a three-step process. The first step is to set the value of the accumulating variable to 0 at the start of
each BY group. Next, you increment the accumulating variable with a sum statement. Then, you output only the last
observation of each BY group. Let's take a closer look at each step.

In the first step, you set the accumulating variable value to 0 at the start of each BY group. The conditional statement is
considered true when FIRST.Dept is not missing or not equal to 0. So SAS tests every observation. If it’s the first
occurrence of a department, then you have a value for DeptSal and it’s 0.

In the second step, you increment the accumulating variable using a sum statement. Remember that when you use the
sum statement, SAS initializes the value of the accumulating variable to 0 and retains the current value across iterations
of the DATA step. Here, you add the value of Salary to the accumulating variable, DeptSal.

In the third step, you need to output only the last observation of each BY group. The data set salsort shows every
observation read by the DATA step.

Here's a question: What statement can you use to write out just a subset of the observations processed in a DATA step?
You can use the subsetting IF statement. Remember that the subsetting IF statement defines a condition that the
observation must meet in order for the DATA step to continue processing. SAS tests the current observation. In other
words, if the value of LAST.Dept is 1, SAS continues processing.

Using BY-Group Processing to Summarize Data


In this demonstration, you summarize data using BY-group processing.

1. Copy and paste the following code into the editor.

proc sort data=orion.specialsals out=salsort;


by Dept;
run;

data deptsals(keep=Dept DeptSal);


set SalSort;
by Dept;
if First.Dept then DeptSal=0;
DeptSal+Salary;
if Last.Dept;
run;
The PROC SORT step sorts the data set in preparation for summarization. The DATA step includes the BY
statement, which specifies processing by the Dept variable. The conditional IF statement sets up the testing of
each observation and assigns a value of 0 to DeptSal. The sum statement creates the DeptSal variable and
increases it by the value of Salary. Lastly, the subsetting IF statement writes the last observation of each BY
group to the data set. So essentially, this code creates a data set that shows us salary totals by department
group.

2. Submit the code and view the log. We want to see a difference in observations between the input data set and
the output. This will tell us that our department and salary totals were summarized. Notice the notes in the log.
SAS read 39 observations from the input data set and created 5 observations in the deptsals data set.

3. Copy and paste the following PROC PRINT step into the editor.

proc print data=deptsals;


format DeptSal 7.;
title 'Employee Salaries by Department';
run;

4. Submit the PROC PRINT step and view the output. In the output, you can see that each department is listed
once, and DeptSal contains the total salaries allocated to special projects for each department.

Question
What must happen in the DATA step to summarize data by groups? Select all that apply.

a. Sort the input data.

b. Set the accumulating variable to its initial value at the start of each BY group.

c. Increment the accumulating variable.

d. Output only the last observation of each BY group.

The correct answer is b, c, and d. To summarize data by groups, you must include all three of these tasks in the DATA
step.

Business Scenario
In the previous task, you summarized data by one group, department. But suppose you need to summarize your data by
multiple groups.

Each employee listed in orion.projsals is assigned to a special project. You need to see the salary totals and the total
number of employees from each department, for each special project. You need to create a new data set that shows the
number of employees and salary totals from each department for each special project. So there are two accumulating
variables in the output, NumEmps and DeptSal.

Using Multiple BY Variables


As with the previous task, the first step is to use PROC SORT to sort the input data set. Projsort is the name of the sorted
data set. The code for this task is very similar to the code you used in the previous task. This time, you use two BY
variables, Proj and Dept. You use Proj as the primary sort variable because it’s for each special project that you want to
summarize department salaries. Dept becomes the secondary sort variable.
The second step is to process the data in groups. You use the BY statement with the SET statement in the DATA step to
process the data by both the Proj and Dept groups. You need to summarize both of these groups in order to complete
the task.

Finding the First and Last Observations in Multiple Groups


The next step is to identify the beginning and ending of the Proj observations and the beginning and ending of the Dept
observations. Let’s see how SAS assigns these values.

Remember that the DATA step creates a FIRST. variable and a LAST. variable for each of the variables in the BY
statement. Let’s start by looking at the FIRST. variables.

In the first observation, FIRST.Proj and FIRST.Dept both equal 1 because this is the first occurrence of the CAP1 and
ADMIN values, respectively. SAS looks ahead to the second observation to obtain the values of LAST.Proj and LAST.Dept.
The second observation also contains CAP1 for the project and ADMIN for the department, so the values for LAST.Proj
and LAST.Dept in the first observation are 0.

In the second observation, this is not the first occurrence of CAP1 or ADMIN, so FIRST.Proj and FIRST.Dept equal 0. SAS
looks ahead to the third observation to assign the LAST. values.

The third observation also contains CAP1 for the project and ADMIN for the department, so the values for LAST.Proj and
LAST.Dept for the second observation are 0. In the third observation, the value of Proj is still CAP1, and the value of Dept
is still ADMIN, so FIRST.Proj and FIRST.Dept equal 0. SAS looks ahead to the fourth observation to determine the LAST.
values. Because there’s a change in the project name, the value of LAST.Proj in the third observation is 1.

Here's a question. What will the value for LAST.Dept be in the third observation? The value of Dept in the fourth
observation is ADMIN, which is the same as in the third observation. It makes sense to assign a value of 0 to LAST.Dept
in the third observation because it’s not the last occurrence of ADMIN. But it's the last occurrence of the ADMIN
department within the CAP1 project.

When you specify multiple BY variables in a BY statement, a change in the primary LAST.BY-variable value forces a
change in the secondary LAST.BY-variable value. In other words, when LAST.Proj equals 1, LAST.Dept also equals 1. It's
the last department within this project. This process continues until all of the observations are read.

Here you can see the values for the FIRST. and LAST. variables for the first five observations. Remember that when you
use more than one variable in the BY statement, a value of 1 for the primary variable forces a value of 1 for the
secondary variable.

Summarizing Data by Multiple Groups


Here's the complete program. The input data set is projsort. Remember that the data has already been sorted by Proj
and Dept. The BY statement tells SAS to process the data by the Proj and Dept groups. Now let’s see how to use the
FIRST. and LAST. information.

Remember that a DO statement specifies a group of statements to be executed as a unit until a matching END statement
appears. In this DO group, the accumulating variables DeptSal and NumEmps equal 0 when the first occurrence of a
department name appears in the data. The sum statements create the accumulating variables.
The first sum statement increments the DeptSal variable by the value of Salary. It’s summarizing the salaries within a
particular department. The second sum statement increments NumEmps by 1 for each row within a particular
department. When a new department name occurs, SAS sets DeptSal and NumEmps equal to 0 and then starts
accumulating new totals.

Finally, this subsetting IF statement tests the LAST.Dept value. Every time SAS encounters the last observation for a
department, you want SAS to output the result. Remember that every time SAS reaches the last occurrence of a project
name, the value of LAST.Proj is set to 1. This change forces a change for the value of LAST.Dept. This is why the
subsetting IF statement is based on the LAST.Dept value.

Summarizing Data by Multiple Groups


In this demonstration, you see summarize data using multiple BY variables.

1. Copy and paste the following code into the editor.

proc sort data=orion.projsals out=projsort;


by Proj Dept;
run;

data pdsals(keep=Proj Dept DeptSal NumEmps);


set projsort;
by Proj Dept;
if First.Dept then
do;
DeptSal=0;
NumEmps=0;
end;
DeptSal+Salary;
NumEmps+1;
if Last.Dept;
run;

Examine the code. Will the number of observations be the same in the input data set as in the output data set?
Since we have the subsetting IF statement, we expect to see a difference in the log between the incoming and
outgoing observations.

2. Submit the code and check the log. Here you can see the difference of 39 observations versus 14 observations.

3. Copy and paste the following PROC PRINT step into the editor.

proc print data=pdsals noobs;


run;

4. Submit the PROC PRINT step and view the results. You can see that the data is listed first by project, then by
department. For each unique combination of project and department, the department salary is summarized for
the number of employees in the department who worked on that project.

Summary: Summarizing Data

This summary contains topic summaries, syntax, and sample programs.


Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Creating an Accumulating Variable


An accumulating variable accumulates the value of another variable and keeps its value from one observation to the
next.

An Assignment statement specifying the the addition operator and a RETAIN statement are often used to create an
accumulating variable. The SUM function could be used instead of the addition operator in the expression in the
Assignment statement. The SUM function also sums the arguments, but ignores missing values. The RETAIN statement is
a compile-time-only statement that prevents SAS from reinitializing the variable at the top of the DATA step. Because
the variable is not reinitialized, it retains its value across multiple iterations of the DATA step.

The RETAIN statement includes an optional initial value for the variable. If you don't specify an initial value, the RETAIN
statement initializes the variable to missing before the first iteration of the DATA step. Be sure to specify an initial value
of zero when creating an accumulating variable.

RETAIN variable-name <initial-value> …;

As an alternative to using an Assignment statement and the RETAIN statement, you can use a sum statement. The
accumulating variable is specified on the left side of the plus sign, and an expression is specified on the right side.

variable+expression;

By default, the sum statement retains the accumulating variable and initializes it to zero. During each iteration of the
DATA step, the expression is evaluated and the resulting value is added to the accumulating variable, ignoring missing
values.

Accumulating Totals for Grouped Data


When you need to accumulate totals for groups of data, (for example, if you need to see total salaries by department),
the input data set must be sorted on the BY-variable. You can then use a BY statement in the DATA step to process the
data in groups.

DATA output-SAS-data-set;
SET input-SAS-data-set;
BY BY-variable …;
<additional SAS statements>
RUN;

When a BY statement is included in a DATA step, SAS creates two temporary variables (FIRST.by-variable and LAST.by-
variable) for each BY variable listed on the BY statement. SAS sets the value of these variables to identify the first and
last observation in each BY group. FIRST.by-variable is set to 1 when the first observation in a group is read, otherwise its
value is 0. LAST.by-variable is set to 1 when the last observation in a group is read, otherwise its value is 0.
FIRST. BY-variable
LAST. BY-variable

You can use the values of the FIRST. and LAST. variables in a DATA step to summarize the grouped data.

 First, set the accumulating variable equal to 0 at the start of each BY group.
 Second, increment the accumulating variable with a sum statement on each iteration of the DATA step.
 Third, output only the last observation of each BY group.

To accumulate totals for multiple groups (for example, if you need to see total salaries allocated to special projects by
department), you can specify two or more BY variables on the BY statement. Be sure to list the primary grouping
variable first, then the secondary grouping variable, etc. Remember that the input data set must be sorted by the BY
variables.

The BY statement creates a FIRST. and LAST. variable for each BY variable. When the last observation for the primary
variable is encountered, SAS sets the LAST. variable to 1 for the primary and all subsequent BY variables.

Sample Programs

Creating an Accumulating Variable

data mnthtot2;
set orion.aprsales;
retain Mth2Dte 0;
Mth2Dte=sum(Mth2Dte,SaleAmt);
run;

data mnthtot2;
set orion.aprsales2;
Mth2Dte+SaleAmt;
run;

Accumulating Totals for a Group of Data

proc sort data=orion.specialsals


out=salsort;
by Dept;
run;

data deptsals (keep=Dept DeptSal);


set salsort;
by Dept;
if First.Dept then DeptSal=0;
DeptSal+Salary;
if Last.Dept;
run;

proc sort data=orion.projsals out=projsort;


by Proj Dept;
run;

data pdsals (keep=Proj Dept DeptSal NumEmps);


set projsort;
by Proj Dept;
if First.Dept then
do;
DeptSal=0;
NumEmps=0;
end;
DeptSal+Salary;
NumEmps+1;
if Last.Dept;
run;

Lesson 3: Reading Raw Data


Raw data can be organized in many ways. The data might be arranged in columns, or fixed fields. Other times, the data
you’re working with might be free format, meaning that the values for a particular field do not begin and end in the
same columns. The data might even require special formatting. In some cases, information for one observation might be
spread out over several records. Or perhaps each record in the raw data file contains data for multiple observations.

How your data is organized determines which input style you should use to read the data. Either from your own
experience or in the SAS® Programming 1: Essentials course, you've learned how to use list input to read raw data. In
this lesson, you'll learn additional techniques for reading raw data.

Objectives

In this lesson, you learn to do the following:

 use a subsetting IF statement to output selected observations


 read a raw data file with multiple records per observation
 read a raw data file with mixed record types
 subset from a raw data file with mixed record types

Using Formatted Input


The raw data file offers.dat contains information about Orion Star's discount offers.

Suppose you need to create a SAS data set named discounts from the raw data. The discounts data set needs to have
one observation for each record in the raw data file.

Selecting an Input Style


Remember, the INPUT statement describes the arrangement of values in the input data record and assigns input values
to the corresponding SAS variables. You can use list input to read standard or nonstandard data that is separated by
delimiters. You can use column input to read standard data that is arranged in columns or fixed fields.

The data in offers.dat is arranged in columns. There's a predictable beginning and ending column for each field. So, you
might consider using column input to read the data. Column input specifies actual column locations for values.

The syntax for the INPUT statement with column input is shown here. When you use column input, you specify a valid
variable name followed by the starting column number and the ending column number. If the variable you're creating is
a character variable, you must include a dollar sign after the variable name. For example, you could use this INPUT
statement to read the values in the first field of offers.dat.

But column input is appropriate only in certain situations. Remember, when you use column input, your data must be
standard data in fixed columns. You could use column input to read the values for Item_gp as well as Cust_type. But, the
values of Offer_dt and Discount don't contain standard data. So, column input won't work to read these values.

Identifying Nonstandard Data


So, what is nonstandard data? Let's start by exploring standard data.

Standard data is data that SAS can read without any special instructions. Standard data values can contain only numbers,
decimal points, numbers in scientific or E-notation, plus signs, and minus signs.

Nonstandard data is any data that SAS cannot read without special instructions. Nonstandard data includes values that
contain special characters, such as percent signs, dollar signs, and commas. Nonstandard data also includes date and
time values as well as data in fraction, integer binary, real binary, and hexadecimal forms.

Using Formatted Input


Nonstandard data values require an input style that has more flexibility than column input. You can use formatted input,
which combines the features of column input with the ability to read both standard and nonstandard data in fixed fields.

The syntax for the INPUT statement with formatted input is shown here. The INPUT statement describes the
arrangement of values in the raw data file and assigns input values to the corresponding SAS variables. When you use
formatted input, you move the input pointer to the starting position of the field using a column pointer control, name
the variable, and specify an informat. Remember, an informat is the special instruction that specifies how SAS reads raw
data.

Click the Information button if you'd like view some examples of SAS informats.

Question
Which raw data value below is an example of nonstandard numeric data?

a. 234.908

b. -234.908

c. $234,908.00

d. 23E4

The correct answer is c. Standard numeric data values can include a minus sign, decimals, and numbers expressed in
scientific notation.

Using the @n Column Pointer Control


In this lesson, you'll be working with two column pointer controls: The @n pointer control and the +n pointer control.
Let's first take a look at the @n column pointer control.

The @n is an absolute pointer control that moves the input pointer to a specific column number. The @ moves the
pointer to column n, which is the first column of the field that you want to read. Here, the INPUT statement reads the
first value for Cust_type as a numeric value, beginning in column 1. By default, the pointer is already positioned at
column 1, so the @1 is optional here. The value for the next variable, Offer_dt, begins in column 5.
Using the +n Column Pointer Control
Now, let's take a look at the +n column pointer control.

The +n pointer control moves the input pointer from left to right, to a column number that is relative to the current
position. The + sign moves the pointer forward n columns. In order to count correctly, it's important to understand
where the column pointer control is located after each data value is read. Let's look at an example.

With formatted input, the column pointer control moves to the first column following the field that was just read. In this
example, after Offer_dt is read, the pointer moves to column 13. To start reading Item_gp, which begins in column 14,
you move the column pointer control ahead 1 column with +1.

Column pointer controls are very useful. For example, you can use the @n to move a pointer forward or backward when
reading a record. Here, the INPUT statement reads the values for Discount, beginning in column 22, before the values
for Cust_type. But, if you use this technique, you should keep in mind that it's more efficient to read the data in the
input buffer from left to right.

Using Formatted Input

In this demonstration, you use formatted input to read the values in offers.dat in consecutive order.

1. Copy and paste the following code into the editor.

data work.discounts;
infile "&path/offers.dat";
input @1 Cust_type 4.
@5 Offer_dt mmddyy8.
@14 Item_gp $8.
@22 Discount percent3.;
run;

The values for Cust_type begin in column 1. We could use the @1 column pointer control. However, it's not
necessary because the default column pointer location is column 1. We'll name the variable Cust_type. We need
to read in four columns of standard numeric data. So, we'll use the 4. informat.

Let's continue with the INPUT statement. The values for Offer_dt begin in column 5. In this case, the data values
are nonstandard numeric data. Here's a question. What informat should we use to read in the values for
Offer_dt? We'll use MMDDYY8. because the values are in the form month/day/year and we need to read in
eight columns of data.

The values for Item_gp begin in column 14. We need to read in eight columns of character data, so we'll use the
$8. informat. Finally, we'll read in the values for Discount, which begin in column 22. These values include
percent signs, so we'll use the PERCENT3. informat.

2. Submit the code and check the log. Verify that the discounts data set contains four variables.

3. Copy and paste the following PROC PRINT step into the editor.

proc print data=work.discounts;


run;
4. Submit the PROC PRINT step and view the results. Notice that the values of Offer_dt are unformatted SAS date
values.

5. Add a FORMAT statement to the PROC step, as shown below, to temporarily associate the DATE9. format with
Offer_dt.

proc print data=work.discounts;


format Offer_dt date9.;
run;

6. Resubmit the PROC PRINT step and view the results. Notice that the values of Offer_dt now appear with the
DATE9. format. You could add a FORMAT statement to the DATA step to permanently associate the DATE9.
format with Offer_dt.

Processing the DATA Step


Now that you've seen the program that creates the discounts data set, let's take a closer look at how the code is
processed.

The input buffer and the program data vector are created when the program is compiled. Prior to execution, the values
in the PDV are initialized to missing. When the INPUT statement executes, SAS loads the input buffer with the first row
of data from offers.dat.

SAS reads the value for Cust_type, in columns 1 through 4, using the 4. numeric informat. SAS uses the MMDDYY8.
informat to read the value for Offer_dt, in columns 5 though 12, as a SAS date value. Next, SAS reads the values for
Item_gp, in columns 14 though 21, into the PDV using the $8. character informat. Finally, SAS reads the values for
Discount, in columns 22 through 24, into the PDV using the PERCENT3. informat.

At the bottom of the DATA step, an implicit OUTPUT statement directs SAS to write the observation to the discounts
data set. Then, an implicit RETURN statement returns processing to the top of the DATA step. SAS continues to process
the data until the end of the file is reached.

Notice that in the PDV, the character variable Item_gp is defined with the exact length specified by the informat, 8. But,
the lengths that are defined for Cust_type, Offer_Dt, and Discount in the PDV are different from the lengths that are
specified by their informats.

Remember, by default, SAS stores numeric values – no matter how many digits the values contain – as floating-point
numbers in 8 bytes of storage. The length of a stored numeric variable isn't affected by an informat's width or by other
column specifications in an INPUT statement. But, the informat does specify the number of columns to be read. So, you
still need specify the actual width of a raw data field in an INPUT statement. Otherwise, you'll get inappropriate variable
values when the program executes.

Question
Can you use column input to read both standard and nonstandard data in fixed fields?

a. Yes

b. No
The correct answer is b. Nonstandard data values require an input style that has more flexibility than column input. You
can use formatted input, which combines the features of column input with the ability to read both standard and
nonstandard data in fixed fields.

Question
Which informat below should you use to read the raw data value $1,230,456 as a numeric variable?

a. w.

b. $w.

c. COMMAw.

The correct answer is c. The COMMAw. informat strips out the dollar sign and commas and assigns the value to a
numeric variable.

Creating a Single Observation from Multiple Records


The raw data file address.dat contains name, mailing address, and phone information. Notice that the information for
each person is on four lines in the raw data file.

Suppose you need to create a SAS data set, mycontacts, with one observation per person, which contains the name,
phone, and second line of the mailing address.

Using Multiple INPUT Statements


You can use multiple INPUT statements to read each group of records as a single observation. By default, SAS loads a
new record into the input buffer when it encounters an INPUT statement. Let's take a look at how SAS uses multiple
INPUT statements to process the data in address.dat.

SAS reads the first INPUT statement and loads the first line of raw data into the input buffer. In this example, we don't
want to read any values from the second line of raw data. But, we do need to include a second INPUT statement, with
no variables specified, to move the line pointer to the second record. If we don't include this INPUT statement, we won't
be able to correctly read the values in the third and fourth records.

Because this INPUT statement doesn't read in any values, the data in the second record won't become part of the
output data set. However, the record is still loaded into the input buffer. As the DATA step continues to execute, SAS
loads the third and fourth lines of raw data into the input buffer.

Line Pointer Controls


As an alternative to writing multiple INPUT statements, you can write one INPUT statement that contains line pointer
controls to specify the record(s) from which values are to be read. There are two line pointer controls, the forward slash
and the #n.

The forward slash is known as a relative line pointer control that moves the line pointer relative to the line on which it is
currently positioned. The forward slash only moves the input pointer forward and must be specified after the
instructions for reading the values in the current record.

The #n line pointer control specifies the absolute number of the line to which you want to move the input pointer. The
#n pointer control can read lines in any order. Notice that in the INPUT statement, you must specify the #n before the
instructions for reading the values.

Using the / Line Pointer Control


Let's take a look at how SAS processes the input data when you use a / line pointer control.

The INPUT statement reads the first line of raw data. Remember, we don't want to read any values from the second line
of raw data. But, we do need to include a forward slash, without any following variable specifications, to move the line
pointer to the second record.

Because this INPUT statement doesn't read in any values, the data in the second record won't become part of the
output data set. However, the record is still loaded into the input buffer. A second forward slash advances the input
pointer to the third line of raw data. A final forward slash advances the input pointer to the fourth line of raw data.

Using the #n Line Pointer Control


Now let's see how SAS processes the same input data when the #n line pointer control is used.

This INPUT statement reads the values for FullName in the first record, followed by the values for Address2 in the third
record, and the values for Phone in the fourth record. Remember, the #n is an absolute line pointer control. It specifies
the absolute number of the line to which you want to move the input pointer.

In this example, we don't need to read the second record into the input buffer. However, the #n line pointer control also
enables you to read raw data records in any order. For example, you could use the #n to read the values for Phone, in
the fourth record before the values for Address2, in the third record.

Creating a Single Observation from Multiple Records


In this demonstration, you explore two ways to create a single observation from multiple records.

1. Copy and paste the following DATA Step into the editor.

data mycontacts;
infile "&path/address.dat";
input FullName $30.;
input;
input Address2 $25.;
input Phone $8.;
run;
Notice that the code contains uses multiple INPUT statements to read the data in address.dat. We need to read
the values for FullName in the first record, followed by the values for Address2 in the third record. Then we
need to read the value for Phone from the fourth record. In order to skip the first address record, we also need
to include an INPUT statement with no variables specified.

2. Submit the code and check the log. Notice that 48 records were read from address.dat and that the mycontacts
data set has 12 observations and 3 variables.

3. Copy and paste the following PROC PRINT step into the editor.
4. proc print data=mycontacts;
run;

5. Submit the PROC PRINT step and view the results.Notice that there is one observation per person, which
contains the name, phone, and second line of the mailing address.
6. Edit the DATA step, as shown below, so that it contains a single INPUT statement.

data work.mycontacts;
infile "&path/address.dat";
input FullName $30. / /
Address2 $25. /
Phone $8. ;
run;

We still want to read the values for FullName from the first record. The forward slash line pointer control
advances the input pointer to the next record. In this case we want to skip the second record, so let’s add
another forward slash. Now we want to read the values for Address2 in the third line. The forward slash
advances the input pointer to the fourth record, where we need to read the value of Phone.

7. Submit the revised DATA step and the PROC PRINT step.

8. Check the log. Verify that that 48 records were read from address.dat and that mycontacts contains 12
observations and 3 variables.

9. View the results. Notice that the output is the same as when you used multiple INPUT statements: there is one
observation per person, whichcontains the name, phone, and second line of the mailing address.

Question
Which set of instructions reads the values for Quantity (in the third field) after the values for Item (in the first
field)? Select all that apply.

1 1 2 2 3 3
1---5----0----5----0----5----0----5
COMPAPER $13.25 500

a. input Item $9. @20 Quantity 3.;

b. input Item $9. +10 Quantity 3.;

c. input Item $9. +11 Quantity 3.;

The correct answer is a and b. You can either move the pointer forward (+10) or give the exact location (@20).

Question
Which set of instructions reads the values for Quantity (in the third field) after the values for Item (in the first
field)? Select all that apply.

1 1 2 2 3 3
1---5----0----5----0----5----0----5
COMPAPER $13.25 500
a. input Item $9. @20 Quantity 3.;
b. input Item $9. +10 Quantity 3.;

c. input Item $9. +11 Quantity 3.;

The correct answer is a and b. You can either move the pointer forward (+10) or give the exact location (@20).

Controlling When a Record Loads


The raw data file sales.dat contains data about the largest sales in the first quarter of 2011. Notice that the decimal
places and commas are reversed for the European and United States sales figures. Also notice that the sales dates for
Europe are represented in the form daymonthyear, while the sales dates for the United States are represented in the
form month-day-year.

Suppose that you want to create a SAS data set, named salesQ1, from the raw data in sales.dat. The SAS data set should
contain the variables SaleID, Location, SaleDate, and Amount.

Working with Mixed Record Types


To create the salesQ1 data set, you might write a program that uses conditional logic.

If the value of Location is USA, the MMDDYY10. and the 7. informats are used to read in the values for SaleDate and
Amount. If the value of Location is Europe, the DATE9. and COMMAX7. informats are used to read in the values for
SaleDate and Amount.

This program is a good start. However, it produces some unexpected results.

Processing the DATA Step


To understand why this program doesn't work, let's take a look at what happens when SAS processes the code.

As the DATA step begins, SAS initializes the PDV. The INFILE statement specifies the input data file. The first INPUT
statement loads the input buffer with the first record in the raw data file. SAS loads the values for SaleID and Location
into the PDV. The specified condition Location = 'USA' is true, so the program continues to execute.

Think about this: What happens when the program reaches the second INPUT statement? The second INPUT statment
loads the second record from sales.dat into the input buffer. As the INPUT statement continues, SAS tries to read the
values in the input buffer into the PDV using the specified informats. In this case the informats don't match the data, so
an invalid data message is written to the SAS log and the values for SaleDate and Amount remain set to missing.

At the bottom of the DATA step, an implicit OUTPUT statement directs SAS to write the observation to the salesQ1 data
set. Then, an implicit RETURN statement returns processing to the top of the DATA step. SAS continues to process the
data until the end of the file is reached.

Line Hold Specifiers


Normally, each INPUT statement in a DATA step reads a new data record into the input buffer. To get the correct results,
you need some way to keep the second INPUT statement from moving to the next line of raw data. You can keep the
second INPUT statement from moving to the next line of raw data by using a line-hold specifier, the single trailing @, in
the first INPUT statement.
The syntax for an INPUT statement with a single trailing @ is shown here. The term trailing indicates that the @ sign
must be the last item specified in the INPUT statement. When you use a trailing @, the pointer position doesn’t change
and a new record isn’t loaded into the input buffer. So, the next INPUT statement for the same iteration of the DATA
step continues to read the same record as the previous INPUT statement.

The single trailing @ holds a raw data record in the input buffer until an INPUT statement without a trailing @ executes
or the next iteration of the DATA step begins.

Using a Single Trailing @


In this demonstration, you add a single trailing @ to the program that we've been working with.

1. Copy and paste the following code into the editor. Notice that the first INPUT statement contains a single trailing
@.

data salesQ1;
infile "&path/sales.dat";
input SaleID $4. @6 Location $3. @;
if Location='USA' then
input @10 SaleDate mmddyy10.
@20 Amount 7.;
else if Location='EUR' then
input @10 SaleDate date9.
@20 Amount commax7.;
run;

proc print data=work.salesQ1;


run;

2. Submit the code and check the log. Verify that six records were read from sales.dat and that salesQ1 contains six
observations and four variables. Notice that no errors were reported.

3. Vew the results. Notice that there are no missing or incorrect values.

Processing the DATA Step


You've seen that the single trailing @ works to create the output that you want. Now let's take a look at exactly what
happens when SAS processes this code.

The first INPUT statement loads the first record in the raw data file into the input buffer. Next, SAS reads the values for
SaleID and Location into the PDV. The single trailing @ indicates that a new record should not be read when the next
INPUT statement is reached. The condition Location='USA' is true, so SAS executes the following INPUT statement.
Remember, the previous record is being held in the input buffer.

The second INPUT statement reads the values for SaleDate and Amount into the PDV. There's no trailing @ in the
second INPUT statement, so SAS releases the record in the input buffer. At the bottom of the DATA step, an implicit
OUTPUT statement directs SAS to write the observation to the salesQ1 data set. Then, an implicit RETURN statement
returns processing to the top of the DATA step. The program continues to execute until the end of the file is reached.

Subsetting the Data


Now, let's modify the program that we've been working with. Suppose you want to create a SAS data set, named
europeQ1, that contains only the European records from sales.dat. To accomplish this task, you can use a subsetting IF
statement to process only the records in which Location equals EUR.
Notice the placement of the subsetting IF statement. Generally, it's most efficient to place a subsetting IF statement just
after the variables that are needed to evaluate the condition have been assigned values. In this case, the most efficient
place to put the subsetting IF is after the first INPUT statement, because that is where Location is assigned a value.

mmary: Reading Raw Data

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Using Formatted Input


The way data is organized in a raw data file determines which input style you should use to read the data. You can use
list input to read standard or nonstandard data that is separated by delimiters. You can use column input to read
standard data that is arranged in columns or fixed fields.

INPUT variable <$> startcol-endcol . . . ;

Standard data is data that SAS can read without any special instructions. Nonstandard data is data that SAS cannot read
without special instructions. Nonstandard data values require an input style that has more flexibility than column input.
You can use formatted input, which combines the features of column input with the ability to read both standard and
nonstandard data in fixed fields.

INPUT column-pointer-control variable informat . . . ;

When you use formatted input, you specify the starting position of a field using a column pointer control, name the
variable, and specify an informat. An informat is the special instruction that specifies how SAS reads raw data.

There are two choices for column pointer control: absolute and relative. With absolute pointer control, @n , you specify
the column in which the field begins. SAS moves the input pointer directly to column n, which is the first column of the
field that you want to read.

INPUT @n variable informat . . . ;

Relative pointer control, +n , moves the input pointer from left to right, to a column position that is relative to the
current position. The + sign moves the pointer forward n columns. With this style of pointer control, it's important to
understand the position of the input pointer after a data value is read. With formatted input, the input pointer moves to
the first column following the field that was just read.

INPUT +n variable informat . . . ;


The informat indicates the type and length of the variable to be created. It also tells SAS the width of the input field in
the raw data file, and how to convert the data value before copying it to the program data vector.

Creating a Single Observation from Multiple Records


You can use multiple INPUT statements to read a group of records and create a single observation. By default, SAS loads
a new record into the input buffer when it encounters an INPUT statement.

DATA SAS-data-set;
INFILE 'raw-data-file-name';
INPUT specifications;
INPUT specifications;
<additional SAS statements>

As an alternative to writing multiple INPUT statements, you can write one INPUT statement that contains line pointer
controls to specify the record(s) from which values are to be read. There are two line pointer controls, the forward slash
and the #n.

DATA SAS-data-set;
INFILE 'raw-data-file-name';
INPUT specifications /
#n specifications;
<additional SAS statements>

The forward slash moves the line pointer relative to the line on which it is currently positioned, causing it to read the
next record. The forward slash only moves the input pointer forward and must be specified after the instructions for
reading the values in the current record.

The #n line pointer control specifies the absolute number of the line to which you want to move the input pointer. The
#n pointer control can read lines in any order. You must specify the #n before the instructions for reading the values.

Controlling When a Record Loads


By default, each INPUT statement in a DATA step reads the next record into the input buffer, overwriting the previous
contents of the buffer. You can use a line-hold specifier, the single trailing @, to prevent the second INPUT statement
from reading a record.

INPUT specifications . . . @ ;

The single trailing @ holds the record in the input buffer, causing the next INPUT statement to read from the buffer,
instead of loading a new record. The input buffer is held until an INPUT statement without a trailing @ executes or the
next iteration of the DATA step begins.
Sample Programs
Using Formatted Input

data work.discounts;
infile "&path/offers.dat";
input @1 Cust_type 4.
@5 Offer_dt mmddyy8.
+1 Item_gp $8.;
run;

Creating a Single Observation from Multiple Records

data mycontacts;
infile "&path/address.dat";
input FullName $30. / /
Address2 $25. /
Phone $8. ;
run;

data mycontacts;
infile "&path/address.dat";
input #1 FullName $30.
#3 Address2 $25.
#4 Phone $8. ;
run;

Controlling When a Record Loads

data salesQ1;
infile "&path/sales.dat";
input SaleID $4. @6 Location $3. @;
if Location='USA' then
input @10 SaleDate mmddyy10.
@20 Amount 7.;
else if Location='EUR' then
input @10 SaleDate date9.
@20 Amount commax7.;
run;

data EuropeQ1;
infile "&path/sales.dat";
input @6 Location $3. @;
if Location='EUR';
input @1 SaleID $4.
@10 SaleDate date9.
@20 Amount commax7.;
run;

Lesson 4: Manipulating Character Values


SAS functions are built-in programming routines that enable you to complete many types of data manipulation quickly
and easily. Functions are often categorized by the type of manipulation they perform―categories such as arithmetic,
financial, character, descriptive statistics, date and time, and many, many more. You can't possibly learn to use all the
SAS functions in one course, so be sure to browse the SAS Help and Documentation to explore the many functions that
you can use in your SAS programs.

In the next lesson, you learn to use some of the functions that manipulate numeric variables. In this lesson you learn
how to use some of the functions that manipulate character variables.

You can use these character functions in assignment statements or conditional statements to create new variables, or
you can use them to replace the contents of a variable. For example, you might want to extract a portion of a character
value, or change the case of a character value, or separate a single value into multiple values, or concatenate multiple
values into one value, or find and replace a particular string in a character value, or remove the blanks in a character
value. These are just a few of the examples of the character value manipulations you can perform by using functions.

Objectives

In this lesson, you learn to do the following:

 extract a portion of a character value from a specified position


 change the case of a character value
 separate the values of one variable into multiple variables
 put multiple values together into one value
 remove blanks from a character value
 search a character value to find a particular string
 replace a portion of or all of the contents of a character value

Using SAS Functions


A SAS function is a routine that performs a calculation on, or a transformation of, the arguments listed in parentheses
and returns a value. You should be familiar with functions, but let's quickly review the syntax.

To use a SAS function, you specify the function name followed by the function arguments, which are enclosed in
parentheses. An argument can be a constant, a variable, or any SAS expression, including another function. You can use
functions in DATA step statements anywhere that you can use an expression. You can also use functions in WHERE
statements in many procedures and in SQL syntax.

In this example, the MEAN function calculates the average of three exam scores that are stored in the numeric variables
exam1, exam2, and exam3. The three variables are the arguments for the MEAN function. The resulting value is assigned
to the numeric variable AvgScore.

Target Variables and Functions


Now that you've reviewed the syntax of SAS functions, let's talk about target, or assignment, variables. A target variable
is a variable to which the result of the function is assigned. For example, AvgScore is the target variable here.

If the target variable is a new variable, its type and length are determined by the expression on the right side of the
equals sign. In this example, the result of the function on the right side of the equals sign is a number, so AvgScore is
created as a numeric variable with a length of 8 bytes.

In this next example, the function on the right side of the equals sign returns a character string, so the variable B is
created as a character variable. The length of B is determined by the specific character function that is used.
Some character functions produce target variables with a length that is equal to the length of the first argument. Other
functions produce target variables with a default length of 200 bytes. This is important to understand, because default
lengths could cause variables to use more space than necessary in your data set, or cause the value to be truncated if
the value exceeds the length of the variable. You can avoid this by using a LENGTH statement to specify a length for the
target variable before the assignment statement with the function. You learn more about this as you learn each new
character function in this lesson.

Question
Which of the following statements is false?

a. Functions are used in SAS statements.

b. A function is written by specifying the function name followed by the function arguments.

c. Function arguments must be enclosed in parentheses.

d. All function arguments are variables.

The correct answer is d. Function arguments can be variables, constants, or expressions.

Extracting and Transforming Character Values


Suppose a manager in the Orion Star Finance Department has asked for a list of all the charities that Orion Star
contributes to. The report she wants must contain the charity's identification code and name.

The Orion Star accounting system has a data set named biz_list that is close to what the manager wants. You can use
this input data set to create a new data set, charities, that contains the information you need in the report. To create the
new data set, you need to do several things. The biz_list data set contains the names of Orion Star's US suppliers,
charities, and consultants. The last character in the Acct_Code variable represents the type of organization: 1 denotes a
supplier, 2 denotes a charity, and 3 denotes a consultant.

The first step is to subset the data based on the last position of Acct_Code so that you have only charities in your new
data set. The other characters in the Acct_Code variable represent the identification code for the organization.

The next step is to extract the identification code from Acct_Code and store it in a new variable named ID. The name of
the organization is stored in uppercase letters in the variable Name.

The final step is to change the values in the Name variable from uppercase to proper case. You'll leave the variable
Acct_Code in the charities data set so that you can easily compare it to the ID variable. Otherwise, the variable could be
dropped from the output data.

Here's a list of the character functions you’ll use to transform the data. You use the SUBSTR and LENGTH functions to
extract a string from a value and use it to subset the data. Then you learn an alternate way to get the same result by
using the RIGHT, LEFT, and CHAR functions. Finally, you use the PROPCASE function to change the case of character
values.
Using the SUBSTR Function to Extract a String
The first thing you need to do is subset the biz_list data based on the last character of Acct_Code. You can use the SAS
character function SUBSTR to do this.The SUBSTR function can be used in different ways. When you use the function on
the right side of an assignment statement, it extracts a substring of characters from a character string, starting at a
specified position in the string.

The syntax for the SUBSTR function is the function name with the required arguments string and start, and the optional
argument length. String can be a character constant, a variable, or an expression. Start specifies the starting position.
Length specifies the number of characters to extract. If you don't specify a length, the remainder of the string is
extracted.

Let's look at an example. This SUBSTR function specifies that a string be extracted from the value of the variable
Item_Code. The string to be extracted begins in position 1 and contains 3 characters. The string is then assigned to the
variable Item_Type. If the length of the new variable is not previously defined with a LENGTH statement, it has the same
length as the first argument to SUBSTR.

Question
Which SUBSTR function extracts the group of five numbers from the middle of the Item_Code value?

Item_Code

978-1-59994-397-8

a. substr(Item_Code,5,7)

b. substr(Item_Code,5)

c. substr(Item_Code,7,5)

d. substr(Item_Code,'mid',5)

The correct answer is c. It must start in position 7 (counting numbers and dashes) and extract 5 characters.

Extracting and Transforming Character Values


Suppose a manager in the Orion Star Finance Department has asked for a list of all the charities that Orion Star
contributes to. The report she wants must contain the charity's identification code and name.

The Orion Star accounting system has a data set named biz_list that is close to what the manager wants. You can use
this input data set to create a new data set, charities, that contains the information you need in the report. To create the
new data set, you need to do several things. The biz_list data set contains the names of Orion Star's US suppliers,
charities, and consultants. The last character in the Acct_Code variable represents the type of organization: 1 denotes a
supplier, 2 denotes a charity, and 3 denotes a consultant.

The first step is to subset the data based on the last position of Acct_Code so that you have only charities in your new
data set. The other characters in the Acct_Code variable represent the identification code for the organization.

The next step is to extract the identification code from Acct_Code and store it in a new variable named ID. The name of
the organization is stored in uppercase letters in the variable Name.

The final step is to change the values in the Name variable from uppercase to proper case. You'll leave the variable
Acct_Code in the charities data set so that you can easily compare it to the ID variable. Otherwise, the variable could be
dropped from the output data.

Here's a list of the character functions you’ll use to transform the data. You use the SUBSTR and LENGTH functions to
extract a string from a value and use it to subset the data. Then you learn an alternate way to get the same result by
using the RIGHT, LEFT, and CHAR functions. Finally, you use the PROPCASE function to change the case of character
values.

Using the LENGTH Function to Find the Length of a String


Now that you know how the SUBSTR function works, you might be thinking that you'll subset the data by extracting the
last character in the Acct_Code variable and writing an IF statement that compares the extracted value to 2. You're on
the right track, but there's a little problem.

When you look at the values in the Acct_Code variable, you see that in most cases the account code has four characters,
but in some cases it has three characters. That means the starting position for extraction varies. This is a problem for the
SUBSTR function because you must specify a starting location that works for all values. To solve this problem, you can
use the LENGTH function to determine the number of characters in the value. Once you know the length of the value,
you can determine a starting position for extracting the particular value.

The LENGTH function returns the length of a character string, excluding trailing blanks. If the character string is blank,
the LENGTH function returns 1. The function requires one argument, which can be a variable, constant, or expression.

Let's look at an example. Here, the variable Code is defined as ABCD with two trailing blanks. The LENGTH function
returns the integer 4 because trailing blanks are not counted. The value is assigned to the variable Last_Position.

Let's look at another example. Here, the variable Code is defined as EFGH with two leading blanks. This time the LENGTH
function returns the integer 6 because leading blanks in a value are counted. The value is assigned to the variable
Last_Position.

Question
What value is returned by the LENGTH function if the value in Code contains two leading blanks and three trailing
blanks?

Code
$9
length(Code)
ABCD
a. 4

b. 6

c. 9

The correct answer is b. Leading blanks are counted, but trailing blanks are not, so 6 is the length of the value.

Extracting a Value Using the SUBSTR and LENGTH Functions


Now you can use the SUBSTR and LENGTH functions together to subset the data and extract a string for the ID variable
in the charities data set. Remember that you need to use the LENGTH function because the values in the Acct_Code
variable have different lengths.

First you use the two functions in a subsetting IF statement to subset the data. You want to select observations from
biz_list where the last character in the value of Acct_Code is 2. You use the LENGTH function to get the length of
Acct_Code. The integer that the LENGTH function returns specifies the start position for the SUBSTR function. Then you
also use the two functions in an assignment statement that extracts a string. You use the LENGTH function again to find
the length of Acct_Code. Here the LENGTH function returns an integer that specifies the length of the string to extract.

Now, let's take a look at this code. Although it's technically accurate, it's not efficient to repeat a function to extract the
same value over and over again. Instead, it's better to create a variable to hold the value in the program data vector, and
then reference the variable rather than repeat the function. Let's revise the code to create the Len variable to hold the
length of Acct_Code. Then you can use the Len variable in the arguments for each SUBSTR function. Since you don't
want the Len variable in the charities data, you can drop it by using the DROP= data set option or the DROP statement.

Let's step through execution for the first charity observation to see how the functions transform the data. The SET
statement reads the first observation that is a charity. The LENGTH function reads the value in Acct_Code and returns
the value 4. The value is assigned to the variable Len. The SUBSTR function begins reading the value in Acct_Code at
position 4 and reads one character. The SUBSTR function returns the value 2. The IF statement is true, so the assignment
statement executes. In the assignment statement, the SUBSTR function begins reading the value in Acct_Code at
position 1. The value stored in Len is 4, and this argument is 4-1, so the SUBSTR function reads three characters. The
SUBSTR function returns the value AQI. The value is assigned to the ID variable. SAS writes the observation to the
charities data set, and control returns to the top of the DATA step.

Extracting a String from a Specific Position in a Character Value


In this demonstration, you extract a string from a specific position in a character value.

1. Copy and paste the following code into the editor.

data charities(drop=Len);
length ID $ 5;
set orion.biz_list;
Len=length(Acct_Code);
if substr(Acct_Code,Len,1)='2';
ID=substr(Acct_Code,1,Len-1);
run;
This DATA step subsets the orion.biz_list and creates the variable ID by extracting a substring from the variable
Acct_Code.

2. Submit the code and check the log. Notice that there are 12 observations in the charities data set.

3. Copy and paste the following PROC PRINT step into the editor.

proc print data=charities;


run;

4. Submit the PROC PRINT step and view the results. Notice that the values in ID vary from 2 to 4 characters. The
LENGTH function gives you a lot of flexibility for extracting substrings. Looking at the Acct_Code variable, you
can see that all values have a 2 in the last position. This tells us that all observations in the new data set are
charities.

Extracting a Value Using the RIGHT, LEFT, and CHAR Functions


SAS provides a rich set of functions for transforming character data. Often, you can perform identical transformations by
using different combinations of functions. You've seen one way to extract a string from a character value. Now let's look
at another way to extract the identification code from Acct_Code by using SUBSTR with the RIGHT, LEFT, and CHAR
functions.

The RIGHT function right-aligns a value. If there are trailing blanks, they are moved to the beginning of the value. In this
example, the RIGHT function right-aligns the value of Acct_Code. There are three trailing blanks at the end of the value.
The new variable has the value right-aligned with three leading blanks.

The LEFT function left-aligns the value and moves leading blanks to the end of the value. In this example, the LEFT
function left-aligns the value in the NewCode_Rt variable.

The CHAR function returns a single character from a specified position in a character string. In this example,
the CHAR function returns the third character in the value of Acct_Code.

You can use the RIGHT, LEFT, CHAR and SUBSTR functions together to subset the biz_list data and extract a
string to create the ID variable from the Acct_Code variable. Notice here that the Code_Rt variable stores the
right-aligned value of Acct_Code. This variable is used in the arguments of both the CHAR and LEFT
functions.

Stepping through execution of the first charity observation is the easiest way to understand how the functions
transform the values. As before, let's start with the SET statement that has just read the first observation that is a
charity.

The first assignment statement executes. The length of the Acct_Code variable is 6. The RIGHT function right-
aligns the value and moves the two trailing blanks to the front of the string. The IF statement executes and the
CHAR function extracts the character in the sixth, or last, position in the value of Code_Rt. Notice that the
length of Code_Rt is 6. Because the value of Code_Rt is right-aligned, it doesn't matter how many characters
there are in the value of Acct_Code. The last character in the value is always in the sixth position when the
value is right-aligned. In this case, the last character is 2 and the IF statement is true.

Next, the second assignment statement executes. The SUBSTR function reads the value in Code_Rt, starts
extracting at position 1, and extracts five characters. The result in this case is a string with two leading blanks
and the letters AQI. Then the LEFT function left-aligns the value and the result is AQI with two trailing blanks.
The value is assigned to the ID variable. At the bottom of the DATA step, the observation is written to the
charities data set.

This program creates the same output data as the program that uses the LENGTH and SUBSTR functions.

Changing the Case of a Value Using the PROPCASE Function


Now that you've subset the data and created the ID variable with the account code, the final step is to change the case
of the charity name.

Remember that the name is stored in uppercase in the biz_list data, and you want to change the values in Name so that
the first letter in each word is uppercase and the remaining letters are lowercase. This is called proper case. You can
make this change by using the PROPCASE function.

The syntax for the PROPCASE function is the function name followed by the argument, which can be a character
constant, a variable, or an expression. The PROPCASE function uses delimiters, or characters, to specify when a new
word begins. Let's look at some examples to see how delimiters work.

In this example, no delimiter is specified, so SAS looks for one of the default delimiters: a blank, forward slash, hyphen,
open parenthesis, period, or tab. The value in Fruit has spaces, so the PROPCASE function returns each word with initial
capital letters. Notice that PROPCASE assigns a length of 200 bytes to a new variable that has not been previously
defined with a LENGTH statement.

Let's look at another example. Here the comma is the only delimiter separating the words in the value of Fruit. The
comma is not one of the default delimiters, so if you submit the function without specifying the comma as a delimiter,
the case change will apply only to the first character in the string. If you specify the delimiter, the case changes for each
word.

Let's look at one more example. Suppose there is a space and a comma between two of the words in the value of Fruit.
In your code, you specify the comma as the delimiter in the function arguments. You might think that the word Lilikoi
would be capitalized, but it would not be. The function capitalizes the next character after the delimiter. Once you
specify a delimiter, the default delimiters do not apply, so the space is not considered a delimiter. Capitalization starts
with the character after the comma, which is a space, so the rest of the string is lowercase. The solution is to specify
multiple delimiters. If you specify both a comma and a space as delimiters, then the word Lilikoi is capitalized.

Remember, you want the charities data set to have charity names in proper case, so you need to use the PROPCASE
function with Name as the argument. You don't need to specify delimiters because each word is separated by a space,
which is one of the default delimiters.

Question
Which assignment statement converts the current value of Name to the new value of Name? Select all that
apply.

Current value New value

Name Name
HEATH*BARR*LITTLE EQUIPMENT SALES Heath*Barr*Little Equipment Sales

a. propcase(Name,'*');

b. propcase(Name,' *');

c. propcase(Name,'* ');

The correct answer is b and c. The second argument to PROPCASE must list all the characters to use as delimiters. The
space and the asterisk must both be listed, but they can be listed in any order.

Changing the Case of a Character Value


In this demonstration, you use a function to change the case of character values.

1. If necessary, copy and paste the following code into the editor. If you still have the code from the
last demonstration in the editor, you only need to add the PROPCASE function to the code to change
the values of Name to proper case.

data charities(drop=Len);
length ID $ 5;
set orion.biz_list;
Len=length(Acct_Code);
if substr(Acct_Code,Len,1)='2';
ID=substr(Acct_Code,1,Len-1);
Name=propcase(name);
run;

proc print data=charities;


run;

2. Submit the code. Then, check the log and view the results.

3. You don't need the account code in the report, so revise the PROC PRINT step, as shown below, to drop the Obs
column and include only the ID and Name variables in the listing report.

proc print data=charities noobs;


var ID Name;
run;

4. Submit the revised PROC PRINT step and view the results. You've created the report that the manager
wants by using functions to transform the data.

Using Other Functions to Change Case


In addition to PROPCASE, SAS has other functions that change the case of values in a character variable. The LOWCASE
function changes all characters in a value to lowercase. For this function, you do not need to specify a delimiter. The
UPCASE function changes all characters in a value to uppercase. Again, there is no need to specify a delimiter. Both the
UPCASE and LOWCASE functions produce target variables with the same length as their argument.
Separating and Concatenating Character Values
Suppose you want to make mailing labels for a letter that Orion Star is going to send out to charities.

The contacts data set contains the contact information for each charity's representative. The data in the Address1 and
Address2 variables is in the correct form to use for a mailing address, but the data in the Title and Name variables needs
to be combined into a new variable named FullName. To do this, you first separate the last name from the first and
middle names based on the position of the comma.

Here's a question: Would it be easy to use the SUBSTR function to separate the contact's name into two parts? No, it
would be difficult because the comma isn't in the same position for each value of Name. You can use the SCAN function
to separate the names. Then you combine Title, FMName, and LName to create the variable FullName. Here's a preview
of the data set labels that you create for the mailing labels.

Here's a list of the character functions you'll use to transform the data. You use the SCAN function to extract a word
from a character string. You use the CATX function to concatenate the values of several variables, and you also learn
about other concatenation functions. You use the TRIM and STRIP functions to remove blanks from a value.

Separating a Word from a String Using the SCAN Function


The SCAN function enables you to separate a character value into words and to return a specified word. You use the
SCAN function when you know the relative order of words, but their starting positions vary.

Let's look at the syntax. The value of string can be a character constant, variable, or expression. N specifies which word
to read. A missing value is returned if there are fewer than N words in the string. If N is negative, SCAN finds the word in
the character string by starting from the end of the string. Delimiters list the characters that serve as word separators.
You can specify as many delimiters as you need to correctly separate the character expression. Delimiters must be
enclosed in quotation marks. If you don't specify a delimiter, SAS looks for default delimiters such as a blank, a comma, a
forward slash, or a period. The maximum length of the word that is returned by the SCAN function depends on the
environment from which it is called. In a DATA step, if the SCAN function returns a value to a variable that has not yet
been given a length, that variable is given the length of the first argument. If you need the SCAN function to assign a
variable a value that is different from the length of the first argument, use a LENGTH statement for that variable before
the statement that uses the SCAN function.

Let's look at an example. Suppose you want to create two variables from the value stored in the Phrase variable. The
space is specified as a delimiter and the SCAN function separates the value in the variable Phrase into three words. The
first word is returned and assigned to the variable Item1. The third word is returned and assigned to the variable Item2.
The default length of both variables is 21 bytes unless they’ve been defined earlier with a LENGTH statement.

Let's look at the value of Fruit in the program data vector. Here's a question: What value will be assigned to the variable
Third? The variable will contain the value papaya with one leading blank. There is a leading blank in the value because
only the comma is specified as the delimiter. Once you specify a delimiter, the default delimiters are not in effect.

Here's another question: How would you extract the value papaya with no leading blank? You could specify both the
comma and a blank as delimiters.

Now you can use the SCAN function to extract words from the Name variable in the contacts data. Remember, you want
to separate the last name from the first and middle names.
Here's a question: What delimiter should you use to do this? The comma separates the last name from the first and
middle names. Although the comma is one of the default delimiters, you need to specify it here. If you don't specify it,
the space is also used as a delimiter and the first name is separated from the middle initial.

Let's look at the code. Notice that here a LENGTH statement specifies 15 bytes for the length of the two new character
variables. It's a good practice to use a LENGTH statement in your code. The SCAN function returns the second word from
the Name variable. The value is assigned to FMName. LName is assigned the first word from the Name string.

Question
If you specify the period, comma, and blank as delimiters for the SCAN function, what is the fourth word in the text
string below?

MR. JONATHAN E. MATTHEWS, PERSONNEL DIRECTOR

a. E

b. PERSONNEL DIRECTOR

c. MATTHEWS

d. PERSONNEL

The correct answer is c. This text string would be divided into six words: MR, JONATHON, E, MATTHEWS, PERSONNEL,
and DIRECTOR. MATTHEWS is the fourth word.

Question
If you specify the colon as the delimiter for the SCAN function, how many words appear in the text string below?

Washington:New York:California:New Mexico

a. 4

b. 6

The correct answer is a. The SCAN function uses the colon as the delimiter and divides the string into four words.

Question
In this DATA step, which SCAN function completes the assignment statement to correctly extract the four-digit
year from the Text variable? Select all that apply.

data Scan_Quiz;
Text="New Year's Day, January 1st, 2007";
Year=________________________;
run;
a. scan(Text,-1)

b. scan(Text,6)

c. scan(Text,6,', ')

The correct answer is a, b, and c. All of these SCAN functions extract 2007 from the string in the Text variable.

Choosing between the SCAN and SUBSTR Functions


Both the SCAN and SUBSTR functions can extract text strings from a character value. SCAN extracts words within a value
that is marked by delimiters. SUBSTR extracts a substring of characters from a value starting at a specified location.

You use the SUBSTR function when you know the exact position of the string that you want to extract from a character
value. For example, the first two characters of the variable ID identify the class level of college students. Because the
position of these characters does not vary with the values of ID, you can use the SUBSTR function to extract them.

You use the SCAN function when you know the order of the words in a character value, the starting position of the word
varies, and the words are marked by some delimiter. In this example, the first name starts in different positions in the
two values, so you can't use the SUBSTR function. However, in every case the first name is the second word and the
words are separated by spaces and/or commas. You can extract the first name from the Name variable by using the
SCAN function.

Question
Look at the values of the two variables below. Which functions would you use to extract the bold portion of the string
from each variable?

ShipCode SiteMgr

TR112K_T Montry, 75

TD107M_R Lee, 124

a. Use the SUBSTR function to extract values from ShipCode, and the SCAN function to extract values from
SiteMgr.

b. Use the SCAN function to extract values from ShipCode, and the SUBSTR function to extract values from SiteMgr

The correct answer is a. The three numbers in ShipCode are in the same position within each the string, so you can use
the SUBSTR function to extract them from the string. The numbers in SiteMgr are in different locations, but they are in
the second position after a delimiter that can be specified. You can use the SCAN function to extract them.

Extracting a Word from a Delimited Character String


In this demonstration you run the code for separating the name string into two strings: the first and middle name, and
the last name.
1. Copy and paste the following code into the editor.

data labels;
set orion.contacts;
length FMName LName $ 15;
FMName = scan(Name,2,',');
LName = scan(Name,1,',');
run;

2. Submit the code and check the log. Verify that no errors occurred.

3. Copy and paste the following PROC PRINT step into the editor.

proc print data=labels noobs;


run;

4. Submit the PROC PRINT step and view the results. Notice that the FMName and LName variables contain text
from the separated Name variable.

Concatenating Strings Using the CATX Function


You've separated the Name variable into two variables. Now you need to join the values of Title, FMName, and LName
and store the concatenated value in a new variable. To do this, you can use one of several functions that concatenate
strings.

The CATX function removes leading and trailing blanks, inserts separators, and returns a concatenated character string.
The syntax for the CATX function is shown here. Let's look at the arguments for the function. Separator is a character
string that is inserted between the concatenated strings. String-1 to string-n can be character constants, variables, or
expressions. Leading and trailing blanks are removed from each argument.

Let's look at how the function concatenates the variables. The separator is a blank, and the strings are concatenated in
the order that you list them in the function arguments. The CATX function removes the leading and trailing blanks from
each variable value and inserts a blank between each one. FullName is assigned the value and returns the string Ms. Sue
Farr. Notice that the length of FullName is 200 bytes because that is the default length set by the CATX function. You can
use a LENGTH statement to set the length of the Fullname variable.

Concatenating Character Strings

In this demonstration, you complete the program to create the data for mailing labels by concatenating character
strings.

1. Copy and paste the following code into the editor.

data labels;
set orion.contacts;
length FMName LName $ 15
FullName $ 50;
FMName = scan(Name,2,',');
LName = scan(Name,1,',');
FullName=catx(' ',Title,FMName,LName);
run;

proc print data=labels noobs;


var Fullname Address1 Address2;
run;
To finish up the data for the mailing labels, we need to concatenate Title, FMName, and LName. The CATX
function concatenates the variables and inserts a blank between each string. Remember, a new variable created
by an assignment statement that uses the CATX function will have a default length of 200 bytes. The LENGTH
statement assigns a length of 50 bytes to FullName.

2. Submit the code and check the log. Verify that no errors occurred.

3. View the results.

4. Revise the code, as shown below. Add a KEEP= data set option to the labels data so that you keep only the
variables FullName, Address1 and Address2. Add a VAR statement to the PRINT procedure so that the variables
are listed in this order: FullName, Address1, Address2.

data labels(keep=FullName Address1 Address2);


set orion.contacts;
length FMName LName $ 15
FullName $ 50;
FMName = scan(Name,2,',');
LName = scan(Name,1,',');
FullName=catx(' ',Title,FMName,LName);
run;

proc print data=labels noobs;


var FullName Address1 Address2;
run;

5. Re-submit the code and view the results. The data is ready for making mailing labels.

Concatenating Strings Using the Concatenation Operator


In addition to the CATX function, SAS provides a variety of other ways to concatenate strings.

You can use the concatenation operator to join character strings. The concatenation operator is written as two
exclamation points. You can also use two solid vertical bars or two broken vertical bars for the concatenation operator if
the keys are available on your keyboard.

Let's look at an example. Suppose you want to concatenate the values in Area_Code and Number and store them in the
variable Phone. This assignment statement uses the concatenation operator to create the value in Phone.

Let's look at this concatenation piece by piece. The value of Phone is an open parenthesis, concatenated to the value of
Area_Code, concatenated to a close parenthesis and a space, concatenated to the value of Number. The length of the
variable Phone is the sum of the length of each variable and the constants that are used to create the new variable.
Here, the length of Phone is 26. The value of the new variable has embedded blanks because Area_Code contains
trailing blanks. Whenever the value of a character variable does not match the length of the variable, SAS pads the value
with trailing blanks. You can use other SAS functions to remove blanks from strings.

Removing Blanks Using the TRIM and STRIP Functions


The TRIM and STRIP functions both remove blanks from their argument. In both functions, argument can be a character
constant, variable, or expression. If argument is blank, TRIM returns a blank and STRIP returns a null string.

TRIM removes trailing blanks from its argument. STRIP removes both leading and trailing blanks from its argument. You
can use the TRIM function to remove the trailing blanks from Area_Code before concatenating the string.
Concatenating Strings Using Other CAT Functions
You've seen how you can use the CATX function to concatenate the values of variables. SAS has a family of CAT functions
that you can use to perform a variety of concatenations. The CAT, CATT, and CATS functions are similar in that they
return a value that is the concatenation of the named strings. For each function, the size of the created variable is 200
bytes unless it's defined by a LENGTH statement. The difference between the three is how they treat leading and trailing
blanks in the concatenated string.

The CAT function does not remove any leading or trailing blanks. The CATT function trims trailing blanks. The CATS
function strips leading and trailing blanks.

Let's look at some examples. First, let's use the CAT function to concatenate the area code and number. Remember, the
CAT function does not remove leading or trailing blanks. You can use character constants in the argument for the
function. Here, a parenthesis is concatenated before and after the area code. The result is similar to using the
concatenation operator except that the length of the new variable is 200 bytes.

The same assignment statement using the CATT function returns this result. CATT trims the trailing blanks from
Area_Code.

Question

Consider the values in the First, Middle, and Last variables and then match the function to the value it returns. Type the
letter in the box.

cat(First,Middle,Last) a. SueLFarmer

cats(First,Middle,Last) b. Sue L Farmer

catx(' ',First,Middle,Last) c. Sue L Farmer

The correct answers from top to bottom are c, a, b. The CAT function does not remove leading or trailing blanks. The
CATS function strips leading and trailing blanks, but you can specify a delimiter in the arguments for the CATS function
(as in the last function listed). The CATX function removes leading and trailing blanks and separates each string with the
specified delimiter.

Finding and Modifying Character Values


At Orion Star, the Internet Sales Group accidentally used the wrong data files for the Orion Star Catalog Web site. They
corrected the problem as soon as they noticed it, but some orders were created with data errors in them.
The clean_up data shows some of the errors. The Product_ID for mittens should have been 005 instead of 002 for the
third group of numbers. Luci is typed incorrectly. The correct word is Lucky. Each word in the Product variable should
start with a capital letter. Product_ID values should have no internal spaces. The correct data set shows what the data
should look like.

Here are the functions you'll use to clean up the data. The first three errors occur in observations that include Mittens as
part of the product value. You use the FIND function to find those observations. Then you use the SUBSTR function to
correct the Product_ID values for those observations. You use the TRANWRD function to fix the misspelled word, and
you use the PROPCASE function to fix the case. Finally, you use the COMPRESS function to remove the blanks in all the
values of Product_ID.

Searching Strings Using the FIND Function


The FIND function searches for a specific substring of characters within a character string that you specify.

The function searches for the first occurrence of the substring and returns the starting position of that substring. If the
substring is not found in the string, FIND returns a value of 0.

Let's look at the syntax for the FIND function. String specifies a character constant, variable, or expression to search in.
Substring specifies the characters to search for in string. If string or substring is a character literal, you must enclose it in
quotation marks.

Optional arguments can be used to modify the search. They can be specified in any order. The modifier I causes the FIND
function to ignore character case during the search. If I is not specified, FIND searches for the same case as the
characters specified in substring. The modifier T trims trailing blanks from string and substring. Start indicates where in
the string to start searching for the substring. It also specifies the direction of the search. A positive value indicates a
forward or right search. A negative value indicates a backward or left search. If you don't specify anything for start, the
search starts at position 1 and moves forward.

Let's look at several different FIND functions and what they return. Here, the variable Text is defined as AUSTRALIA,
DENMARK, and US. For the variable P1, the FIND function searches Text for the first occurrence of US.

Here's a question: What does the function return for the value of P1? The value of P1 is 2 because the first occurrence of
US is in AUSTRALIA. The value of P2 is 20 because the substring includes a leading blank.

Here's a question: What will the value of P3 be? The function returns 0 because lowercase us is not found in the value of
Text. Adding the I modifier causes the function to ignore case, so the function returns 2 to P4.

The next function has the I modifier and also directs the function to begin searching at position 10, so the function
returns 21 to P5.

Finally, the last function begins searching at position 10, and the negative sign specifies that the search goes to the left,
so the function returns 2 to P6.

Now that you know how the FIND function works, let's look at the clean_up data again. You need to find observations
that include the word Mittens in the Product variable, so the function searches Product for the substring Mittens.
Here's a question: Do you need to specify a modifier for the FIND function? In this case, you need the I modifier to make
the search case-insensitive because mittens is in lowercase in at least one observation.

Question
Which of the following FIND functions returns the value 1?

a. find('She sells seashells? Yes, she does.','she')

b. find('She sells seashells? Yes, she does.','she','i')

c. find('She sells seashells? Yes, she does.','she',22)

d. find('She sells seashells? Yes, she does.','she',-10)

The correct answer is b.

The modifier I causes the FIND function to disregard case in the string and match the first instance of the substring that
it finds, beginning from the left.

Replacing Characters in a Value Using the SUBSTR Function


Now that you know how to find the observations that have the substring Mittens in Product, you can use the SUBSTR
function to fix the incorrect values in Product_ID for those observations.

Earlier in this lesson, you learned to use the SUBSTR function on the right side of the assignment statement to extract
characters at a specified position from a string. This time you use the SUBSTR function on the left side of the assignment
statement to replace characters at a specified position in a string.

Let's look at the syntax. String specifies the character variables whose values are to be modified. Start specifies the
starting position where the replacement is to begin. Length specifies the number of characters to replace in string. If you
don't specify length, all characters from the start position to the end of the string are replaced. If you specify a value for
length, it can't be greater than the remaining length of string (including trailing blanks).

Let's look at an example. In the Location variable, the SUBSTR function starts at position 11 and replaces two characters
with the string OH.

Question
Select the statement that replaces the first three characters in values of Dept with the characters PUR.

a. Dept=substr(PUR,1,3);

b. substr(Dept,1,3)='PUR';

c. PUR=substr(Dept,1,3);

d. none of these
The correct answer is b. The SUBSTR function replaces variable values when it is used on the left side of an assignment
statement. This statement replaces the the first three characters in Dept with PUR.

Putting it Together
Now that you know how to find and replace characters in a string, you can use an IF-THEN statement and a DO group to
fix the Product_ID values for mittens.

Let's take a look at the DATA step. In the IF statement, the FIND function searches for the string Mittens in the values of
Product. If the FIND function returns a value greater than zero, then the DO group executes. In the DO group, the
SUBSTR function replaces one character in position 9 in Product_ID with a 5.

This DATA step accomplishes the first task of fixing the clean_up data.

Finding and Replacing a Character String


In this demonstration, you run the DATA step that uses the FIND and SUBSTR functions.

1. Copy and paste the following code into the editor.

data correct;
set orion.clean_up;
if find(Product,'Mittens','I')>0 then do;
substr(Product_ID,9,1) = '5';
end;
run;

proc print data=correct;


run;

2. Submit the code and check the log. Verify that no errors occurred.

3. View the results. As you can see, the product IDs now contain the appropriate 005 code in the middle for the
observations that contain Mittens.

Replacing Characters in a Value Using the TRANWRD Function


Remember, another problem with the clean_up data was the misspelled word, Luci instead of Lucky. You can use the
TRANWRD function to replace the misspelled word. The TRANWRD function replaces or removes all occurrences of a
given word (or a pattern of characters) within a character string. It is different from the SUBSTR function because you
don't have to specify the position of the string you want to replace.

Let's look at the syntax for the function. Source specifies the source string that you want to change. Target specifies the
string to search for in source and replacement specifies the string that replaces target. Target and replacement can be
variables or character strings. If you specify character strings, be sure to enclose the strings in quotation marks. New
variables created by the function have a default length of 200 bytes if a LENGTH statement is not used.

In the first example, the TRANWRD function looks in Product for Small and replaces it with Medium. The value is
assigned to Rename1.

In the second example, the TRANWRD function looks in Product for Red and replaces it with purple. Notice that purple is
lowercase because it was specified in lowercase in the function arguments.

In the third example, the TRANWRD function looks in Product for lowercase red. In Product, Red has an initial cap, so
lowercase red does not match and the string is not replaced with blue as specified in the function arguments.

Keep in mind that when you specify the target, it must match the case of the values in the variable. In this example, the
TRANWRD functions were used to create the new variables Rename1 through Rename3. When the TRANWRD function
is used in an assignment statement that creates a new variable, the default length of the variable is 200 bytes.

Let's see what happens if you use the TRANWRD function to replace an existing string in a variable. In this code, the
TRANWRD function replaces the string Red with the string Purple and assigns it to Product. Product was originally
created in this assignment statement and assigned a length of 13. The new value is 16 characters, so the value is
truncated. If you use the TRANWRD function to replace an existing string with a longer string, you may need to use a
LENGTH statement so that the value is not truncated.

Now you can use the TRANWRD function to replace the misspelled word Luci in the clean_up data. Because the
misspellings all occur in the string Lucky Knit Mittens, you can add the function to the DO group that you wrote earlier. If
the FIND function finds the string Mittens, the TRANWRD function looks in Product for the string Luci and a space and
replaces it with the string Lucky and a space.

You also want all values of Product in proper case. You can use the PROPCASE function to fix this. Here's a question: Will
all the values of Product be fixed by the assignment statement shown here? No. Not all of the lowercase characters that
need to be fixed are in observations that contain Mittens, so you need to put this assignment statement outside the DO
group.

Question
Select the assignment statement that changes the string Burma to Myanmar in values of the variable Country. Assume
that Country has a length of 25 bytes.

a. Country=tranwrd('Burma','Myanmar');

b. Country=tranwrd(Country,Burma,Myanmar);

c. Country=tranwrd(Country,'Burma','Myanmar');

The correct answer is c. With the TRANWRD function, first you specify the variable, then the string to search for in
quotes, and then the string to use for replacement.

Removing Characters in a Value Using the COMPRESS Function


The last thing you need to do to clean up the data is to remove the blanks in Product_ID. For this, you can use the
COMPRESS function. The COMPRESS function removes the characters listed in the chars argument from the source. If no
characters are specified, the COMPRESS function removes all blanks from the source. If the function creates a new
variable, the new variable has the same length as the source.

Let's look a few examples of the COMPRESS function. In the first example, only the source, ID, is specified, so the blanks
are removed and the new value is assigned to New_ID1.

In the second example, a hyphen is specified, so the hyphen is removed and the new value is assigned to New_ID2. As in
other functions, if you specify a character to remove in the arguments, the default blank is no longer in effect.
In the third example, the blank and hyphen are both specified and both blanks and the hyphen are removed from the
value. The new value is assigned to New_ID3.

Putting It All Together


You're almost done with the DATA step to fix the clean_up data. You just need to add the COMPRESS function to take
the blanks out of all values of Product_ID.

Replacing Strings and Removing Blanks from Character Values

In this demonstration, you replace strings and remove blanks for character values so that you can clean up the
data.

1. If necessary, copy and paste the following code into the editor. If you still have the code from the last
demonstration in the editor, you only need to add the code shown in bold to your existing code.

data correct;
set orion.clean_up;
if find(Product,'Mittens','I')>0 then do;
substr(Product_ID,9,1) = '5';
Product=tranwrd(Product,'Luci ','Lucky ');
end;
Product=propcase(Product);
Product_ID=compress(Product_ID);
run;

proc print data=correct;


run;

the TRANWRD function replaces the misspelled word, Luci, with the correct word, Lucky. The PROPCASE
function changes the case of values in the Product variable. The COMPRESS function removes blanks from values
in the Product_ID variable.

2. Submit the code and view the log. Verify that no errors occurred.

3. View the results. You've been able to use character functions to fix all the problems in the clean_up data.

Summary: Manipulating Character Values

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Using SAS Functions


A SAS function is a routine that performs a calculation on, or a transformation of, the arguments listed in parentheses
and returns a value.

target-variable=function-name(<argument-1><,argument-n>)
A target variable is a variable to which the result of the function is assigned. If the target variable is a new variable, the
type and length are determined by the expression on the right side of the equals sign. If the expression uses a function
whose result is numeric, then the target variable is numeric with a length of 8 bytes. If the expression uses a function
whose result is character, then the target variable is character and the length is determined by the function.

Extracting and Transforming Character Values


Function Purpose

var=SUBSTR(string,start<,length>) On the right side of an assignment statement, the SUBSTR function extracts a
substring of length characters from a character string, starting at a specified
position in the string.

LENGTH(argument) The LENGTH function returns the length of a character string, excluding trailing
blanks.

RIGHT(argument) The RIGHT function right-aligns a value. If there are trailing blanks, they are
moved to the beginning of the value.

LEFT(argument) The LEFT function left-aligns a character value. If there are leading blanks, they are
moved to the end of the value.

CHAR(string,position) The CHAR function returns a single character from a specified position in a
character string.

PROPCASE(argument<,delimiter(s)>) The PROPCASE function converts all letters in a value to proper case.

UPCASE(argument) The UPCASE function converts all letters in a value to uppercase.

LOWCASE(argument) The LOWCASE function converts all letters in a value to lowercase.

Separating and Concatenating Character Values


You use the SCAN function when you know the relative order of words but their starting positions vary. You use the
SUBSTR function when you know the exact position of the string that you want to extract from a character value.

Function Purpose

SCAN The SCAN function separates a character value into words and returns the nth word.
(string,n<,'delimiter(s)'>)

CATX (separator,string- The CATX function removes leading and trailing blanks, inserts separators, and returns a
1,…,string-n) concatenated character string.

NewVar=string-1 !! String-2; The concatenation operator joins character strings.

TRIM (argument) The TRIM function removes trailing blanks from a character string.

STRIP (argument) The STRIP function removes leading and trailing blanks from a character string.

CAT (string-1,…,string-n) These functions return concatenated character strings. The CAT function does not remove
CATT (string-1,…,string-n) any leading or trailing blanks. The CATT function trims trailing blanks. The CATS function
CATS (string-1,…,string-n) strips leading and trailing blanks.

Finding and Modifying Character Values


Function Purpose

FIND The FIND function searches for a specific substring of characters within a
(string,substring<,modifiers,start>) character string. The function returns the starting position of the first
occurrence of the substring. If the substring is not found in the string, FIND
returns a value of 0.

SUBSTR(string,start<,length>)=value; On the left side of the assignment statement, the SUBSTR function replaces
length characters at a specified position with value.

TRANWRD(source,target,replacement) The TRANWRD function replaces or removes all occurrences of a given word (or
a pattern of characters) within a character string.

COMPRESS(source<,chars>) The COMPRESS function removes the characters listed in the chars argument
from the source. If no characters are specified, the COMPRESS function
removes all blanks from the source.

Sample Programs
Extracting and Transforming Character Values

data charities(drop=Code_Rt);
length ID $ 5;
set orion.biz_list;
Code_Rt=right(Acct_Code);
if char(Code_Rt,6)='2';
ID=left(substr(Code_Rt,1,5));
run;
data charities;
length ID $ 5;
set orion.biz_list;
if substr(Acct_Code,length(Acct_Code),1)='2';
ID=substr(Acct_Code,1,length(Acct_Code)-1);
Name=propcase(Name);
run;

data charities(drop=Len);
length ID $ 5;
set orion.biz_list;
Len=length(Acct_Code);
if substr(Acct_Code,Len,1)='2';
ID=substr(Acct_Code,1,Len-1);
Name=propcase(Name);
run;

Separating and Concatenating Character Values

data labels;
set orion.contacts;
length FMName LName $ 15
FullName $ 50;
FMName=scan(Name,2,',');
LName=scan(Name,1,',');
FullName=catx(' ',title,fmname,lname);
run;
Finding and Modifying Character Values
data correct;
set orion.clean_up;
if find(Product,'Mittens','I')>0 then do;
substr(Product_ID,9,1) = '5';
Product=tranwrd(Product,'Luci ','Lucky ');
end;
Product=propcase(Product);
Product_ID=compress(Product_ID);
run;

Lesson 5: Manipulating Numeric Values


You know that there are many categories of SAS functions, and you've learned how SAS functions can be used to modify
the values of character variables. SAS also provides functions to create or modify numeric values. In this lesson you'll
learn how to use several descriptive statistics functions, several truncation functions, and several special functions that
enable you to convert variables from one type to another.

Objectives

In this lesson, you learn to do the following:

 use SAS functions to compute descriptive statistics of numeric values


 work with SAS variable lists
 use SAS functions to truncate numeric values
 explain the automatic conversion that SAS uses to convert values between data types
 explicitly convert values between data types

Using Descriptive Statistics Functions


The SAS data set orion.employee_donations contains data about charitable contributions that employees made
throughout the year.
Suppose you need to create a new data set, named orion.donation_stats, that contains each employee's total donation
for the year, the average donation for the quarters that the employee made a donation, and the number of quarters
that the employee made a donation.

Descriptive Statistics Functions


You can use descriptive statistics functions to easily calculate the values needed for the donation_stats data set.

The descriptive statistics functions include SUM, which returns the sum of the non-missing arguments. MEAN returns
the arithmetic mean or average of the arguments, MIN returns the smallest value from the arguments, MAX returns the
largest value from the arguments, N returns the number of non-missing arguments, NMISS returns the number of
missing numeric arguments, and CMISS, returns the number of missing numeric or character arguments.

These are just a few of the many statistical functions that are available in SAS. Remember that a SAS function is a routine
that performs a calculation on, or a transformation of, the arguments listed in parentheses and returns a value. The
descriptive statistics functions share the same general syntax. The function name is followed by numeric arguments in
parentheses.

In this example, the SUM function returns the sum of the variables Qtr1 through Qtr4 and assigns the value to the
variable Total. Notice that the arguments are separated by commas.

Working with SAS Variable Lists


Let's take another look at this example. In this case there are only four variables: Qtr1 through Qtr4. But what if you
need the sum of more variables? For example, Qtr1 through Qtr40? You could list all the variables in the function, but it
would be easier to use a variable list as the argument to this function.

A SAS variable list is an abbreviated method of referring to a group of variable names. When you use a SAS variable list
in a function, you use the keyword OF in front of the first variable name in the list. Be sure to include the OF. If you omit
this keyword, SAS won't interpret the function arguments correctly.

In this example, if you don't include the OF, the function calculates the sum of Qtr1 minus Qtr40.

Types of Variable Lists


SAS enables you to use several types of variable lists: numbered range, name range, name prefix, and special SAS name
lists. Let's take a look at each type of list.

You can use a numbered range list to specify all variables from x1 to xn, inclusive. The starting variable and ending
variable are separated by a hyphen. You can begin the range with any number and end with any number as long as the
numbers are consecutive. In this example, the values of Qtr1, Qtr2, Qtr3, and Qtr4 are included in the calculation. The
value of Var1 is not included.

You can use a name range list to specify all variables ordered as they are in the program data vector, from x to a
inclusive. The starting and ending variables are separated by two dashes. Here, the values of Qtr1, Second, Q3, and
Fourth are included in the calculation. The value of Var2 is not included. You can also use a name range list to specify all
numeric variables from x to a inclusive (x - numeric - a) or all character variables from x to a inclusive (x - character - a).

You can use a name prefix list to specify all the variables that begin with the same string, such as sum(of Tot:). Tot:
indicates the starting prefix for the variable names to be included in the calculation. In this example, the values of
TotJan, TotFeb, and TotMar are included in calculation. The values of Qtr2 and Qtr3 are not included in the calculation.

Special SAS name lists enable you to use the keyword _ALL_ to specify all of the variables that are already defined in the
current DATA step. You can also use the keyword _NUMERIC_ to specify all of the numeric variables that are already
defined in the current DATA step or the keyword _CHARACTER_ to specify all of the character variables that are already
defined in the current DATA step. Here, the values of the numeric variables Qtr1, Q2, Q3, and Fourth are included in the
calculation. The values of the character variable, Name, are not included.

Using Descriptive Statistics Functions

In this demonstration you use descriptive statistics functions to generate data for the donation_stats data set.

1. Copy and paste the following code into the editor. Remember that the donation_stats data set needs to contain
each employee's total donation for the year, the average quarterly donation for each employee, and the number
of quarters that the employee made a donation.

data donation_stats;
set orion.employee_donations;
keep Employee_ID Total AvgQT NumQT;
Total = sum(of Qtr1-Qtr4);
AvgQT = mean(of Qtr1-Qtr4);
NumQt = n(of Qtr1-Qtr4);
run;

proc print data=donation_stats;


run;

The SUM function calculates the total donation for the year and the MEAN function calculates the average
quarterly donation. The average quarterly donation will be calculated using the number of quarters in which an
employee made a contribution. For example, if an employee contributed during only two quarters, the value of
AvgQT will be calculated from just those two quarters. The missing quarter values will be ignored. The N
function calculates the number of quarters that the employee made a donation.

The variables are specified using a numbered range list, but you could specify the variables other ways. For
example, you could list all of the variables separated by commas. The KEEP statement keeps only the ID variable
and the new variables that the DATA step creates.

2. Submit the code and check the log. Notice that 124 observations were read in from employee_donations.

3. View the results. The functions generated the statistics you need. donation_stats contains each employee's total
donation for the year, the average quarterly donation for each employee, and the number of quarters that the
employee made a donation.

Question
Which of the following assignment statements correctly calculates the average of Rest1, Rest2, Rest3, and Rest4?

a. RestAvg=mean of Rest1-Rest4;

b. RestAvg=mean(Rest1 Rest2 Rest3 Rest4);

c. RestAvg=sum(Rest1,Rest2,Rest3,Rest4);
d. RestAvg=mean(Rest1,Rest2,Rest3,Rest4);

The correct answer is d. The MEAN function calculates the arithmetic mean (average) of the arguments. The arguments
must be enclosed in parentheses and separated by commas.

Truncating Numeric Values


You've seen how you can use descriptive statistics functions to calculate the values that you need for the donation_stats
data set. Suppose you want to modify the output so that the values of AvgQT are rounded to the nearest dollar.

Rounding Values
You can use the ROUND function to round values. ROUND is one of the functions that you can use to truncate numeric
variable values. The truncation functons also include the CEIL, FLOOR, and INT functions. Let's start by taking a look at
the ROUND function, and then we'll explore these other truncation functions.

The ROUND function returns a value rounded to the nearest multiple of the round-off unit. The keyword ROUND is
followed by an argument in parentheses. The argument must be a number or a numeric expression. The round-off unit
must be numeric and positive. If you don't specify a round-off unit, the argument is rounded to the nearest integer. For
example, this assignment statement rounds the values of TotalSales to the nearest integer.

Let's take a look at some examples. This program uses assignment statements to create four variables. Each assignment
statement uses a ROUND function. The first assignment statement rounds the value 12.12. Because no round-off unit is
specified, the function rounds the value of NewVar1 to the nearest integer, which is 12. The second assignment
statement rounds the value of 42.65. Here, the round-off unit is .1. So the function rounds the value of NewVar2 up to
the nearest tenth, which is 42.7.

Now, let's take a look at what happens when you round a negative value. The third assignment statement rounds the
value of -6.478. Because no round-off unit is specified, the function rounds the value of NewVar3 to to the nearest
integer, which is -6.

Here's a question: What is the value of NewVar4? The fourth statement rounds the value of 96.47. Here the round-off
unit is 10. So the function rounds the value of NewVar4 to the nearest multiple of 10, which is 100.

So far, we've used multiples of 10 or the default for the round-off unit, but you can use other multiples as well. The next
assignment statement rounds the value of 12.69. Here, the round-off unit is .25. So the function rounds the value of
NewVar5 to the nearest multiple of .25, which is 12.75. The last assignment statement rounds the value of 42.65. In this
example the round-off unit is .5. So the function rounds the value of NewVar6 to the nearest multiple of .5, which is
42.5.

Exploring the CEIL, FLOOR, and INT Functions


The CEIL, FLOOR, and INT functions are closely related to the ROUND function. To use one of these functions, you specify
the corresponding function name followed by the argument in parentheses. In this example, the keyword CEIL is
followed by the argument TotalSales.

Let's take a closer look at the CEIL, FLOOR, and INT functions. As you work with these functions, you might find it helpful
to think about a number line. The CEIL function returns the smallest integer greater than or equal to the argument. So, if
the argument is 4.4, the CEIL function returns a value of 5 for x.
The FLOOR function returns the greatest integer less than or equal to the argument. In this example the argument is 3.6.
So, the FLOOR function returns a value of 3 for y.

The INT function returns the integer portion of the argument. In this example the argument is 3.9. So, the INT function
of returns a The INT function returns the integer portion of the argument. In this example the argument is 3.9. So, the
INT function returns a value of 3 for z, even though the value of the argument is almost 4.

Let's look at a few more examples. This DATA step creates four variables: Var1, CeilVar1, FloorVar1, and IntVar1. Var1 is
assigned a value of 6.478 and is used as the argument for each function. The CEIL function returns a value of 7 for
CeilVar1. This is the smallest integer greater than or equal to the argument. The FLOOR function returns a value of 6 for
FloorVar1. This is the greatest integer less than or equal to the argument. The INT function returns a value of 6 for
IntVar1. This is the integer portion of the argument.

Here's a question: Given the same value as an argument, do the INT and the FLOOR funtions always return the same
result? For values greater than or equal to 0, FLOOR and INT return the same value. For values less than 0, for example -
6.478, CEIL and INT return the same value.

Using the ROUND Function

In this demonstration you round the values of AvgQT in the donation_stats data set.

1. If necessary, copy and paste the following code into the editor. If you still have the code from the last
demonstration in the editor, you only need to modify the second assignment statement, as shown in bold.

data donation_stats;
set orion.employee_donations;
keep Employee_ID Total AvgQT NumQT;
Total = sum(of Qtr1-Qtr4);
AvgQT = round(mean(of Qtr1-Qtr4));
NumQt = n(of Qtr1-Qtr4);
run;

proc print data=donation_stats;


run;

Remember that you need to round the values of AvgQT in the donation_stats data set. Notice that the
MEAN function is nested within the ROUND function. You can nest a function as long as the function
that is used as the argument meets the requirements for the argument.

2. Submit the code and check the log. Verify that no errors occurred.

3. View the results. Notice the rounded values for AvgQT.

Question
Which program changes the values of the variable Examples, found in the data set Before, to the values shown in After?

Before After
Examples Examples

326.54 327

98.2 98

-32.66 -33

1401.75 1402

a.

data after;
set before;
Examples=int(Examples);
run;

b.

data after;
set before;
Examples=round(Examples);
run;

c. both a and b

The correct answer is b. The values shown in After have been rounded to the nearest integer. This is done by using the
ROUND function without a round-off unit

Converting Values Between Data Types


Orion Star has just acquired a small marketing firm. You’ve been asked to convert the firm’s personnel data into a data
set that can be easily transferred to Orion’s Human Resources system.

A sample of the marketing firm’s data is stored in the SAS data set orion.convert. To successfully transfer the data, you
need to convert the character values in ID, GrossPay, and Hired into numeric values. You also need to convert the
numeric values in Code into character values.

Automatic Character-to-Numeric Conversion


Data types can be converted two ways. The data can be converted automatically, by allowing SAS to do it for you. The
data can also be converted explicitly, using SAS functions. Let's first take a look at automatic character-to-numeric
conversion.

By default, if you reference a character variable in a numeric context such as an arithmetic operation, SAS tries to
convert the variable values to numeric. Specifically, automatic character-to-numeric conversion occurs when a character
value is assigned to a previously defined numeric variable, such as the numeric variable Rate. Automatic conversion also
occurs when a character value is used in an arithmetic operation, compared to a numeric value using a comparison
operator, or specified in a function that requires numeric arguments.
The automatic conversion uses the w. informat, where w is the width of the character value that is being converted.
Automatic conversion produces a numeric missing value from any character value that does not conform to standard
numeric notation. For example, a character value that contains commas is converted to missing.

Let's revisit our business task. To avoid conflicts with the existing Orion Star employee IDs, we need to add 11000 to the
values of the character variable ID. We can use an assignment statement to create the variable EmpID, which will be the
current value of ID plus 11000.

Here's a question: What will happen to the character values of ID when this DATA step executes? Because we're using
the character variable ID in an arithmetic operation, SAS automatically converts the character values of ID to numeric
values so that the calculation can occur. This conversion is completed by creating a temporary numeric value for each
character value of ID. This temporary value is used in the calculation.

The character values of ID are not replaced by numeric values. After a variable’s type is established, it can't be changed.
Whenever data is automatically converted, a message is written to the SAS log stating that the conversion has occurred.
The conversion note indicates a line and column number for the location in the log where the conversion occurred. In
this example the statement that caused the conversion can be found on line 31 of the log.

Explicit Character-to-Numeric Conversion


Now let's consider another character variable in the convert data set. We need to create a new variable, named Bonus.
The value of Bonus should be 10% of GrossPay.

Here's another question: Can you use automatic conversion to create Bonus? Automatic conversion won't work in this
situation. The values of GrossPay contain commas, which can't be converted by the w. informat. So, the values of Bonus
are set to missing. Since automatic conversion won't work, you need to explicitly convert the values in GrossPay.

You can use the INPUT function to explicitly convert character values to numeric values. The INPUT function returns the
value produced when the source, which contains a SAS character expression, is read with a specified informat. Here's an
example of the INPUT function. The function uses the numeric informat COMMA9. to read the values of the character
variable SaleTest. Then the resulting numeric values are stored in the variable Test.

When you use the INPUT function, be sure to select a numeric informat that can read the form of the values. You must
specify a width for the informat that is equal to the length of the character variable that you need to convert. To create
Bonus, you can use the numeric informat COMMA6. The comma informat can read the form of the values of GrossPay. A
width of 6 is specified on the informat since the length of GrossPay is 6. As you saw with automatic conversion, the
resulting new variable, Bonus, is numeric. However, GrossPay remains a character variable.

Converting an Existing Variable to Another Type


Suppose though that you do want to convert GrossPay to a numeric variable. You might think about using an
ASSIGNMENT statement with an INPUT function to perform this task. But, that won't work. Remember that after a
variable’s type is established, it can't be changed. But, by following a few steps, you can get a new variable with the
same name and a different type.

The first step is to use the RENAME= data set option to rename the variable you want to convert. The next step is to use
the INPUT function in an assignment statement to create a new variable with the original name of the variable you just
renamed. You can then use a DROP= data set option in the DATA statement to exclude CharGross from the output data
set.

Automatic Numeric-to-Character Conversion


The automatic conversion of numeric data to character data is very similar to character-to-numeric conversion. Numeric
data values are converted to character values whenever they are used in a character context.

For example, the numeric values of the variable Site are converted to character values if you assign the variable to a
previously defined character variable such as SiteCode, use the variable with an operator that requires a character value
(such as the concatenation operator) or use the variable in a function that requires character arguments (such as the
SUBSTR function).

Remember that the orion.convert data set contains a numeric variable Code (area code) and a character variable Mobile
(mobile telephone number). To convert the data, you need to create a character variable, Phone, that contains the area
code in parentheses followed by the mobile telephone number. Since SAS automatically converts a numeric value to a
character value when you use the concatenation operator, you might try concatenating the value of Code with
parentheses and the value of Mobile. If you submit this code, SAS automatically converts the numeric values in Code
into character values. Notice that numeric-to-character conversion causes a message to be written to the SAS log
indicating that the conversion has occurred.

The conversion worked. But, the PROC PRINT output shows something interesting. Notice that SAS inserted the extra
blanks – nine to be exact – before the area code. To understand why the extra blanks were inserted, it's helpful to know
how automatic numeric-to-character conversion works.

SAS writes the numeric value with the BEST12. format, and the resulting character value is right-aligned. This conversion
occurs before the value is assigned or used with any operator or function. The values for the numeric variable, Code,
have fewer than 12 digits. So, when automatic conversion occurs, the resulting character value will have leading blanks.
Leading blanks might cause problems when you perform an operation or function.

Explicit Numeric-to-Character Conversion


To convert the values of Code without adding leading blanks, you can use the PUT function. The PUT function enables
you to explicitly control the numeric-to-character conversion using a format.

The syntax for the PUT function is shown here. Source indicates the numeric variable, constant, or expression to be
converted to a character value. A format matching the data type of the source must also be specified. So, you can use
the PUT function to convert the values of Code to character using the 3. numeric format. The values of Code can then be
concatenated with parentheses and the values of Mobile.

In the previous lesson, you learned about concatenation functions. You can use the CAT function as an alternative to the
PUT function. Remember that the CAT function returns a value that is the concatenation of the named strings. This
assignment using the CAT function produces the same results as this assignment using the PUT function. If you use the
CAT function, no note is written to the log.

Converting Data Values between Types

In this demonstration you use several data conversion techniques to convert the data in orion.convert.

1. Copy and paste the following DATA step into the editor.
data hrdata;
keep EmpID GrossPay Bonus Phone HireDate;
set orion.convert;
EmpID = input(ID,2.)+11000;
Bonus =input(GrossPay,comma6.)*.10;
Phone='(' !! put(Code,3.) !! ') ' !! Mobile;
HireDate = input(Hired,mmddyy10.);
run;

2. Submit the DATA step and check the log. Verify that no errors occurred.

3. Copy and paste the following PROC PRINT step into the editor.

proc print data=hrdata;


format HireDate mmddyy10.;
run;

The FORMAT statement specifies that the MMDDYY10. format is applied to the values of HireDate.

4. Submit the PROC PRINT step and view the results. Notice that the values of HireDate appear in the form
month/day/year with a 4-digit year.

Question
Which statement correctly describes the result of submitting the DATA step that is shown below? The variables TotalPay
and Commission are numeric. The variable Pay is character.

data reg.newsales;
set reg.sales;
TotalPay=Pay+Commission;
run;

a. The values of the numeric variables TotalPay and Commission are converted to character values because these
variables are used in a character context.

b. The values of the character variable Pay are converted to numeric values because this variable is used in a
numeric context.

The correct answer is b. The character variable Pay appears in an arithmetic operation, so its values are converted to
numeric to complete the calculation.

Question
A typical value for the character variable Target is 123,456. Which statement correctly converts the values of Target to
numeric values when creating the variable TargetNo?

a. TargetNo=input(Target,comma6.);

b. TargetNo=input(Target,comma7.);

c. TargetNo=put(Target,comma6.);
d. TargetNo=put(Target,comma7.);

The correct answer is b. You explicitly convert character values to numeric values by using the INPUT function. Be sure to
select an informat that can read the form of the values.
Summary: Manipulating Numeric Values

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Using Descriptive Statistics Functions


A SAS function is a routine that performs a calculation on, or a transformation of, the arguments listed in parentheses
and returns a value.

function-name(argument-1, argument-2,…,argument-n)

Function Returns

SUM the sum of the nonmissing arguments

MEAN the arithmetic mean (average) of the arguments

MIN the smallest value from the arguments

MAX the largest value from the arguments

N the number of nonmissing arguments

NMISS the number of missing numeric arguments

CMISS the number of missing numeric or character arguments

You can list all the variables in the function, or you can use a variable list by preceding the first variable name in the list
with the keyword OF. There are several types of Variable Lists including numbered ranges, name ranges, name prefixes,
and special SAS names.

Variable List Description Example

Numbered all variables x1 to xn, inclusive Total=sum(of Qtr1-


Range Qtr4);
x1-xn
Name Range all variables ordered as they are in the program data vector, from x to a Total=sum(of Qtr1--
Fourth);
x--a inclusive
Name Prefix all variables that begin with the same string Total=sum(of Tot:);

Special SAS all of the variables, all of the character variables, or all of the numeric Total=sum(of _All_);
Total=sum(of
Name List variables that are already defined in the current DATA step _Character_);
Total=sum(of
_Numeric_);

Truncating Numeric Values


There are four truncation functions that you can use to truncate numeric values. They are the ROUND, CEIL, FLOOR, and
INT functions.

The ROUND function returns a value rounded to the nearest multiple of the round-off unit. If you don't specify a round-
off unit, the argument is rounded to the nearest integer.

ROUND(argument<,round-off-unit>)

The CEIL function returns the smallest integer greater than or equal to the argument.

CEIL(argument)

The FLOOR function returns the greatest integer less than or equal to the argument.

FLOOR(argument)

The INT function returns the integer portion of the argument.

INT(argument)

Converting Values Between Data Types


You can allow SAS to automatically convert data to a different data type for you, but it can be more efficient to use SAS
functions to explicitly convert data to a different data type. By default, if you reference a character variable in a numeric
context, SAS tries to convert the variable values to numeric. Automatic character-to-numeric conversion uses the w.
informat, and it produces a numeric missing value from any character value that does not conform to standard numeric
notation.

You can use the INPUT function to explicitly convert character values to numeric values. The INPUT function returns the
value that is produced when the source is read with a specified informat.

INPUT(source, informat)
Numeric data values are automatically converted to character values whenever they are used in a character context. For
example, SAS automatically converts a numeric value to a character value when you use the concatenation operator.
When SAS automatically converts a numeric value to a character value, SAS writes the numeric value with the BEST12.
format and right aligns the value. The resulting value might contain leading blanks.

You can use the PUT function to explicitly control the numeric-to-character conversion using a format.

PUT(source, informat)

Sample Programs
Using Descriptive Statistics Functions and Truncating Numeric Values

data donation_stats;
set orion.employee_donations;
keep Employee_ID Total AvgQT NumQT;
Total=sum(of Qtr1-Qtr4);
AvgQT=round(Mean(of Qtr1-Qtr4),1);
NumQt=n(of Qtr1-Qtr4);
run;

proc print data=donation_stats;


run;

Converting Values Between Data Types

data hrdata;
keep EmpID GrossPay Bonus Phone HireDate;
set orion.convert;
EmpID=input(ID,2.)+11000;
Bonus=input(GrossPay,comma6.)*.10;
Phone='(' !! put(Code,3.) !! ') ' !! Mobile;
HireDate=input(Hired,mmddyy10.);
run;

proc print data=hrdata;


format HireDate mmddyy10.;
run;

Lesson 6: Debugging Techniques


Debugging is the process of removing logic errors from a program. Unlike syntax errors, logic errors don't stop a program
from running. Instead, they cause the program to produce unexpected results.

For example, if you create a DATA step that keeps track of inventory and your program shows that you're out of stock
even though your warehouse is full, your program might contain a logic error.

In this lesson, you learn to use the PUTLOG statement to identify logic errors in your SAS programs.

Objectives

In this lesson, you learn to do the following:


 use the PUTLOG statement to identify logic errors

Understanding Logic Errors


Suppose you want to send a targeted mailing to potential customers in the United States.

You’ve been given a program that's designed to find and extract only values for United States addresses from
orion.mailing_list. The extracted values will be stored in a new SAS data set named us_mailing and should include only
the customer’s name, street address, city, state, and zip code.

Identifying a Logic Error

In this demonstration, you submit the program that creates us_mailing and view the output.

1. Copy and paste the following code into the editor.

data us_mailing;
set orion.mailing_list;
drop Address3;
length City $ 25 State $ 2 Zip $ 5;
if find(Address3,'US');
Name=catx(' ',scan(Name,2,','),scan(Name,1,','));
City=scan(Address3,1,',');
State=scan(address3,2,',');
Zip=scan(Address3,3,',');
run;

proc print data=us_mailing;


title 'Current Output from Program';
run;
title;

The goal of the program is to create a data set that's formatted to meet the needs of a mailing to only United
States customers. The new data set should include only the customer’s name, street address, city, state, and zip
code.

2. Submit the code and view the log. Notice that 424 observations were read from the input data set and that no
errors were reported.

3. View the results. Notice that the values of State and Zip are truncated. The log doesn't contain any error
messages, but the output isn't what we expect.

Exploring Logic Errors


When we run the mailing list program, we see that the values of State and Zip are truncated in the output. The program
contains a logic error.

Logic errors are different from syntax errors. Remember, syntax errors occur when programming statements don’t
conform to the rules of the SAS language. When a syntax error occurs, SAS writes an error message to the log.

A logic error occurs when the programming statements follow the rules but the results aren’t correct. Since the
statements conform to the rules, SAS doesn't write an error message to the log.

The lack of messages can make logic errors more difficult to detect and correct than syntax errors. Now that you know
that there's a problem with the mailing list program, you need to identify and correct the problem so that the mailing
can be sent.

Question
In the box next to each statement, type the letter representing the matching type of error.

Program statements don't conform to the rules of the SAS a. syntax error
language.

Program statements follow the rules but the results aren't correct. b. logic error

An error message is written to the log.

No error messages are written to the log.

The correct answers from top to bottom are a, b, a, b. One or more items are incorrectly matched.

Using PUTLOG Statements


When you need to debug a program it's often useful to display messages and variable values in the log. You can do that
using PUTLOG statements. To use a PUTLOG statement, you specify the keyword PUTLOG, followed by your
specifications. You can use PUTLOG statements in batch or interactive mode. There are a number of ways to write the
specifications. Let's take a look at some examples.

Sometimes it's helpful to write a string of text to the log. For example, you might want to write a message to indicate
when a particular statement executes. To write a string of text to the log, you simply specify the keyword PUTLOG,
followed by the quoted message text. The text must be enclosed in quotation marks. The PUTLOG statement here writes
the text string Looking for country to the log. You can even precede your message text with WARNING, MESSAGE, or
NOTE to help identify it in the log.

You can also use a PUTLOG statement to write the name and value of a variable to the log. This technique is helpful
when you suspect that the value of a variable might be causing a logic error. For example, if the value of the variable City
is San Diego, this statement writes the message City=San Diego to the log.

Formatting Character Values with the PUTLOG Statement


By default, the PUTLOG statement writes character values with the standard character format $w. This format left-
justifies values and removes leading blanks. But, sometimes you might want to apply a different format.

For example, suppose your data values contain leading spaces. The leading spaces won't appear in the log message
unless you specify a format that preserves leading spaces in the PUTLOG statement. To add a format, you specify the
name of the variable followed by the format name and width. The format width must be wide enough to display the
value of the variable, as well as any additional characters such as commas or quotation marks.

For example, the $QUOTEw. format writes a character value enclosed in double quotation marks and preserves any
leading spaces. If the value of the variable City is Philadelphia with a leading space, the statement shown here writes
City=" Philadelphia", including the leading space, to the log.
You can increase the value of the format width beyond the minimum to ensure that you can see the full value of the
variable in the log. The specified format width of 22 is more than wide enough to accommodate the 12 characters in
Philadelphia, the leading space, and the quotation marks which are part of the format.

Viewing Automatic Variables with the PUTLOG Statement


When you're debugging a program, it's often helpful to see what SAS has stored in the program data vector. To write the
current contents of the PDV to the log, use the _ALL_ option in the PUTLOG statement.

When you use the _ALL_ option, the values of the automatic variables _ERROR_ and _N_ are included in the log.
Remember that these variables are part of the PDV even though SAS doesn't write them to the output data set. Let’s
take a closer at the _ERROR_ and _N_ variables to see how you can use them to help debug your programs.

SAS creates an _ERROR_ variable for every DATA step. SAS changes the value of _ERROR_ from 0 to 1 when it
encounters certain types of errors. For example, input data errors, conversion errors, and math errors are among the
types of errors that cause the value of _ERROR_ to change. If the value of _ERROR_ changes to 1, SAS writes the
contents of the PDV and a note to the log.

Here's a question. If your program contains a logic error, will SAS change the value of _ERROR_? In the case of a logic
error, the programming statements conform to the rules of the SAS language. So, the value of _ERROR_ will remain 0.

Now, let's take a look at the _N_ variable. SAS also creates an _N_ variable for every DATA step. This variable represents
the number of times that the DATA step has iterated. SAS initially sets the value of _N_ to 1. Each time the DATA step
loops past the DATA statement, SAS increments _N_ by 1. SAS retains the value of _N_ between iterations, so you can
use the value of _N_ to determine how many passes you’ve made through the data set.

Combining PUTLOG Statements with Conditional Logic


You can use the value of _N_ to conditionally execute a PUTLOG statement or a series of PUTLOG statements. For
example, suppose you want to write a text string to the log during the first iteration of a DATA step. You know that the
value of _N_ will be 1 during the first iteration, so you can use conditional logic to execute a PUTLOG statement when
the condition _N_ equals 1 is true.

Now suppose that when SAS processes the last observation in the data set, you want to write another text string, as well
as the contents of the PDV, to the log. To do that, you need to identify the last observation in the input data set. You can
use the END= option in the SET statement to create and name a temporary variable that acts as an end-of-file indicator.
The syntax for the END= option is shown here. You can also use the END= option in an INFILE statement to indicate the
end of a raw data file.

The value of the END= variable is initialized to 0 and is set to 1 when the SET statement reads the last observation from
the input data set. So, you can check the value of the END= variable in an IF statement to determine if the current
iteration of the DATA step is the last iteration.

In this example, if the value of _N_ is 1, SAS writes the text string First Iteration to the log. If the value of the END=
variable Last equals 1, SAS executes two PUTLOG statements. The first PUTLOG statement writes the text string Final
values of variables: to the log. The second PUTLOG statement writes the contents of the PDV to the log.

Since SAS considers any numeric value other than 0 or missing to be true, you can also write the IF statement for the
variable Last without the equals 1.
Question
Which of the following represents acceptable syntax for the PUTLOG statement. Select all that apply.

a. putlog 'customer'=;

b. putlog customer=;

c. putlog 'Note: write this in the log';

d. putlog write this in the log;

e. putlog customer $quote13.;

The correct answer is b, c, and e. After the keyword, you list the variable name and its format, or the text to be written
to the log surrounded by quotes.

Question
What will this PUTLOG statement do?

putlog _all_;

a. write text to the log

b. write the values of all the variables to the log

c. write formatted values to the log

d. write all the logic errors to the log

The correct answer is b. This statement will write the values of all the variables to the log.

Question
Which of the following represents acceptable syntax for the END= option in the INFILE statement?

a. infile work.catalog end=last;

b. infile work.catalog end.last;

c. infile catalog end=last;

d. infile catalog end.last;

The correct answer is c. The syntax is END= variable <options>.


Question
Is it acceptable to use the END= option in both SET and INFILE statements?

a. yes

b. no

The correct answer is a. It is acceptable to use the END= option in both SET and INFILE statements.

Using PUTLOG Statements


In this demonstration you use the PUTLOG statements to debug the mailing list program.

1. Copy and paste the following DATA step into the editor.

data us_mailing;
set orion.mailing_list (obs=10);
drop Address3;
length City $ 25 State $ 2 Zip $ 5;
putlog _n_=;
putlog "Looking for country";
if find(Address3,'US');
putlog "Found US";
Name=catx(' ',scan(Name,2,','),scan(Name,1,','));
City=scan(Address3,1,',');
State=scan(address3,2,',');
Zip=scan(Address3,3,',');
putlog State= Zip=;
run;

We probably don't need to process all of the observations in the input data set in order to debug this program.
So, we use the OBS= data set option to process only 10 observations. That will save processing time.

The first PUTLOG statement displays the values of _N_. The second PUTLOG statement displays the text
message "Looking for country". The third PUTLOG statement to displays the message "Found US" when a US
address is located, which makes it easier to locate the data for the US addresses in the log. The fourth PUTLOG
statement conditionally displays the values of City, State, and Zip when a US address is located.

2. Submit the DATA step and check the log. Notice the output from the PUTLOG statements. The US observations
are identified, but the values of State and Zip are incorrect.

3. Add $QUOTEw. formats to the fourth PUTLOG statement as shown below.

data us_mailing;
set orion.mailing_list (obs=10);
drop Address3;
length City $ 25 State $ 2 Zip $ 5;
putlog _n_=;
putlog "Looking for country";
if find(Address3,'US') ;
putlog "Found US";
Name=catx(' ',scan(Name,2,','),scan(Name,1,','));
City=scan(Address3,1,',');
State=scan(address3,2,',');
Zip=scan(Address3,3,',');
putlog State=$quote4. Zip=$quote7.;
run;

The values of State should contain two characters. The values of Zip should contain five characters. The width
specifications of 4 and 7 will accommodate the characters as well as the quotation marks that the format will
add.

4. Submit the DATA step again and check the log. Notice that this time, the log messages indicate that each value
of State and Zip contains a leading blank.

5. We can use the LEFT function to remove the leading blanks. Add the LEFT function to the assignment statements
for State and Zip as shown below.

data us_mailing;
set orion.mailing_list (obs=10);
drop Address3;
length City $ 25 State $ 2 Zip $ 5;
putlog _n_=;
putlog "Looking for country";
if find(Address3,'US') ;
putlog "Found US";
Name=catx(' ',scan(Name,2,','),scan(Name,1,','));
City=scan(Address3,1,',');
State=left(scan(address3,2,','));
Zip=left(scan(Address3,3,','));
putlog State=$quote4. Zip=$quote7.;
run;

6. Submit the DATA step and check the log. Notice that the values of State and Zip are now correct.

7. Since the DATA step is now correct, remove the OBS= and PUTLOG statements as shown below.

data us_mailing;
set orion.mailing_list;
drop Address3;
length City $ 25 State $ 2 Zip $ 5;
if find(Address3,'US') ;
Name=catx(' ',scan(Name,2,','),scan(Name,1,','));
City=scan(Address3,1,',');
State=left(scan(address3,2,','));
Zip=left(scan(Address3,3,','));
run;

8. Copy and paste the following PROC PRINT step into the editor.

proc print data=us_mailing;


title 'Current Output from Program';
run;
title;

9. Submit the both the DATA step and the PROC PRINT step and view the results. Notice that the values of State
and Zip are no longer truncated.

Summary: Debugging Techniques


This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Understanding Logic Errors


Debugging is the process of identifying and removing logic errors from a program. Logic errors are different from syntax
errors.

Syntax errors occur when programming statements don’t conform to the rules of the SAS language. When a syntax error
occurs, SAS writes an error message to the log.

Logic errors occur when the programming statements follow the rules but the results are not correct. Since the
statements conform to the rules, SAS continues to process the program and doesn't write an error message to the log.
The lack of messages can make logic errors more difficult to detect and correct than syntax errors.

Using PUTLOG Statements


You can use PUTLOG statements to display messages, variable names, and variable values in the log. This technique is
helpful when you suspect that the value of a variable might be causing a logic error.

By default, the PUTLOG statement writes character values with the standard character format $w. To use a different
format, specify the name of the variable followed by the format name and width.

PUTLOG <specifications>;
PUTLOG 'text';
PUTLOG variable-name=;
PUTLOG variable-name format-namew.;
PUTLOG _ALL_;

It's often helpful to see what SAS has stored in the program data vector. To write the current contents of the PDV to the
log, use the _ALL_ option in the PUTLOG statement. When you use the _ALL_ option, the values of the automatic
variables _ERROR_ and _N_ are included in the log.

PUTLOG statements can be executed conditionally using IF-THEN statements. For example, you might want to display
the value of all variables when _ERROR_ is eqaul to 1, when _N_ is greater then 5000, or on the last iteration of the
DATA step. But how do you know when the data step is on the last iteration?

You can use the END= option in the SET statement to create a temporary variable that acts as an end-of-file indicator.
You can also use the END= option in an INFILE statement to indicate the end of a raw data file. The END= variable is
initialized to 0 and is set to 1 when the last observation or record is read.

SET SAS-data-set END=variable <options>;

INFILE 'raw-data-file' END=variable <options>;


Sample Program
data us_mailing;
set orion.mailing_list (obs=10);
drop Address3;
length City $ 25 State $ 2 Zip $ 5;
putlog _n_=;
putlog "Looking for country";
if find(Address3,'US');
putlog "Found US";
Name=catx(' ',scan(Name,2,','),scan(Name,1,','));
City=scan(Address3,1,',');
State=left(scan(address3,2,','));
Zip=left(scan(Address3,3,','));
putlog State=$quote4. Zip=$quote7.;
run;

Lesson 7: Using Iterative DO Loops


One of the challenges in SAS programming is how to simplify your code so that you can easily read and manage it. You
might find yourself writing DATA steps that are very long and that contain many repeated lines of code. For instance,
you might need to perform the same calculation over and over again in a DATA step. In this lesson, you learn to use DO
loops to simplify your code to avoid repetitive code and redundant calculations.

Objectives

In this lesson, you learn to do the following:

 explain iterative DO loops


 construct a simple DO loop to eliminate redundant code and repetitive calculations
 execute DO loops conditionally
 construct nested DO loops

Constructing a Simple DO Loop


Suppose an Orion Star employee has $50,000 to invest and needs to choose between two funds: one will make 4.5%
interest compounded quarterly, and the other will make 4.5% interest compounded annually. Your job is to inform the
employee how much money each investment will accrue. You can use a SAS program to easily compare these two
investment opportunities.

Identifying Redundant Code and Repetitive Calculations


You can use a single DATA step to calculate the growth of two investments. The DATA step shown here calculates the
variable Yearly to hold the yearly compounded interest in one investment, and calculates the variable Quarterly to hold
the quarterly compounded interest in the other investment. Notice the four identical lines of code that calculate the
variable Quarterly. Each line calculates the interest for a subsequent quarter, so that at the end of the DATA step the
variable Quarterly has the value of the interest, compounded quarterly, at the end of one year. This is a useful DATA
step and using four identical lines of code is manageable.

But here's a question: how would you need to alter this code if the employee wanted to determine the annual and
quarterly compounded interest in these investments for a period of 20 years? You'd need 20 identical lines of code to
determine the value of Yearly, and you'd need 80 identical lines of code to calculate the value of Quarterly. In that case,
the redundant code and repetitive calculations would make the code difficult to read and manage.
There is a better way to write the DATA step, by using DO loops. Instead of using a new line of code for each increment
to the Yearly or Quarterly variables, you can use DO loops to iteratively execute a single line of code multiple times. So,
you need far fewer lines of code in your DATA step, which makes your program much easier to read and manage. Let's
look more closely at DO loops.

Constructing an Iterative DO Loop


An iterative DO loop executes the statements between the DO statement and the END statement repetitively. The
syntax of a simple iterative DO loop is shown here. In this form, the DO loop is controlled by an index variable that
increments from a start value to a stop value.

The DO statement begins the loop. You name the index variable to count the number of times that the loop repeats, and
set its initial value. Then, you specify the stop value of the index variable after the keyword TO. You can use the optional
BY clause to specify an amount by which to increment the index variable on each iteration. The values of start, stop, and
increment must be numbers or expressions that yield numbers. If you omit the keyword BY and the increment value, the
increment defaults to a value of 1. SAS increments the value of the index variable after each iteration of the loop, and
the DO loop iterates until the value of the index variable exceeds the specified stop value. This means that when the DO
loop stops executing, the value of the index variable is one increment beyond the specified stop value.

So, in the example shown here, the DO loop iterates until the value of i reaches 21. Let's look at a few more examples to
further illustrate the syntax. Remember, if you don't specify an increment, the increment defaults to 1. So in this first
example, i has an initial value of 1 and it increments by 1 after each iteration of the loop. When the value of i reaches 11,
the loop stops executing because 11 is out of the specified range. So, the loop has executed 10 times and the value of
Total is 10.

In the second example, i has an initial value of 1 and it increments by 2 after each iteration of the loop. When the value
of i reaches 11, the loop stops executing. At that point, the loop has executed 5 times and Total has a value of 5. If your
start value is greater than your stop value, you must specify increment as a negative number. A negative value in the BY
option alerts SAS to decrement the value of the index varaible after each iteration of the loop, and to execute the loop
until the value of the index variable is less than the specified stop value.

In the third example, i has an initial value of 10 and it decrements by 2 after each iteration of the loop. When the value
of i reaches 0, the loop stops executing. At that point, the value of Total is 5.

Look at the fourth example. In this case, the start value is greater than the stop value, but there is no increment
specified. This loop never executes because the initial value of i, 10, is already greater than the stop value. So, the value
of Total is 0.

In the fifth example, the statements within the loop modify the value of i. So, in addition to the increment at the end of
each iteration of the loop, i increases by 10 within each loop. At the top of the second iteration of the loop, i already
exceeds the stop value and the loop stops executing. At that point, Total has a value of 1. It's also possible to create an
infinite loop by changing the value of the index variable within the loop, so you need to be careful.

Understanding DO-Loop Logic


You've seen some examples of DO loops. Now, let's look at the logic that SAS uses to process a DATA step that contains a
DO loop. When the DO statement first executes, SAS evaluates index and if it's within the range of start to stop, SAS
executes the statements within the loop. Then, SAS increments the value of index and loops back to execute the DO
statement again. If the value of index is still within the range of start to stop, SAS executes the statements within the DO
loop and then increments index again.

This process continues until index is out of the range of start to stop. At that point, SAS moves past the loop and
executes additional program statements. Take some time to examine this diagram, because understanding the logic that
SAS uses to process DO loops in a DATA step will help you when you write your own DO loops.

Question
Match the DO statements with the values of the index variables on the right.

do i=1 to 5; a. 1 2 3 4 5 6

do j=2 to 8 by 2; b. 10 8 6 4 2 0

do k=10 to 2 by -2; c. 3.5 3.55 3.6 3.65 3.7 3.75 3.8

do m=3.5 to 3.75 by .05; d. 2 4 6 8 10

The correct answers from top to bottom are a, d, b, c. One or more DO statements are incorrectly matched.

Using an Iterative DO Loop

An Orion Star employee has $50,000 to invest. The employee can either invest in a fund that will make 4.5%
interest compounded quarterly or in a fund that will make 4.5% interest compounded annually. In this
demonstration, you use an iteratuve DO loop to calculate how much money each investment will accrue.

1. Copy and paste the following code step into the editor.

data compound;
Amount=50000;
Rate=.045;
do i=1 to 20;
Yearly+(Yearly+Amount)*Rate;
end;
do i=1 to 80;
Quarterly+((Quarterly+Amount)*Rate/4);
end;
run;

proc print data=work.compound;


run;

The DATA step calculates two variables: Yearly is the growth on an investment that has the interest
compounded annually. Quarterly is the growth on an investement that has the interest compounded quarterly.
The DATA step calculates both of these variables for an investment period of 20 years on the initial deposit of
$50,000.

The PROC PRINT step enables you to see and compare the values of Yearly and Quarterly at the end of the
investment period.

2. Submit the code and view the results. Notice that the output data set, compound, contains one observation.
Amount represents the initial investment. Rate is the interest rate, which is the same for both accounts. Yearly
represents the value of the investment at the end of 20 years for the account that has the interest compounded
annually, and Quarterly represents the value of the investment at the end of 20 years for the account that has
the interest compounded quarterly.

Also notice that the index variable, i, is included in the output data set and that it has a value of 81, which is
beyond the range that we specified in the second DO loop. When you use a DO loop in a DATA step, SAS
includes the index variable in the output data set by default. Here's a question: how could you exclude this
variable from the output data set? You could use the DROP= option in the DATA statement.

Understanding DO-Loop Processing


Let's take a look at another example to see what goes on behind the scenes when SAS processes a DO loop. In this
example, an Orion Star employee invests $5,000 in an account on January 1st of each year. You need to determine the
value of that account after three years based on a constant annual interest rate of 4.5%, beginning in 2008.

This DATA step creates a data set named invest that contains two variables: Year, the index variable, indicates the
investment period from 2008 until 2010. Capital contains the value of the investment account at the end of each year.
When the execution phase begins, the PDV contains the automatic variable _N_ as well as the variables Year and Capital.
Because Capital appears in a sum statement, SAS automatically initializes it to 0 and retains it. SAS initializes Year to
missing.

The DO statement executes and SAS sets the values for Year and establishes the start and stop range. Then, SAS
evaluates whether Year is within the specified range. The first time the DO statement executes, the value of Year is
2008, so it is within the range and the statements within the DO loop execute. SAS increases the value of Capital by
5000. Then, SAS updates Capital with the new value, 5000 plus 5000*.045, which equals 5225. The END statement
executes and SAS increments the value Year by 1.

Next, SAS loops back and executes the DO statement again. The value of Year is now 2009, which is still within the range,
so the sum statements execute again. SAS increases the value of Capital by 5000 to account for the new year's deposit.
Then, SAS uses the new value of Capital, 10225, in the second sum statement to calculate the end-of-year value
including interest. The END statement executes and SAS increments the value of Year by 1 again, and then loops back to
execute the DO statement.

Year now has a value of 2010, which is within the specified range. SAS executes the sum statements again and increases
the value of Capital by 5000 and then by the interest earned. The END statement executes and SAS increments Year to
2011. When the DO statement executes this time, Year is outside of the specified range. So, the statements within the
DO loop do not execute again.

Instead, SAS continues with the next statement in the DATA step, after the END statement. In this case, there are no
more statements in the DATA step, so SAS writes out the values from the PDV to the output data set. Remember, we
want to find out what the investment is worth after three years if an employee began the investment on January 1st,
2008. The data in work.invest shows the value of the investment as of the year 2011, which is three years after the initial
investment.

Using the DO Statement with an Item List


You've seen one form of an iterative DO loop. Here, you see the syntax for another form of an iterative DO loop. The
basic logic is the same, but this DO statement uses a list of values for the index variable rather than a range. The items in
the list can be either all numeric or all character constants, or they can be variables. You must enclose character
constants in quotation marks. You separate the values by commas.

In this form of the iterative DO loop, SAS executes the loop once for each item in the list. At the end of each iteration of
the DO loop, instead of incrementing the index variable, SAS assigns the next value in the list to the index variable. When
there are no more items in the list, the loop stops executing. The last item in the list provides the final value for the
index variable, assuming that the statements within the loop do not affect that value.

So, in the example shown here, when SAS writes the index variable Year to the output data set invest2, it has a value of
2010. Capital has a value of 16390.96, which is the same value that Capital had in the invest data set from the last
example.

Question
Match the DO statements with the number that represents how many times the DO loop will execute. Assume that the
statements within the DO loop do not affect the number of iterations.

do Month='JAN', 'FEB', 'MAR'; a. five times

do Odd=1,3,5,7,9; b. three times

do i=Var1, Var2; c. not enough information to tell

do j=BeginDate to Today() by 7; d. two times

The correct answers from top to bottom are b, a, d, c. One or more items are incorrectly matched.

Using the OUTPUT Statement in a DO Loop


As you've seen, the DO loop shown here results in only one observation being written out to the output data set. But
what if you want to include a separate observation for each year? You can use an OUTPUT statement inside the DO loop
to explicitly write the values out to the data set on each iteration of the DO loop.

Using a DO Loop to Reduce Redundant Code


In this demonstration, you use a DO loop to reduce redundant code.

1. Copy and paste the following code into the editor.

data forecast;
set orion.growth;
Year=1;
Total_Employees=Total_Employees*(1+Increase);
output;
Year=2;
Total_Employees=Total_Employees*(1+Increase);
output;
run;

proc print data=forecast noobs;


run;

You've seen this program before, in Lesson 1; it creates the SAS data set forecast with two observations for
every observation in the growth data set. The first observation predicts the increase in employees for one year
and the second observation predicts the increase in employees for two years.

2. Submit the code and check the log. Verify that no errors occurred.

3. View the results. Notice that the forecast data set contains 12 observations, 2 for each department. But what if
you want to forecast the company's growth over the next six years? Instead of adding separate lines of code for
each year, you could use a DO loop.

4. Replace the six lines of code between the SET statement and the RUN statement with a DO loop. Begin with a
DO statement, and make Year the index variable with a range of 1 to 6. Within the loop, write two lines of code:
an assignment statement that calculates the number of employees at the end of each year, and an OUTPUT
statement. Finish the loop with an END statement. Your program should look like this:

data forecast;
set orion.growth;
do Year=1 to 6;
Total_Employees=Total_Employees*(1+Increase);
output;
end;
run;

proc print data=forecast noobs;


run;

5. Submit the code and check the log. Verify that no errors occurred.

6. View the results. Notice that the output is very similar to the output from the previous program except this time,
the data set contains six obsevations for each department.

Here's a question: how could we alter the DATA step if we wanted to forecast the number of years it would take
for the total number of employees in the Engineering department to exceed 75? We could guess how long it will
take. For instance, we could change the upper limit of the index variable Year to 10 and run the program again
to see if Engineering is forecasted to have more than 75 employees by then. But there is a more efficient way,
which we'll look at next.

Conditionally Executing DO Loops


Suppose you want to determine the number of years it will take for an account to exceed $1,000,000 if an employee
invests $5,000 annually at 4.5%. You can easily determine that investment period by creating a DO loop that executes
repetitively until the value of the account exceeds $1,000,000.

Using the DO UNTIL Statement


You can use a DO UNTIL statement instead of a simple DO statement. The syntax of the DO UNTIL statement is shown
here. In a DO UNTIL statement, you can specify a condition in the form of an expression and SAS executes the loop until
that expression is true. Although the conditional expression appears at the top of the loop, SAS evaluates the expression
at the bottom of the loop after each iteration.
So, the statements in the loop always execute at least once. Be careful, because it is possible to create an infinite loop by
writing an expression that never becomes true. In the example shown here, the DO loop executes until the value of
Capital is greater than one million.

Using the DO WHILE Statement


Another way to execute a DO loop conditionally is to use the DO WHILE statement. The syntax of the DO WHILE
statement is shown here. The DO WHILE statement executes the DO loop while the expression is true. In a DO loop that
uses the DO WHILE statement, SAS evaluates the expression at the top of the loop, and executes the statement within
the loop if the expression is true.

So, it's possible to create a DO loop that never executes because the expression is initially false. It is also possible to
create an infinite loop by creating a condition that never becomes false, so write your expressions and loops carefully. In
the example shown here, the DO loop continues to execute as long as the value of Capital is less than or equal to
1,000,000. As soon as Capital exceeds 1,000,000, the expression is false and the loop ends.

Conditionally Executing DO Loops


You've seen the syntax and rules for conditionally executing DO loops. In this demonstration you view a couple of
examples in SAS.

1. Copy and paste the following code into the editor.

data invest;
do Year=2008 to 2010;
Capital+5000;
Capital+(Capital*.045);
end;
run;

proc print data=invest noobs;


run;

This DATA step calculates the value of an investment account with an annual deposit of $5,000 and an interest
rate of 4.5%, compounded yearly. The DATA step calculates the growth of the account over a period of three
years.

2. Alter this DATA step, as shown below, to determine how long it will take this account to exceed $1,000,000 in
value, assuming that the annual $5,000 deposits continue. Instead of using an index variable to control the loop,
add the UNTIL keyword and use the expression Capital > 1000000. Add a sum statement inside the loop to track
the Year, because you still need to have that information.

data invest;
do until (Capital>1000000);
Year+1;
Capital +5000;
Capital+(Capital*.045);
end;
run;

proc print data=invest noobs;


run;

3. Submit the program and check the log. Verify that no errors occurred.

4. View the results. You can see that it will take 52 years for this account to exceed $1,000,000.
5. Can you think of another way to write this DATA step to find out how many years the value of the account will
be less than or equal to $1,000,000? You could use the WHILE keyword instead of the UNTIL keyword and
change the expression to Capital <= 1000000, like this:

data invest2;
do while (Capital<=1000000);
Year+1;
Capital +5000;
Capital+(Capital*.045);
end;
run;

proc print data=invest2 noobs;


run;

6. If you submit this code and examine the results, you can see that the output data set invest2 contains exactly
the same information as invest1.

Question
Select the DO WHILE statement that would generate the same result as the program below.

data work.invest;
Capital=100000;
do until(Capital gt 500000);
Year+1;
Capital+(Capital*.10);
end;
run;

a. do while(Capital ge 500000);

b. do while(Capital=500000);

c. do while(Capital le 500000);

d. do while(Capital>500000);

The correct answer is c. Because the DO WHILE loop is evaluated at the top of the loop, you specify the condition that
must exist to execute the enclosed statements.

Using an Iterative DO Loop with a Conditional Clause


You can combine DO UNTIL and DO WHILE statements with the iterative DO statement. This is one way to avoid creating
an infinite loop. The syntax for using an iterative DO statement with a conditional clause is shown here. In this case, the
loop executes either until the value of the index variable exceeds the specified range or until the expression is true for
an UNTIL clause or false for a WHILE clause. Let's look at a couple of examples.

In the first example shown here, the loop executes either until the value of Year reaches 31 or until the value of Capital
exceeds 250,000. SAS evaluates Capital at the bottom of the loop to make sure the UNTIL condition is still false, and then
increments the index variable Year.

In the second example, the loop executes either until the value of Year reaches 31 or until the value of Capital is no
longer less than or equal to 250,000. SAS increments Year at the bottom of the loop, and then evaluates Capital at the
top of the next iteration of the loop to see if the condition is still true.

Using an Iterative DO Loop with a Conditional Clause


You've seen two examples of DATA steps that use an iterative DO loop with a conditional clause. In this demonstration,
you submit the two DATA steps and compare the results.

1. Copy and paste the following DATA steps into the editor.

data invest;
do year=1 to 30 until (Capital>250000);
Capital +5000;
Capital+(Capital*.045);
end;
run;

data invest2;
do year=1 to 30 while (Capital<=250000);
Capital +5000;
Capital+(Capital*.045);
end;
run;

The first DATA step creates a data set named invest, and uses a DO loop that executes either until the value of
Year reaches 31 or until the value of Capital exceeds 250,000. The second DATA step creates a data set named
invest2, and uses a DO loop that executes either until the value of Year reaches 31 or until the value of Capital is
no longer less than or equal to 250,000.

2. Add two PROC PRINT steps so that you can compare the results of each of these DATA steps. Use the
DOLLAR14.2 format to format the Capital variables.

proc print data=invest;


title "Invest";
format Capital dollar14.2;
run;

proc print data=invest2;


title "Invest2";
format Capital dollar14.2;
run;

title;

3. Submit the code and check the log. Verify that no errors occurred.

4. Examine the results. You can see that the value of Capital is the same in invest and invest2. The value of Year is
27 in invest and 28 in invest2. Why is that?

Look back at the DATA steps. Neither DO loop executed enough times for the value of Year to exceed the
specified range. Remember, in a DO UNTIL loop, SAS checks the condition before incrementing the index
variable. In a DO WHILE loop, SAS checks the condition after incrementing the index variable. So both loops
executed 27 times, but the DO WHILE loop incremented the index variable one more time after the 27th
execution of the statements within the loop.
Question
How many times does this DO loop execute?

data work.test;
x=15;
do while(x<12);
x+1;
end;
run;

a. 0

b. 1

c. 12

d. x

The correct answer is a. The DO WHILE expression is evaluated at the top of the loop. Because the expression is false,
the DO loop does not execute.

Question
What is the value of X at the completion of the DATA step?

data work.test;
x=15;
do while(x<12);
x+1;
end;
run;

a. 12

b. 15

c. 16

d. unknkown

The correct answer is b. The value of X is set to 15 at the beginning of the DATA step, and because the DO loop never
executes, the value of X never increases.

Nesting DO Loops
Suppose an employee wants to see a report that shows the yearly totals for an investment account for five years. The
employee deposits $5,000 in the account annually, and the account earns 4.5% interest, compounded quarterly. You can
easily write a DATA step to track this information for the employee's report.
Nesting DO Loops
You can nest DO loops in a DATA step. That is, you can create a loop within a loop. When you nest DO loops, you must
use different index variables for each loop, and you must be certain that each DO statement has a corresponding END
statement.

In the example shown here, the outside DO loop uses the index variable Year. This loop executes five times: each time, it
increments the value of Capital by 5000 for each year, executes the inner loop four times, and then writes the current
observation to the output data set. The inner loop uses the index variable Quarter, and calculates the interest that the
account earns in each quarter of the current year. The inner loop executes 4 times for each value of Year, which means
that it executes a total of 20 times.

Notice that index variable Quarter is dropped from the output data set. Remember that, by default, all index variables
are included in the output data set.

Nesting DO Loops
In this demonstration you submit a DATA step that contains nested DO loops.

1. Copy and paste the following DATA step into the editor.

data invest (drop=Quarter);


do Year=1 to 5;
Capital+5000;
do Quarter=1 to 4;
Capital+(Capital*(.045/4));
end;
output;
end;
run;

2. Submit the DATA step and check the log. Notice that the output data set contains 5 observations and 2
variables.

3. Copy and paste the following PROC PRINT step into the editor.

proc print data=invest;


run;

4. Submit the code and examine the results. In the results you can see the increases for Capital over the five-year
period. Here's a question: how could we change the DATA step to make it generate one observation for each
quarterly amount? We could move the OUTPUT statement inside the nested DO loop.

5. Move the OUTPUT statement inside the nested DO loop, as shown below.

data invest (drop=Quarter);


do Year=1 to 5;
Capital+5000;
do Quarter=1 to 4;
Capital+(Capital*(.045/4));
output;
end;
end;
run;

proc print data=invest;


run;
6. Submit the program again and view the results. This time the invest data set contains 20 observations, 4 for
each year.

Here's another question: what would have happened if we had just added a second OUTPUT statement in the
nested DO loop instead of moving the one that was already there? The output data set would have had 25
observations: 1 observation for each quarter, plus a 5th observation for each year that repeats the information
from the fourth quarter observation. That’s not what we want. So, we need only one OUTPUT statement in the
DATA step.

Exploring Nested DO Loop Processing


Let's look at another example in detail. Suppose you want to compare the final results of investing $5,000 a year for five
years in three different banks that compound interest quarterly. Assume that each bank has a fixed interest rate. The
orion.banks data set contains three observations that contain the interest rates of the three banks.

This DATA step iterates three times because the input data set contains three observations. Capital must be set to 0 for
each iteration of the DATA step. Then, the nested DO loops calculate the growth of the investments by the annual
deposits and the quarterly-compounded interest. During the first iteration of the DATA step, the value of Rate is 0.0318.
During the second iteration, the value of Rate is 0.0321, and during the third iteration, the value of Rate is 0.0328.

Notice that both of the index variables are dropped from the output data set. Also, because neither of the DO loops
contains an OUTPUT statement, there is only one observation in the output data set for each iteration of the DATA step,
even though the DO loops within the DATA step execute 5 and 20 times for each iteration of the DATA step.

Question
In the example below, the variable X is summed during each iteration of the DO loop. What is the value of X at the
completion of this DATA step?

data test;
do i=1 to 5;
do j=1 to 4;
x+1;
end;
end;
run;

a. 0

b. 5

c. 20

d. 21

The correct answer is c. The nested DO loop executes four times during each of the five executions of the outside DO
loop. The value of X is 20.
Summary: Using Iterative DO Loops

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Constructing a Simple DO Loop


Iterative DO loops simplify your code, avoiding repetitive code and redundant calculations. An iterative DO loop
executes the statements between the DO and the END statements repetitively.

DO index-variable=start TO stop <BY increment>;


iterated SAS statements…
END;

Start and stop values must be numeric constants or expressions that result in a numeric value. If you do not specify an
increment for a DO loop, the increment defaults to 1. If the start value is greater than the stop value, you must specify
an increment that is negative.

You can use an item list rather than a start and stop value to control your DO loop. The items can be variables or
constants, and must be all numeric or all character. The DO loop is executed once for each item in the list.

DO index-variable=item-1 <,… item-n>;


iterated SAS statements…
END;

You can use an OUTPUT statement within a DO loop to explicitly write an observation to the data set on each iteration
of the DO loop.

Conditionally Executing DO Loops


There are two types of conditional DO loops: DO UNTIL and DO WHILE. In a DO UNTIL loop, SAS executes the loop until
the condition specified in the DO UNTIL statement is true. Even though the expression is written at the top of the loop,
in a DO UNTIL it is evaluated at the bottom of the loop, after each iteration. The statements in a DO UNTIL loop will
always execute at least once.

DO UNTIL (expression);
iterated SAS statements…
END;
In a DO WHILE statement, SAS executes the loop while the specified condition is true. The condition is evaluated at the
top of the loop, before the statements in the loop are executed. If the condiiton is initially false, the statements in a DO
WHILE loop will not execute at all.

DO WHILE (expression);
iterated SAS statements…
END;

It is possible to create a DO WHILE or a DO UNTIL loop that executes infinitely. You should write your conditions and
iterated statements carefully.

You can also combine a DO UNTIL or DO WHILE loop with an iterative DO loop. In this case, the loop executes either until
the value of the index variable exceeds the stopping value or until the condition is met. This can be used to avoid
creating an infinite loop.

DO index-variable=start TO stop <BY increment>


UNTIL | WHILE (expression);
iterated SAS statements…
END;

Nesting DO Loops
You can nest DO loops in a DATA step. Be sure to use different index variables for each loop, and be certain that each DO
statement has a corresponding END statement.

DO index-variable-1=start TO stop <BY increment>;


iterated SAS statements…
DO index-variable-2=start TO stop <BY increment>;
iterated SAS statements…
END;
iterated SAS statements…
END;

Sample Programs

Using Iterative DO Loops

data compound;
Amount=50000;
Rate=.045;
do i=1 to 20;
Yearly+(Yearly+Amount)*Rate;
end;
do i=1 to 80;
Quarterly+((Quarterly+Amount)*Rate/4);
end;
run;

proc print data=work.compound;


run;
Using a DO Loop to Reduce Redundant Code

data forecast;
set orion.growth;
do Year=1 to 6;
Total_Employees=
Total_Employees*(1+Increase);
output;
end;
run;

proc print data=forecast noobs;


run;

Conditionally Executing DO Loops

data invest;
do until (Capital>1000000);
Year+1;
Capital+5000;
Capital+(Capital*.045);
end;
run;

proc print data=invest noobs;


run;

data invest2;
do while (Capital<=1000000);
Year+1;
Capital+5000;
Capital+(Capital*.045);
end;
run;

proc print data=invest2 noobs;


run;

Using an Iterative DO Loop with a Conditional Clause

data invest;
do year=1 to 30 until (Capital>250000);
Capital+5000;
Capital+(Capital*.045);
end;
run;

data invest2;
do year=1 to 30 while (Capital<=250000);
Capital+5000;
Capital+(Capital*.045);
end;
run;

proc print data=invest;


title "Invest";
format Capital dollar14.2;
run;

proc print data=invest2;


title "Invest2";
format Capital dollar14.2;
run;

title;

Nesting DO Loops

data invest (drop=Quarter);


do Year=1 to 5;
Capital+5000;
do Quarter=1 to 4;
Capital+(Capital*(.045/4));
end;
output;
end;
run;

proc print data=invest;


run;

Lesson 8: Using SAS Arrays


In DATA step programming, you often need to perform the same action on more than one variable. For example, you
might have 365 days of temperature readings recorded in the Fahrenheit scale that you want to convert to Celsius.
Suppose each reading is stored in a separate variable. You can perform the calculation on each variable individually, but
it would be easier to handle them as a group.

You can simplify your program by grouping the variables in an array and then referencing the array in your calculation.
By using an array and DO loop processing, you eliminate the need for 365 separate programming statements. In addition
to performing repetitive calculations, you can also use arrays to create new variables with the same attributes, or to
initialize variables with values that can be used to perform a table lookup. You also learn how to restructure a data set
using DO loops and arrays.

Objectives

In this lesson, you learn to do the following:

 define a SAS array


 reference elements of the SAS array
 use SAS arrays to perform repetitive calculations
 use SAS arrays to create new variables
 use SAS arrays to perform a table lookup
 use a DATA step with arrays and DO loop processing to restructure a data set

Understanding SAS Arrays


A SAS array is a temporary grouping of elements that exists only for the duration of the DATA step. An array is identified
by a single unique name. The array name must be different from any variable name in the data set you are referencing.
All variables that are grouped together in an array must be the same type: either all character or all numeric.

If you are familiar with other programming languages, you should know that SAS arrays are not data structures. In SAS,
arrays are simply a convenient way to temporarily identify a group of variables. Arrays can be multi-dimensional, but in
this lesson you will only work with one-dimensional arrays.
Question
Which statement about SAS arrays is true?

a. When you create an array in one DATA step, you can reference it by name in another DATA step.

b. An array can contain a mixture of numeric and character variables if they are explicitly named.

c. An array exists only for the duration of the DATA step.

d. It is a good practice to name the array the same name as one of the variables in the array.

The correct answer is c. The ARRAY statement is a compile-time only statement and the array exists only for the duration
of the DATA step.

Creating SAS Arrays


The employee_donations data set contains quarterly contribution data for each Orion Star employee. The company is
considering a 25 percent matching program. Suppose you need to create a new data set named charity that contains the
new contribution amount, including the proposed 25 percent company supplement.

You can do this by writing a relatively simple DATA step that includes four assignment statements. As you can see, the
assignment statements perform the same calculation over and over―only the name of the variable changes. This DATA
step will create the data set that you need, but you can simplify the program. You cannot replace the four calculations
by a single calculation inside a DO loop because the variables have different names. However, you can group the
variables in a SAS array and then process the array in a DO loop.

Defining a SAS Array


You use the ARRAY statement to define an array. Let's look at the syntax. Array-name specifies the name of the array.
Dimension describes the number and arrangement of elements in the array. Array-elements lists the variables to include
in the array. Array elements can be listed in any order, but they must be either all numeric or all character. If no
elements are listed, SAS creates new variables with default names. You learn more about creating variables later. This
ARRAY statement creates an array named contrib that groups the four numeric variables Qtr1 through Qtr4.

Specifying the Array Dimension


Let's look at a few syntax variations for the ARRAY statement. Another way to indicate the dimension of a one-
dimensional array is by using an asterisk. When you use an asterisk, SAS determines the dimension of the array by
counting the variables in the list of array elements. The list of array elements must be included if you use an asterisk.
The array dimension must be enclosed in either braces, brackets, or parentheses. It is best to use braces or brackets so
that there is no confusion with functions.

Specifying the Array Elements


There are also various ways to specify the list of array elements. Remember, all the elements in an array must be the
same type: either all numeric or all character. As you've seen, you can list each variable name separated by a space. You
already know how to use variable lists in functions.

You can use variable lists – either a numbered range list or a name range list – to specify array elements. You can also
use the special SAS names _NUMERIC_ or _CHARACTER_ . If you use one of these keywords, the array includes either all
the numeric variables or all the character variables that were already defined in the current DATA step. Here's an
example of another valid ARRAY statement.
Variables that are elements of an array do not need to have similar, related, or numbered names. They also do not need
to be stored sequentially or be adjacent to one another. The order of elements in an array is determined by the order in
which they are listed in the ARRAY statement.

Question
Which of the following ARRAY statements is correct?

a. array{*} Ans1-Ans10;

b. array quiz{6} Q1 Q2 Q3 Q4 Q5 Q6 Q7;

c. array fruit{*} Apples Pears Grapes Bananas Oranges;

d. quiz array{7} Q1-Q6;

The correct answer is c. The ARRAY statement begins with the keyword ARRAY, followed by the array name, the
dimension, and the array elements.

Question
The trial data set contains five numeric variables that store data collected during an experimental trial. Which ARRAY
statement groups the variables T1, T2, T3, T4, and T5 into an array named test?

a. array test{*} T1 T2 T3 T4 T5;

b. array test{5} T1-T5;

c. array test{*} _numeric_;

d. all of the above

The correct answer is d. All the ARRAY statements listed are valid ways to group the five numeric variables into an array.

Referencing Elements of an Array


Now that you know how to define an array, let's look at how you reference arrays in your DATA step program. When you
define an array in a DATA step, a subscript is assigned to each array element. The subscripts are assigned in the order of
the array elements. These subscripts are used to reference a specific element in the array. The syntax for an array
reference is the name of the array, followed by a subscript enclosed in braces, brackets, or parentheses. The subscript
can be an integer, a variable, or a SAS expression.

For the contrib array, the subscripts are 1 through 4. You can use an array reference to perform an action on an array
element during execution. The four variables, Qtr1, Qtr2, Qtr3, and Qtr4, can now be referenced by the array reference,
that is, the array name contrib, with an appropriate subscript. Here's a question: What array reference refers to the
value stored in Qtr2? The array reference contrib{2}.
Using a DO Loop to Reference Elements in an Array
The ability to reference the elements of an array by a subscript is what gives arrays their power. Typically, arrays are
used with DO loops to process multiple variables and to perform repetitive calculations. When you use a DO loop, the
index variable is used as the array subscript. As the index variable i is incremented from 1 to 4, the array reference
increments from 1 to 4 and processes each element of the array. Remember that you started with a DATA step that
calculated a new quarterly donation amount by using four assignment statements.

Now let's simplify the code by using an array to process the calculations. We’ll start with the same DATA, SET, and KEEP
statements. Now we add an ARRAY statement that creates the contrib array referencing the variables Qtr1 through
Qtr4. Now we write a DO statement that creates the index variable i which starts at 1 and stops at 4. The stop value is 4
because there are 4 array elements to process. Remember that by default, the index variable i is included in the output
data set. In this code, the KEEP statement that we already have prevents the index variable i from being written to the
output data. Inside the DO loop you use the array reference to perform the calculation for the employee and employer
charity contribution.

Processing a DO Loop to Reference Elements in an Array


Now, let's step through the compilation and execution phases of the DATA step. Compilation begins. SAS adds the
variables from the employee_donations data set to the program data vector. Only some of the variables are shown here.

Next, SAS creates the contrib array and assigns the subscripts for the array elements. Notice that the array name and the
array references are not included in the PDV. Arrays exist only for the duration of the DATA step. SAS also creates the
index variable, i. The descriptor portion of the SAS data set is complete.

Execution begins, and SAS initializes the variables in the PDV to missing. The SET statement executes and reads the first
observation from employee_donations into the PDV. SAS ignores the KEEP and ARRAY statements because they are
compile time only statements.

Next, SAS executes the DO loop and sets the index variable i to 1. Because the value of i is equal to 1, the array reference
contrib{i} becomes contrib{1}. Contrib{1} refers to the first array element, Qtr1 and the value of Qtr1 is multiplied by
1.25. In this case, the value in Qtr1 is missing, so a missing value is the result. As the DO loop iterations continue, SAS
increments the index variable i from 1 to 2, 3, and 4, which causes Qtr1 through Qtr4 to receive new values in the PDV.
When the index variable is incremented to 5, the value is out of range and the DO loop stops processing. Processing
returns to the top of the DATA step.

Question
Which DO loop completes the program so that the values stored in the variables Weight1-Weight8 are converted from
kilograms to pounds?

data hrd.convert;
set hrd.fitclass;
array wt{*} Weight1-Weight8;
___________
__________
___________
run;

a.
do i=1 to 8;
wt{i}=wt{i}*2.2046;
end;

b.

do i=1 to *;
wt[i]=wt[i]*2.2046;
end;

c.

do i=1 to 8;
Weight{i}=Weight{i}*2.2046;
end;

The correct answer is a. There are eight elements in the array, so you specify 8 as the stop value for DO loop. Then you
use the array reference wt{i} in the calculation to convert kilograms to pounds.

Using the DIM Function to Determine the Number of Elements in an Array


You know how to specify a stop value for the iterative DO statement. Another way to specify the stop value is to use the
DIM function. The argument for the DIM function is the array name. The DIM function returns the number of elements
in the array. Let's look at an example.

Here, the DIM function is used instead of a stop value in the DO statement. Here's a question: What value does the DIM
function return? The function returns 6 because the array has six elements. One convenient feature of using the DIM
function is that you do not have to change the stop value of the iterative DO statement if you change the dimension of
the array.

Using an Array to Process Variables


In this demonstration you use an array to process variables.

1. Copy and paste the following code into the editor.

data charity;
set orion.employee_donations;
keep employee_id qtr1-qtr4;
array contrib{*} qtr1-qtr4;
do i=1 to 4;
contrib{i}=contrib{i}*1.25;
end;
run;

proc print data=orion.employee_donations;


var employee_id qtr1-qtr4;
run;

proc print data=charity;


run;

We need to calculate the total donation amount if Orion Star contributes an additional 25% of the employee
donation. The program creates the contrib array to reference the variables Qtr1 through Qtr4. Then, we
reference the array in a DO loop to perform the calculations for the quarterly donations.
The PRINT procedures enable us to compare the input and output data sets. We include a a VAR statement for
the employee_donations data set to make the comparison easier.

2. Submit the code and check the log. Notice that there's a note that missing values were generated in the charity
data set.

3. View the PRINT procedure output for the charity data. Notice that there are many missing values.

4. View the PRINT procedure output for the employee_donations data. You can see that a missing value was
generated in the output data set when SAS tried to perform a calculation on a missing value in the input data
set.

5. Edit the code, as shown below, to use the DIM function in place of the stop value in the DO statement. Use
contrib as the argument.

data charity;
set orion.employee_donations;
keep employee_id qtr1-qtr4;
array contrib{*} qtr1-qtr4;
do i=1 to dim(contrib);
contrib{i}=contrib{i}*1.25;
end;
run;

proc print data=orion.employee_donations;


var employee_id qtr1-qtr4;
run;

proc print data=charity;


run;

6. Submit the DATA step again and view the results. You can see that the program with the DIM function created
the same results as the program wtih the stop value.

Using SAS Arrays to Create Variables and Perform Calculations


Suppose you want to calculate the percentage that each employee's quarterly donation represents of their total annual
charitable contribution. To do this, you need to calculate the total annual contribution for each employee. Then you can
calculate the percentages for each quarter and store them in four new variables. You can use an array to perform those
calculations. You can also use an array to create the new variables and to calculate a total contribution.

Passing an Array to a Function


Let's see how you can use an array with a function to calculate the total. You used the SUM function before to calculate
a total. The argument for the SUM function is a list of variables to add together. To calculate a value for the Total
variable, you could list each of the four variables.

You can also use a variable list with the SUM function. Remember that when you pass a variable list to a function, you
use the keyword OF. Otherwise, SAS will interpret Qtr1-Qtr4 as an expression and find the difference. Because you are
using an array to reference the Qtr variables, it would be convenient to pass the array to the function. You can pass the
array to the function as if it were a variable list. Now you have a variable to hold the total annual contribution for each
employee.
Using an Array to Create Variables
So far you created arrays of existing variables. In the array-elements list, you listed the variables from the data set that
you want to include in the array. Notice that the array-elements list is optional.

When you do not reference existing variables in the ARRAY statement, SAS automatically creates the variables for you
and assigns default names to them. The default variable names are created by concatenating the array name and the
numbers 1, 2, 3, and so on, up to the array dimension. During compilation of the DATA step, the variables that the
ARRAY statement creates are added to the PDV and are stored in the resulting data set. When you create variables in an
ARRAY statement, the default variable names will match the case that you used for the array name. If you prefer, you
can specify the new variable names by listing them in the ARRAY statement.

This ARRAY statement creates the variables Discount1, Discount2, Discount3, and Discount4. The array name does not
have to match the new variable names.

This ARRAY statement creates the numeric variables Oct12, Oct19, and Oct26. Variables that you create in an ARRAY
statement all have the same variable type. If you want to create an array of character variables, you must add a dollar
sign after the array dimension. By default, all character variables that are created in an ARRAY statement are assigned a
length of 8. You can assign a different length by specifying the length after the dollar sign. The length that you specify is
automatically assigned to all variables that are created by the ARRAY statement. If you want to create several variables
of different lengths, you can use a LENGTH statement before the ARRAY statement.

Question
Which ARRAY statement creates the numeric variables Comp1 through Comp25? Select all that apply.

a. array Comp{25};

b. array compute{25} Comp1-Comp25;

c. array comp{25} $;

d. array comp{*} Comp1-Comp25;

The correct answer is a, b, and d. These ARRAY statements all create the numeric variables Comp1 – Comp25.

Using an Array to Create New Variables


In this demonstration, you use an array to create variables and calculate the percentage that each quarterly contribution
represents of the employee's annual contribution.

1. Copy and paste the following code into the editor.

data percent(drop=i);
set orion.employee_donations;
array contrib{4} qtr1-qtr4;
/* add additonal programming statements here */
run;

proc print data=percent;


var Employee_ID pct1-pct4;
run;
The DATA step uses the contrib array to reference the existing Qtr1-Qtr4 variables. We need to add the code
that calculates the percentage that each quarterly contribution represents of the employee's annual
contribution.

2. Add a second array to create four numeric variables to store the calculated percentages. Name this array Pct
and specify a dimension of 4. These are the two arrays we need.

array Pct{4};

3. Add an assignment statement to create the Total variable. This vaiable will hold hold each employee's total
annual contribution. Use the SUM function with the contrib array as the argument. Remember, you must use
the keyword OF when you pass an array to a function.

Total=sum(of contrib{*});

4. Write a DO loop to calculate the values for the Pct variables. Divide the quarterly contribution by the total
annual contribution.

do i=1 to 4;
pct{i}=contrib{i}/Total;
end;

5. In the PROC PRINT step, use the PERCENTw.d format to format the values in the Pct variables as percentages.

format pct1-pct4 percent6.;

6. Submit the code (your code should look like the code shown below).

data percent(drop=i);
set orion.employee_donations;
array contrib{4} qtr1-qtr4;
array Pct{4};
Total=sum(of contrib{*});
do i=1 to 4;
pct{i}=contrib{i}/Total;
end;
run;

proc print data=percent;


var Employee_ID Pct1-Pct4;
format Pct1-Pct4 percent6.;
run;

7. Check the log. Verify that no errors occurred.

8. View the results. You can see that the new variables were created.

Business Scenario
Suppose this time you want to calculate the difference in each employee's contribution from one quarter to the next
and store the values in the data. You know that you can use arrays to do this! You need one array to reference the
existing variables, and another array to create the three variables for the differences between quarters.

Using Arrays to Perform Calculations


Let's take a look at the code. The contrib array references the existing variables. The Diff array creates three variables:
Diff1, Diff2, and Diff3. Notice that each value of diff{i} is calculated by subtracting contrib{i} from contrib{i+1}. By
manipulating the index variable, you can easily reference any array element. Let's see what happens in the three
iterations of the DO loop. When i=1, the value of the variable Diff1 is contrib{2}-contrib{1}, or Qtr2-Qtr1. When i=2, the
value of the variable Diff2 is contrib{3}-contrib{2}, or Qtr3-Qtr2. When i=3, the value of the variable Diff3 is contrib{4}-
contrib{3}, or Qtr4-Qtr3.

Using an Array to Perform Calculations and Create Variables


In this demonstration, you use an array to calculate the difference in employee contributions for one quarter to the
next.

1. Copy and paste the following DATA step into the editor.

data change;
set orion.employee_donations;
drop i;
array contrib{4} Qtr1-Qtr4;
array Diff{3};
do i=1 to 3;
diff{i}=contrib{i+1}-contrib{i};
end;
run;

2. Add a PRINT procedure that prints only Employee_ID, Qtr1-Qtr4, and Diff1-Diff3.

proc print data=change;


var Employee_ID Qtr1-Qtr4 Diff1-Diff3;
run;

3. Submit the code and check the log. Verify that no errors occurred.

4. View the results. Remember that the Diff1 values represent the difference in the Qtr1 and Qtr2 contributions. As
you've seen in the past, if there is a missing value in the input data, a missing value is generated in the output
data.

Assigning Initial Values to an Array


Suppose you need to find the difference between the amount that the employee contributed each quarter and Orion
Star's quarterly goals of $10, $20, $20, and $15. The quarterly goals are not stored in the employee_donations data set.
To do this, you can create an array of new variables that store the values of the quarterly goals. Then you can reference
the variables in your calculations. This is called table lookup.

To find the difference between the employee contribution and the quarterly goals, you need to create three arrays. The
first array references the actual employee contributions (Qtr1 through Qtr4). The second array creates four variables
and assigns initial values that represent Orion Star's goal amounts for each quarter. The third array creates four variables
to store the calculated difference between the goal amount and the actual contribution amount. The only new task that
you need to learn is how to assign initial values to variables in an ARRAY statement.

Creating an Array with Initial Values


To assign initial values in an ARRAY statement, you place the values in an initial value list. In this list, you specify one
initial value for each corresponding array element. Elements and values are matched by position, so the values must be
listed in the order of the array elements. You separate each value with a comma or blank, and you enclose the values in
parentheses. If you are assigning character values, each value must be enclosed in quotation marks.

This array creates the variables Target1 through Target5 and assigns the initial values that are listed in order. If there are
more array elements than initial values, SAS assigns missing values to the remaining array elements and issues a
warning.

When you specify an initial value list, all elements behave as if they were named in a RETAIN statement. This creates a
lookup table, that is, a list of values to refer to during DATA step processing. You created the contrib and Diff arrays in
previous tasks. Now you need to create the array that stores the goal values. Then, in the DO loop, you use the arrays to
calculate the difference between the actual contribution and Orion Star's goals.

Here's a question: What variables are created during compilation? Nine new variables are created. The Diff array creates
four variables, the Goal array creates four variables, and the DO statement creates one variable—the index variable i.
The contrib array references existing variables. You do not need the variables i and Goal1-Goal4 in your output data, so
you can add the DROP= data set option to drop them.

Let's see what happens during compilation and initialization. Both retain and drop flags are set for the variables Goal1
through Goal4. A drop flag is set for the variable i. SAS initializes the PDV. Notice that the initial values that you specified
in the ARRAY statement are set for Goal1 through Goal4. The values in variables Goal1 through Goal4 are retained in the
PDV.

Creating a Temporary Array of Values


Rather than creating new variables and then dropping them, you can use the keyword _TEMPORARY_ in an ARRAY
statement to indicate that the elements are not needed in the output data set. You use the keyword after the array
name and the dimension. Temporary arrays can improve performance and are useful when you only need the array to
perform a calculation or to lookup a value. When you create a temporary array, SAS sets aside an area of memory and
accesses that memory directly, instead of creating slots in the PDV. Since a temporary array doesn’t create variables in
the PDV, you can’t refer to the values by variable names. Instead, you must refer to the table values using the array
name and subscript. For example, in this program you could refer to Diff1, but you can’t refer to Goal1.

Using an Array to Create a Lookup Table

In this demonstration you assign initial values to variables in an array.

1. Copy and paste the following DATA step into the editor.

data compare(drop=i);
set orion.employee_donations;
array contrib{4} Qtr1-Qtr4;
array Diff{4};
array goal{4} (10,20,20,15);
do i=1 to 4;
diff{i}=contrib{i}-goal{i};
end;
run;

We need to find the difference between the amount that the employee contributed each quarter and Orion
Star's quarterly goals. The contrib array references the variables Qtr1-Qtr4. The Diff array creates four variables
to store the calculated differences. The goal array creates the variables goal1-goal4 and assigns the initial values
to the variables in order. These values represent Orion Star's quarterly goal amounts.

2. Submit the DATA step and check the log. The compare data set should have 15 variables.
Here's a question: There are 7 variables in the employee_donations data set. Why are there 15 variables in the
compare data set? The compare data includes 7 variables from the input data set, 4 variables created by the Diff
array, and 4 variables created by the goal array. Because of the DROP= data set option, the index variable i is not
written to the output data set.

3. Edit the code, as shown below, to make goal a temporary array.

data compare(drop=i);
set orion.employee_donations;
array contrib{4} Qtr1-Qtr4;
array Diff{4};
array goal{4} _temporary_ (10,20,20,15);
do i=1 to 4;
diff{i}=contrib{i}-goal{i};
end;
run;

4. Submit the revised DATA step and check the log. Notice that this time the compare data set has 11 variables.
Temporary arrays are created in memory. No corresponding variables are created in the program data vector, so
there is no need to drop the array elements.

5. Add a PROC PRINT step for the compare data set and list only the Employee_ID and Diff variables.

proc print data=compare;


var Employee_ID Diff1-Diff4;
run;

6. Submit the PROC PRINT step and view the results. As you've seen in previous demonstrations, there are missing
values in the compare data set because there are missing values in the employee_donations data set.

7. Suppose you know that these missing values indicate that no contribution was made in that quarter. You know
from a previous lesson that the SUM function ignores missing values. You can use the SUM function to treat a
missing value as a 0.

Edit the DATA step, as shown below, to use the SUM function. As arguments to the SUM function, list the
contrib array reference and the negative goal array reference.

data compare(drop=i);
set orion.employee_donations;
array contrib{4} Qtr1-Qtr4;
array Diff{4};
array goal{4} _temporary_ (10,20,20,15);
do i=1 to 4;
diff{i}=sum(contrib{i},-goal{i});
end;
run;

8. Submit the revised DATA step and the PROC PRINT step. Then view the results. As you can see, the compare
data set no longer contains missing values.

Question
Which ARRAY statement correctly defines a temporary lookup table named country with three elements that are
initialized to AU, NZ, and US?

a. array country{3} _temporary_ ('AU','NZ','US');


b. array country{3} $ 2 _temporary_ (AU,NZ,US);

c. array country{3} $ 2 _temporary_ ('AU','NZ','US');

The correct answer is c. $ 2 indicates that each value in the lookup table is a character data type with a length of 2. You
use the keyword _TEMPORARY_ to specify a temporary array. You must enclose the initial character values in quotation
marks.

Restructuring a Data Set


When you write a program, you need to consider the data available, the output you want, and the processing required.
To create your output, you might need to restructure the data. Some data sets store all the information about one entity
in a single observation. This type of data set is referred to as a wide data set.

For example, in this wide data set, information for each employee who made a donation is a separate observation. The
donations for each quarter are stored in separate variables. Other data sets have multiple observations per entity. Each
observation typically contains a small amount of data, and missing values might or might not be stored. This type of data
set is known as a narrow or long data set. Depending on the type of analysis or report you want to run, you might need
to convert a narrow data set to a wide data set, or a wide data set to a narrow data set.

Question
In the box next to each data set characteristic, type the letter of the matching data set structure.

There are multiple observations per entity. a. wide data set

All the information about one entity is contained in a single observation. b. narrow data set

Each observation typically contains a small amount of data.

The correct answers from top to bottom are b, a, b. Some data sets store all the information about one entity in a single
observation. This type of data set is referred to as a wide data set. Narrow data sets typically have multiple observations,
with a small amount of data per entity.

Business Scenario
The Payroll Manager at Orion Star asked for a report that shows the number of employees who made charitable
donations in each quarter. You can use the FREQ procedure to generate the report. The orion.employee_donations data
set contains the information that you need, but it’s not in a form that can be easily analyzed using PROC FREQ.

To simplify the the task of creating the frequency report, you need to restructure the data set so it contains a separate
observation for each non-missing quarterly contribution. Restructuring a data set is sometimes called rotating a data set.

Restructuring a SAS Data Set with a DATA Step


You can use a DATA step to restructure a SAS data set. If the value of contrib{i} is not equal to missing, the statements in
the DO group within the DO loop create the values for Period and Amount. The OUTPUT statement writes the
observation to the output data set.
Processing the DATA Step
To better understand how this DATA step works, let’s take a step-by-step look at how SAS processes the code. When the
program is compiled, SAS creates the program data vector. As execution begins, SAS initializes the variables in the PDV
to missing. The SET statement reads the first observation from employee_donations into the PDV. Remember that SAS
ignores the ARRAY statement because it’s a compile time only statement.

Next, SAS executes the DO loop and sets the index variable i to 1. In the first iteration of the DO loop, the value of i is
equal to 1. So, the array reference contrib{i} becomes contrib{1}, which refers to the first array element, Qtr1. The value
of Qtr1 in the first observation is missing, so SAS skips the remaining statements in the DO group. At the bottom of the
DO loop, the value of i is incremented by 1, and control returns to the top of the DO loop. Now contrib{i} becomes
contrib{2}, which refers to the second array element, Qtr2. The value of Qtr2 in the first observation is also missing, so
SAS skips the remaining statements in the DO group. At the bottom of the DO loop, the value of i is incremented by 1,
and control returns to the top of the DO loop. Now the value of contrib{i} becomes contrib{3}.

Once again, the condition specified in the IF-THEN statement is false. So, the remaining statements in the DO group
don’t execute. The value of i is incremented by 1, and control returns to the top of the DO loop. Now the value of
contrib{i} becomes contrib{4}. This time, the value of the variable being referenced, Qtr4, has a value of 25. The
condition specified in the IF-THEN statement is true. So, the remaining statements in the DO group execute.

The first assignment statement concatenates the text Qtr with the value of i (in this case, 4) to create a value for Period.
The second assignment statement assigns the value of contrib{4} to Amount. The explicit OUTPUT statement outputs
the observation to the rotate data set. At the bottom of the DO loop, the value of i is incremented by 1. When control
returns to the top of the DO loop, the value of i is out of range. So, SAS exits the DO loop.

At the bottom of the DATA step, there is an implicit RETURN statement. Control returns to the top of the DATA step, and
SAS initializes the values of Period, Amount, and i. The SET statement loads the values from the second observation into
the PDV. SAS executes the DO loop and sets the index variable i to 1. The value contrib{1} is now 15. The IF-THEN
statement is true, so SAS executes the assignment statements in the DO group and outputs the current observation to
the rotate data set. This process continues until the end of the file is reached.

Rotate a Data Set Using a DATA Step


In this demonstration, you use a DATA step to rotate a data set.

1. Copy and paste the following code into the editor.

data rotate (keep=Employee_ID Period Amount);


set orion.employee_donations
(drop=Recipients Paid_By);
array contrib{4} qtr1-qtr4;
do i=1 to 4;
if contrib{i} ne . then do;
Period=cats("Qtr",i);
Amount=contrib{i};
output;
end;
end;
run;

proc print data=rotate;


run;
The DATA step restructures the data in orion.employee_donations to create a new output data set named
rotate.

2. Submit the code and check the log. Notice that 124 observations and 3 variables were read from the
employee_donations data set and that the rotate data set has 417 observations.

3. View the results. Notice that there are multiple observations for some values of Employee_ID. You need a report
that shows the number of donations for each quarter.

4. Copy and paste the following PROC FREQ step into the editor.

proc freq data=rotate;


tables Period / nocum nopct;
run;

This step creates a report that shows the number of donations for each quarter. You only need a simple
frequency count. The NOCUM option suppresses the display of cumulative frequencies and cumulative
percentages. The NOPCT option suppresses the display of percentages.

5. Submit the PROC FREQ step and check the log. Verify that no errors occurred.

6. View the results. You can see the number of donations for each quarter.

Summary: Using SAS Arrays

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Understanding SAS Arrays


A SAS array is a temporary grouping of variables that exists only for the duration of the DATA step in which it is defined.
An array is identified by a single, unique name. The variables in the array are called elements, and are referred to by
combining the array name and a subscript.

The variables that are grouped together in an array must be the same type: either all character or all numeric. Unlike
arrays in other programming languages, SAS arrays are not data structures.

Creating SAS Arrays


You use the ARRAY statement to define an array. Array-name specifies the name of the array. Dimension describes the
number of elements in the array. Array-elements lists the variables to include in the array. Array elements can be listed
in any order. If no elements are listed, SAS creates new variables with default names by combining the array name with
consecutive integers.

ARRAY array-name {dimension} <array-elements>;


There are a few syntax variations for the ARRAY statement. When you use an asterisk for the dimension, SAS determines
the number of elements by counting the variables in the array-elements list. This list must be included if you use an
asterisk. The array dimension must be enclosed in either braces, brackets, or parentheses.

There are also various ways to specify the list of array elements. You can list each variable name separated by a space,
use variable lists – either a numbered range list or a name range list – to specify the array elements, or use the special
SAS names _NUMERIC_ or _CHARACTER_. If you use one of these keywords, the array includes either all the numeric
variables or all the character variables that were already defined in the current DATA step.

Processing SAS Arrays


When you define an array in a DATA step, each array element is referenced using a subscript.

array-name{subscript}

The subscripts are assigned in the order of the array elements. The first element has a subscript of 1, the second
element has a subscript of 2, etc. The syntax for an array reference is the name of the array, followed by a subscript
enclosed in braces, brackets, or parentheses. The subscript can be an integer, a variable, or a SAS expression.

Typically, DO loops are used to process arrays. This allows you to process multiple variables and to perform repetitive
calculations using the array name with the DO loop index variable as the subscript, allowing you to reference a different
array element on each iteration of the loop.

The DIM function returns the number of elements in an array.

DIM(array-name)

You can use the DIM function to specify the stop value of a DO loop. This is useful when an astersik is used in the ARRAY
statement to determine the array size dynamically.

DO index-variable=start TO DIM(array-name);

Using SAS Arrays to Create Variables and Perform Calculations


You can use an ARRAY statement to create many variables of the same type. The variable names are created by
concatenating the array name and the numbers 1, 2, 3, and so on, up to the array dimension. In this case, you must
specify a dimension. You can also create variables by listing the new variable names in the array-elements list in the
ARRAY statement. The array name does not have to match the new variable names.

ARRAY array-name {dimension} <$> <length> <array-elements>;

Variables that you create in an ARRAY statement all have the same variable type. To create an array of character
variables, you must include a dollar sign after the array dimension, optionally followed by a length that applies to each
character variable being created. If the length is omitted, all character variables created in an ARRAY statement are
assigned a length of 8. If you want to create several variables of different lengths, you can use a LENGTH statement
before the ARRAY statement.

If you reference an array element which does not exist, SAS automatically creates the non-existing variables, assigning
default names based on the array name.

You can pass an entire array to a function as if it were a variable list. Remember that when you pass a variable list to a
function, you must use the keyword OF.

SUM(OF array-name {* }

When using arrays to perform calculations, you can easily reference any array element by manipulating the index
variable.

Assigning Initial Values to an Array


To assign initial values to array elements, you place the values in an initial-value-list in the ARRAY statement. Elements
and values are matched by position, so the inital values must be listed in the order of the array elements. The values
must be comma or blank separated, and you must enclose the list in parentheses. If you are assigning character values,
each value must be enclosed in quotation marks.

ARRAY array-name {dimension} <_TEMPORARY_>


<array-elements> <(initial-value-list)>;

When you specify an initial value list, all elements behave as if they were named in a RETAIN statement. This creates a
static list of values that can be used as a lookup table during DATA step processing.

Temporary arrays can improve performance and are useful when you only need the array to perform a calculation or to
look up a value. To create a temporary array, you use the keyword _TEMPORARY_ following the dimension in the ARRAY
statement. Temporary arrays are created in memory. No corresponding variables are created in the program data
vector, so there is no need to drop the array elements.You can improve performance time by using temporary arrays.

Restructuring a Data Set


Some data sets store all the information about one entity in a single observation. Other data sets have multiple
observations per entity, and each observation typically contains a small amount of data. Missing values might or might
not be stored in the observations. At times you might want to restructure a data set from one form to another to
prepare it for further processing. Restructuring a data set is sometimes referred to as rotating a data set.

You can use arrays and DO loops to restructure a SAS data set.

Sample Programs
Processing SAS Arrays

data charity;
set orion.employee_donations;
keep employee_id qtr1-qtr4;
array contrib{*} qtr1-qtr4;
do i=1 to dim(contrib);
contrib{i}=contrib{i}*1.25;
end;
run;
proc print data=charity;
run;

Using an Array to Create Variables

data percent(drop=i);
set orion.employee_donations;
array contrib{4} qtr1-qtr4;
array Pct{4};
Total=sum(of contrib{*});
do i=1 to 4;
pct{i}=contrib{i}/Total;
end;
run;

proc print data=percent;


var Employee_ID Pct1-Pct4;
format Pct1-Pct4 percent6.;
run;

Using an Array to Perform Calculations and Create Variables

data change;
set orion.employee_donations;
drop i;
array contrib{4} Qtr1-Qtr4;
array Diff{3};
do i=1 to 3;
diff{i}=contrib{i+1}-contrib{i};
end;
run;

proc print data=change;


var Employee_ID Qtr1-Qtr4 Diff1-Diff3;
run;

Assigning Initial Values to an Array

data compare(drop=i);
set orion.employee_donations;
array contrib{4} Qtr1-Qtr4;
array Diff{4};
array goal{4} _temporary_ (10,20,20,15);
do i=1 to 4;
diff{i}=sum(contrib{i},-goal{i});
end;
run;

proc print data=compare;


var Employee_ID Diff1-Diff4;
run;

Restructuring a Data Set

data rotate (keep=Employee_ID Period Amount);


set orion.employee_donations
(drop=Recipients Paid_By);
array contrib{4} qtr1-qtr4;
do i=1 to 4;
if contrib{i} ne . then do;
Period=cats("Qtr",i);
Amount=contrib{i};
output;
end;
end;
run;

proc freq data=rotate;


tables Period / nocum nopct;
run;

Lesson 9: Combining SAS Data Sets


In the SAS Programming I: Essentials course, you learned how to use match-merging in a DATA step to combine
observations from two or more data sets into a single observation in a new data set. In this lesson, you learn to use
match-merging in combination with other data manipulation techniques that you already know, such as the OUTPUT
statement, the DROP= and KEEP= data set options, FIRST. and LAST. processing, the sum statement, and functions. You
also learn how to match-merge data sets that lack a common variable, as well as how to match-merge a SAS data set
and an Excel workbook.

Objectives

In this lesson, you learn to do the following:

 use data manipulation techniques in a DATA step to perform a match-merge


 perform a match-merge on three SAS data sets that lack a common variable
 perform a match-merge on a SAS data set and a Microsoft Excel workbook
 perform a match-merge on SAS data sets that have identically named variables other than the BY variables

Match-Merging SAS Data Sets


The orion.customer data set contains information about Orion Star customers. The orion.order_fact data set contains
customer order information. Suppose you need to combine these data sets so that the customer information and the
product information for 2007 are in one data set.

Match-Merging SAS Data Sets


Merging these data sets is a simple process. You should already know that you can use the MERGE statement to
combine SAS data sets with related data based on the values of one or more common variables. This process is called
match-merging. You can click the Information button to review the syntax and details for the MERGE statement.

In this case, you need to merge the data sets by the common variable Customer_ID. But remember, each input data set
must first be sorted in order of the values of the BY variable or variables, or it must have an appropriate index.
Orion.customer is already sorted by Customer_ID. Orion.order_fact isn't sorted, so you need to sort this data set by
Customer_ID before the merge. You also need to subset the order_fact data so that the output data set contains only
order information for 2007. You can use the YEAR function to extract the year value from order_date and then subset
the value using a WHERE statement. We'll write the sorted subset to work.orders_2007.

Once you sort the data sets, you can merge them. The MERGE statement lists the data sets to combine. SAS
joins observations from the listed data sets into a single observation. The BY statement performs the match-
merge by matching observations on the common BY variable, which is Customer_ID. Here you can see the log
messages that this code generates. A match-merge produces matches and non-matches by default. The output
data set, custord, contains more observations than either of the input data sets.

Identifying Data Set Contributors


Here's a partial listing of the new combined data set, custord. The highlighted rows represent the matches. Both data
sets contributed to these observations.

These highlighted rows represent the non-matches. Non-matches are observations that contain data from only one
input data set. These customers didn't place orders, so the orders_2007 data set didn't contribute to these observations.

Suppose you want the output data set to contain only the matches. Which data set option can you add to the MERGE
statement to help identify observations that contain data from both input data sets? You can use the IN= data set option
to identify which input data sets contribute data to each observation that SAS processes. In the code, you can use two
IN= data set options and a subsetting IF statement to determine whether an observation contains data from both input
data sets.

In this example, the first IN= option creates the temporary variable cust. This temporary variable is set to 1 when an
observation from the orion.customer data set contributes to the current observation; otherwise, it is set to 0. The
second IN= option creates the temporary variable order. The value of order depends on whether the orders_2007 data
set contributes to an observation. Click the Information button if you want to review the IN= data set option.

The subsetting IF statement controls which observations will be in the output. Once SAS reaches the IF statement in the
DATA step shown here, the only processing that remains is the implicit output at the bottom of the DATA step. So, if
both of the IN= values equal 1, the condition in the IF statement evaluates to true and SAS writes the data that's
currently in the PDV to the output data set. If one of the IN= values is 0, the condition evaluates to false and SAS does
not write the observation to the output data set. Because SAS considers any numeric value other than 0 or missing as
true, you can also write the IF statement as if cust and order.

Match-Merging SAS Data Sets


In this demonstration, you'll see how to match-merge data sets, and then you'll see how to control output using the IN=
data set option and a subsetting IF statement.

1. Submit the following code:

proc sort data=orion.order_fact


out=work.orders_2007;
by Customer_ID;
where year(Order_Date)=2007;
run;

data custord;
merge orion.customer
work.orders_2007;
by Customer_ID;
run;

proc print data=custord;


run;

2. Check the log. Notice that custord contains 163 observations and 22 variables.
3. Look at the output. The custord data set contains both matches and non-matches.
4. Now submit the following code to control which observations SAS writes to the output data set. You only need
to include the observations that contain data from both input data sets. You can use the IN= option to identify
which input data set contributes data to each observation. Then you can use an IF statement to check the IN=
values. If both of the IN= values equal 1, SAS will write the observation to the output data set.

data custord;
merge orion.customer(in=cust)
work.orders_2007(in=order);
by Customer_ID;
if cust and order;
run;

proc print data=custord;


run;

5. Check the log. custord now contains 128 observations.


6. Look at the output. There's a match for every order.

Question
Which of the following IF statements creates the non-matches seen in the combine data set?

products costs combine

Product ID ID Cost Product ID Cost

XYZ Shoe A123 B456 59.99 XYZ Shoe A123

ABC Coat B456 C789 35.75 C789 35.75

data combine;
merge products(in=InProd) costs(in=InCost);
by ID;
_________________________________
run;

a. if InProd=1 or InCost=1;

b. if InProd=0 or InCost=0;

c. if InProd=0 and InCost=0;

The correct answer is b. If the IN= variables equal zero, the data sets didn't contribute to the current observation. You
use the "or" logical operator in the IF statement. Neither data set contributes to the new data set If you use the "and"
operator.

Using Data Manipulation Techniques with a Match-Merge


Now, let's see how to combine match-merging with other techniques that you already know. Using these techniques
with a match-merge will enable you to have more control of your output. We'll start by looking at how you can use the
OUTPUT statement and the DROP= and KEEP= data set options with a match-merge. Then we'll explore other
techniques, such as using FIRST. and LAST. processing, the sum statement, and functions.

Controlling Match-Merge Output


You know you can add an explicit OUTPUT statement to the DATA step to control when SAS writes an observation to a
SAS data set. When you use the OUTPUT statement in combination with the MERGE statement, you have more options
for controlling your results. For example, suppose you want to copy customers with matching orders to one data set and
customers with no orders to another data set. You can do this by adding OUTPUT statements to your code.

This program creates two data sets, orderdata and noorders. The MERGE statement lists the data sets to combine. The
IN= option creates the temporary variable order, which we can use to identify observations from the orders_2007 data
set. If the value of order is 1, the first OUTPUT statement writes the contents of the program data vector to the
orderdata data set. If the value of order isn't 1, the second OUTPUT statement writes the contents of the PDV to the
noorders data set. If you submitted this code, the orderdata data set would contain 128 observations, indicating that
there were 128 matches of customers to orders. The noorders data set would contain 35 observations, indicating that
there were 35 customers who did not place orders. Both data sets would have 22 variables.

Selecting Variables
Now suppose you need to control which variables to include in each new data set. Remember, you can use the KEEP=
data set option, the KEEP statement, the DROP= data set option, or the DROP statement to list the variables to keep in a
data set. In this example, you need to keep Customer_Name, Product_ID, Quantity, and Total_Retail_Price in the
orderdata data set. In the noorders data set, you need to keep Customer_Name and Birth_Date. These customers didn't
place orders, so product information isn't available or necessary.

Controlling Match-Merge Output


In this demonstration, you create two output data sets, with specific variables, from a match-merge.

1. Copy and paste the following code into the editor.

data orderdata(keep=Customer_Name Product_ID


Quantity Total_Retail_Price)
noorders(keep=Customer_Name Birth_Date);
merge orion.customer
work.orders_2007(in=order);
by Customer_ID;
if order=1 then output orderdata;
else output noorders;
run;

proc print data=orderdata;


run;

proc print data=noorders;


run;

This program creates two data sets, orderdata and noorders. If the orders_2007 data set contributes to the
observation in the program data vector (meaning that the customer placed an order), SAS writes the
observation to the orderdata data set. If the orders_2007 data set doesn't contribute to the observation
(meaning that the customer didn't place an order), SAS writes the observation to the noorders data set.
2. Submit the code and check the log. Notice that the orderdata data set contains 128 observations. This indicates
that there are 128 matches of customers to orders. The noorders data set contains 35 observations. This
indicates that there are 35 customers who did not place orders.

3. View the results. The new data sets contain just the variables necessary for understanding the data. The
orderdata data set contains customer names and order information. The noorders data set contains only
customer information.

Question
What are the names of the two temporary variables created by the BY statement in the code below?

data orderdata(keep=Customer_Name Quantity Total_Retail_Price)


noorders(keep=Customer_Name Birth_Date);
merge orion.customer
work.orders_2007(in=order);
by Customer_ID;
if order=1 then output orderdata;
else output noorders;
run;

a. FIRST.Order, LAST.Order

b. FIRST.Customer, LAST.Customer

c. FIRST.Customer_ID, LAST.Customer_ID

The correct answer is c. The BY statement creates the temporary variables FIRST.Customer_ID and LAST.Customer_ID.

Summarizing Merged Data


You can use FIRST. and LAST. processing along with a sum statement to accumulate the total number of orders for each
customer. Let's take a look at an example.

Suppose you want to see the total number of orders for each customer. You need to create a variable, NumberOrders,
which accumulates based on the matching observations. Here's the modified code from the previous example. You need
to create a new data set, summary, to hold the order summary information. The KEEP= data set option specifies
Customer_Name and the new variable, NumberOrders, as the variables to keep. The MERGE statement specifies the
data sets to merge, as well as the IN= variable, order.

Remember, when you use a BY statement, SAS creates two temporary variables: FIRST.variable and LAST.variable. In this
simple DO group, the statements between DO and END are performed only when the IN= variable, order, equals 1. If
order equals 0, the statements in the DO group don't execute, and the program continues with the ELSE statement. At a
high level, this group specifies that the summary data set will include the matches of customers and orders, that each
customer is listed once, and that each customer's orders are summarized into one variable, NumberOrders.

Now let's look in detail at each line of code. If the orders_2007 data set contributes, SAS implements the statements in
the DO group. SAS writes the matches of the customers and orders to the orderdata data set. SAS checks the values of
the FIRST. variable. If FIRST.Customer_ID equals 1 (meaning that the current observation contains the first occurrence of
a customer ID), SAS sets the NumberOrders variable equal to 0. You can use a sum statement to create an accumulating
variable. Remember, SAS retains the value of a variable created with a sum statement across iterations of the DATA
step. This sum statement creates the summary variable NumberOrders. SAS initializes the value of NumberOrders to 0
and increments the value by 1 each time this Customer_ID value occurs. Then, SAS checks the value of the LAST.
variable. If LAST.Customer_ID equals 1 – meaning that the current observation contains the last occurrence of a
customer ID – SAS writes the contents of the PDV to the summary data set.

Here's the end of the DO group. If the statements within the DO group aren't processed, SAS continues with the next
line of code and writes the content of the PDV to the noorders data set. Here's the summary data set. It lists each
customer once with his or her accumulated total number of orders. Now suppose you want to see the customer’s name
concatenated with the total number of orders. To do that, you can use a function to further manipulate the data created
from the merged data sets.

Summarizing Merged Data


In this demonstration, you use a function to concatenate the data from merged data sets.

1. Copy and paste the following code into the editor.

proc sort data=orion.order_fact


out=work.orders_2007;
by Customer_ID;
where year(Order_Date)=2007;
run;

data orderdata(keep=Customer_Name Quantity


Total_Retail_Price)
noorders(keep=Customer_Name Birth_Date)
summary(keep=Customer_Name NumOrders);
merge orion.customer
work.orders_2007(in=order);
by Customer_ID;
if order=1 then
do;
output orderdata;
if first.Customer_ID then NumOrders=0;
NumOrders+1;
if last.Customer_ID then output summary;
end;
else output noorders;
run;

You want to see the customer's last name concatenated with the customer's number of orders in the summary
data set. We can use a function to create a variable that holds this information. You need to insert the new code
within the DO group because this is where the summary data set is being created.

2. Name the new variable NameNumber. Next add a function. You can use the CATX function here, but you could
use other functions to achieve the same results. Remember that the CATX function removes leading and trailing
blanks, inserts delimiters, and returns a concatenated character string. The delimiter is a blank, and the two
variable values we want to concatenate are Customer_LastName, from the orion.customer data set, and our
accumulating variable, NumOrders. Keep the new variable, NameNumber, in the summary data set.

data orderdata(keep=Customer_Name Quantity


Total_Retail_Price)
noorders(keep=Customer_Name Birth_Date)
summary(keep=Customer_Name NumOrders NameNumber);
merge orion.customer
work.orders_2007(in=order);
by Customer_ID;
if order=1 then
do;
output orderdata;
if first.Customer_ID then NumOrders=0;
NumOrders+1;
NameNumber=catx('',Customer_LastName,NumOrders);
if last.Customer_ID then output summary;
end;
else output noorders;
run;

3. Add the following title statement and PROC PRINT step, and then submit the code.

title 'Summary';
proc print data=summary;
run;
title;

4. Check the log. Verify that no errors occurred.

5. View the results. The data set contains the new variable, NameNumber, which contains the customer
last name, a space, and the customer's total number of orders. The function produced the correct
results.

Match-Merging Data Sets That Lack a Common Variable


Suppose you need to create a customer activity report using data from three SAS data sets: orion.customer,
orion.order_fact, and orion.product_dim. The customer data set contains the customer information, the order_fact data
set contains the order information, and the product_dim data set contains the product information.

Match-Merging Multiple Data Sets


You need data from all three data sets to create the activity report. Any number of data sets can be merged in a single
DATA step, but the data sets must have common BY-variable values. All of the data sets must be sorted by the variables
that you want to use for the merge. When data sets don’t share common BY-variable values, you can merge them using
multiple, separate DATA steps. Remember that the data sets must be sorted by the appropriate BY variables first.

When you think about combining the customer, order_fact and product_dim data sets for the activity report, the first
question you need to answer is whether or not they share at least one common variable. These three data sets don't
share one common variable. The first two data sets have Customer_ID as a common variable, and the last two data sets
have Product_ID as a common variable. So for these three data sets, you can first perform a match merge on
orion.customer and order_fact. They share the common variable Customer_ID. Once you sort the data sets, you can
merge them by specifying Customer_ID as the BY variable.

Match-Merging Multiple Data Sets: Step 1


In this demonstration, you merge the orion.customer and the order_fact data sets by the Customer_ID variable.

1. Copy and paste the following code into the editor.

proc sort data=orion.order_fact


out=work.orders_2007;
by Customer_ID;
where year(Order_Date)=2007;
run;
data custord;
merge orion.customer(in=cust)
work.orders_2007(in=order);
by Customer_ID;
if cust=1 and order=1;
keep Customer_ID Customer_Name Quantity
Total_Retail_Price Product_ID;
run;

proc print data=custord;


run;

The first step is to sort the order_fact data set by Customer_ID. We sorted orion.customer by the values of
Customer_ID in an earlier demonstration.

The DATA step specifies the output data set, custord. The MERGE statement lists the orion.customer data set
and the new data set from the sort, work.orders_2007. Each data set has an IN= data set option. The subsetting
IF statement evaluates the matches of customer and orders based on the values of the IN= variables. The BY
statement specifies the common variable, Customer_ID. Finally, the KEEP statement lists the variables we want
to see in the custord data set.

Notice that we're keeping the Product_ID variable because it's the common variable needed in the second
match-merge, which we'll do in an upcoming demonstration.

2. Submit the code and check the log. Verify that no errors occurred.

3. View the results. The new data set, custord, includes the matches of customers and orders by Customer_ID.

Finding a Common Variable


Here are the names of the variables in the custord data set. Notice only the pertinent customer and order variables
remain. Now that you’ve merged the first two data sets, it’s time to perform a match merge on custord and the third
data set, product_dim. Remember that these data sets share the common variable Product_ID. Once you sort the data
sets, you can merge them by specifying Product_ID as the BY variable.

Match-Merging Multiple Data Sets: Step 2


In this demonstration, you'll merge the custord and product_dim data sets. This completes the two-part merge for the
three data sets that didn't share one common variable.

1. Copy and paste the following code into the editor.

proc sort data=custord;


by Product_ID;
run;

data custordprod;
merge custord(in=ord)
orion.product_dim(in=prod);
by Product_ID;
if ord=1 and prod=1;
Supplier=catx(' ',Supplier_Country,Supplier_ID);
keep Customer_Name Quantity
Total_Retail_Price Product_ID Product_Name Supplier;
run;

proc print data=custordprod(obs=15) noobs;


run;

As in the previous demonstration, we start by sorting the custord data set by Product_ID, the common variable.
Next, the DATA step match-merges custord with product_dim by Product_ID. If both data sets contribute to the
observation, SAS writes the contents of the PDV to the CustOrdProd data set.

2. Submit the code and check the log. Verify that no errors occurred.

3. View the results. The CustOrdProd data set is the result of merging all three of the data sets. The observations
are ordered by Product_ID. For each ID, you can see the customer who placed the order and the specific order
information. Now, all the customer activity information is in one data set for easy access, and there's no
extraneous information such as non-matches.

Match-Merging with an Excel Worksheet and Renaming Variables


As you’ve seen, merging two or more SAS data sets is fairly straightforward, but suppose one of your input data sets is in
the form of an Excel workbook. Management at Orion Star wants a list of customers who qualify for bonus gifts. The SAS
data set work.custordprod contains purchasing information for each supplier. The Excel workbook BonusGift.xls contains
a list of suppliers that want to send gifts to customers who purchased more than a specified minimum quantity of a
product.

To determine which customers qualify for a bonus gift, you need to merge the SAS data set with the Excel workbook.
You need to merge the data sets by the supplier variable, but those variables have different names in each data set. The
BY variables must have the same name for the match-merge to work correctly, so you will have to use the RENAME=
data set option. Also, the two Quantity variables do not have the same meaning. In the work.custordprod data set,
Quantity refers to the number of items purchased. In the BonusGift data set, Quantity refers to the number or each item
that must be purchased in order to receive a bonus gift. You want to keep merged observations where the value of
Quantity in work.custordprod is greater than or equal to the value of Quantity in BonusGift.xls. You need to use the
RENAME= option here as well to make this comparison for the match-merge.

Accessing an Excel Workbook in SAS


To perform the merge, you need to access the BonusGift workbook in SAS. You’ve learned that you can use the
SAS/ACCESS LIBNAME statement to assign a libref to an Excel workbook. That way, SAS treats each worksheet within the
workbook as though it is a SAS data set. Notice the syntax of the SAS/ACCESS LIBNAME statement here. For more
information on the SAS/ACCESS LIBNAME statement, click the Information button.

Question
Which statements correctly access the Supplier worksheet in the Excel workbook?

a.

libname bonus "&path/BonusGift.xls";

proc print data=bonus.Supplier;


run;

b.

libname bonus "&path/BonusGift.xls";

proc print data=bonus.'Supplier$'n;


run;

The correct answer is b. You use a SAS name literal to refer to an Excel worksheet in SAS code. You enclose the name of
the worksheet, including the dollar sign, in quotation marks followed by the letter n.

Match-Merging a SAS Data Set and an Excel Worksheet


Now that you know how to access the BonusGift workbook, and treat it as a SAS data set, you can merge it with the SAS
work.custordprod data set. The first step in a match-merge is to identify the common variable.

In this example the common variable contains supplier information. The supplier variable is named SuppID in the
BonusGift data and Supplier in the custordprod data set. So, you need to rename one of the variables. You can use the
RENAME= data set option to match the variable SuppID in BonusGift to the variable Supplier in work.custordprod. As a
reminder, here’s the syntax for the RENAME= data set option. You specify the RENAME= option immediately after the
associated SAS data set name. For each variable, you specify the existing variable name, an equals sign, and the new
variable name in an inner set of parentheses.

In this example, there's something else to consider. Remember, each input data set must first be sorted in order of the
values of the BY variable or variables, or it must have an appropriate index. The supplier values in BonusGift are already
sorted. However, the supplier values in custordprod are not. You need to sort custordprod by Supplier before you merge
the data.

Now let’s take a look at the complete code to merge this customer and supplier information. The PROC SORT step sorts
the data in custordprod by the values of Supplier. The libname statement assigns the libref bonus to the BonusGift.xls
file. Now SAS can merge the worksheet with another SAS data set. The new data set is custordprodgift. The MERGE
statement specifies the two data sets to combine. Notice that you're accessing the Excel worksheet Supplier using a SAS
name literal. You want only the matches of customer and supplier, so you need to include two IN= options. The first IN=
option equals 1 when an observation from custordprod contributes, and the second IN= option equals 1 when an
observation from the Supplier worksheet contributes. The RENAME= option inside the data set options list for the
worksheet changes the variable name in the supplier worksheet from SuppID to Supplier. The BY statement indicates
Supplier as the common variable. The IF statement will output the results if both IN= options are true.

Notice the additional libname statement. Remember, once you associate a libref with an Excel workbook, the workbook
is locked until you clear the libref. As long as you have the libref associated with the workbook in an active session of
SAS, you can’t open the Excel file outside of SAS. So, it's very important to clear the libref at the end of your program.

Match-Merging Data Sets with Same-Named Variables


Remember, you want to identify only customers whose orders meet or exceed the minimum quantity requirements for
specific suppliers. So for the targeted customers, the quantity information in the work.custordprod data set must be
greater than or equal to the quantity requirements in the BonusGift workbook.

Here's a question. How can you obtain this information? You can use a subsetting IF statement to test the condition of
whether the quantity the customer purchased is greater than or equal to the minimum quantity requirement. If the
statement is true, SAS continues processing and writes the results to the output data set. You need to compare the
value of Quantity in the customer data set to see if it's greater than or equal to the supplier quantity requirement.

So, will this subsetting IF statement test this condition? No, it won't. The variables share the same name, so you can't
compare them with this IF statement. When you match-merge data sets that contain same-named variables (other than
the BY variables), the DATA step overwrites values of the same-named variable in the first data set in which it appears
with values of the same-named variable in subsequent data sets. Since you need to compare the values of Quantity in
each data set, you need to rename Quantity in one of the data sets. Since you're already renaming the SuppID variable
in the BonusGift data, you can easily rename the Quantity variable in the BonusGift data as well. Let's rename the
variable Minimum, since it will represent the minimum value needed to receive a gift.

Now you can use this subsetting IF statement to test the condition of whether the quantity the customer purchased is
greater than or equal to the minimum quantity requirement. Here's the final code to sort, merge, and calculate which
customers should receive the bonus gifts.

Match-Merging Data Sets with Same-Named Variables


If you are using a client application (such as SAS Enterprise Guide or SAS Studio) to access SAS on a remote server, you
cannot use the SAS/ACCESS Interface to PC Files engine that is necessary to assign a LIBNAME statement to an Excel file.
This demonstration uses the SAS windowing environment with the PC Files engine installed.

In this demonstration, an Excel workbook and a SAS data set share a same-named variable other than the BY variable.
You match-merge this data using the RENAME= data set option.

1. Copy and paste the following code into the editor:

proc sort data=custordprod;


by Supplier;
run;

libname bonus pcfiles path="&path/BonusGift.xls";

data custordprodgift;
merge custordprod(in=c)
bonus.'Supplier$'n(in=s
rename=(SuppID=Supplier
Quantity=Minimum));
by Supplier;
if c=1 and s=1 and Quantity>=Minimum;
run;

libname bonus clear;

proc sort data=custordprodGift;


by Customer_Name;
run;

proc print data=custordprodGift;


var Customer_Name Gift;
run;

The SAS/ACCESS LIBNAME statement enables SAS to access the Excel workbook, and in turn, treat it as a SAS
data set in the code. Optionally, you can use the IMPORT procedure to read the worksheet and write the data to
a SAS data set. To learn about the IMPORT procedure, visit the online documentation at support.sas.com.

The MERGE statement includes the RENAME= data set option, which specifies that the variables SuppID and
Quantity in BonusGift.xls should be renamed Supplier and Minimum. The MERGE statement also includes the
SAS name literal for the Excel worksheet and the IN= options. We can use the IN= variables with the subsetting
IF statement to determine the data set contributors.
If both data sets contribute to the data that's currently in the PDV and the conditions for Quantity are met, SAS
continues to the next line in the program. All parts of this IF statement must be true in order for SAS to write the
contents of the PDV to the output data set.

Next, we need to issue the libname statement to clear the libref and unlock the Excel workbook. We sort the
data set by customer name before printing the list of customers and the gifts that they should receive.

2. Submit the code and check the log. The log shows that the code ran successfully.

3. View the results. The output shows the list of customers and gifts. Notice that some customers will receive more
than one gift.

Summary: Combining SAS Data Sets

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Match-Merging SAS Data Sets


In a process called match-merging, you use the MERGE statement in a DATA step to combine SAS data sets with related
data into a new data set, based on the values of one or more common variables. Each observation in the new data set
contains data from multiple input data sets. Before the merge, each input data set must be sorted by the common
variable(s).

A match-merge produces matches (observations containing data from both input data sets) and non-matches
(observations containing data from only one input data set) by default. You can control this by identifying data set
contributors and using explicit OUTPUT or subsetting IF statements.

DATA SAS-data-set;
MERGE SAS-data-set1 SAS-data-set2 . . . ;
BY <DESCENDING> BY-variable(s);
<additional SAS statements>
RUN;

To identify the data sets that contribute data to an observation, you use the IN= data set option following the input data
set name in the MERGE statement. SAS creates a temporary variable with the name you specify, and assigns a value of 1
or 0 to indicate whether or not the data set contributed data to the observation in the program data vector. You can use
a subsetting IF statement to output only those observations that contain data from all the input data sets or from just
one data set.

SAS-data-set (IN=variable)
Using Data Manipulation Techniques with a Match-Merge
You can use explicit OUTPUT statements in a match-merge to control your output. For example, you can direct the
matches (observations containing data from both input data sets) to one data set, and non-matches (observations
containing data from only one input data set) to another data set.

OUTPUT <SAS-data-set(s)>;

To control which variables appear in your output, you can use the KEEP= or DROP= data set option, or the KEEP or DROP
statement.

You can summarize merged data by using FIRST. and LAST. processing along with a sum statement.

Match-Merging SAS Data Sets That Lack a Common Variable

If you need to merge three or more data sets that don’t share a common variable, you can use multiple separate DATA
steps. Remember, the data sets must be sorted by the appropriate BY variable, so you may need to sort the
intermediate data sets in this process.

You need to find a common variable to complete the sort and match-merge.

Match-Merging with an Excel Worksheet and Renaming Variables

You can access an Excel workbook in SAS using the SAS/ACCESS LIBNAME statement. This assigns a libref to the Excel
workbook, allowing SAS to access the workbook as if it were a SAS library, and each worksheet as if it were a SAS data
set.

LIBNAME libref 'physical-file-name';

You can also merge a SAS data set with an Excel workbook. You need to identify the common variable, and may need to
use the RENAME= data set option to complete the merge.

You use a SAS name literal to refer to an Excel worksheet in SAS code. You enclose the name of the worksheet, including
the dollar sign, in quotation marks followed by the letter n.

Remember to clear the libref to unlock the Excel workbook when you are finished accessing it.

When you match-merge data sets that contain same-named variables (other than the BY variables), the data value from
the second and subsequent data sets overwrites the value of the same-named variable from the first data set in the
program data vector. You can use the RENAME= data set option to rename the variable(s) to unique names to avoid
overwriting.

SAS-data-set (RENAME = (old-name-1 = new-name-1


<…old-name-n = new-name-n>))
Sample Programs
Match-Merging SAS Data Sets

proc sort data=orion.order_fact


out=work.orders_2007;
by Customer_ID;
where year(Order_Date)=2007;
run;

data custord;
merge orion.customer(in=cust)
work.orders_2007(in=order);
by Customer_ID;
if cust and order;
run;

proc print data=custord;


run;

Controlling Match-Merge Output

proc sort data=orion.order_fact


out=work.orders_2007;
by Customer_ID;
where year(Order_Date)=2007;
run;

data orderdata(keep=Customer_Name Product_ID Quantity Total_Retail_Price)


noorders(keep=Customer_Name Birth_Date);
merge orion.customer
work.orders_2007(in=order);
by Customer_ID;
if order=1 then output orderdata;
else output noorders;
run;

proc print data=orderdata;


run;

proc print data=noorders;


run;

Summarizing Merged Data

proc sort data=orion.order_fact


out=work.orders_2007;
by Customer_ID;
where year(Order_Date)=2007;
run;

data orderdata(keep=Customer_Name Quantity


Total_Retail_Price)
noorders(keep=Customer_Name Birth_Date)
summary(keep=Customer_Name NumOrders
NameNumber);
merge orion.customer
work.orders_2007(in=order);
by Customer_ID;
if order=1 then do;
output orderdata;
if first.Customer_ID then NumOrders=0;
NumOrders+1;
NameNumber=catx('',Customer_LastName,NumOrders);
if last.Customer_ID then output summary;
end;
else output noorders;
run;

title 'Summary';
proc print data=summary;
run;
title;

Match-Merging Multiple Data Sets: Step 1

proc sort data=orion.order_fact


out=work.orders_2007;
by Customer_ID;
where year(Order_Date)=2007;
run;

data custord;
merge orion.customer(in=cust)
work.orders_2007(in=order);
by Customer_ID;
if cust=1 and order=1;
keep Customer_ID Customer_Name Quantity
Total_Retail_Price Product_ID;
run;

proc print data=custord;


run;

Match-Merging Multiple Data Sets: Step 2

proc sort data=custord;


by Product_ID;
run;

data custordprod;
merge custord(in=ord)
orion.product_dim(in=prod);
by Product_ID;
if ord=1 and prod=1;
Supplier=catx(' ',Supplier_Country,Supplier_ID);
keep Customer_Name Quantity
Total_Retail_Price Product_ID Product_Name Supplier;
run;

proc print data=custordprod(obs=15) noobs;


run;

Match-Merging with an Excel Worksheet and Renaming Variables

proc sort data=custordprod;


by Supplier;
run;

libname bonus pcfiles path="&path/BonusGift.xls";


data custordprodGift;
merge custordprod(in=c)
bonus.'Supplier$'n(in=s
rename=(SuppID=Supplier
Quantity=Minimum));
by Supplier;
if c=1 and s=1 and Quantity >= Minimum;
run;

libname bonus clear;

proc sort data=custordprodGift;


by Customer_Name;
run;

proc print data=custordprodGift;


var Customer_Name Gift;
run;

Lesson 10: Creating and Maintaining Permanent Formats


You should already know how to use the FORMAT procedure to change the appearance of variable values. A format
affects the way variables look when you display them in reports, but not the way SAS stores them in the SAS data set. In
this lesson, you learn to create and access permanent formats, create formats from SAS data sets, and maintain formats.

Objectives

In this lesson, you learn to do the following:

 create permanent formats


 create formats from SAS data sets
 access permanent formats
 maintain formats

Creating Permanent Formats


Orion Star Sports & Outdoors is a global company, and some of its data sets contain geographic codes rather than
names. For example, the data set orion.country contains information about some of the countries that Orion Star serves.
The values for the variables Country_ID and Continent_ID are coded. The data includes a corresponding country name
for each value of Country_ID. However, continent names are not included. Management has requested that country and
continent names, instead of codes, be used in reports so that the output is easier to interpret.

Creating a Format
Let's start by using the orion.country data set to create PROC PRINT output that displays continent names, rather than
Continent_ID values. As you probably know, you can apply a format to the data. Although SAS has many predefined
formats, such as date and time formats and monetary formats, a predefined format won't work here. However, you can
use PROC FORMAT to create a custom format for displaying data in a particular way.

Let's briefly review the PROC FORMAT syntax. The VALUE statement defines the format. First, you specify a name for the
format. Then, you specify how you want the values to be displayed. This VALUE statement creates the Continent format
and specifies the numeric data values and the descriptive text values to apply. Now let's see how to use this format.
After the PROC FORMAT step, you add a PROC PRINT step. The FORMAT statement applies the Continent format to the
variable Continent_ID. Notice that when you refer to a user-defined format in the FORMAT statement, you must specify
a period after the format name, the same way that you do for a format name that is supplied by SAS. Remember, you
don't have to include a period after a user-defined format when you create it. The PROC PRINT output now contains the
continent names instead of the codes.

To review the rules for constructing a format name, click the Information button.

Business Scenario
A number of Orion Star data sets contain two-letter country codes rather than country names. For example, the SAS
data set orion.customer includes the variable Country. Suppose you want to format the two-letter codes as country
names.

Using a Control Data Set to Create a Format


You've seen that you can create formats by specifying values and labels in a PROC FORMAT step. You can also create a
format from a SAS data set, called a control data set, that contains value and label information. You can use the
orion.country data set to create the format you need. This data set contains information such as country codes, country
names, and continent IDs. To create a format from this data set, you use the CNTLIN= option to read the data and create
the format. Here's the syntax. You specify the data set name in the CNTLIN= option in the PROC FORMAT statement. The
CNTLIN= option builds formats without using a VALUE statement. As you can see, the code for creating a format from a
SAS data set is simple.

But the control data set must contain certain variables. So, you need to restructure most data sets before you can use
them. The control data set that you use to create a format must contain variables that supply the same information that
a VALUE statement would. So, the data set specified in the CNTLIN= option must contain the variables FmtName, Start,
and Label. If the format specifies a range, the data set must also contain the variable End. If no End variable exists, SAS
assumes that the ending value of the format range is equal to the value of Start. The control data set must also contain
the variable Type for character formats unless the value for FmtName begins with a dollar sign.

You can create a control data set using a DATA step or another PROC step. You can also create a control data set using
an interactive application such as the VIEWTABLE window in the SAS windowing environment or the New Data Wizard in
SAS Enterprise Guide. You can use the control data set to create new formats or re-create existing formats.

Restructuring the Data


Now let's see how you create a correctly structured control data set. Remember, you need to have the variables Start,
Label, and FmtName in the control data set. Orion.country doesn't contain these variables, so you need to restructure
the data set.

Let's use a DATA step to manipulate the data. The variable Country contains the two-letter country codes, and the
variable Country_Name contains the country names. You could use this DATA step to create the control data set. The
code includes a KEEP statement to write only the required variables Start, Label, and FmtName to the output data set.
The first assignment statement assigns the value $country to the variable FmtName. The next two assignment
statements assign the values of Country and Country_Name to the variables Start and Label respectively.

This DATA step works, but it is inefficient. The three assignment statements execute once per row, and SAS initializes the
three new variables to missing each time the DATA step iterates. This DATA step is more efficient. The RETAIN statement
initializes the variable FmtName with the value $country and retains the value, and the RENAME= option renames
variables on input.
Creating a Control Data Set
In this demonstration, you restructure the orion.country data set so you can use it to create a format using the CNTLIN=
option.

1. Copy and paste the following code into the editor.

data country_info;
keep Start Label FmtName;
retain FmtName '$country';
set orion.country(rename=(Country=Start
Country_Name=Label));
run;

proc print data=country_info noobs;


title 'Country Information';
run;
title;

The DATA step restructures the orion.country data set and creates the variables Start, Label, and FmtName in
the output data set.

2. Submit the code and check the log. Verify that no errors occurred.

3. View the results. Notice that the country data set is now properly structured as a control data set, so you can
use it to create a format using the CNTLIN= option.

Question
The control data set that you use to create a format must contain variables that supply the same information as which of
the following statements?

a. assignment

b. VALUE

c. INPUT

d. FORMAT

The correct answer is b. The data set specified in the CNTLIN= option must contain variables that supply the same
information that a VALUE statement would supply: FmtName, Start, and Label, as well as the variable End if a format
specifies a range.

Question
SAS stores formats as entries in which of the following?

a. SAS data sets

b. the SAS registry


c. SAS catalogs

d. SAS programs

The correct answer is c. When you use PROC FORMAT to create a format, SAS stores the format as a catalog entry. SAS
catalogs are special SAS files within a SAS library that store information in smaller units called entries.

Question
Which of the following PROC FORMAT steps creates a format from the temporary data set citycode and saves the
format to sasuser.formats?

a.

proc format lib=work cntlin=sasuser.citycode;


run;

b.

proc format lib=sasuser cntlin=citycode;


run;

c.

proc format lib=work cntlin=citycode;


run;

The correct answer is b. In the PROC FORMAT statement, you specify the sasuser library. You can optionally specify
sasuser.formats, but SAS stores the format in the Formats catalog by default. Then you use the CNTLIN= option to
specify the SAS data set citycode.

Applying Permanent Formats


Using Formats
When you reference a format, SAS loads the format from the catalog entry into memory, performs a binary search on
the format to execute the lookup, and returns a single result for each lookup. You can reference a format using FORMAT
statements, FORMAT= options, PUT statements, and PUT functions in assignment, WHERE, or IF statements. For
example, in this DATA step, the character format $country is referenced within the PUT function in the assignment
statement.

Creating a Permanent Format


In this demonstration, you use the control data set country to create a permanent format named $country.

1. Copy and paste the following PROC FORMAT step into the editor.

proc format library=orion cntlin=country_info;


run;

SAS will store the $country format in the orion.formats catalog. The CNTLIN= option specifies country as the
input control data set.
2. Submit the code and check the log. The log shows that the format was successfully written to the orion.formats
catalog.

3. Now you are ready to use the format. Copy and paste the following DATA step into the editor.

data supplier_country;
set orion.shoe_vendors;
Country_Name=put(Supplier_Country, $country.);
run;

This DATA step creates the SAS data set work.supplier_country. Notice that the assignment statement
references the $customer format within the PUT function.

4. Submit just the DATA step. Then check the log.

Notice that the log includes an error message which indicates that the $country format was not found or could
not be loaded.

Note: If you are working in SAS Enterprise Guide, using the practice data for this course, the permanent format
is found and is applied. This behavior occurs because of the course setup. In this course, for the SAS Enterprise
Guide Environment, the work library is used as the orion library. By default, SAS searches in the work and library
libraries for formats, so the format is found and is applied. For more information on how SAS searches for
formats, see the subtopics Accessing Permanent Custom Formats and Using the FMTSEARCH= System Option
within this lesson.

Business Scenario
Suppose you've created a permanent format and you want to use it. For example, you successfully created the $country
format and stored it in the orion.formats catalog. Now, you want to use the format in a DATA step. However, when you
submit the code, the log indicates that SAS didn't find the requested user-defined format. SAS stopped processing the
step because of errors. Why didn't SAS find the format? Remember, you successfully created $country and stored it in
the orion.formats catalog. This code doesn't indicate the location of the format, so SAS doesn't know where to find it.
Let's see how you can access your permanently stored custom formats.

Accessing Permanent Custom Formats


To access your custom formats it's helpful to understand how SAS looks for formats. By default, SAS searches in the
work and library libraries for formats. Given that SAS automatically searches library.formats, one strategy for accessing
your formats might be to assign the libref library to the SAS library that contains your format catalog, and to name the
catalog formats.

But, what if you have more than one permanent format catalog, or if you named the format catalog something other
than formats? How will SAS know which catalog stores your custom format?

A better strategy is to use the FMTSEARCH= system option to control the order in which SAS searches for format
catalogs. Let's learn about the FMTSEARCH= option.

Using the FMTSEARCH= System Option


Here's the syntax for the FMTSEARCH= system option. After the keyword, you list the catalog names in parentheses. By
default, SAS searches in the work and library libraries first unless you specify work or library in the FMTSEARCH= list. SAS
searches the catalogs in the FMTSEARCH= list in the order that they are listed until it finds the format. If you specify the
catalogs in the order shown here (orion, work, and library), SAS first searches the orion library for your format, followed
by the work and library libraries. Now that SAS knows where to find your custom format, you can successfully use the
format.

Using the NOFMTERR System Option


Remember, you successfully created the $country format and saved it in the orion.formats catalog. By default, the
FMTERR system option is in effect. If you use a format that SAS can't load, SAS issues an error message and stops
processing the step.

To prevent this default action, you can change the system option to NOFMTERR. This option replaces missing formats
with the w. or $w. default format, and issues a note. SAS will continue processing the program.

Using the NOFMTERR and FMTSEARCH= Options


In this demonstration, you use the NOFMTERR and FMTSEARCH= options.

1. Copy and paste the following code into the editor.

options nofmterr;
data supplier_country;
set orion.shoe_vendors;
Country_Name=put(Supplier_Country, $country.);
run;

proc print data=supplier_country;


run;

The DATA step creates the SAS data set work.supplier_country. The assignment statement creates the variable
Supplier_Country and references the $country format.Notice the NOFMTERR option. With this option, SAS will
process the program even if it can't find the specified formats, but the $country format won't be applied.

The PROC PRINT step prints the supplier_country data set.

2. Submit the code and check the log. Notice that SAS processed the code without errors. However, there is a note
that indicates that the format, $country, was not found or could not be loaded. We used the NOFMTERR option,
so SAS still processed the program and created results.

3. View the results. Did SAS apply the $country format to the Country_Name variable? No, it didn't. SAS still
doesn't know where to find the $country format.

Note: If you are working in SAS Enterprise Guide, using the practice data for this course, the permanent format
is found and is applied. This behavior occurs because of the course setup. In this course, for the SAS Enterprise
Guide Environment, the work library is used as the orion library. By default, SAS searches in the work and library
libraries for formats, so the format is found and is applied. For more information on how SAS searches for
formats, see the subtopics Accessing Permanent Custom Formats and Using the FMTSEARCH= System Option
within this lesson.

4. Modify the OPTIONS statement as shown below, to turn the FMTERR option back on.

options fmterr;

5. Add the FMTSEARCH= option to identify the orion.formats catalog to SAS. Your code should look like this:

options fmterr fmtsearch=(orion);


data supplier_country;
set orion.shoe_vendors;
Country_Name=put(Supplier_Country, $country.);
run;

proc print data=supplier_country;


run;

6. Submit the code and check the log, The log doesn't contain any errors.

7. View the results. As you can see, the variable Country_Name now has the $country format applied.

Remember, SAS system options are global. When you use the FMTSEARCH= option or the NOFMTERR option,
the option remains in effect for the entire session unless you change the option or turn it off.

Question
Given the following OPTIONS statement, in what order will SAS search to find a user-defined format?

options fmtsearch=(rpt prod.newfmt);

a. work.formats→library.Prod→library.Newfmt

b. work.formats→library.formats→rpt.Formats→prod.Newfmt

c. rpt.Formats→prod.Newfmt→work.formats→library.formats

The correct answer is b. SAS searches in the order specified in the FMTSEARCH= option. By default, SAS searches in the
work and library libraries first unless you specify them in the option.

Code Challenge
Write an OPTIONS statement to specify the following search order for format catalogs:

 formats stored in the abc library


 formats stored in the newfmt catalog of the def library
 formats stored in the formats catalog of the work library

You specify the libraries and catalogs in the order you want them to be searched. You must specify work in the
FMTSEARCH= option if you want to change the default search order.

options fmtsearch=(abc def.newfmt work);

Question
Given the following PROC FORMAT step, select the OPTIONS statement to submit before SAS can use the region format.

proc format lib=sasuser;


value region
1='North'
2='South'
3='East'
4='West'
run;

a. options fmtsearch=(sasuser);

b. options nofmterr;

c. options nofmterr fmtsearch=(work);

The correct answer is a.The LIBRARY= option in the PROC FORMAT step indicates to save the region format in the
sasuser library. You must tell SAS where to find this format, so you add the FMTSEARCH= option and specify sasuser.

Question
By default, the FMTERR system option is in effect. If you use a format that SAS cannot load, SAS issues an error message
and stops processing the step.

a. True

b. False

The correct answer is a. By default, the FTMTERR system option is in effect. If you want to avoid error messages and
continue to process the step when SAS can not load a format, use the NOFMTERR system option.

Business Scenario
Suppose you need to create a report using the data set orion.employee_addresses2. The data contains missing values
for the variable Country. The missing values appear as blanks when you apply the $country format to the data. You want
to assign the label Unknown to the missing values and apply the $country format to the nonmissing values.

Nesting Formats
You can create another format that references and adds to the existing $country format. This technique is called nesting
formats. In general, you should try to avoid nesting formats more than one level because the resource requirements
increase dramatically with each additional level. In this example, PROC FORMAT creates the $extra format and nests the
$country format that we created earlier. The $extra format associates missing values with the label Unknown and uses
the label of the $country format for all other values. You enclose the name of the nested format in square brackets.
Notice that a length of 30 is specified for the $country format. The default length would be 40.

Nesting Formats with the VALUE Statement


In this demonstration you use the VALUE statement to nest a format.

1. Copy and paste the following code into the editor:

proc format library=orion;


value $extra
' '='Unknown'
other=[$country30.];
run;

2. Submit the code and check the log. The log indicates that SAS created the $extra format successfully.
3. In the editor, copy and paste the following PROC PRINT step to format the variable Country with the $extra
format. Submit just this step.

proc print data=orion.employee_addresses2;


format Country $extra.;
run;

4. Examine the results. Notice that the value of Country in the first observation is Unknown. The other values of
Country are all formatted with the $country format. Remember, you nested the $country format in the $extra
format, so whenever you use the $extra format, it also applies the $country format.

Managing Permanent Formats


Using PROC CATALOG to Manage Formats
When you create a large number of permanent formats, it's easy to forget the exact spelling of a specific format name or
its range of values. Because SAS stores formats as catalog entries, you can use the CATALOG procedure to manage the
entries in SAS catalogs.

For example, you can list the contents of a catalog. This program displays the contents of the orion.formats catalog. As
you can see, the orion.formats catalog contains just one format, continent. You can see its type, creation date, and
modification date. Alternatively, if you are using the SAS windowing environment, you can use the Explorer window to
locate and view stored formats.

Using the FMTLIB Option to Document Formats


To document your formats, you can add the FMTLIB option to the PROC FORMAT statement. FMTLIB prints a table of
information for each of the format entries in the catalog that you specify in the LIBRARY= option. This program
documents the formats in the orion.formats catalog.

As you create more permanent formats, it's helpful to subset the catalog by adding the SELECT or EXCLUDE statement to
the PROC FORMAT step to document only certain formats. This program subsets the orion.formats catalog to return just
the $country format. Here's the output. In addition to the name, range, and label, the format description includes the
length of the longest label, the number of values defined by this format, the version of SAS this format is compatible
with, and the date and time of creation.

Question
Which of the following statements can you add to this PROC FORMAT step to document all of the formats in the
sasuser.formats catalog except the $geoarea and geopop formats?

proc format lib=sasuser fmtlib;


___________________________
run;

a. select geoarea. geopop.;

b. exclude geoarea. geopop.;


c. select $geoarea geopop;

d. exclude $geoarea geopop;

e. exclude $geoarea. geopop.;

The correct answer is d. You specify the EXCLUDE statement to exclude catalog entries from processing by the FMTLIB
option. Catalog entry names are the same as the name of the format they store. You precede names of entries that
contain character formats with a dollar sign.

Question
Which option do you use with PROC FORMAT to document the format in a particular format catalog?

a. FMTSEARCH=

b. FMTERR

c. CATALOG

d. FMTLIB

The correct answer is d. Use the FMTLIB keyword in the PROC FORMAT statement to document the formats in a catalog.
You can add the SELECT and EXCLUDE statements to process specific formats rather than the entire catalog.

Maintaining Permanent Formats


After you create formats, how can you maintain them? For example, suppose you need to add new countries to the
$country format. How can you add new labels or modify the existing labels?

Updating a Format
If you created the format from a VALUE statement and saved the code, you can modify and resubmit the original PROC
FORMAT step. But if you didn't save the code, or if you created the format from a SAS data set, you can use the
CNTLOUT= option. This option creates an output control data set. You can follow a three-step process for using the
CNTLOUT= option. First, you create a SAS data set from the values in the format using the CNTLOUT= option. Second,
you edit the data set. And third, you re-create the format from the updated SAS data set using the CNTLIN= option. Let's
look at each of these steps.

Creating a SAS Data Set from a Format


To add new countries to the $country format, let's start with the first step: using the CNTLOUT= option to create an
output control data set from the format. Here's the syntax for the CNTLOUT= option. After the keyword, you specify the
name of the SAS data set. In this code, the output data set is countryfmt. You include the SELECT statement to specify
that the resulting data set contain only the data for the $country format. As you can see, the code for creating a SAS
data set from a format is simple.

Question
Which PROC FORMAT option do you use to create a SAS data set from a format?
a. CNTLIN=

b. LIBRARY=

c. CNTLOUT=

d. FMTLIB=

The correct answer is c. You use the CNTLOUT= option to create a SAS data set from a format.

Code Challenge
Complete the PROC FORMAT step to create the SAS data set sasuser.runs from the format $marathon stored in the
library.formats catalog.

;
select $marathon;
run;

The following PROC FORMAT statements are acceptable. You specify the library where the format is stored in the
LIBRARY= option and the name of the data set (including the libref) in the CNTLOUT= option.

proc format library=library.formats cntlout=sasuser.runs;


proc format lib=library cntlout=sasuser.runs;
proc format lib=library.formats cntlout=sasuser.runs;

Editing the Data Set


The next step in updating the format is editing the data set. You can edit the data set using a DATA step or PROC SQL
step. Alternatively, if you are using the SAS windowing envrionment, you can edit the data set through an interactive
window such as the VIEWTABLE window. If you are using SAS Enterprise Guide, you can edit the data set through the
data grid. Whatever the method, you must add values for the FmtName, Start, End, and Label variables. Even if the
values for Start and End are the same, you must enter values for both variables or SAS will return an error. You don't
have to add values for the other variables in the data set.

Creating a Format from the SAS Data Set


After you edit and save the data set, you can re-create a format from it using the CNTLIN= option. Remember, you want
the $country format to include the three new countries, so you use the input control data set countryfmt to add these
values. Here's the code. The first PROC FORMAT step re-creates the $country format from the control data set
countryfmt. The second PROC FORMAT step includes the FMTLIB option and a SELECT statement to document the new
$country format with the additional countries added.

Here's a question. Can you write these two PROC FORMAT steps as one PROC FORMAT step? Yes you can, and it's more
efficient to do so. This PROC FORMAT step creates the $country format and uses FMTLIB to document it.

Updating a Format Using the CNTLOUT= and CNTLIN= Options


In this demonstration, you use the CNTLOUT= and CNTLIN= options to update a format.

1. Copy and paste the following code into the editor.


proc format library=orion cntlout=countryfmt;
select $country;
run;

proc print data=countryfmt;


run;

The CNTLOUT= option creates a new data set named countryfmt from the existing version of the $country
format.

2. Submit the code and check the log. The log indicates that the new data set, countryfmt, was successfully
created.

3. View the results. When you use the CNTLOUT= option, SAS creates an output control data set that has many
variables for storing information about the format.

As you can see, the countryfmt data set contains the variable End and many other variables that were not in the
input control data set. When you use the CNTLIN= option, if there is no End variable in the data set, SAS assumes
that the Start and End variables have the same value. When you write the format out as a data set using the
CNTLOUT= option, both variables are in the data set.

4. Data for the additional countries is stored in orion.newcodes. Copy and paste the following code into the editor.

data countryfmt;
set countryfmt orion.newcodes;
run;

proc print data=countryfmt;


run;

This DATA step to concatenates the new country codes to countryfmt.

5. Submit the code and check the log. The log shows that SAS inserted the three new rows into the countryfmt
data set.

6. View the results. Notice the codes for Brazil, Switzerland, and Mexico.

7. Copy and paste the following PROC FORMAT step into the editor.

proc format library=orion.formats cntlin=countryfmt fmtlib;


select $country;
run;

This PROC FORMAT step creates the $country format and uses FMTLIB to document it.

8. Submit the code and check the log. Verify that the code ran successfully.

9. View the results. Notice that the $country format now includes the three new countries: Brazil, Switzerland, and
Mexico. You can apply this format just like any other user-defined or SAS supplied format.

Summary: Creating and Maintaining Permanent Formats


This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Creating Permanent Formats


You can use the FORMAT procedure to define your own custom formats for displaying values of variables. The VALUE
statement defines the format. First, you specify a name for the format. Then, you specify how you want each value or
range of values to be displayed.

PROC FORMAT;
VALUE format-name value-or-range1='formatted-value1'
value-or-range2='formatted-value2'
. . . ;
RUN;

After the format is defined, you can apply it to a variable by using a a FORMAT statement in a DATA step or PROC step.
In a FORMAT statement you must specify a period after the format name, the same way that you do for a SAS supplied
format. You do not include a period after a user-defined format name when you create it.

User-defined formats are stored in a SAS catalog named work.formats, and are available for the duration of the SAS
session. Work is a temporary library and its contents will be deleted when the session ends. Therefore the user-defined
formats stored in work.formats will be deleted.

You can store user-defined formats in a permanent library. You can also create a format from an existing SAS data set
instead including hard-coded values in a VALUE statement.

To create a format from a SAS data set, the data set must contain value and label information, in specially named
variables. This is called a control data set. You specify the control data set name using the CNTLIN= option in the PROC
FORMAT statement. The CNTLIN= option builds formats without using a VALUE statement.

PROC FORMAT CNTLIN=SAS-data-set;

The control data set must contain variables that supply the same information that a VALUE statement would: FmtName,
Start, Label, and possibly End and Type. FmtName must be assigned the name of the format, and has the same value in
every observation in the control data set. Start contains a data value to be formatted, and Label contains the associated
label to be displayed instead of that data value. If the format applies to a range of values, then Start specifies the first
value in the range, and another variable, End, specifies the last value in the range. If no End variable exists, SAS assumes
that the ending value of the format range is equal to the value of Start. The control data set must also contain the
variable Type for character formats unless the value for FmtName begins with a dollar sign.

Most data sets do not contain the required variables, so you need to restructure them before you can use them as a
control data set. You can restructure the data by using a DATA step or another PROC step. You can also create a control
data set using an interactive application such as the VIEWTABLE window in the SAS windowing environment or the New
Data Wizard in SAS Enterprise Guide.

Recall that user-defined formats are stored in the temporary catalog, work.formats. To permanently store frequently
used formats in a permanent catalog, you add the LIBRARY= option to the PROC FORMAT statement. If you specify only
a libref without a catalog name, SAS permanently stores the format in the Formats catalog in that library. If you specify a
libref and a catalog name, SAS permanently stores the format in that catalog.

PROC FORMAT LIBRARY=libref<.catalog>;

Applying Permanent Formats


When you reference a format, SAS searches for the format, loads it from the catalog entry into memory, performs a
binary search on the format to execute the lookup, and returns a single result for each lookup.

When you use permanent formats, SAS automatically searches for formats in work.formats and then in library.formats.
You can take advantage of this automatic search path by assigning the libref, library, to the SAS library that contains
your formats catalog. A better option is to use the FMTSEARCH= option to identify the locations of your permanent
formats, and to control the order in which SAS searches for format catalogs.

You can use the OPTIONS statement to set the FMTSEARCH= option. After the keyword FMTSEARCH=, you list the
catalog names in parentheses. SAS searches work.formats, then library.formats, and then the catalogs in the
FMTSEARCH= list in the order that they are listed, until it finds the format.

OPTIONS FMTSEARCH=(catalog-specification-1... catalog-specification-n);

If you use a format that SAS can't load, SAS issues an error message and stops processing the step. This behavior is due
to the FMTERR system option which is in effect by default. To prevent this default action, you can change the system
option to NOFMTERR. This option replaces missing formats with the w. or $w. default format, issues a note and
continues processing the program.

OPTIONS NOFMTERR;

You can create a format that references and adds to an existing format. This technique is called nesting formats. In
general, you should try to avoid nesting formats more than one level because the resource requirements increase
dramatically with each additional level.

Managing Permanent Formats


When you create a large number of permanent formats, it's easy to forget the exact spelling of a specific format name or
its range of values. You can use the CATALOG procedure to manage the entries in SAS catalogs.

PROC CATALOG CATALOG=<libref.>catalog <options>;

To document your formats, you can add the FMTLIB option to the PROC FORMAT statement. The FMTLIB option prints a
table of information for each format entry in the catalog specified in the LIBRARY= option. In addition to the name,
range, and label, the format description includes the length of the longest label, the number of values defined by this
format, the version of SAS this format is compatible with, and the date and time of creation.
PROC FORMAT LIBRARY=libref.catalog FMTLIB;

As you create additional permanent formats, it's helpful to subset the catalog by adding the SELECT or EXCLUDE
statement to the PROC FORMAT step to document only certain formats.

Maintaining Permanent Formats


After you create a format, you might need to update it. If you created the format from a VALUE statement and saved the
code, you can modify and resubmit the original PROC FORMAT step.

If the original program or control data set is not available, you can use the CNTLOUT= option to update a format. This
option creates an output control data set from an existing format.

PROC FORMAT LIBRARY=libref.catalog


CNTLOUT=SAS-data-set;

You follow a three-step process for using the CNTLOUT= option to update a format. First, you submit a PROC FORMAT
step using the CNTLOUT= option to create a SAS data set from the values in the format specified in the SELECT
statement. Second, you edit the data set. To edit the data set you can use a DATA or PROC SQL step, or you can edit the
data set interactively thru a Viewtable window or Data Grid. Third, you submit another PROC FORMAT step to re-create
the format from the updated SAS data set using the CNTLIN= option.

Sample Programs

Creating Permanent Formats

data country_info;
keep Start Label FmtName;
retain FmtName '$country';
set orion.country(rename=(Country=Start
Country_Name=Label));
run;

proc format library=orion cntlin=country_info;


run;

Applying Permanent Formats

options nofmterr fmtsearch=(orion);


data supplier_country;
set orion.shoe_vendors;
Country_Name=put(Supplier_Country,$country.);
run;

proc print data=supplier_country;


run;

Nesting Formats

proc format library=orion;


value $extra
' '='Unknown'
other=[$country30.];
run;

Managing Permanent Formats

proc catalog catalog=orion.formats;


contents;
run;

proc format library=orion fmtlib;


select $country;
run;

Maintaining Permanent Formats

proc format library=orion cntlout=countryfmt;


select $country;
run;
Note: The countryfmt data set needs to be modified with code or by editing
interactively.
proc format library=orion.formats cntlin=countryfmt fmtlib;
select $country;
run;

Lesson 11: An Introduction to the SQL Procedure


Structured Query Language, or SQL, is a standardized language that many software products use to retrieve, join, and
update data in tables. In SAS, the SQL procedure enables you to include ANSI standard SQL queries in your SAS
programs. Like other SAS procedures, PROC SQL can access SAS data sets. PROC SQL can also access data that is stored in
other databases if you have the corresponding SAS/ACCESS engines licensed and installed. In this lesson, you learn how
to use PROC SQL to work with SAS data sets.

Objectives

In this lesson, you learn to do the following:

 identify the uses of the SQL procedure


 identify the main differences between PROC SQL and the DATA step for joining tables
 query a SAS data set and create a report by using PROC SQL
 create a table that contains the results of a PROC SQL query
 join multiple SAS data sets by using PROC SQL

Understanding PROC SQL


What Is PROC SQL?
Using PROC SQL, you can retrieve and manipulate data that is stored in tables. In PROC SQL terminology, a table is a SAS
data set or any other type of data file that you can access by using SAS/ACCESS engines. A row in a table is the same as
an observation in a SAS data set. And, a column is the same as a variable.

The process of retrieving data from tables by using PROC SQL is called querying tables. PROC SQL stores the retrieved
data in a result set. By default, a PROC SQL query generates a report that displays the result set. You can also specify that
PROC SQL output the result set as a table. Using PROC SQL, you can retrieve data from one table, or from multiple tables
by joining them.

Why Use PROC SQL?


So, why would you want to use PROC SQL? After all, you can use the DATA step to retrieve data from tables. Actually,
each of these techniques has advantages. Let's look at the advantages of each.

PROC SQL can perform some tasks more efficiently than the DATA step can. For example, using PROC SQL, you can join
tables and produce a report in just one step without creating a SAS data set. PROC SQL also enables you to join tables
without presorting the input data sets. Another advantage of PROC SQL is that you can specify complex matching
criteria.

In some situations, however, the DATA step is your best choice. Using the DATA step, you can create multiple data sets.
You can also direct output to data sets based on which input data set contributed to the observation. You can use FIRST.
and LAST. processing as well as DO loops, arrays, and the hash object. Finally, using the DATA step, you can perform
complex data manipulation.

The Elements of a PROC SQL Step


Let's look at the the main elements of a PROC SQL step. As you'll see, there are some differences between the syntax of
a PROC SQL step and the syntax of other PROC steps. To start the SQL procedure, you specify the PROC SQL statement.
When SAS executes a PROC SQL statement, the SQL procedure continues to run until SAS encounters a step boundary. A
step boundary stops PROC SQL from running and removes it from memory. A QUIT statement at the end of a PROC SQL
step is an explicit boundary. The beginning of a PROC step or a DATA step is also a step boundary. All of the PROC SQL
programs shown in this lesson end with a QUIT statement.

To retrieve data from one or more tables, you use a SELECT statement. A SELECT statement is also called a query. In a
single PROC SQL step, you can include one SELECT statement or multiple SELECT statements. Each SELECT statement
creates a separate report. Here's an example of a basic SELECT statement. Later in this lesson, you'll learn about the
SELECT statement in more detail. Other statements besides the SELECT statement can also appear in a PROC SQL step.
You'll learn about some of these additional statements later in this lesson.

Question
Which statement about PROC SQL is false?

a. PROC SQL stops running only when SAS encounters a QUIT statement.

b. A PROC SQL step can contain more than one SELECT statement.

c. A PROC SQL step creates a report by default.

d. When you use PROC SQL to join tables, you do not need to presort the input tables.

The correct answer is a. PROC SQL also stops running if SAS encounters another PROC step or a DATA step.
Using PROC SQL to Query a Table
The table orion.sales_mgmt contains information about Orion Star sales managers. Suppose you want to create a report
that lists the employee IDs, titles, and salaries of the sales managers. You can write a simple PROC SQL query to retrieve
and display the data in a report.

Using the SELECT Statement to Query the Data


Let's take a closer look at the SELECT statement, which you use to retrieve data from one or more tables. Like many
PROC SQL statements, the SELECT statement is composed of smaller elements called clauses. Each clause begins with a
keyword. Also, notice that a semicolon appears at the end of the entire SELECT statement, not after each clause. SAS
executes a PROC SQL query as soon as it reaches this semicolon, so you don't need to specify a RUN statement at the
end of the step.

The first two clauses (the SELECT clause and the FROM clause) are required in a SELECT statement. All other clauses,
including the WHERE clause, are optional. The SELECT clause specifies the columns to include in the query result. The
FROM clause specifies the table or tables that contain the columns. Additional clauses that can appear in the SELECT
statement are the GROUP BY clause, the HAVING clause, and the ORDER BY clause.

For more information about these additional clauses, see SAS Help and Documentation. In the SELECT statement, you
must specify the clauses in the order shown here.

Selecting Columns Using the SELECT Clause


In the SELECT clause, after the keyword SELECT, you can specify the names of one or more columns in the table that
you're querying. The column names are separated by commas, but there is no comma following the last column in the
list. You can specify columns from the queried table, text strings, or calculated columns. In this example, the SELECT
clause specifies the columns Employee_ID, Job_Title, and Salary.

If you want your result set to include all columns in the table or tables that you're working with, you can specify an
asterisk instead of listing all of the columns by name. In this example, the query output will now display all columns in
the table. Before you specify an asterisk in the SELECT clause, make sure that you really need all of those columns.
Depending on the size of the table, a PROC SQL query that requests many columns can take a long time to run. We're
going to use our earlier example, which selects only three columns.

Using the FROM Clause to Specify Tables


In the FROM clause, after the keyword, you specify the name of the table. You can also specify multiple table names, as
you'll see later in this lesson. In this example, the FROM clause specifies the table orion.sales_mgmt, which contains the
columns that are listed in the SELECT clause.

Question
Which code creates a report that displays two columns from orion.donate? (Assume that the orion library is already
defined.)

a.

proc sql;
from orion.donate
select Employee_ID, Amount;
quit;
b.

proc sql;
select Employee_ID, Amount
from orion.donate;
quit;

c.

proc sql;
select Employee_ID Amount
from orion.donate;
quit;

d.

proc sql;
select Employee_ID, Amount;
from orion.donate;
quit;

The correct answer is b. The SELECT clause must be first, followed by the FROM clause. In the SELECT clauses, a comma
should appear between column names but not at the end of the clause. One semicolon appears at the end of the SELECT
statement, not at the end of each clause.

Using the WHERE Clause to Subset Rows


So far, you've learned to select all or some of the columns in a table to display in a report. Now suppose you want to
subset the rows that your query returns, based on a condition. To subset rows, you can use the optional WHERE clause.
The WHERE clause identifies a condition that must be satisfied for each row to be included in the PROC SQL output. The
condition is stated in an expression. As in the WHERE statement, which you've used in other SAS procedures, the
expression in the WHERE clause can be any valid SAS expression. You can use a simple expression or a compound
expression that has logical operators.

In the WHERE clause shown here, the expression selects only the rows in which the manager's salary is greater than
$95,000. In this example, the WHERE clause specifies a column that is also specified in the SELECT clause. However, in
the WHERE clause, you can specify any columns from underlying tables, whether or not the columns are specified in the
SELECT clause. For example, this WHERE statement now references the Gender column, which is not specified in the
SELECT clause.

Creating a Report from a PROC SQL Query

In this demonstration, you run a PROC SQL step to create a report that contains information about Orion Star
sales managers.

1. Copy and paste the following PROC SQL step into the editor.

proc sql;
select *
from orion.sales_mgmt;
quit;
This query contains an asterisk to display all columns and specifies the name of the table. Without a WHERE
clause, PROC SQL will display all of the rows in the input table.

2. Submit the PROC SQL step and and view the results. The report shows all columns and all rows in
orion.sales_mgmt. Notice that orion.sales_mgmt has only 10 columns and four rows. If this table were larger,
the query might have taken a long time to run.

3. Suppose you want to display only the columns Employee_ID, Job_Title, and Salary. Modify the SELECT clause, as
shown below, to specify these three columns instead of an asterisk,.

proc sql;
select Employee_ID, Job_Title, Salary
from orion.sales_mgmt;
quit;

4. Submit the modified PROC SQL step and view the results. The report now displays three columns instead of 10.
This is the report that you originally wanted to create.

5. Now suppose you want to create a report that displays only the rows for sales managers whose salaries are
greater than $95,000. To subset the data, add a WHERE clause, as shown below. Remember to remove the
semicolon from the FROM clause.

proc sql;
select Employee_ID, Job_Title, Salary
from orion.sales_mgmt
where Salary>95000;
quit;

6. Submit the modified PROC SQL step and view the results. This report displays only one row. Only one of the four
sales managers in orion.sales_mgmt has a salary greater than $95,000.

Using the CREATE TABLE Statement


You've seen how PROC SQL creates a report by default. Now, suppose you want to create an output table from the
query results. To create an output table instead of a report, you can use the CREATE TABLE statement in your PROC SQL
step. The output table is a SAS data set.

After the keywords CREATE TABLE, you specify the name of the output table, followed by the keyword AS. The rest of
the statement consists of the clauses that are used in a query: SELECT, FROM, and any optional clauses. As in the SELECT
statement, the query clauses must appear in the order shown here. One semicolon appears at the end of the entire
statement.

This example shows the PROC SQL step that you saw earlier, with a CREATE TABLE statement instead of a SELECT
statement. This CREATE TABLE statement creates the output table direct_reports from the results of a simple query. No
libref is specified before the output table name, so direct_reports is a temporary table that's stored in the work library.

Creating an Output Table from a PROC SQL Query


In this demonstration, you run a PROC SQL step to create an output table that contains information about Orion Star
sales managers. This demonstration shows two ways to display an output table that PROC SQL creates.

1. Copy and paste the following PROC PRINT step into the editor.

proc sql;
create table direct_reports as
select Employee_ID, Job_Title, Salary
from orion.sales_mgmt;
quit;

This program creates the temporary output table direct_reports. Direct_reports contains the three columns
specified in the SELECT clause and all the rows in orion.sales_mgmt.

2. Submit the PROC SQL step. Notice that this program did not create a report. The CREATE TABLE statement
creates an output table instead of a report. To verify that the output table was created, view the log. A note in
the log indicates that the table work.direct_reports was created with four rows and three columns.

3. To create a report that creates and displays the output table, add an a second SELECT statement after the
CREATE TABLE statement, as shown below. To display all columns, specify an asterisk in the second SELECT
clause. In the second FROM clause, specify the output table name.

proc sql;
create table direct_reports as
select Employee_ID, Job_Title, Salary
from orion.sales_mgmt;
select *
from direct_reports;
quit;

4. Submit the PROC SQL step and view the output.

5. To display the output table another way, copy and paste the following PROC PRINT step into the editor.

proc print data=direct_reports;


run;

The output table is a SAS data set so we can use a PROC PRINT step.

6. Submit the PROC PRINT step and view the results. Notice that the PROC PRINT output contains the three data
columns and an Obs column.

7. Compare the PROC PRINT output and the PROC SQL output. Notice that the PROC SQL output shows variable
labels. PROC PRINT displays variable names instead of variable labels by default.

Question
Which PROC SQL step creates only an output table named clinic.bills?

a.

proc sql;
create table clinic.bills;
select ID, Date, Fee
from clinic.admit;
quit;

b.

proc sql;
create table clinic.bills as
select ID, Date, Fee
from clinic.admit;
quit;

c.

proc sql;
select ID, Date, Fee
from clinic.admit
create table as clinic.bills;
quit;

The correct answer is b. To create an output table, you use a CREATE TABLE statement, which has multiple clauses. The
CREATE TABLE clause names the output table, and the SELECT and FROM clauses query the input table.

Using PROC SQL to Join Tables


Earlier, you used PROC SQL to create this report, which lists the employee IDs, job titles, and salaries of the sales
managers at Orion Star. The data in this report is from the table orion.sales_mgmt. Now, suppose you want to add
employee names to your report. The problem is that employee names are not stored in orion.sales_mgmt. Instead,
employee names are stored in the table orion.employee_addresses. To create the report that you want, you can use
PROC SQL to join the two tables. Joining tables enables you to select data from multiple tables as if the data were
contained in one table. Joins do not alter the original tables.

Ways of Joining Tables by Using PROC SQL


Using PROC SQL, you can join tables in different ways. In the most basic type of join, PROC SQL combines each row from
the first table with every row from the second table. The result of joining tables in this way is called the Cartesian
product. In a Cartesian product, the number of rows is equal to the product of the number of rows in each of the source
tables. The Cartesian product of large tables can be huge. Typically, you want your result set to contain only a subset of
the Cartesian product.

The other way that PROC SQL can join tables is by matching rows based on the values of a common column. An inner
join is a specific type of join that returns only a subset of the rows from the first table that matches the rows from the
second table. The result of an inner join can be represented as the overlapping area of two circles in a Venn diagram.
Conceptually, for an inner join, PROC SQL creates a Cartesian product as an internal or intermediate table. Then, PROC
SQL selects only the matching rows for the final result.

Remember that you want to join orion.sales_mgmt and orion.employee_addresses so that you can add the
Employee_Name column to your report. To add the employee names to your report, you need to match the rows in the
two input tables by the values of the common column Employee_ID. You can use an inner join to create your report.

Using the SELECT Statement to Join Tables


To join tables, you use the SELECT statement. The SELECT clause specifies the columns that appear in the report. The
FROM clause specifies the tables to be joined. The WHERE clause specifies one or more join-conditions that PROC SQL
uses to combine and select rows for the result set. In addition to join-conditions, the WHERE clause can also specify an
expression that subsets the rows. Additional clauses may also be used in a SELECT statement that joins tables. Here's an
example of a PROC SQL join. Next, we'll take a closer look at how each clause in the SELECT statement is used in a join.
Specifying the Tables and the Columns
The FROM clause specifies the tables to be joined. You can list the table names in any order. The table names are
separated by commas. In this example, the FROM clause specifies the input tables orion.sales_mgmt and
orion.employee_addresses. Notice that a comma appears between the two table names but not after the last table
name.

The SELECT clause specifies the columns that you want to appear in the output. You can specify columns from any of the
tables listed in the FROM clause. In this example, the SELECT clause specifies four columns. The second column in the
list, Employee_Name, comes from the second table listed in the FROM clause. The other three columns come from the
first table. Notice that the first column in the SELECT clause, Employee_ID, now has the table name sales_mgmt
specified as a prefix. A column named Employee_ID appears in both of the input tables. If you select a column that has
the same name in multiple tables, you must indicate which column you want to select by specifying the table name and
a period before the column name.

Prefixing the table name is called qualifying a column. Notice that you don't have to specify the libref when you qualify a
column in the SELECT clause. In PROC SQL, you can use a two-level name for any column, but you are only required to
qualify the columns that appear in multiple tables.

Specifying the Join Conditions


A basic join uses only the two required clauses: SELECT and FROM. Here's a question: If you specify only a SELECT clause
and a FROM clause, what rows will appear in the result set? If you specify only a SELECT clause and a FROM clause, PROC
SQL matches all the rows in one table with all the rows in the other table. The result set contains all possible
combinations of rows, or a Cartesian product.

A basic join is very resource intensive and is rarely used. In an inner join, you use the WHERE clause to include only
matching rows in the result set. The WHERE clause specifies one or more join-conditions that PROC SQL uses to subset
the rows. The join-conditions are expressed as an sql-expression, which produces a value from a sequence of operands
and operators. An sql-expression can be any valid SAS expression. In the output from an inner join, PROC SQL includes all
rows that meet the condition stated in the sql-expression. The sql-expression does not have to be equijoin – a join in
which values are matched.

In this example, the WHERE clause specifies one condition. The value of Employee_ID in orion.sales_mgmt must match
(or equal) the value of Employee_ID in orion.employee_addresses. You must qualify the columns that appear in both
tables wherever you reference them in the SELECT statement. In this example, the Employee_ID column names are
qualified in both the WHERE clause and the SELECT clause.

Joining Tables on a Common Column


In this demonstration, you run a PROC SQL step to join two tables and create a report that contains columns from both
tables and only the matching rows.

1. Copy and paste the following PROC SQL query into the editor.

proc sql;
select sales_mgmt.Employee_ID, Employee_Name,
Job_Title, Salary
from orion.sales_mgmt,
orion.employee_addresses
where sales_mgmt.Employee_ID =
employee_addresses.Employee_ID;
quit;

This query selects four columns from the two tables orion.sales_mgmt and orion.employee_addresses. The
WHERE clause selects only the rows that have matching values of Employee_ID. One advantage of using PROC
SQL to join tables is that you don't have to sort the input tables first.

2. Submit the query and view the results. The report lists the four employees who are sales managers, as in the
report you created in the last demonstration. Now, the name of each employee is also included.

3. Remove the WHERE clause from the query and add a semicolon at the end of the SELECT statement as shown
below.

proc sql;
select sales_mgmt.Employee_ID, Employee_Name,
Job_Title, Salary
from orion.sales_mgmt,
orion.employee_addresses;
quit;

4. Submit the modified query and view the results. This report is much longer than the previous report because
PROC SQL generated a Cartesian product.

For example, starting at the top of the report, notice that the same employee ID number, job title, and salary
appear in multiple rows. These three columns are all from the first table, orion.sales_mgmt, which contained
data for only the four sales managers. However, each of these report rows contains a different employee name.
PROC SQL combined the first row in orion.sales_mgmt, for the manager who has the employee ID number
121143, with every row in orion.employee_addresses.

This report is not helpful. PROC SQL did not match rows by the values of Employee_ID so you don't know which
name belongs to employee number 121143.

5. View the log. A note in the log indicates that this query created a Cartesian product, so PROC SQL cannot
optimize the join.

6. Remember, you can also combine tables and generate a report by using a DATA step merge. A PROC SQL inner
join is equivalent to a DATA step merge in which both data sets contribute to the merge. Click the Information
button to view these two methods of combining SAS data sets.

Question
The tables Orion.customer and orion.country both store country abbreviations in a column named Country. The other
columns have unique names. In a PROC SQL step, which SELECT statement correctly joins the two tables on matching
values of Country?

a.

select Customer_ID, Country, Country_Name


from orion.customer, orion.country
where Country = Country;

b.

select Customer_ID, Country, Country_Name


from orion.customer, orion.country
where customer.Country = country.Country;

c.

select Customer_ID, customer.Country, Country_Name


from orion.customer, orion.country
where country.Country = customer.Country;

The correct answer is c. In the SELECT clause and the WHERE clause, you must qualify references to any columns that
have the same name in multiple tables.

Assigning Aliases to Tables


This SELECT statement shows that qualified column names can be long. To make your code shorter and easier to read,
you can replace the full table name in qualified column names with an alias. You can specify aliases for tables in the
FROM clause. After the table name, you can optionally specify the keyword AS. Then, you specify the alias that you want
to use for the table. An alias can be any valid SAS name.

In this example, aliases have now replaced the full table names. The FROM clause assigns the alias s to the first table and
the alias a to the second table. The keyword AS appears before each table alias here, but remember that you can omit
this keyword if you want. In the SELECT and WHERE clauses, the qualified column names now use an alias as a prefix
instead of full table names. Note that assigning aliases does not change the names of the underlying tables. Here's the
complete PROC SQL step that uses aliases in the SELECT statement. This code produces the same report as the earlier
version, which used full table names.

Question
This PROC SQL step joins two tables that have only the column Student_Name in common. Which FROM clause correctly
completes this step? Select all that apply.

proc sql;
select r.Student_Name,
Course_Number,
Balance
___________________
where r.Student_Name=
s.Student_Name;
quit;

a.

from school.register as r,
school.students as s

b.

from school.register as r,
school.students s

c.

from school.students as s,
school.register as r

d.

from school.students as
s.students,
school.register as
r.register

The correct answer is a, b, and c. The FROM clause uses the table aliases that are specified in the WHERE clause. The
keyword AS is optional. You can list the tables in the FROM clause in either order.

Summary: An Introduction to the SQL Procedure

This summary contains topic summaries, syntax, and sample programs.

Topic Summaries
To go to the movie where you learned a task or concept, select a link.

Understanding PROC SQL


Structured Query Language, or SQL, is a standardized language that many software products use to retrieve, join, and
update data in tables. Using PROC SQL, you can write ANSI standard SQL queries that retrieve and manipulate data.

PROC SQL;
SELECT column-1<, column-2>…
FROM table-1…
<WHERE expression>
<additional clauses>;
QUIT;

In PROC SQL terminology, a table is a SAS data set or any other type of data file that you can access by using SAS/ACCESS
engines. A row in a table is the same as an observation and a column is the same as a variable.

The process of retrieving data from tables by using PROC SQL is called querying tables. PROC SQL stores the retrieved
data in a result set. By default, a PROC SQL query generates a report that displays the result set. You can also specify that
PROC SQL output the result set as a table.

You can perform many of the same tasks by using PROC SQL or the DATA step. Each technique has advantages.

There are some differences between the syntax of a PROC SQL step and the syntax of other PROC steps. To start the SQL
procedure, you specify the PROC SQL statement. When SAS executes a PROC SQL statement, the SQL procedure
continues to run until SAS encounters one of the following step boundaries: the QUIT statement, or the beginning of
another PROC or DATA step.
A PROC SQL step can contain one or more statements. The SELECT statement (a query) retrieves data from one or more
tables and creates a report by default. A PROC SQL step can contain one or more SELECT statements. Other statements
besides the SELECT statement can also appear in a PROC SQL step.

Using PROC SQL to Query a Table


Like many PROC SQL statements, the SELECT statement is composed of smaller elements called clauses. Each clause
begins with a keyword. A semicolon appears at the end of the entire SELECT statement, not after each clause. SAS
executes a PROC SQL query as soon as it reaches this semicolon, so you don't need to specify a RUN statement at the
end of the step.

SELECT column-1<, column-2>…


FROM table-1…
<WHERE expression>
<additional clauses>;

The SELECT and FROM clauses are required in a SELECT statement. All other clauses, including the WHERE clause, are
optional. The SELECT clause specifies the columns to include in the query result. An asterisk indicates that all columns
should be included. The FROM clause specifies the table or tables that contain the columns.

SELECT column-1<, column-2>…

SELECT *

FROM table-1…

To subset rows, you use the optional WHERE clause to specify a condition that must be satisfied for each row to be
included in the PROC SQL output. The condition, which is stated in an expression, can be any valid SAS expression.

<WHERE expression>

To create an output table instead of a report, you can use the CREATE TABLE statement in your PROC SQL step. The
output table is a SAS data set.

CREATE TABLE table-name AS


SELECT column-1<, column-2>…
FROM table-1…
<WHERE expression>
<additional clauses>;

Using PROC SQL to Join Tables


Using PROC SQL, you can join tables in different ways. In the most basic type of join, PROC SQL combines each row from
the first table with every row from the second table. The result of joining tables in this way is called the Cartesian
product. In a Cartesian product, the number of rows is equal to the product of the number of rows in each of the source
tables.
The Cartesian product of large tables can be huge. Typically, you want your result set to contain only a subset of the
Cartesian product.

The other way that PROC SQL can join tables is by matching rows based on the values of a common column. An inner
join is a specific type of join that returns only a subset of the rows from the first table that matches the rows from the
second table.

SELECT column-1<, column-2>…


FROM table-1, table-2…
<WHEREjoin-condition(s)>
<additional clauses>;

To join tables, you use the SELECT statement. A basic join uses only the two required clauses: SELECT to specify the
columns that appear in the report, and FROM to specify the tables to be joined. The WHERE clause specifies one or more
join-conditions used to combine and select rows for the result set. In addition to join-conditions, the WHERE clause can
also specify an expression that subsets the rows.

If the SELECT, FROM, or WHERE clause reference a column that has the same name in multiple tables, you must qualify
the column by specifying the table name and a period before the column name.

To make your code shorter and easier to read, you can replace the full table name in qualified column names with an
alias.

FROM table-1 <AS>alias-1


table-2 <AS>alias-2…

Sample Programs

Querying a Table to Create a Report

proc sql;
select Employee_ID, Job_Title, Salary
from orion.sales_mgmt
where Gender='M';
quit;

Querying a Table to Create an Output Data Set

proc sql;
create table direct_reports as
select Employee_ID, Job_Title, Salary
from orion.sales_mgmt;
quit;

Joining Tables by Using Full Table Names to Qualify Columns


proc sql;
select sales_mgmt.Employee_ID, Employee_Name,
Job_Title, Salary
from orion.sales_mgmt,
orion.employee_addresses
where sales_mgmt.Employee_ID =
employee_addresses.Employee_ID;
quit;

Joining Tables by Using Table Aliases to Qualify Columns


proc sql;
select s.Employee_ID, Employee_Name, Job_Title, Salary
from orion.sales_mgmt as s,
orion.employee_addresses as a
where s.Employee_ID =
a.Employee_ID;
quit;

Vous aimerez peut-être aussi