Académique Documents
Professionnel Documents
Culture Documents
DN1000019.1091
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Sources of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 External Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 FOCUS Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Preparing Data for Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Direct Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.2 Selective Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.3 Specifying Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.4 Selecting Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3Environmental Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 The Examples in this Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Preparing for ANALYSE Sessions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Entering the Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Statistical Operations Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Preparing the Environment: STATSET . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Limiting the Sample Size: FILESIZE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Reviewing Online Documentation: EXPLAIN . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1-2 1-2 1-2 1-2 1-3 1-3 1-3 1-3 1-5 1-7 2-1 2-1 2-2 2-2 2-6 2-7
2.
3.
3 The Statistical Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.1 Analysis of Variance: ANOVA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.2 Correlation Analysis: CORRE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 3.3 Exponential Smoothing: EXSMO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4 3.4 Factor Analysis: FACTO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 3.5 Discriminant Analysis: MDISC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12 3.6 Multiple Linear Regression: MULTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-15 3.7 Polynomial Regression: POLRG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19 3.8 Descriptive Statistics: STATS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 3.8.1 Control Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28 3.9 Stepwise Multiple Regression: STEPR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-31 3.10 Time-Series Analysis: TIMESER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-35 3.10.1 The Time Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36 3.10.2 Commands that Create New Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-36 3.10.3 Other TIMESER Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-42 3.10.4 Saving Forecast Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43 3.10.5 Ease-of-Use Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-43
3.11 Crosstabulations: XTABS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.1 Specifying Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.2 Specifying Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.3 Summary Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.4 Cell Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.5 Ease-of-Use Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.6 Control Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.7 Specifying Columns and Rows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.11.8 General Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. B.
1 Introduction
FOCUS provides a full range of statistical tools. These facilities have been designed for ease of use in a conversational environment. The spectrum of facilities covers two types of statistical operations: Simple functions supplied by the FOCUS report request language as part of the regular report writer including: Minimums. Maximums. Averages. Average sums of squares. Percentage counts. User-supplied functions.
Complex statistical operations with interactive prompts to determine what to perform next (based on results obtained so far). These include: Descriptive Statistics. Correlations. Multiple Linear Regressions. Stepwise Regressions. Polynomial Regressions. Analyses of Variance. Discriminant Analyses. Factor Analyses. Exponential Smoothing. Time Series Analyses and Forecasting.
Statistical operations are performed on selected sets of variables and their values (observations) collected from one or more FOCUS or external files by the FOCUS report writer. Up to 64 variables are permitted. The number of observations is only limited by the amount of storage available to the user (e.g., 20,000 observations with 20 numeric variables would need 1.6 million bytes of virtual storage). The steps for performing statistical analysis of data in FOCUS databases are: 1. 2. 3. 1. or Create a HOLD file of selected and/or redefined data using the report writer. Issue the ANALYSE command. Respond to the prompts. Create a Master File Description describing the data to be analyzed (attributes of the variables: names, formats and lengths, etc. -- described in the FOCUS Users Manual).
The steps for performing statistical analysis of data in external files are:
1-1
1. 2. 3.
Create a HOLD file of selected and redefined data using the report generation language (only necessary if the external file is not a fixed- format file). Issue the ANALYSE command. Respond to the prompts.
Usage:
1-2
Information Builders
Variables can be created or transformed directly with the DEFINE or COMPUTE facilities, which are described in the FOCUS Users Manual.
The prepared data is placed in a HOLD file. All analysis is then performed on the data in the HOLD file. Such procedures may be run reiteratively since FOCUS treats HOLD files as external files (with their own descriptions), which can be overwritten by subsequent requests. An alternative method is to create subsets of data using the SELECT operand in a STATSET operation (see Section 2.3 Preparing the Environment: STATSET on page 2-2).
1-3
Figure 1-1. Extracting Data for Analysis Several sets of extracted data may be held for subsequent analysis with the HOLD AS feature. Example 1: Selected Variables
TABLE FILE SMSA PRINT POPULATION ON TABLE HOLD AS POP END ANALYSE FILE POP . . .
1-4
Information Builders
TABLE FILE PROPERTY PRINT REGION CTYPE AND PROPERTY ON TABLE HOLD AS PROPHOLD IF REGION EQ "NORTHEAST" END ANALYSE FILE PROPHOLD . . .
in which case you would supply a name for variable &1 on the command line when you executed the FOCEXEC (see the FOCUS Users Manual). You can also type ahead in live sessions:
1-5
To direct ANALYSE output to an offline device (usually a line printer), set PRINT=OFFLINE in FOCUS before entering the ANALYSE environment, or select STATSET option ONLINE=OFF. The FOCUS LET facility is also supported in ANALYSE, and offers an easy way to store row and column labels in crosstabs before starting your ANALYSE session (see Section 3.11.7 Specifying Columns and Rows on page 356). The following example illustrates the combined use of ANALYSE, GRAPH, and TABLE facilities. Enter ANALYSE. Set up EQFILE TEMP1. Specify a regression. Set up EQFILE TEMP2. Specify a second regression. End the ANALYSE session. Execute TEMP1 FOCEXEC (DEFINEs). Execute TEMP2 FOCEXEC (DEFINEs). Graph TEMP1, TEMP2 and the difference across DIST.
ANALYSE FILE HOLD STATSET EQVAR=TEMP1,EQFILE=TEMP1 MULTR,TEMP,2,DIST,DEPTH,YES,NO STATSET EQVAR=TEMP2,EQFILE=TEMP2 MULTR,TEMP,3,DIST,DEPTH,BACTERIA,YES,NO QUIT EX TEMP1 HOLD EX TEMP2 HOLD ADD GRAPH FILE HOLD WRITE TEMP1 AND TEMP2 AND COMPUTE DIFF=TEMP1-TEMP2 ACROSS DIST END RETYPE
Figure 1-3. Using TABLE, ANALYSE, and GRAPH in a Session Note the use of the ANALYSE option STATSET at the start of the FOCEXEC in Figure 1-3. STATSET provides facilities for altering the default parameters (called STATSET flags) that control the ANALYSE environment during your session. Use these facilities to: Set up special HOLD files for ANALYSE output, variables, and equation files. Specify handling of records with missing data. Specify the print destination. Set the print width for your terminal.
If you are a new user, you may wish to review online descriptions of each of the operations. You can summon these by selecting EXPLAIN in response to the ANALYSE prompt for a statistical operation.
1-6
Information Builders
1-7
You select an operation by typing in a selection. The analysis chosen then proceeds, issuing prompts for any required information, and concludes with a request for another operation:
ENTER STATISTICAL OPERATION DESIRED -
Thus, after reading a file just once you can perform unlimited analyses on the data. Sections 2.3 and 2.4 of this chapter discuss ANALYSE facilities for preparing the environment and limiting the size of the sample file. Section 2.5 describes the online ANALYSE documentation facilities that explain the various statistical facilities.
2-1
Figure 2-1. Summary of Statistical Operations NOTE: Each operation is initiated by a mnemonic subcommand and may be repeated in any order, as often as needed. Whenever possible, analyses are integrated and use the same input as the previous steps. For example, all regression analyses use the results of a single set of CORRE calculations instead of reproducing the input data for each analysis.
2-2
Information Builders
Type ? STATSET and press ENTER. This initiates a display of the current STATSET flag settings. Type STATSET and make your own assignments for STATSET flags before pressing ENTER (e.g., STATSET MISSING=ON, MISSVAL=-999, HOLD=STYDY1).
To issue a status request for the current settings of the STATSET flags, issue the following command:
? STATSET
This is illustrated in the STATSET terminal session at the end of this section. The STATSET flags are summarized on the following page. Flag
EQFILE
Value
name EQFILE
Description Names the FOCEXEC that will hold the regression equation as a defined variable. Set to NONE or not issued if no FOCEXEC is required. The dependent variable name used in the regression equation (stored in the EQFILE FOCEXEC). The name of the HOLD file being created, similar to the usage HOLD AS HOLD1 in the FOCUS TABLE command. An ON or OFF setting specifies whether records with missing data will be processed. The value used to identify missing data fields. All variables are compared to this value when MISSING=ON. An ON or OFF setting specifies where to route output (ONLINE/OFFLINE) from within ANALYSE. An ON or OFF setting routes the output from analyses to print (useful when HOLD files are created and analyzed in subsequent iterations). The selection criteria used to select records for analysis. Specifies the FOCUS terminal print width setting. Note that this affects report formatting by XTABS.
EQVAR
name
HOLD
name STATHOLD
MISSING
ON OFF
MISSVAL
value -999
ONLINE
ON OFF
ON OFF
SELECT WIDTH
criteria n
EQFILE
The EQFILE flag names the equation FOCEXEC (maximum eight characters) being created. Equation file processing creates a FOCEXEC containing the regression equation, with EQVAR as the left-hand side of the equation. The command syntax is as follows:
STATSET EQFILE=FIT1
2-3
EQVAR
The EQVAR flag sets the variable name (maximum 12 characters) that appears on the left-hand side of the EQFILE. Using STATSET EQFILE=TFIT1, EQVAR=TEMP1, the regression fitting TEMP by DIST and DEPTH would produce an EQFILE that looks like the following:
-DEFAULT &2=""; DEFINE FILE &1 &2 TEMP1 = ; END >
) )
In this form &1 may be used as needed (e.g., any HOLD file, etc.); &2 can be ADD or a blank (see the FOCUS Users Manual). Note that you may create as many equation files as needed in addition to the original data for subsequent use in comparative graphing and analysis reporting. The command syntax is as follows:
STATSET EQVAR=FIT1
HOLD
The HOLD flag sets the name of the HOLD file to be created. A number of statistical operations (e.g., STATS, TIMESER) provide the means for holding varying types of data. This data is written into a HOLD file. A Master File Description is also created. Such files may then be used directly by FOCUS. The default name for the HOLD flag is STATHOLD. The command for setting this flag is
STATSET HOLD=mychoice
where:
mychoice
Is up to any eight-character name. The HOLD flag may be set as often as needed. In this way, many different HOLD files can be created in one ANALYSE session. A given file may be replaced (overwritten) as necessary. When the file created has the same name as the ANALYSE subject, the following message is displayed
(FOC133) WARNING: ANALYSE FILE LOST...HAD SAME NAME AS HOLD FILE. (FOC134) RE-READING DATA UNDER NEW SELECTION CRITERIA IS INHIBITED.
The file being analyzed remains available for further ANALYSE processing throughout the current ANALYSE session. Processing with new selection criteria (STATSET flag SELECT) is inhibited (because the disk file is now different) and the following error message is displayed:
(FOC134) RE-READING DATA UNDER NEW SELECTION CRITERIA IS INHIBITED.
2-4
Information Builders
MISSING
The MISSING flag specifies whether missing data should be included in the analysis (ON) or omitted (OFF). The default value is OFF. When ON, individual variables within records are ignored if they are numerically equal to the value in the STATSET flag MISSVAL (see below). This special processing of records with missing data is supported in the following operations: CORRE, MULTR, POLRG, STEPR, STATS and XTABS. To include missing data issue the following command:
STATSET MISSING=ON
MISSVAL
The MISSVAL flag sets the value that identifies missing data. The default value is -999. It is recommended that you use only whole numbers. The command syntax is as follows:
STATSET MISSVAL=-999999
ONLINE
The ONLINE flag determines the destination of statistical output. When set ON (the default value), statistical output is sent to the terminal (SYSPRINT). When set OFF, statistical output is sent to the offline device (usually a line printer). Error messages and prompts are directed to the terminal. Use the following command to turn off this flag:
STATSET ONLINE=OFF
PRINT
The PRINT flag controls the statistical output of the regression analyses (MULTR, POLRG, STEPR). When ON (the default value), statistical output is sent to the terminal or printer, depending on the setting for ONLINE. When OFF, statistical output is inhibited. This is useful when the purpose of a regression, or series of regressions, is to produce an equation file for future use. Error messages and prompts are directed to the terminal. To inhibit the output, use the following STATSET command:
STATSET PRINT=OFF
SELECT
The SELECT flag sets the record selection criteria for subsequent statistical analyses. The default value (OFF) specifies no screening. SELECT may be set to include a series of valid FOCUS screening conditions of the following form:
FIELDNAME TEST RELATION LITERAL (or LITERAL or...)
Any number of screening tests may be used. Separate multiple tests with commas, and terminate the last with a comma/dollar sign (,$), as shown below:
STATSET SELECT=CITY EQ STAMFORD, PROD_CODE EQ B10, UNIT SOLD GT 25,$
Note that multiple fields may be tested on one or more lines; however, a test for a single field may not exceed one line.
2-5
WIDTH
The WIDTH flag sets the number of characters in the print line (the default is 80 characters). In XTABS (the ANALYSE cross tabulation facility) WIDTH is used to control automatic paneling. The command syntax is as follows:
STATSET WIDTH=130
2-6
Information Builders
summons a display of the online documentation for ANOVA. FOCUS command level help is also available with:
>>HELP ANALYSE
EXPLAIN
Enter EXPLAIN in response to the ANALYSE prompt for a statistical operation to invoke the following display
ENTER STATISTICAL OPERATION DESIREDexplain ENTER COMMAND (ALL, LIST, FOR SUMMARY, OR NAME OF SPECIFIC OPERATION)
where:
ALL
2-7
2-8
Information Builders
After selecting ANOVA, you are prompted for the name of each factor desired. All other required information is taken from the input file, including the number of factors and the number of groups or levels in each factor (in a factorial manner). An example follows: TABLE FILE PROPERTY
WRITE CNT.REGION WRITE CNT.CTYPE BY REGION WRITE CNT.PROPERTY BY REGION BY CTYPE PRINT PROPERTY BY REGION NOPRINT BY CTYPE NOPRINT ON TABLE HOLD END ANALYSE FILE HOLD ANOVA A,B,C END >
Initiate the request. Retrieve the number of REGIONS (factor 1). Retrieve number of City types (factor 2) within factor 1. Property by region (factor 3). The actual variable PROPERTY. Specify the sort sequence. The file for analysis. The ANALYSE request. The statistical function. The factors.
References
The ANOVA analysis is done as a factorial design using a three-operator method. A. Ralston (1967), Mathematical Methods for Digital Computers, Analysis of Variance (Chapter 20), New York: John Wiley and Sons.
3-1
3-2
Information Builders
The results of these calculations and the sums of cross-product deviations (not displayed) are saved for use in subsequent regression analyses. The method used calculates product-moment correlation coefficients. If you choose to process records with missing data (SET MISSING=ON, MISSVAL=some value), pair-wise deletion is performed. (In pair-wise deletion, cases are omitted from the computation when either of the two variables under consideration is missing.) With the input data X ij , where i=1, 2,..., n (observations) and j=1, 2,..., m (variables), the following equations are used: Means
n
x ij
Xj =
i=1 -------------
where:
j=1, 2,..., m
Correlation Coefficients
S jk T jk = ------------------------S jj S kk
where:
j=1, 2,..., m k=1, 2,..., m
Standard Deviation
S jj S j = --------------- = n1
-------------------------------n1
i=1
( X ij X j )
Nj
where:
j=1, 2,..., m
( X ij Tj ) ( X ij Tk )
S jk =
i=1
where
j=1, 2,..., m k=1, 2,..., m
and
3-3
i=1 T j = ------------n
x ij
are used for computational accuracy. NOTE: Any alphanumeric data is detected and treated as numeric zeroes. When processing records with missing data (MISSING=ON), the CORRE output includes the number of observations and pairs present (i.e., not missing). CORRE requires a greater number of observations than variables (by at least 1).
3-4
Information Builders
To initiate this analysis, enter EXSMO, in response to the ANALYSE prompt for a statistical operation.
ENTER STATISTICAL OPERATION DESIRED - exsmo
The variable (in the above example sales) may be specified by its fieldname, alias, position number in the file, or any unique truncation of the alias or fieldname. EXSMO names smoothed variables (those created for examination, holding, or further smoothing) by prefixing the first 10 characters of the names or aliases (aliases are assigned if HOLD files are created) with an S. For example, RIDER is the second variable in the file, AIRLINE, EXSMO creates a smoothed variable with the following attributes:
FIELDNAME=S.RIDER,ALIAS=S.E02, FORMAT=D15.3,$
If you prefer to name the smoothed variable yourself, use the AS phrase. For example:
EXSMO RIDER AS SMOOTH_RIDER
These alternate names may contain up to 12 characters and must be enclosed in single quotation marks if they contain any embedded blanks. The values of variables that exist when you enter EXSMO are protected and cannot be overwritten by EXSMO. For example, if NRIDER is the name of a field in the file being analyzed, the expression
EXSMO RIDER AS NRIDER
results in an error message (and a repeat prompt). However, you can overwrite fields created by EXSMO during the current session. EXSMO prompts the user for a smoothing constant (alpha) (where 0.0 < < 1.0 ).
ENTER SMOOTHING CONSTANT (E.G. 0.1) -
The larger the value of the constant, the larger the influence of previous data on the next smoothed point. EXSMO then prompts the user for the source of the three additional constants (A, B, and C) required for the smoothing:
DO YOU WISH TO SPECIFY INITIAL COEFFICIENTS (YES/NO)
If you specify NO, EXSMO uses the default method to calculate starting values for A, B, and C. The initial default values of A, B, C are determined as follows:
C = X 1 2X 2 + X 3 B = X 2 X 1 1.5C A = X 1 B 0.05C
where X 1 , X 2 , and X 3 are the first three input time series data points, and the calculation is done in the order shown (C, B, and then A).
3-5
Starting with initial values of A, B, and C, EXSMO calculates the first smoothed series data point S 1 , updates coefficients A, B, and C, and then proceeds step by step through the time series values generating the smoothed series. The formulas used at each step are defined below
S i = A + B + 0.5C
where:
Si
Is the smoothed data value for the next time period (the ith period).
A,B,and C
Are the values that exist at the (i-1)th time period. After S 1 is calculated for one time period ahead, the A, B, and C coefficents are updated with the following formula
A = X1 + ( 1 ) ( Si Xi ) B = B + C 1.5 ( ) ( 2 ) ( S i X i ) C = C ( ) ( Si Xi )
3 2 3
where:
Xi
Is the input time series data point for the one time period ahead.
Alpha is the smoothing constant. The calculation is done in the order shown (A, B, and then C) with the B and C on the right-hand side of the equations taking on their previous values. EXSMO then proceeds, one time period at a time, until an entire smoothed time series (Si ) is calculated covering the same extent as the input time series. At the end, EXSMO provides final values for the coefficients A, B, and C in the expression
Fi = A + B ( T ) + C ( T ) 2
2
where:
Fi
Are used for the first, second, third, etc. forecasted time period(s). This expression is used to find estimates (or forecasts) for the specified number of time periods ahead (T).
3-6
Information Builders
When the smoothing calculations are complete, EXSMO prints the initial and final values of the coefficients used (see the example below). The user then receives a series of prompts calling for dispositions for the input and smoothed data for the actual and forecasted time periods (display, hold, etc.). EXSMO requests the data with the following prompt:
ENTER COMMAND (E.G. PRINT,LAST,FORECAST,SHOW,EXSMO,KEEP,HOLD,QUIT)
Note that at any point before a new smoothed series is computed, a response of EXSMO restarts the analysis at the first prompt.
DESCRIPTION A table of data values and smoothed-series values is displayed for data points numbered from p to q, where p and q are positive integers. For values greater than N (the number of observations) only predicted smoothed values can be displayed. Displays the last p data values, along with their corresponding smoothed series values. Displays the predicted smoothed series values for the first p data points starting with N + 1. Displays information (for point p only) on one line. Allows the user to specify another variable for smoothing, or resmooth the present variable using different parameters. The data for all variables currently defined is written to a HOLD file. Exit from EXSMO, while retaining the data for all created variables in core. Exit from EXSMO, discarding all variables created in the session.
LAST p
FORECAST p
SHOW p EXSMO
HOLD KEEP
QUIT
Figure 3-2. EXSMO Control Commands Each command may be specified by name or with a unique truncation. The PRINT, LAST, and FORECAST commands each displays three columns of output (the data point number, the input, and the smoothed data values). Data points beyond the extent of the input file are represented as blanks. The EXSMO control commands support reiterative smoothing and display of variables. The smoothed variables created with EXSMO are saved and are available thereafter as normal variables for use in any subsequent ANALYSE statistical operation (e.g., CORRE, MULTR, etc.). Thus, you can perform multiple analyses using both the original and the smoothed variables.
PRINT
The PRINT command initiates two prompts for the time period to be printed. For example:
3-7
print ENTER FIRST DATA POINT TO OUTPUT ENTER LAST DATA POINT TO OUTPUT -
1 3
LAST
The LAST command initiates the prompt:
ENTER NUMBER OF DATA POINTS -
When smoothing a large file this is a useful technique for examining just the end of the file.
FORECAST
The FORECAST command initiates the following prompt:
ENTER NUMBER OF DATA POINTS -
This command is useful when only the forecasted data points are required, particularly when used in conjunction with the LAST command. For example, LAST 4 FORECAST 5 produces a display of the last four actual time periods together with the first five forecasted time periods. NOTE: PRINT, LAST, and FORECAST may be entered together to produce an integrated display.
SHOW
The SHOW command initiates the following prompt:
ENTER POINT TO DISPLAY -
HOLD
The HOLD command initiates the following prompts:
ENTER FIRST DATA POINT TO OUTPUT ENTER LAST DATA POINT TO OUTPUT -
HOLD creates a raw data file (and Master File Description) that can be used immediately in graphs, reports, etc. The file created, using the name specified by the STATSET HOLD flag, will contain all of the new variables created by the smoothing process in addition to the original fields from the target file. All pre-existing alpha and integer fields (for maintaining identity and date facilities) are preserved in their original formats. All other numeric fields are held in decimal (D) format with the number of decimal places that they originally held. All smoothed fields created by EXSMO are held in D15.3 format. Numeric variables not forecasted are assigned values as set in the STATSET MISSVAL flag (alpha fields are set to blank) for the forecasted time periods.
KEEP
The KEEP command causes an exit from EXSMO, while retaining all of the newly created variables in core. You can then perform other ANALYSE functions and later return to EXSMO for further smoothing.
3-8
Information Builders
QUIT
The QUIT command exits EXSMO and returns to ANALYSE, deleting all of the smoothed data created during the EXSMO session.
3-9
The analysis performs a principal components solution and a varimax rotation of the factor matrix. The results of the principal component analysis determine the minimum number of dimensions needed to account for most of the original variable set variance. The varimax rotation is used to simplify columns (factors), rather than rows (variables), in the factor matrix. In the extraction, eigenvalues equal to or greater than the supplied eigenvalue are retained. This is done to minimize the number of factors. FACTO produces: Eigenvalues. Cumulative percentage eigenvalues. Eigenvectors. Factor matrix. Variances for each iteration cycle. Rotated factor matrix. Communalities for initial extraction and final rotation.
Reference
For information about the principal component, varimax rotation method, see: W.J. Dixon (1973), Biomedical Computer Programs manual, Los Angeles: University of California Press.
3-10
Information Builders
3-11
3-12
Information Builders
The required information is taken directly from the input file without further prompting. The input file created by the TABLE request must contain the number of groups, the number of observations in each group, and the name of each group, in addition to the variables. For example:
TABLE FILE SMSA WRITE CNT.STATE WRITE CNT.SMSA BY STATE PRINT POPULATION MILLIONARES REMARRIAGE BY STATE ON TABLE HOLD END ANALYSE FILE HOLD MDISC
Initiates the request. The number of groups. The number of observations (in each group). Names three variables.
Figure 3-3. Creating an Extract File to Analyze with MDISC NOTE: The counts must contain the number of unique items, not records. The calculation checks for and requires that the number of variables be equal to or greater than the number of groups. MDISC produces: The mean of each variable by group. The pooled dispersion matrix. The common means for each variable. The generalized Mahalanobis D-square. Numbered discriminant functions. a. b. a. b. c. Constant. Coefficients. Observation. The probability associated with largest discriminant function. The largest function number.
References
T.W. Anderson (2nd Edition - 1984), Introduction to Multivariate Statistical Analysis, (Section 6.6 through 6.8), New York: John Wiley and Sons. W.J. Dixon (1973), Biomedical Computer Programs, Los Angeles: University of California Press.
3-13
3-14
Information Builders
MULTR supports processing of both equation files and records with missing data. A linear relationship of the following form is sought
n
Y =
Ci xi + C0
i=1
3-15
where:
Y
Are the N regression coefficients relating the change in Y caused by a change in each Xi and the intercept C0. Since MULTR searches for a relationship between variables, it automatically calls CORRE to produce the data from which it selects the Y and Xi variables for the regression (if not called previously, in which case the data would already be available). After examining the means, standard deviations, and correlation coefficients for all variables, the system prompts the user to determine: The dependent variable (name, truncation or number). Each independent variable (name, truncation or number). Whether a table of residuals is desired. Whether the regression equation is desired.
To simplify variable selection, especially when many variables are included (a maximum of 64 is permitted), variable names are also assigned position numbers which you can use when selecting them. After the analysis is performed and the report appears, you may select as many alternate analyses (other variables) as desired, perform other types of analysis and return to evaluate other models displayed by MULTR. MULTR displays: The Variable. Mean. Standard deviation. Correlation vs. the dependent variable. Regression coefficient. Standard error of regression coefficient. Computed T value.
The Intercept. The Multiple Correlation Coefficient. The Standard Error of Estimate. An Analysis of Variance for the Regression. Source of variation. Degrees of freedom. Sum of squares. Mean squares. F values.
3-16
Information Builders
The Table of Residuals and/or Durbin-Watson upon prompted request. Observation. Actual dependent variable value. Estimated dependent variable value. Residual (actual - estimate). Durbin-Watson statistic.
If EQFILE processing is on (see Section 2.3 Preparing the Environment: STATSET on page 2-2), the regression equation is stored in an equation file FOCEXEC. This equation, along with the results of other regressions, can be used for graphics, analysis, reporting, etc. To create a regression equation for further processing, set the STATSET PRINT flag OFF, and EQFILE processing ON; this suppresses printed output.
References:
The Gauss-Jordan Method is used in the solution of the normal equation. W.W. Cooley and P.R. Lohnes (1971), Multivariate Procedures for the Behavioral Sciences, (Chapter 3), New York: Kreiger Bernard Ostle and Rick Mensing (1975), Statistics in Research, (Chapter 8), Ames, Iowa: Iowa State College Press Norman Draper and Harry Smith (1981), Applied Regression Analysis, NY: Wiley
3-17
3-18
Information Builders
Subsequently, you will be prompted for the following: Highest degree polynomial to be used (1 - 10). The dependent and independent variables (name, truncation or number). Is a table of residuals desired? Is the regression equation desired?
The analysis calculates powers of the independent variable to calculate polynomials of increasing degree. The calculation proceeds by degree until there is no further reduction in the residual sum of squares or the maximum degree polynomial specified is reached. POLRG produces the polynomial degree at each step: The intercept. The regression coefficients. The analysis of variance for the degree step. The source of variation.
3-19
The degree of freedom. The sum of squares. The mean square. The f-value. The sum of squares improvement.
The regression equation for the degree step. A table of residuals and/or Durbin-Watson upon request.
POLRG and MULTR share the same methodology. EQFILE and PRINT control are also supported.
3-20
Information Builders
3-21
You will be prompted for all required input and may specify any or all statistics (optionally grouped by a sort field). You can hold the statistical output as a HOLD file for use by FOCUS (reporting, graphics, etc.). Processing of missing data is automatically supported. STATS is useful for obtaining a significant quantity of descriptive statistics on up to 64 variables. This information provides both a valuable end-product report (e.g., modes, medians, deciles, quartiles, etc.), as well as a necessary step in the selection of subsequent analyses. Each statistical option has a name or number, as shown in Figure 3-4. When you enter STATS in response to the ANALYSE prompt, you are prompted for a statistical operation:
ENTER STATISTICAL OPERATION DESIRED - stats ENTER OPTION NAME(S) OR NUMBER(S) DESIRED (E.G. 3,MEAN,VAR,14) all
3-22
Information Builders
Enter the name, number, or unique truncation of the options desired. The selected statistics are displayed for all the numerical variables. A response of ? produces a help display similar to the following one. Number 1 2 3 Name ALL MEAN MEDIAN Explanation Options 2 - 14. The average value. The midpoint, or 50th percentile, derived after all values are in order (lowest to highest). The maximum minus the minimum value. A measure of dispersion. The square root of the variance. The most frequent value of a variable. The highest value. The standard error measure of sample mean stability, estimated by the standard deviation divided by the square root of the number of observations. A measure of the symmetry of a distribution. The lowest value. A measure of data dispersion about its mean. A measure of a distributions peaks or flatness. A table of deciles or variable values at each of 10 10% population points. A table of quartiles or variable values at each of 4 25% population points.
4 5 6 7 8
9 10 11 12 13
14
QUARTILE
3-23
Number 15
Name NUM_OBS
Explanation Number of observations present is placed in a HOLD file if the HOLD option is selected. Data is grouped according to the first field. Statistics are written to a HOLD file specified by the STATSET command. Printing of statistics is suppressed. Overrides print suppression by HOLD.
40 41
GROUPS HOLD
42
Figure 3-4. Table of STATS Statistical Options The following response to a STATS prompt for an option
2, 3, 4, DEC, Q
produces means, medians, ranges, the 10 deciles, and the four quartiles for each numeric variable in the subject file. Note that you can mix numbers and names in a response and that typing ahead is supported. The options entered above could have been input on the initial response to the ANALYSE prompt for an operation, as shown below:
ENTER STATISTICAL OPERATION DESIRED -- stats 2,3,r,d,q
ALL Option
All of the statistical options selected are calculated for each numeric variable (except for the group field) in the subject file. If you choose to include missing data in the analysis (see Section 2.3 Preparing the Environment: STATSET on page 2-2), the number of observations present for each variable is displayed against the size of the sample. The ALL option produces all of the statistics (options 2 - 14).
MEAN Option
The MEAN option produces the mean, or average value, for the variable. It is a simple measure of the variables central tendency (the sum of all variable values divided by the number of values). The formula used is as follows
Xi
i=1 X = ------------N
where:
i
=1,2,3, ...,N.
3-24
Information Builders
MEDIAN Option
The MEDIAN option calculates the middle case value for each numeric variable. From the median value, 50% of the cases lie above it and below it if the variable is ranked from its lowest to highest values. The median value lies precisely on the 50th percentile. If the number of cases (N) is odd, the median value is the (N + 1)/2. If N is even, the median is linearly extrapolated according to the following formula
median = X N 2 + 0.5x ( X N 2 + 1 X N 2 )
where:
XN/2
RANGE Option
The RANGE option calculates the difference between the maximum and minimum value for each numeric variable.
STDEV Option
The STDEV option calculates the standard deviation for each numeric variable. In normal distributions, it is the value such that 65% of the cases lie between the mean 1 standard deviation. It is a measure of the spread of values and is equal to the square root of the variance (S2) defined below.
N
STDEV =
VARIANCE =
i=1 -----------------------------N1
( Xi X )
MODE Option
The MODE option calculates the value that occurs most frequently for each numeric variable. If more than one value occurs the same number of times, the lowest value is deemed the mode. If the mode cannot be calculated (e.g., all values occur only once), a message to that effect is printed and the mode is set to the missing value as set in STATSET.
MAXIMUM Option
The MAXIMUM option determines the highest value for each numeric variable.
3-25
STERROR Option
The STERROR option calculates the standard error for each numeric variable. Given a sample (a given group of cases), the true population mean can be estimated by examining the means for a large number of equal sized samples chosen from that population. The array of these sample means forms a normal distribution. This distribution has a standard deviation which is called the standard error. It is an estimate of the difference between a given sample mean and an estimated population mean. The standard error is determined by dividing the standard deviation by the square root of the number of observations:
STERROR = S ( N )
SKEWNESS Option
The SKEWNESS option calculates the skewness for each numeric variable. It is a measure of the deviation from symmetry for a distribution. Since skewness (or third moment) is an odd power of (Xi - X), a value of zero indicates symmetry, and a positive or negative value indicates clustering above or below the mean (X) respectively. The following equation defines skewness:
Xi X 3 -------------S =1 SKEWNESS = i---------------------------N
n
N1
where:
Xi
Is the mean.
N
= 1,2,..., N. The denominator is the calculation formula for S3 (where S2 is the variance). If skewness cannot be calculated, a message is printed and the held value is set to the missing value from STATSET.
3-26
Information Builders
MINIMUM Option
The MINIMUM option determines the lowest value for each numeric variable.
VARIANCE Option
The VARIANCE option calculates the variance (S2) for each numeric variable. It is a measure of variation from the sample mean. As an even power (second moment), both positive and negative differences count equally with large variations counting more than small ones. The variance is literally the average squared deviation from the mean:
2 ( Xi X ) n
=1 S 2 = i-----------------------------N1
where N-1 is generally taken as the denominator (instead of N) assuming sample data rather than the entire population. The difference is negligible for large samples (large N). A small variance occurs when there is little variation in the sample. For computational purposes, the following formula is used:
2 X i2 NX n
=1 S 2 = i-----------------------------N1
KURTOSIS Option
The KURTOSIS option calculates the kurtosis for each numeric variable. As a fourth power of the difference from the mean (the fourth moment), it is a measure of the flatness or sharp definition of a sample distribution. The kurtosis for a normal distribution is zero. A positive or negative value indicates, respectively, a distribution narrower or flatter than a normal one. It is defined by:
Xi X 4 -------------- 3 S i=1 KURTOSIS = ---------------------------------------N
n
-----------------------------N1 Note the denominator is the square of the variance (S4) and the minimum value of the KURTOSIS is -3. If the kurtosis is not available, a message is printed and the value, if held, is set to the missing value set by STATSET.
3-27
DECILES Option
The DECILES option calculates the 10 deciles for each numeric variable. If a variable is sorted from low to high, then the 10 deciles are the exact values for which 10%, 20%, 30%,..., 90%, 100% of the value lie below. The kth decile is calculated as follows (N k) DECILE k = X i + ( X i + 1 X i ) N k INT -----------------------10 10 where:
N
Is the decile.
i
QUARTILES Option
The STATS option QUARTILES calculates the four quartiles for each numeric variable. They are calculated similarly to deciles and represent the 25%, 50%, 75%, 100% population points. The formula for the kth quartile is (N k) Nk QUARTILE k = X i + ( X i + 1 X i ) ---------- INT --------------4 4 where: (N k) i = INT ---------------4
NUM_OBS Option
The NUM_OBS option determines the number of values present for each numeric variable. It is always displayed.
GROUPS Option
The GROUPS option causes all selected options to be calculated for each numeric variable, one for each value of the first analyzed field in the file. The group field is assumed to be sorted in ascending order and may be numeric or alphanumeric. No statistics are calculated for the first field (or variable) if the GROUPS option is specified. The group field is then the first field in the original HOLD file and keeps its original field name.
3-28
Information Builders
HOLD Option
The HOLD option causes the output of STATS to be held as a HOLD file with the filename set by STATSET. Printing is suppressed unless PRINT is also specified. If STATS ALL, HOLD, and GROUPS are specified, and the STATSET HOLD flag is left as the default value, a Master File Description is created for STATHOLD. The HOLD option generates a HOLD file containing all of the selected STATS options. A record is created for each variable, and for each group field value (if GROUPS is specified). The HOLD file can then be used for reports, graphs, and/or relational matches with other data files. The latter can be used to create standardized data (data with zero mean and unit deviation).
PRINT Option
The STATS option PRINT sends output to a print queue, and is used to override the normal print suppression of the HOLD option.
3-29
3-30
Information Builders
Instructions are provided for the last prompt (nature of variables) the first time STEPR is executed during an ANALYSE session, but not thereafter. Each step of the analysis looks at the reduction of the sum of squares for each variable. Each step adds the next independent variable that shares the highest partial correlation with the dependent variable. Forced or deleted designations always take precedence.
3-31
STEPR produces: The Dependent Variable. The number of forced variables. The number of deleted variables.
For each step in the regression: The Step Number. The name, number and (forced or available) designation of the variable entered. The sum of squares reduced. The cumulative proportion reduced. The multiple correlation coefficient. The multiple correlation coefficient adjusted for degrees of freedom. The f-value for analysis of variance. The standard error of estimate. For each variable name and number. -The mean. -The standard deviation. -The regression coefficient. -The standard error of the regression coefficient. -The T value. -The Beta weight. The regression equation at that step. A table of residuals and/or Durbin-Watson upon prompted request.
The analysis continues step by step and is terminated if the proportion reduced is less than the limiting constant specified by the user (0 is acceptable) or upon completion of the link. As with MULTR, EQFILE and PRINT control is supported. The analysis uses the Abbreviated Doolittle Method to enter variables in the regression and compute their regression coefficients.
Reference
Carl A. Bennett and others, Statistical Analysis in Chemistry and the Chemistry Industry, (Appendix 6A), Ann Arbor, Michigan: Books on Demand (313-761- 4700).
3-32
Information Builders
3-33
3-34
Information Builders
3-35
TIMESER forecasting functions create additional records containing calculated variables along with existing actual variables. The actual variables are marked as missing for these new time series records by the ANALYSE missing value indicator. If MISSING is not set ON, TIMESER assumes a missing value with the default message:
MISSING DATA VALUE UNDEFINED: ASSUMED TO BE: -999.00
TIMESER then prompts for the field to be used as the time series variable and the value of 1 interval. For example:
ENTER NAME OF TIME VARIABLE- QUARTER ENTER TIME FOR 1 INTERVAL (e.g. "1", "2 DAYS", "1 MONTH") - 3
The time-variable, time intervals, and TIMESER commands are described in the following sections.
you may enter the fieldname, alias, a position number, or unique truncation of the variable that specifies position of "time" in the series. TIMESER then prompts for the increment between successive time periods. For example:
ENTER ITEM FOR 1 INTERVAL (e.g., "1", "2 DAYS", "1 MONTH") - .25 MONTH
Respond with just a number for a series that is not date-oriented (e.g., 1, 3, .5) or a number and a unit for dateoriented time series (e.g. "1 DAY" for a YMD or MDY formatted time-variable, "1 MONTH" for a YM or MY formatted time-variable, etc.). TIMESER considers a time variable date- oriented if it has a FOCUS date format in the file being analyzed (e.g., I4YM). The only restrictions on the time-variable are that the values must be in ascending order and an integral multiple of the TIME interval. TIMESER supports not-present values or gaps in the series being analyzed. Based on the interval specified, the full series is formed by assigning the missing value indicator to records not originally provided. For example, consider a YM (year/month) formatted time variable. Quarterly data (every third month provided) is formed into a monthly series by specifying a 1 MONTH interval. Similarly, monthly data is formed into "weekly" data by specifying a .25 MONTH interval. The series points for missing values may then be valued by either linear interpolation (LINERP) or exponential interpolation (EINTERP) as described below.
3-36
Information Builders
TIMESER commands may be specified in full or by a unique truncation of the name (e.g., "MTOTAL SALES 4" and "MT SA 4" are equivalent). Fields may be referenced by full fieldname, alias or unique truncation. (The default prefixes are shown in parentheses following the commands. When using them, separate prefixes from fieldnames with periods.)
produces
LE.Q i = Q i + N
where:
Qi
Is the ith value of the variable Q. Note that this example illustrates typing ahead. If you enter only LEAD, FOCUS will prompt you for a variable.
produces:
LG.Q i = Q i N
which results in a moving average extrapolation that produces values based on maintaining the last average calculated.
3-37
results in:
i
MT.Q i =
j = iN+1
Qj
results in
1 CA.Q i = --N
(N 1) i + ----------------2
Qj
(N 1) j = i ----------------2
when N is odd. If N is even, the interval extends 1 period further forward than backward.
CA.Q i =
Qj
j = i (N 1 ) ----------------2
LINTERP Command
Substitute values for missing values in the series are determined and supplied using linear interpolation. There is no prefix. "LINTERP Q" results in a linear fit (using Q=a+bT) between the nearest two present values to assign values for missing values.
EINTERP Command
Same as LINTERP but an exponential interpolation is performed by entering
EINTERP Q
where:
3-38
Information Builders
Q = aebt
Is used to find an exponential fit between the two adjacent values to assign values for variables with null values.
where:
LD.Q i = Q i + N Q i
where:
GF.Q i = Q i Q i N
3-39
New points are calculated by adding either the user-provided growth factor (user) or the difference between the last two points to the last point.
Q i = Q i 1 + user
or
Q i = Qi 1 + [ Qi 1 Q i 2 ]
New points are calculated by increasing the last point by either the user- provided percentage growth factor (user) or percentage difference between the last two points.
Q i = Q i 1 user
or
Q i = Qi 1 [ Q i 1 Qi 2 ]
FIT Command
The FIT command performs regressions to fit a number of equations to the specified variable and optionally forecasts the fitted equation forward. The command
FIT Q
3-40
Information Builders
DO YOU WISH TO KEEP PREDICTED VALUES (TYPE "KEEP" OR "NOKEEP") HOW MANY PERIODS DO YOU WISH TO EXTRAPOLATE DO YOU WISH TO KEEP RESIDUALS (TYPE "RESID" OR "NORESID") ENTER THE TYPE(S) OF EQUATION YOU WISH TO FIT -
If you do not name a variable, FOCUS prompts you for one. If predicted values are kept ("KEEP" entered) and residuals are kept, they then become new variables available for holding (see Section 3.10.3 Other TIMESER Commands on page 3-42) and analysis by any ANALYSE function. The equations that may be fitted are listed below. "Y" represents the variable to be fitted and "T" the time variable. The default prefixes for the predicted- value variables appear in parentheses in the table. The names of the residual variables are constructed by further prefixing the predicted-value variable name with an "R." Name/(Default Prefix) LINEAR (LF.) EXP (EF.) POWER (PF.) HYP1 (H1.) HYP2 (H2.) HYP3 (H3.) HYP4 (H4.) ALL Type of Function Linear Exponential Power Simple Hyperbolic Hyperbolic (Type 2) Hyperbolic (Type 3) Hyperbolic (Type 4) All functions Equation Y=A+B*T Y = A * EXP(B * T) Y = A*T**B Y = A + B/T Y = 1/(A + B*T) Y = T/(A + B*T) Y = A/(1 + NBT)1/N
Figure 3-5. TIMESER Commands that Create New Variables If the time variable has a date format, FIT will treat the data variable as a function not of the time variable, but of a pseudo variable named T.INDEX, which is defined as the integer K for the Kth data point. T.INDEX is not saved in core, but it is included in the HOLD file if one is requested. All of the resulting FIT equations may be saved for further FOCUS use in TABLE and GRAPH (this is discussed as a function of STATSET, and in Section 3.10.4, "Saving Forecast Equations"). If the STATSET PRINT option is ON, the equation, the Durbin-Watson and the regression statistics are output for each type of equation selected. The Durbin-Watson is a measure of serial correlation of adjacent residuals in a regression; a value close to 2 indicates a reasonable fit (a value under 2 indicates positive autocorrelation; a value over 2 indicates negative autocorrelation). NOTE: In each case if only the TIMESER command is entered, (e.g., "LEAD"), you are prompted for the specified variable and a number of periods (if required).
3-41
Description Displays values for the time variable (T.INDEX) and up to 4 specified variables for the user-specified portion of the time series. T.INDEX is a sequential index generated by TIMESER. DISPLAY sends the following prompts:
ENTER UP TO 4 VARIABLES TO DISPLAY -ENTER FIRST TIME PERIOD TO DISPLAY OR "BOT"-> ENTER LAST TIME PERIOD TO DISPLAY OR "TOP" ->
Enter values for the time variable to specify the first and last periods to be displayed.
HOLD
Writes a hold file (default name STATHOLD, see STATSET) containing all variables, original or created. Saves all variables (original and created) for use by other ANALYSE operations and returns you to the ANALYSE prompt. Deletes new variables and returns you to the ANALYSE prompt. Replaces a specified variable. REPLACE prompts:
ENTER NAME OF VARIABLE -ENTER PERIOD TO REPLACE -ENTER NEW VALUE --
KEEP
QUIT REPLACE
If you type ahead "REPLACE Q N" only the prompt for new value appears.
SUBSET
Restricts all TIMESER operations to a range of data you specify with values entered as the first and last values for the time variable. "SUBSET ?" produces the following prompts:
ENTER SUBSET STATUS: "ON OF OFF" -ENTER FIRST PERIOD IN SUBSET -ENTER LAST PERIOD IN SUBSET --
where N1 and N2 define the beginning and end of your subset. Figure 3-6. TIMESER Control Commands
3-42
Information Builders
The resulting FOCEXEC (FITDEFS) will contain DEFINE commands for the fields named LF.Y and PF.Y.
Typing Ahead
Responses to TIMESER prompts may be stacked (i.e., the specifications for an entire command may be typed on one line). For example, the following series of prompts
ENTER ENTER ENTER ENTER COMMAND (OR ? FOR HELP) -- display UP TO 4 VARIABLES TO DISPLAY -- sales lg.sales FIRST TIME PERIOD TO DISPLAY OR "BOT" -- 8004 LAST TIME PERIOD TO DISPLAY OR "TOP" -- 8112
Abbreviations
TIMESER accepts the shortest unique truncations of command names in place of the full names. Variable names (fieldnames) and their aliases can also be referenced by unique truncations. Equation names in FIT can be abbreviated (e.g., "LIN" is a suitable replacement for "LINEAR"). All other words must be typed in full.
3-43
3-44
Information Builders
3-45
3-46
Information Builders
A simple crosstab, showing the counts for salaries by department (into "Low Paid" vs. "Better Paid"), follows. Each count is a cell, identified by its column and row. (A complete example appears in the "Sample XTABS Terminal Session" at the end of this section.)
In order to clarify such joint distributions, other cell statistics (such as percentages of rows, columns and totals) are offered as options, along with various overall statistics (such as chi-square, Cramers V, and contingency coefficients). The widths of these tables (panel sizes) are taken from the STATSET flag WIDTH. The number of rows and columns (and therefore, the number of panels) is limited only by available computer memory. (You should be careful not to use XTABS with continuous variables or those with many values, because tables with empty cells are difficult to interpret and may violate many of the assumptions applied in developing the associated statistics.)
The specification of variables consists of a series of variable names or variable lists, each separated by the word BY:
(variable name or list) BY (variable name or list) BY...
The first variable(s) specified becomes the horizontal classifier(s) and those following the BY are the vertical classifiers, running down the page. For example
SALARY BY DEPARTMENT
produces a single crosstab of cases displayed across SALARY by DEPARTMENT (down the page). Variable names may be fieldnames, aliases or unique truncations of any fields in the file being analyzed. A variable list is a series of variable names separated by the word AND. All words and names must be separated by either blanks or commas. For example
UNIT_SOLD AND RETURNS BY PROD_CODE AND DATE
produces four crosstabs: UNIT_SOLD BY PROD_CODE, RETURNS BY PROD_CODE, UNIT_SOLD BY DATE, and RETURNS BY DATE Multiple BY phrases may also be used. For example
3-47
produces three crosstabs: TEMP BY DIST, TEMP BY BACTERIA, and DIST BY BACTERIA. Since typing ahead is supported, a convenient way to generate the above request is as follows:
ENTER STATISTICAL OPERATION DESIRED - xtabs temp by dist by bacteria
You can then enter option numbers, names, or unique truncations, using the word BY to separate the variable names or variable lists.
3-48
Information Builders
All of the XTABS options are shown in the following table in Figure 3-7. Number /Name Description Summary Statistics 1 2 3 4 6 7 8 9 20
CHISQ CRAMV CONT LAMBDA
Chi-square. Cramers V; phi for 2 by 2 tables. Contingency coefficient. Asymmetric lambdas with each variable taken as dependent, and symmetric lambda. Kendalls tau b. Kendalls tau c. Gamma. Asymmetric Somers D with each variable dependent plus symmetric Somers D. All summary statistical options. Cell Statistics
ALL
27 28 29 30 31 32 33 34
EXPECTED DEVIATN
The frequency expected (assuming independence) is printed in each cell. The deviation of the observed frequency from the expected frequency is printed in each cell. The contribution of the cell to the chi-square is printed in each cell. (The sum of the cell chi- squares is the tables chi-square.) The cell frequencys percentage of the row total is printed in each cell. The cell frequencys percentage of the column total is printed in each cell. The cell frequencys percentage of the grand total is printed in each cell. The count (frequency) is printed in each cell. The frequency weighted according to the values of the selected field. Control Options
CELLCHI2
35 36
NOGRID DASH
Suppresses the default vertical grid normally printed on all tables. Suppresses the dashed lines that are normally printed between horizontal rows of cells.
3-49
Number 40 42 44 45 46 50 ?
/Name
NORANGE
Description All variables are assumed to be unranged and no prompting is done for ranges or user-defined. Suppresses page headings. Suppresses all statistics. Suppresses row totals (and total percentages). Suppresses column totals (and total percentages). Data grouped by first field data must be sorted by first field. Generates a display of online information about XTABS operation.
CHISQ Option
The CHISQ option calculates and displays the chi-square statistic which tests the independence of the joint distribution of the variables in the table. The statistic does not measure the strength of the relation, but can be interpreted as a test of whether or not the variables are related. Empty cells and extremely small or large sample values lessen the significance of this statistic. Chi-square is defined by the following formula
N
2 = where:
fobs(i)
i=1
Is the expected frequency in the ith cell (assuming no relationship between the variables).
3-50
Information Builders
The larger the difference between the observed and randomly expected frequencies, the larger chi-square becomes. Large values for chi-square thus indicate the presence of a systematic relationship, while small values imply the absence of a relationship or statistical independence. Chi-square varies with the number of rows and columns used to determine the number of degrees of freedom (provided along with chi-square by XTABS) and the sample size.
CRAMV Option
The CRAMV statistic calculates and displays Cramers V ( for 2 x 2 tables) which makes a correction for the sample size (chi-square does not). It is defined by the following formula 2 V = --------------------------min ( r i ) ( c i ) where:
"r" and "c"
1/2
Is the phi statistic (appropriate only for 2x2 crosstabs) defined by: 2 = ---N
1/2
(for 2 x 2 crosstabs) corrects 2 for the number of cases (N). For the 2 x 2 cross-tabulations ranges from 0 to +1 for a perfected relationship. For larger crosstabs (greater than 2 x 2), has no upper limit. Cramers V is used to adjust for the minimum of the rows and columns. The lower bound of V is 0, so values from 0 to +1 indicate minimal relationships. Large Cramers V values indicate a strong association.
CONT Option
The CONT option calculates and displays the contingency coefficient. It is another chi-square-based statistic adjusted for sample size. It is defined by the following formula: 2 C = --------------2 + N
1/ 2
The contingency coefficient, C, runs from 0 to a maximum value dependent on the size of the table. In comparisons it should be used with crosstabs of identical dimensions (same numbers of rows and columns).
LAMBDA Option
The LAMBDA option calculates and outputs the three lambda statistics, one symmetric lambda, and two asymmetric lambdas, one with an independent row variable and a dependent column variable, and one with the reverse (dependent row variable and independent column variable). Asymmetric lambda is based on the proportional reduction in error in estimating the distribution of the dependent variable when the independent variable is known. In other words, it measures how well you can identify the value of the dependent variable.
3-51
When all of the occurrences for any given value of the independent variable occur in a single cell (i.e., the remaining cells for that row or column are all zero), then the value of lambda is 1. Asymmetric lambda is defined by the following formula
j asym = --------------------------------------------------N max ( f r )
where:
k
Is the sum of j maximum values of all cell frequencies for each category of the independent variable.
MAX(F )
Is the maximum value for each category of the dependent variable. The symmetric lambda is an average of the asymmetric lambda. No assumption of dependency is assumed. It is defined by the following formula
TAUB Option
The TAUB option calculates and displays Kendalls Tau b statistic. Tau b, which is most appropriate in a square table (number of rows and columns equal), evaluates all cases pair-wise relative to the ordering (low to high) of each variable. Pairs with both variables higher are called concordant pairs and those with both variables reversed are called discordant. Other cases are considered "tied." Tau b, Tau c, gamma, and Somers D are all measures of association between two variables and differ mainly in the manner of counting tied pairs. Tau b is defined by the following formula
Taub = ( P Q ) [ 0.5 ( N ( N 1 ) T ri ( T ri 1 ) ) 1.5 ( N ( N 1 ) T ri ( T ci 1 ) ) ] 1 / 2
where:
"P" and "Q"
Tri
Are the number of concordant and discordant pairs respectively. Note that if there is a general ordering of pairs in the same direction on both variables, Tau b will be positive. and Tci Are the number of ties on the row and column variable, respectively. They turn out to also be the respective row and column totals.
3-52
Information Builders
TAUC Option
The TAUC option calculates and outputs Kendalls Tau c statistic. It is most appropriate in rectangular cases (the number of rows and columns unequal). Tau c is basically an average value per pair of (P-Q), where a row and column adjusted approximation is taken for the number of pairs. Tau c is defined by the following formula
Tau c = 2 min ( P Q ) N 2 ( m 1 )
where:
P and Q
GAMMA Option
The GAMMA option calculates and displays the Gamma statistic. It is independent of ties or table size (dimensions) and is defined by the formula:
Gamma = ( P Q ) ( P + Q )
Its value is positive, zero, or negative, if, respectively, there are more, equal, or less concordant pairs than discordant pairs. For a 2 by 2 table, gamma equals another statistic called "Yules Q."
SOMERSD Option
The SOMERSD option calculates and outputs the Somers D statistic. Ties are taken into consideration in a different way than in the Tau statistics. As in the LAMBDA case, three statistics are produced: two asymmetric cases, one for each variable taken (or the dependent variable), and the symmetric case. The asymmetric Somers D is calculated with the following formula
Asymmetric Somers D = ( P Q ) ( P + Q + T i )
where: Ti = 1,2 Are row and column ties when the dependent variable defines the rows and columns respectively.
P and Q
Are as defined for the Tau statistic. The symmetric case does not account for which variable is dependent as defined by the following formula:
Symmetric Somers D = ( P Q ) [ P + Q + 0.5 ( T 1 + T 2 ) ]
ALL Option
The ALL option calculates and produces all of the available summary statistics.
3-53
EXPECTED
The EXPECTED cell statistic calculates and displays the expected cell frequency (assuming independence). It is defined by the formula
Ci r f exp ( i ) = --------i N
where:
Ci
DEVIATN
This cell statistic calculates and displays the deviation between the observed and expected frequency for each cell. It is defined by the formula
DEVIATN = ABS ( f obs ( i ) f exp ( i ) )
where:
fobs(i)
CELLCHI2
The CELLCHI2 option calculates and displays the contribution to the chi- square summation for each cell. It is defined by the formula ( f obs ( i ) f exp ( i ) ) 2 CELLCHI2 i = -------------------------------------f exp ( i ) where:
fobs(i)
Is the expected frequency in the ith cell assuming no relationship between the variables.
3-54
Information Builders
ROWPCT
The ROWPCT option calculates and displays the cell frequencys percentage of the row total for each cell. It is defined by the formula
f obs ( i ROWPCT i = 100 -----------) ri
where:
ri
COLPCT
The COLPCT option calculates and displays the cell frequencys percentage of the column total for each cell. It is defined by the following formula
f obs ( i COLPCTi = 100 -----------) ci
where:
ci
TOTPCT
The TOTPCT option calculates and displays the cell frequencys percentage of the table grand total for each cell. It is defined with the following formula
f obs ( i TOTPCT i = 100 -----------) N
where:
N
COUNT
The COUNT statistic calculates and displays the cell frequency (fobs(i)) for each cell.
ALL Option
The ALL option calculates and produces all of the available summary statistics.
3-55
Counts and row, column, and total percentages are displayed in each cell. Row and column totals and percentages, along with grand totals, are automatically calculated and displayed (whenever cell row and column percentages are produced). If no variables are specified (e.g., XTABS ALL), the first two variables (or the second and third) in the file being analyzed are used as the ACROSS and BY variables. Output is automatically paneled in accord with the STATSET PANEL setting.
After you specify the crosstab variables and options, XTABS determines whether the required data is available. If this analysis is the first process for the existing selection criteria (see the STATSET option SELECT in Section 2.3 Preparing the Environment: STATSET on page 2-2) the data will be read in and the number of observations displayed as follows:
NUMBER OF OBSERVATIONS = 48
Action Suppresses the printing of the GRID (printed by default). Suppresses automatic range prompting (see Section 3.11.7 Specifying Columns and Rows on page 3-56). In this case, one row or column is produced for each discrete variable value and appears with the value as the heading. Suppresses the printing of a heading in the table of crosstabs (default). Suppresses automatic calculation and display of all summary statistics (default). Suppresses automatic calculation of row totals and percentages (default). Suppresses automatic calculation of column totals and percentages (default). Produces full crosstab and summary statistics for each discrete value of the first field. The sorting field is assumed to be in ascending order. This is similar to the GROUPS option in the ANALYSE Statistics (STATS) facility. Displays the online documentation for XTABS.
3-56
Information Builders
The response to this prompt determines the extent of values included in each column and row, along with their printed labels. Columns and rows are automatically sorted in ascending order. A response of NONE for a variable (alpha or numeric) causes one column or row to be created for each discrete value of the variable with the label set to the variable value. A column or row range specifies the columns (or rows) lower inclusive bound (greater than or equal to value) and the upper non-inclusive bound (less than value). A lower bound of BOT and an upper bound of TOP may be specified for the first and last row or column. For example
BOT -100, 0 100, 100 TOP
Specifies that the first row or column will include the lowest values through -100 (negative 100).
0 100
Specifies that the next group of rows or columns will contain values in the 0 to 100 range.
100 TOP
Specifies that the final group of rows or columns will contain values over 100. Ranges must be in ascending order. Discrete ranges may be alpha or numeric but bounded ranges must be numeric. Ranges or discrete values may be specified but not mixed for a single variable. The resulting columns and rows will be displayed in the ascending order of the data values regardless of the labels specified. Ranges may have gaps, or different lengths, but they may not overlap. The general form of the range specifications is as follows:
SALARY bot 16000 as "Low Paid", 16000 top as "Better Paid"
The "AS label" phrase is optional and may be used for none, any, or all ranges specified. The label (truncated to 12 characters) becomes the row or column heading. Labels containing embedded blanks must be enclosed in single quotation marks. For example:
90 100 as "Very Hot"
If no label is specified, then the upper and lower boundaries (or the discrete values) become the column or row labels. Each range or discrete value response is checked for ascending order and another prompt is issued if the values supplied are in error. (Such input for a variable is generally entered on one line.) If additional lines are needed, enter a comma as the last character on each line, to direct XTABS to continue on the following line with another prompt for the same variable. (If you accidentally type a comma at the end of the current line, and do not wish to continue, type NEXT on the next line to satisfy the prompt.) Type LAST, if you wish to respecify the complete ranges (label, etc.) for the preceding variable. Note that prompting for ranges (labels, etc.) continues until the variable is syntactically complete.
3-57
The FOCUS SET PAUSE=ON facility causes XTABS to wait before printing to allow time for aligning forms on the output device. When STATSET missing value processing is in effect, the number of missing cases is displayed on the generated crosstabulation. Missing cases are not displayed or included in the compilation. If missing cases are required on the table, the range response "-999 as missing" (included with other column or row specifications) will generate appropriately labeled columns or rows. Since typing ahead is supported throughout ANALYSE, a complete crosstabulation may be generated by the following ANALYSE prompt and response:
ENTER STATISTICAL OPERATION DESIRED xtabs salary by department all, none, none
References
John Mueller, Karl Schuessler, and Herbert Costner (1970), Statistical Reasoning in Sociology (2nd edition), Boston: Houghton Mifflin.
3-58
Information Builders
3-59
A Error Messages
(FOC108) NUMBER OF ANALYSE VARIABLES EXCEEDS 64
The statistical analysis sub-system cannot handle more than 64 independent variables.
(FOC109) (FOC110) INSUFFICIENT CORE FOR FACTOR ANALYSIS
The ANALYSE option STATSET does not have the parameter requested.
(FOC111) SET VALUE MISSING
The value after the STATSET parameter is missing. A valid value must be provided.
(FOC112) (FOC113) VALUE MUST BE NUMERIC:
The value in response to the ANALYSE prompt, or STATSET parameter must be numerical.
AN ILLEGAL RESPONSE HAS BEEN ENTERED:
The response to the ANALYSE prompt is not recognized. Type EXPLAIN in the ANALYSE mode if assistance is needed.
(FOC114) AN INDEPENDENT VARIABLE IS ALSO A DEPENDENT VARIABLE:
A regression cannot use the same variable as both independent and dependent.
(FOC115) THE NUMBER ENTERED EXCEEDS THE NUMBER OF VARIABLES:
The numerical variable identity provided in the ANALYSE prompt exceeds the number of variables in the file being analyzed.
(FOC116) FLUSHING TO QUIT OR TO NEXT STATISTICAL OPERATION
Sequential processing of ANALYSE commands cannot be continued because of parameter errors. The stacked commands are ignored until the next valid STATMODE is encountered.
(FOC117) VALID RESPONSES ARE ON OR OFF:
There are only two valid responses to this prompt. These are ON or OFF.
(FOC118) DEPENDENT VARIABLE DOES NOT EXIST:
The dependent variable provided in response to the ANALYSE prompt is not specified in the MASTER description for the data.
(FOC119) INDEPENDENT VARIABLE DOES NOT EXIST:
The data format is different from the value supplied to the ANALYSE prompt, i.e., not numeric, etc.
(FOC121) (FOC122) INVALID LEVEL ENCOUNTERED IN DATA FOR FACTOR:
A-1
(FOC124)
There is non-numerical data for a variable which must be numerical. This observation must be eliminated from the ANALYSE statistical process.
(FOC125) RECAP CALCULATIONS MISSING :
The word RECAP is not followed by a calculation. Either it should be removed, or a calculation provided.
(FOC126) NUMBER OF DATA VALUES NOT CONSISTENT WITH FACTOR LEVELS:
In the ANALYSE mode the factor analysis procedure FACTO requires at least the same number of fields as levels of factors.
(FOC127) LITERAL TEST VALUE FOR GROUP IS INCORRECT:
In the request statement a screening phrase against a group field has inconsistent values based on the contents of the fields in the group. The size of one or more sections is either incorrect or not numerical.
(FOC128) NOTE..LIMIT USES EQ OR LE TEST CONDITION ONLY:
The RECORDLIMIT and READLIMIT phrases test on EQ or LE only. The meaning is the same.
(FOC133) WARNING: ANALYSE FILE LOST...HAD SAME NAME AS HOLD FILE
The prior HOLD file is over-written. Use new names with the EQFILE option.
A-2
Information Builders
All of these files can be created by FOCEXECs distributed with FOCUS. EMPLOYEE, EDUCFILE and JOBFILE are created by EMPTEST (for CMS) and EMPTSO (for TSO) FOCEXECs. To create AIRLINE, PROPERTY, and SMSA, execute the AIRLINE, PROPERTY and SMSA FOCEXECs.
B-1
B-2
Information Builders
B-3
B-4
Information Builders
B-5
B-6
B-7
B-8
B-9
B-10