A short introduction to
using STATA (version 5.0)
1)
Keeping a log file.
To keep a log file of both the commands entered and the resulting output we use the function log. The syntax of the function is given by
log using [filename] |
This will cause STATA to start logging to the file [filename]. Standard extension is “.log” |
log off |
This will stop STATA from logging. |
|
log on |
This will resume logging. |
|
log close |
This will stop the log and close the file currently being used as a log file. |
2)
Loading data the simple way.
If you have an existing file with data that you want to use in STATA the simplest way to load it into the program is to use the command insheet. This command will expect a file where the data is TAB or comma separated and there is only one observation per line. The syntax of the command is
insheet <varnames> using [filename]
<,options>
This will read data from [filename] where the standard extension is “.raw” and if no variable names are specified it is assumed that the fist line contains variable names. Various options include:
tab or comma |
This will specify the type of file being read. Default is TAB separation |
names or nonames |
This specifies if variables names are to be found in the first line of the file. |
|
clear |
This will replace the data currently in memory with the data from the file being read |
3) More flexible loading of data.
For data that is contained in files with complex or nonstandard formats STATA provides the infile2 command. This command requires the creation of a dictionary file (with extension “.dct”) that explains how the data is to be read. The syntax of the dictionary file is as follows:
dictionary using
[filename]{
*comments |
Comments can be added to the dictionary file |
firstlineoffile(#) |
This specifies where the data starts in [filename]. Default is the first line. |
|
_lines(#) |
This specifies how many lines contain one observation of the variables. Default is one. |
|
_line(#) |
If more than one line is used for one observation, this will specify how data is to be from in each line in every observation. This is not needed if _lines(#) was not specified or set equal to one. |
|
_column(#) |
If the data is in fixed-position format, this will specify how to read data that starts at the specify column number. This is no required if the data is tab, comma or space separated. |
|
vartype & varname |
This will specify the variables to be read |
}
The [filename] specified in the dictionary file will refer to the file where the data is actually contained. The standard extension is still “.raw”. To specify variable names and types in the dictionary file we use the following syntax:
[type] [ name] [format]
“description”
The variable type is one of int, long, float, double or str. If a variable is of “string” type we must specify its maximum length. For example if we require a variable that will contain at most 8 characters we declare its type as str8. The maximum length of a string is 80 characters. It is usually not required to specify the format of the variable except when we are reading for a fixed-width file. The most frequently used formats have the following syntax:
%#.#g |
General numeric format. The first number is the length of the number and the second is the number of decimals. It is not required to specify the number of decimals. |
%#.#e |
Scientific format. The numbers specified in # are as before. |
|
%#s |
String format. The number specified the (maximum) length of the string |
Finally, once we have a dictionary file completed, we can use the infile2 command in the following way:
infile2 using
[filename] <options>
where the [filename] is the dictionary file we have created. The options include the clear option (explained above), the if [expression] and the in [range] options, which will be explained in the data management section.
4) Loading and Saving files in STATA’s
proprietary format.
If you plan to use STATA extensively it is very useful to transform your original data into STATA’s own format. Saving and loading in STATA format is accomplished with the following commands:
save [filename]
use [filename]
The standard extension of STATA data files is “.dta”
5) Some additional notes on saving and loading data.
Aside from the commands included here STATA provides a number of other commands designed for loading and saving data. In particular, the command infix is specifically designed for fixed-width format files while the command infile1 is a variation of the command discussed here. If it is required to save files in non-STATA format the commands outsheet and outfile both of which have syntax similar to their loading counterparts.
6) Manipulating data in STATA.
Once your data is loaded into STATA you can manipulate it in a number of different ways. Some of the more useful commands are summarized here
drop [varnames] |
This will eliminated the listed variables from the dataset. |
sort [varnames] |
This will sort the data according to the variables listed. |
|
gen [newvar]=[formula] |
This will generate a new variable according to the specified formula. |
|
edit |
(PC version) this will bring up the graphical data editor. |
|
merge [varname] using [filename] |
This merges the dataset currently in memory with the data contained in the STATA-format file specified by [filename]. The merging is done through the variable in [varname]. |
|
clear |
This will clear all data from memory |
It is important to note that the observations in STATA are numbered in the order in which they where read. The command list will list the observations in that order. The sort command will naturally change that order.
7) Some common options in STATA commands.
A large number of STATA commands will accept the following options:
by [varnames]: command |
This will execute the command several times, one time for each distinct combination of values of the variables specified in [varnames]. |
command if [expression] |
This will execute the command only on the observations that match the specified expression. Note that equality in [expression] must be specified with the “==” symbol. |
|
command in [range] |
This will execute the command on the observations contained in the specified range. [range] is usually specified as [startobs.]/[endobs.]. |
These options are extremely useful since they let us restrict almost all commands in STATA to the variables or observations that of most interest to us.
8) Some common summarization commands.
STATA provides a number of useful summarization commands. Here we list some of the most common ones:
su [varnames] <,detail> |
This will display summary statistics for the variables listed. The “detail” option results in an expanded set of summary statistics. |
tabulate [varnames] <,generate([varname])> |
This will create a frequency table for the variables listed. The “generate” option will create a set of dummy variables that reflect all possible values of the variables involved. |
correlate [varlist] |
This will generate a correlation plot. |
Note that these commands are most useful when combined with the if, by or in options. As plots tend to be the easiest way of summarizing the structure of the data we list some simple “diagnostic” graphics commands:
kdensity [varname] <options> |
This will generate a kernel density plot. Several options allow choice of kernel and bandwidth. |
qnorm [varname] |
This will generate a qq-plot against a normal distribution. |
qqplot [var1] [var2] |
This will generate a qq-plot of both variables. |
plot [var1] [var2] |
This will generate a scatterplot of the two variables listed. |
9) Batch files.
If you have a large dataset and want to execute a series of repetitive commands on several variables it is often useful to create a batch file with the necessary commands. The commands can then be executed with the command do [filename] where the standard extension is “.ado”. Note that in some version of STATA you will need to put two blank lines at the end of the batch file for it to execute correctly.