Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.
Custom Data Search
This Google Custom search engine performs a search on websites that have been evaluated for their usefulness in providing data and statistics. Results will open in a new window.
Other Pages of Interest
Below are links to other pages that may be of interest to you.
Here are the definitions of some common and some uncommon data terms:
- Alphanumeric: A variable that can have letters and/or numbers as its values. numeric; string; character;
- ASCII: stands for American Standard Code for Information Interchange. This is the numeric representation of any kind of character in a file. Although not strictly correct, "ASCII" is often used interchangeably with "text" or "plain text". It simply means that the information in the file is not in any system or proprietary format. binary; text; raw
- Binary: the way a computer actually stores the information in a file; at it's most basic level, a series of ones and zeroes. Some older data files are stored in special binary formats such as "column binary" or "zoned decimal" to save storage space. SAS and SPSS are capable of reading some binary files; Stata is not. ASCII; text
- Byte a byte is a group of eight pairs of zeroes and ones in a computer file. A kilobyte (Kb) is 1,000 bytes; a megabyte is 1,000,000 bytes; a gigabyte is 1,000,000,000 bytes; a terabyte is 1,000,000,000,000 bytes.
- Card: originally referred to a punch-card, an old means of entering data into a computer. Now refers to a single line of data in a file. "Card-image" data files have two or more lines of data for each observation. record; observation
- Codebook: A document that describes in detail the data with which you are working. Although there is no standard format for a codebook, a good one has the full wording of the questions and answers, a list of all the codes or values used to enter the data, and either the begin and end columns or the begin column and the length of the variable. dictionary; data definition statements
- Column: a column actually refers to a single character, including spaces, in a raw data file. This is not the same as a column in a spreadsheet. field; record length
- Comma-separated values also called a "csv" file is a raw data file in which the variables are separated by commas. CSV files are often used to convert a data file from one software package such as Excel to another such as Stata. delimiter; tab-delimited file; fixed format (file); free format (file)
- Data definition statements: are program code that tell the computer in which columns and lines each variable can be found - in SAS, this is the "input" statement, in SPSS it is the "data list" statement, and in Stata it is an "infix" or "infile" dictionary. codebook; dictionary;
- Delimiter: is a character or characters used to separate variables in a raw data file. The most common delimiters are commas, tabs and blank spaces. The choice of which delimiter to use depends on whether there are commas and/or blank spaces in the values of any of the variables. If so, then one must use some other character as a delimiter as the computer will not be able to distinguish between those used as delmiters and those that are actual values. comma-separated values; tab-delimited; fixed format (file); free format (file)
- Dictionary: There are two types of dictionaries: a data dictionary and a Stata dictionary. A data dictionary is a document that lists all the variables and their locations (columns) in the data file, and, sometimes, the values for those variables. These are appropriate for any statistical package. A Stata dictionary is a program or file that Stata uses to read a raw data file - the information in this program can be obtained from either a data dictionary or a codebook. codebook; data defnition statements;
- Extension (filename extension): a filename extension, or simply extension, is the second part of a filename, the part that comes after the ".". Some common extensions are ".sav" for SPSS files, ".csv" for a comma-separated values file, and ".doc" for Microsoft Word files.
- Field: a term typically used in database management, a field is the same as a variable.
- Fixed format (file): is a raw data file in which each variable occupies the same column or columns on each line for each observation. Fixed format files typically do not have any delimiters between variables. free-format (file); comma-separated values; tab-delimited (file)
- Free format (file): is a raw data file in which each variable may occupy different columns on each line for each observation. Free format files must have some type of delimiter, usually spaces, between variables. fixed-format (file); comma-separated values; tab-delimited (file)
- Flat (file): a flat file is a data file that has one record for each observation. These are sometimes called "rectangular" files. rectangular; multiple records
- Hierarchical (file): Sometimes a data file will have different types or levels of information on different records. An example is when a survey collects information about a household, each family within that household, and each person within each family. The different types of variables will be on different records within the file. rectangular; flat
- Length: The number of columns (or characters) a variable in a raw data file will occupy. Sometimes this also means the number of bytes the same variable will take in a system or binary file. record length
- Multiple Records: Sometimes there is more than one record or line of data for each observation in a raw data file. Each record has a different set of variables on it, so each record must be read differently. hierarchical; flat
- Numeric (variable): These are numbers, plain and simple. Decimals and minus signs are the only acceptable non-number characters allowed. string (variable)
- Observation: An observation is a unit of analysis - a respondent in a survey, for example. record;
- Osiris: data storage format used by the ancient Egyptians. No, just kidding; it isn't quite that old. Osiris is a storage format where the data and the dictionary are in binary format. Many older datasets available from ICPSR are in Osiris format; SAS and SPSS are capable of reading Osiris files.
- Raw data: Raw data are data that have not been read into a software package like SAS or Stata. If you were to open this file you would see numbers and, perhaps, letters. You need a codebook or data dictionary to be able to read the data. These are sometimes called "text" or "ASCII" files. text; ASCII; system (file)
- Record: A record is one line in a data file. Sometimes there is more than one record per observation or there are different types of records in a single file. Records are sometimes referred to as "cards." observation; card
- Record length: literally, the length of a record in a data file. This is measured in columns for raw data files and in bytes for system or binary files. This information is necessary for files that have more than 256 columns on one record as the total length must be specified in the data definition statement. Often abbreviated as "lrecl". Also referred to as "logical record length." length
- Rectangular a data file in which there is one line or record of data for each observation. To "rectangularize" a file means to convert a multiple-record file or hierarchical file to this format. hierarchical; flat
- String (variable): a string variable is one that has letters and/or numbers as opposed to just numbers. An example would be a person's name. Numbers can be treated as strings, but strings cannot be treated as numbers. Strings are also referred to as "character" or "alphanumeric" variables. numeric
- System File: is a binary file created by a software package such as SPSS or Excel. These typically have default extensions in the file name i.e., ".sav" for SPSS and ".xls" for Excel. raw; text; ASCII
- Tab-delimited: a tab-delimited file is a raw data file that has tab characters to separate the variables. Tab-delimited files are often used to conver data from one software package such as Excel to another such as Stata. Tab-delimited files are useful when variables can have commas or spaces as part of their values (i.e., a person's name). delimiter; comma-separated values; fixed format (file); free-format (file)
- Text (file) a text file is the same as a raw or ASCII file. It has all letters and numbers and is not in any system or binary format. ASCII; raw; system (file)
- Value label: Value labels assign words to numeric values in a data file. Labels are used simply to make the output easier to read. Instead of printing "1", "2", "3", the computer will print "Yes", "No", "Maybe". variable label;
- Variable label: A variable label is a short description of a variable. They are not required, but make the output easier to understand. value label;