Analyze the data values for at least one variable, such as the annual salaries of employees at a company. Organize the data values into a specific kind of structure from which analysis proceeds.
Data Table: Organize data values into a rectangular data table with the data values for each variable in a column, and the name of the variable at the top of the column.
Store the structured data values within a computer file, such as an Excel formatted file or text file. This file can be stored on your computer, an accessible local network, or the world wide web. The data table in the following figure, formatted as an Excel file, contains four variables: Years, Gender, Dept, and Salary plus an ID field called Name for a total of five columns.
Structure of a data table.
Describe the data table by its columns, rows, and cell entries.
Data value: The contents of a single cell of a data table, a specific measurement.
For example, according to the data values for employee Darnell Ritchie, he has worked at the company for 7 years, identifies as a male, and works in administration with an annual salary of $43,788.26.
Missing data: A cell for which there is no recorded data value.
Two data values in this section of the data table are missing. The number of years James Wu has worked at the company is not recorded, nor is the department in which Alissa Jones works.
Variable name: A short, concise word or abbreviation that identifies a column of data values in a data table.
Case: Each row of the data table, the data for a specific instance of a single person, organization, place, event, or whatever is being studied.
Encode the data table in one of a variety of computer file formats. Common formats include Excel files, indicated by a file type of .xlsx, and text files in the form of comma-separated value files (csv). Identify a text file with one of several potential file types, such as .txt, or more informatively, .csv.
Analysis of data can only proceed with the data table identified and the relevant variables identified by their name.
All R functions analyze the data values for one or more specified variables, identified by their names, such as Salary.
Analysis requires the correct spelling of each variable name, including the same pattern of capitalization.
Your data organized as a data table exists somewhere as a data file stored on a computer system. To analyze data in a data table stored in a computer file, first read the data table from the computer file into a corresponding data table within a running R session. R refers to data tables within R with its own name.
Data frame: A data table stored within an R session, referenced by its name.
Each variable in a data table has a name, and so does the data table itself. Reference the data table stored on your computer system by its file name and location. When read into R, name the data table, the R data frame, with a name of your choice. Regardless of the file name on your computer system, typically name the data table within the active R session, the data frame, as simply d for data. Not only is d easy to type, but it is also the lessR default data frame name for the data processed by its various analysis functions.
When analyzing data read into R, the same data exists in two different locations: as a computer file on your computer system, and as an R data frame within a running R app. Different locations, different names: same data. On your computer system, identify the data table by its file name and location. Within a running R app, identify the same data by the name of the data frame, such as d, within which the data from the computer file was read.
To read the data from a file into a data frame of a running R application, as with every other task in R, accomplish the task with a function. The R ecosystem, base R and its many packages, presents many such functions. We use the lessR function Read() for its simplicity and for its useful output that helps understanding the data that was read.
The lessR function Read() can read data files in many file formats, including MS Excel. The most generic format is the csv format, for comma separated values. Read() also reads SPSS and SAS data files, as well as data files in R’s own native format, of type .rda.
To read the data, direct R to the location of the data file.")` R cannot read the data file until it knows where the data is stored. One option has you locate the data file on your computer system by browsing for it, navigating your file system until you locate the file.
To locate your data file by browsing through your file system, call the
Read()function with an empty file reference,(""), literally nothing between the quotes.
The following Read() statement reads the data stored as a rectangular data table from an external file stored on your computer system into an R data frame called d. The empty quotes indicate to R to open your file browser for you to locate the data file that already exists somewhere on your computer system).
d <- Read("")As with all R (and Excel) functions, the call to invoke the function includes a matching set of parentheses. Any information within the parentheses specifies the information provided to the function for analysis.
The <- indicates to assign what is on the right of the expression, here the data read from an external file, to the object on the left, here the R data frame stored within R, named d in this example. You can also use an ordinary equals sign, =, to indicate the assignment, but the <- is more descriptive, and more widely used by R practitioners.
Also can explicitly specify the location of the data file to be read within the quotes and parentheses. Specify either the full path name of a file on your computer system, or specify a web address that locates the data table on the web. Again, read the data into the d data frame.
d <- Read("path name" or "web address")With Excel, R, or any other computer apps that processes data, enclose values that are character strings, such as a file name, in quotes. For example, to read the data stored on the web in the data file called employee.xlsx into the data frame d, invoke the following Read() function call.
d <- Read("http://web.pdx.edu/~gerbing/data/employee.xlsx")This example reads a data file on the web. To specify a location on your computer, provide the full path name of your data file, its name and location. To obtain this path name, first browse for the file with Read(""). The resulting output displays the path name of the identified file. Copy this path name and insert between the quotes of Read(""), save this and other R function calls in a text file, and then run the code in the future to directly read the data file for future analyses without needing to browse for its location.
In summary, with the Read() function, either put nothing between the quotes to browse for a data file, or specify the location of the data file on your computer system or the web. Direct the data read from a file into an R data frame, usually named d, but can choose any valid name.
Read()The Read() function displays useful output. Because R organizes analyses by variable name, it is crucial to know the exact variable names, including the pattern of capitalization.
In addition to the variable names, Read() also displays the type of each variable as stored in the computer, as numbers with or without decimal digits, or as character strings. Also listed are the number of complete and missing values for each variable, the number of unique values for each variable, and sample data values.
The following lists the output from reading from a data file downloaded with lessR. All that is needed to read these data files is the name. A file on the web cannot be specified here because you may not have web access when this file is generated.
d <- Read("Employee")## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## integer: Numeric data values, integers only
## double: Numeric data values with decimal digits
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     Years   integer     36       1      16   7  NA  15 ... 1  2  10
##  2    Gender character     37       0       2   M  M  M ... F  F  M
##  3      Dept character     36       1       5   ADMN  SALE  SALE ... MKTG  SALE  FINC
##  4    Salary    double     37       0      37   53788.26  94494.58 ... 56508.32  57562.36
##  5    JobSat character     35       2       3   med  low  low ... high  low  high
##  6      Plan   integer     37       0       3   1  1  3 ... 2  2  1
##  7       Pre   integer     37       0      27   82  62  96 ... 83  59  80
##  8      Post   integer     37       0      22   92  74  97 ... 90  71  87
## ------------------------------------------------------------------------------------------To allow for many variables, Read() lists the information for each variable in a row. Note that the data file organizes the variables by column.
Always distinguish continuous variables from categorical variables. This distinction between these two types of variables is fundamental in data analysis.
Continuous (quantitative) variable: A numerical variable with many possible values.
Categorical (qualitative) variable: A variable with relatively few unique labels as data values.
Examples of continuous variables are Salary or Time. Examples of categorical variables are Gender or State of Residence, each with just a relatively few number of possible values compared to numerical values. This distinction of continuous and categorical variables is common to all data analytics.
Sometimes that distinction gets a little confusing because variables with integer values, which are numeric, could be either quantitative or qualitative. For example, sometimes Male, Female, and Other are encoded as 0, 1, and 2, respectively, for three levels of the categorical variable Gender. However, these integer values are just labels for different non-numeric categories. Best to avoid this confusion. Instead, encode categorical variables with non-numeric values, such as Gender, for example, with M F, and O for Other.
A variable label is a longer description of the the corresponding variable than that of the variable name. The variable label displays in conjunction with the variable name on the text and visualization output to further clarify the interpretability of the output.
The variable label file has two columns. The the first column lists the variable names and the second column the corresponding labels. The file can be of type .csv, or .xlsx, or contained within lessR, as with the following example.
The variable labels must be read into the data frame l. Specify the var_labels parameter as TRUE to instruct the Read() function to read variable labels instead of data.
l <- Read("Employee_lbl", var_labels=TRUE)## 
## >>> Suggestions
## Details about your data, Enter:  details()  for d, or  details(name)
## 
## Data Types
## ------------------------------------------------------------
## character: Non-numeric data values
## ------------------------------------------------------------
## 
##     Variable                  Missing  Unique 
##         Name     Type  Values  Values  Values   First and last values
## ------------------------------------------------------------------------------------------
##  1     label character      8       0       8   Time of Company Employment ... Test score on legal issues after instruction
## ------------------------------------------------------------------------------------------lessR can write data with function Write() in three formats: csv, Excel, and native R format. The R format preserves a binary copy of the data frame as it is stored within R. A recommended procedure is to begin the analysis with data in .csv or .xlsx format. Then proceed with data cleaning and prepration, including needed transformations and re-coding. When the data is ready for analysis, save the cleaned, prepared data as a native R file of format .rda. This format is the most efficient for size and for speed of reading back into R, with all data preparations already completed. Sometimes also write the data in a format such as Excel so that those on the team not using R can also access.
The input data frame name defaults to d, consistent with lessR analysis functions, otherwise invoke the data parameter to specify the input data frame. With the Write() function, specify the name of the file for the output, as well as the type of file with the format parameter, with the default of .csv.
The following R statements that call Write() are not run here as the intent is not to create additional files.
Write the current contents of default data frame d to GoodData.csv.
Write("GoodData")Write the data as a Excel data table in an Excel file.
o
Write("GoodData", format="Excel")Can also use the abbreviation for an Excel file, wrt_x().
wrt_x("GoodData")Write the data as a R data table.
Write("GoodData", format="R")Can also use the abbreviation for an R file, wrt_r().
wrt_r("GoodData")The output of Write() indicates the full path name of the written file.
The haven package has an excellent function for reading SPSS files, read_spss(). In particular, the SPSS variable labels and value labels are preserved. Read() invokes this function to read SPSS files, with the default filetype of .sav.
Within SPSS, an (usually) integer scored variable can have value labels. An example is Likert scaled data with 1 representing a Strongly Disagree to 5 a Strongly Agree. The corresponding SPSS variable has integer values but displays the more informative value labels such as in bar charts. This type of categorical variable corresponds to a factor in the R system.
The read_spss() function preserves the value labels with a special variable type called haven_labels. These variables can be converted to an R factor for processing in the R system with the haven function as_factor(). Read() performs this conversion automatically for each relevant variable.
One problem is that the factor conversion preserves the value labels listed in the correct order, but looses the original integer scoring information. To preserve both the labels as a standard R factor, and to preserve the original scoring, Read() converts the original read variable with the labels (type haven_labels) into two variables. The first variable, an integer variable with the original integer scoring, has the name of the read variable. The corresponding factor variable has the same name as the read variable with the suffix \_f.
For example, the use of read_spss() results in the following, here showing just the first four lines of data. The variable region contains both the integer scoring and the value label that are part of the SPSS data file that was read.
# A tibble: 4 x 4
  city                         region growth income
  <chr>                     <dbl+lbl>  <dbl>  <dbl>
1 ALBANY-SCHNTADY-TROY,N.Y.    1 [NE]    -71   3313
2 ATLANTA,GA.                  2 [SE]    264   3153
3 BALTIMORE,MD.                1 [NE]     38   3540
4 BIRMINGHAM,ALA.              2 [SE]   -178   2528With Read(), obtain the following standard R data frame. The variable region is now a standard R integer variable, and region_f is the corrsponding factor.
                       city region region_f growth income
1 ALBANY-SCHNTADY-TROY,N.Y.      1       NE    -71   3313
2               ATLANTA,GA.      2       SE    264   3153
3             BALTIMORE,MD.      1       NE     38   3540
4           BIRMINGHAM,ALA.      2       SE   -178   2528With these data, the analyst may accomplish a numerical analysis with the integer variable, and for analsyses such as a bar chart, instead display the corresponding value labels.
lessR automatically accesses variable labels stored in a data frame named l, and then displays with the variable name in text and visualization output. Usually the variable labels are stored in an Excel or .csv file with two columns, the variable name and the variable label. There are no column titles, just the names and labels. Then read into the l data frame with the var_labels parameter set to TRUE.
l <- Read(file_reference, var_labels=TRUE)Read() also processes the variable labels of these (usually) integer-scored variables with value labels in the SPSS data file. The overseers of the R system do not permit package authors to create stored data structures from internal R code. Only the user can create these structures. As such, Read() lists each variable name, a comma, and then the corresponding label. This example only has one such relevant variable, region and its factor equivalent.
Variable and Variable Label  --> See vignette("Read"), SPSS section
---------------------------
region,  region of US 
region_f,  region of USTo access these labels, copy the names and labels, paste into a text file, and then save as a file. Then read the file of names/labels into R with the preceding Read() statement.
Use the base R help() function to view the full manual for Read() or Write(). Simply enter a question mark followed by the name of the function.
?Read
?Write