Getting Help, Importing Data and Some More Vocabulary
Getting help
An important part of any R package is its documentation. The documentation of any function can be obtained by using ?
followed by the name of the function or by using help("function_name")
. After the execution of a help command a new page should appear in the lower right window in RStudio. If not, click on the tab Help in the lower right window. The help page contains the title of the function followed by a short description. For now the most important parts of the documentation are the following:
- Usage describes the arguments of the function, which can be specified by the user. The values to the right of the equality signs are the standard values, which are used if the users do not specify those values manually.
- Arguments provides further information on the possible specifications of the arguments.
For example, assume that you want to read some data into R. We will use the Penn World Table data (version 9.0) for this. Got to this link and download the Excel version of the data set into your working directory. I saved it under pwt90.xlsx
.1
Since the data come in an xlsx file, we need a function that can read it. Such a function is not included in the standard installation of R, but the readxl
package contains one: read_xlsx
. To understand what kind of input this function needs in order to work, you can read its documentation by using the help function:
library(readxl) # install.packages("readxl")
?read_xlsx
which can also be written as help("read_xlsx")
. Alternatively, you can click on the tab Help in the lower right window and enter read_xlsx into the search field. As soon as you start typing, RStudio will give you some suggestions on functions that fit your input and for which documentation is available.
For our purpose we only need to specify the first two arguments:
path
contains the name of the xlsx file you want to read.sheet
is the number or name of the sheet you want to read.
It might also be worthwhile to take a minute to go through the other arguments that could be specified.
Importing data
Since the name of our data file is pwt90.xlsx, the first argument of the read_xlsx
function is path = "pwt90.xlsx"
. Furthermore, after opening the xlsx file in a spreadsheet program we know that the file consists of three sheets. We only want to read the content of the third sheet named Data. Thus, we can either use the argument sheet = 3
or sheet = "Data"
. I recommend to use the latter approach, because it is more reliable in case the order of the Excel sheets is changed in the future. So, the following line of code should be just right for our purpose:
# Load the readxl package
library(readxl) # install.packages("readxl")
# Read the data
read_xlsx(path = "pwt90.xlsx", sheet = "Data")
Note that the order of a function’s arguments is not important as long as you provide the names of the arguments followed by the equality sign. The space between the entries is also not important, but it is good practice to use it, since it makes the code more readable. Furthermore, make sure that you write the names of objects and arguments correctly. This is especially true for lower and upper cases. R is very sensitive in this regard.
If the command worked, the console will be filled with a lot of numbers and signs and it will indicate that some rows were omitted. This is because the function read_xlsx
only reads the xlsx file and R does not know what else to do with the output than just to display its content in the console. In order to change this behaviour, we use the operator <-
. It assigns a certain input to an object, or so-called flat-file. In this example, it takes the output of the read_xlsx
function as input and assigns it to a new object. The following code line assigns the content of the sheet Data of the pwt90.xlsx file to the object pwt. Of course, you could also use other names for the object like, e.g., this_is_data_from_pwt90, but pwt seems more convenient.
pwt <- read_xlsx(path = "pwt90.xlsx", sheet = "Data")
Note that this line also represents the typical form of code in R: A function with some arguments is executed and its output is assigned to an object via the <-
operator.2
Also note that object names may never contain spaces, so pwt data <- read_xlsx(path = "pwt90.xlsx", sheet = "Data")
would result in an error.
If your data file has a different format, you have to search for the package, which is able to read it. This might seem a bit laborious, but fortunately the structure of those functions is very similar, so that this example should be representative for most other file formats. If not, open a search engine and enter, for example, r read xyz data and search the results for answers on the website stockoverflow.com. It is a forum, where people post questions on coding issues. Usually, the respondents give the right advice. The quality of the answers to a question can be assessed by the approval rating of the community and, even better, a check sign, which indicates that this answer can be regarded as the final solution to the respective problem.
Once the file is successfully loaded into R, an object will appear in the upper right window. This window contains all the objects, which are loaded in R and which can be used for further processing. At the moment this is only one object, pwt
. But with every use of the <-
operator you can create a new object or overwrite an existing one with something else. For example, you could just copy the object pwt
and save the copy as data_copy
:
data_copy <- pwt
Now there are two objects in the upper right window, where one is obviously redundant. In order to get rid of it, you can use the command rm(data_copy)
which will remove the object from the memory. This can be very helpful when working with large samples that have millions of observations.
A very useful application of the rm
function is rm(list = ls())
, which removes all the objects in the upper right window and, thus, from your computer’s memory. I usually put this command on top of a script to make sure that no remaining objects from a different project can cause problems in the new project. Feel free to try the command out and load the data into R again by executing
# Remove objects in the memory
rm(list = ls())
pwt <- read_xlsx(path = "pwt90.xlsx", sheet = "Data") # Re-load PWT data
Note that the lines above contain comments, which are indicated by the route #
signs. Comments are an essential part of every code, because they allow other people to understand your code more easily. It is good practice to use comments to explain, why you added a certain line of code. You should definitely make a habit out of using them.
R can handle a broad variety of data formats such as boolean3 or numeric values, dates or text[^text]. These different kinds of data can also be structured differently. These categories of data formats and structures are called classes. And one of the most popular classes in R are a so-called data frames. They serve as the data input for most of the basic R functions and, hence, we will focus on them in this introduction.
Data frames are quite similar to standard spreadsheets, which might become a bit clearer when you either execute View(pwt)
or, equivalently, click on pwt
in the upper right window. This will open an additional tab in the editor window, where you can explore the data just like in a spreadsheet program.
Following a different approach you can execute the lines
names_pwt <- names(pwt) # Save names column names as distinct object
names_pwt # Show object
to extract the column names of the data frame pwt
and save them as a new object with the name names_pwt
(fist line) and to display the names in the console (second line). Note that R will print the content of an object in the console, if the object name is entered and executed just like before when we used the read_xlsx
function without assigning its output to an object.
The output of the names
function is new object named names_pwt
. It is not a data frame, but a so-called character vector, which is another way to structure data in R. The quotation marks around its elements indicate that the content of the vector is text. Note that this is also indicated by the abbreviation chr to the right of the object name in the upper right window. Moreover, we get the same information when using the command class(names_pwt)
, which gives "character"
as output.
If you extend the data frame object by clicking on the white button to the left of the object name, you will also notice the abbreviation chr in some lines, whereas num (numeric) will appear in others. This indicates that every column of a data frame is a vector of a certain format. Note that although it is possible for data frames to consist of vectors with different data formats, a vector can only contain one kind of data.
Note that a newer version of the PWT is available. Go to PWT-website got get the most recent version.↩︎
Sometimes you will encounter code which uses the equality sign
=
instead of the<-
. This is perfectly fine, since they can be considered as equivalent. However, the<-
seems more widespread.↩︎This means that entries that can only have two values, true or false.↩︎