A step-by-step guide to formatting an external dataset to allow importing to R.
Here, we focus on a data frame.
2-D table To import into R, most data you will use needs to be arranged in a 2-dimensional table (exceptions include GIS spatial data and other specialized data formats).
Extra lines and/or columns Remove or comment out any headers, footers, or un-needed text.
Special characters Remove or replace text characters that give R trouble. The most common ones are hashes (#) and apostrophes (‘).
Rename some/all columns Make sensible names. R also replaces white space in column names with periods (.).
Columns Make sure that the data is in columns, separated by your delimiter of choice (usually tab or comma).
Here is data from the 2015 New Haven Road Race, openly available online.
It all gets copied into one cell.
How can we fix this?
Paste into Notepad or equivalent.
These top lines are not needed.
Nor is this table header rule
Select all, copy and paste…
You can select characters (tab, space, etc) or and a fixed width.
In our case, fixed width is better.
Delete extra column delimiters, place ours at beginning of each column.
Press ‘OK’, and we should hace a nice looking spreadsheet!
White space within columns is easier to remove within the spreadsheet (otherwise, this process will remove the white space between the first and surname in the ‘name’ column.
So, find and replace within selected columns only.
That’s better
Revise column names in two columns
Select all
Paste into Notepad (or equivalent text editor)
Woo hoo!