The most basic steps in getting and cleaning data are like this. I am using a data file that has US housing data. I want to analyze that data the same way a web site such as Zillow might analyze that data.
1. First, you have to fetch the data into R. The code for that might look like this. Here you are reading a csv file from your working directory and loading it into a dataframe called housing.
housing <- read.csv("us-housing-data.csv", stringsAsFactors = FALSE)
The sample code above shows how to fetch data from a CSV file in a local directory. Similarly there are functions to fetch data from an XML file, an EXCEL file, a JSON file, and an HTML web page. Once you understand the fundamentals of fetching data, it is only a matter of knowing the function.
2. Second, you may want to remove some rows where data is missing for a particular column. So you create another dataframe which only has the rows where column 37 has some valid data.
cleanhousingdata <- housing[complete.cases(housing[,37]),]
3. In the third step you may want to filter that column for a certain condition. In this case, I am looking for homes that are valued at more than 1 million USD. VAL is the name of the column.
costlyhouses <- subset(cleanhousingdata, VAL >1000000)
Once you do these basic steps, you can start looking for answers to you questions in the data. Coming up with questions is another interesting area.
1. First, you have to fetch the data into R. The code for that might look like this. Here you are reading a csv file from your working directory and loading it into a dataframe called housing.
housing <- read.csv("us-housing-data.csv", stringsAsFactors = FALSE)
The sample code above shows how to fetch data from a CSV file in a local directory. Similarly there are functions to fetch data from an XML file, an EXCEL file, a JSON file, and an HTML web page. Once you understand the fundamentals of fetching data, it is only a matter of knowing the function.
2. Second, you may want to remove some rows where data is missing for a particular column. So you create another dataframe which only has the rows where column 37 has some valid data.
cleanhousingdata <- housing[complete.cases(housing[,37]),]
3. In the third step you may want to filter that column for a certain condition. In this case, I am looking for homes that are valued at more than 1 million USD. VAL is the name of the column.
costlyhouses <- subset(cleanhousingdata, VAL >1000000)
Once you do these basic steps, you can start looking for answers to you questions in the data. Coming up with questions is another interesting area.
No comments:
Post a Comment