Learning R in Practice

Data IO

Data input

Data output

Data preview

summary

head/tail

table

  • xtabs: cross table Data contains three variables X,Y,Z. To tabulate the result of Z with X and Y catalogs.
xtabs(Z~X+Y)
  • ftable: flat table

Data Manipulation

cut2

From Hmisc library splits a data frame into subgroups. It returns specific number of groups as factors.

groups <- cut2(data, g=number_of_groups)

dplyr

Basic

dpylr library provides a series functions to simulate the data frame as a database.

df <- tbl_df(origin_df)
select(df, var2:var4)
select(df,-(var4:var2)) #delete
filter(df, var1==1)
filter(df,a<="3"|b=="IN") #or
filter(df,!is.na(var1))   #is not missing
arrange(df, var1)
arrange(df,desc(var1))
mutate(df, new_var1=var1+var2, new_var2=new_var1^2)
summarize(df, avg_var=mean(var))
summarize(df,sd_var=sd(var))
  • select: create new data frame with specified variables
  • filter: create new data frame under condition
  • arrange: order data base with given index
  • mutate: create new variables in data frame

Advance

The group_by will initiate the groups in data frame.

by_var <- group_by(data, var)

summarize(by_var, avg_var2= mean(var2))
arrange(data, desc(var))
  • desc represents descending order

Also we could use piping to organize data flow %>5.

tidyr

tidyr is a library to clean dataset.

  • gather
  • separate
  • Notice that we don't need $ for names in data frame.