skimr
The motivation of this project was to create a frictionless approach to quickly viewing summary statistics as part of a pipeline. There are many existing summary functions, but we found them lacking in one way or another because they can be generic, they don’t always provide easy-to-operate-on data structures, and they are not pipeable.
So at rOpenSci #unconf17, we created a new package that would let you quickly skim useful, tidy summary statistics directly from a pipe.
And so we created skimr
.
In a nutshell, skimr
will create a skimr
object that can be further operated upon or that provides a human-readable printout in the console. It presents reasonable default summary statistics for numerics, factors, etc, and lists counts, and missing and unique values.
Amelia McNamara
Job Title: Visiting Assistant Professor of Statistical & Data Sciences at Smith College
Project Contributions: Coder
Eduardo Arino de la Rubia
Job Title: Chief Data Scientist at Domino Data Lab
Project Contributions: Coder
Hao Zhu
Job Title: Programmer Analyst at the Institute for Aging Research
Project Contributions: Coder
Julia Lowndes
Job Title: Marine Data Scientist at the National Center for Ecological Analysis and Synthesis
Project Contributions: Documention and test scripts
Shannon Ellis
Job Title: Postdoctoral fellow in the Biostatistics Department at the Johns Hopkins Bloomberg School of Public Health
Project Contributions: Test Scripts
Elin Waring
Job Title: Professor at Lehman College Sociology Department, City University of New York
Project Contributions: Coder
Michael Quinn
Job Title: Quantitative Analyst at Google
Project Contributions: Coder
Hope McLeod
Job Title: Data Engineer at Kobalt Music
Project Contributions: Documentation
We started off by brainstorming what we liked about existing summary packages and what other features we wanted. We started looking at example data, mtcars
.
str(mtcars)
summary(mtcars)
# "I like what we get here because mpg is numeric so these stats make sense:"
summary(mtcars$mpg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.40 15.42 19.20 20.09 22.80 33.90
# "But I don’t like this because cyl should really be a factor and shouldn't have these stats:"
summary(mtcars$cyl)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.000 4.000 6.000 6.188 8.000 8.000
# "This is OK, but not descriptive enough. It could be clearer what I'm looking at."
mosaic::tally(~cyl, data=mtcars) # install.packages('mosaic')
## cyl
## 4 6 8
## 11 7 14
# "But this output isn't labeled, not ideal."
table(mtcars$cyl, mtcars$vs)
##
## 0 1
## 4 1 10
## 6 3 4
## 8 14 0
# "I like this because it returns 'sd', 'n' and 'missing'":
mosaic::favstats(~mpg, data=mtcars)
## min Q1 median Q3 max mean sd n missing
## 10.4 15.425 19.2 22.8 33.9 20.09062 6.026948 32 0
skimr