Chapter 4 Datasets

Below are two public (open-access) and real datasets that we will use for the analyses. For each dataset, you will find its description, minor data wrangling (or data manipulation), and descriptive statistics in both numeric and visual formats.

4.1 Salaries Dataset

One of the datasets that we’ll use is the Salaries dataset within the carData package. The dataset consists of nine-month salaries collected from 397 collegiate professors in the U.S. during 2008 to 2009. In addition to salaries, the professor’s rank, sex, discipline, years since Ph.D., and years of service was also collected. Thus, there is a total of 6 variables, which are described below.

Variable Variable Type Description
rank Categorical Professor’s rank of either assistant professor, associate professor, or professor
discipline Categorical Type of department the professor works in, either applied or theoretical Continuous Number of years since the professor has obtained their PhD
yrs.service Continuous Number of years the professor has served the department and/or university
sex Categorical Professor’s sex of either male or female
salary Continuous Professor’s nine-month salary (USD)

4.1.2 Descriptive Statistics

It’s also always a good idea to examine the data numerically and visually. Let’s first look at the categorical variables then the continuous variables. Categorical Variables

## Assistant Professor Associate Professor           Professor 
##                  67                  64                 266

To visualize our data, we will be using the function ggplot(). We won’t be going into detail about plotting since data visualization is out of the book’s scope. However, if interested, check out Grolemund’s and Wickham’s data visualization chapter to learn more about ggplot().

In this dataset, there are a lot more professors than assistant and associate professors combined.

##     Applied Theoretical 
##         216         181

There are slightly more professors within the applied than the theoretical discipline (i.e., 35 more).

## Female   Male 
##     39    358

There is a little over 9x as many male professors as there are female professors. Continuous Variables

##                    mean       sd median   min    max skew kurtosis      se
##     22.31    12.89     21     1     56 0.30    -0.81    0.65
## yrs.service       17.61    13.01     16     0     60 0.65    -0.34    0.65
## salary        113706.46 30289.04 107300 57800 231545 0.71     0.18 1520.16

On average, professors have had their Ph.D. for about 22 years.

On average, professors have provided a service to either the department or university for about 17 years and 7 months.

On average, a professor’s 9-month annual income is $113,706.46.

4.2 Anorexia Dataset

Another dataset that we’ll use is the anorexia dataset within the MASS package. The dataset consists of the weight (in lbs.) of 72 female patients with anorexia before and after either cognitive behavioral therapy, family therapy, or no therapy (control condition).

Variable Variable Type Description
Treatment Categorical Treatment of female patient with anorexia, either cognitive behavioral therapy (CBT), family therapy (FT), or no therepy (CONT)
PreWeight Continuous Weight of female patient with anorexia before treatment in lbs.
PostWeight Continuous Weight of female patient with anorexia after treatment in lbs.

4.2.2 Descriptive Statistics Categorical Variables

##     CBT      FT Control 
##      29      17      26

There are more participants within the CBT condition compared to either the FT condition or control group. Continuous Variables

##             n  mean   sd median  min   max  skew kurtosis   se
## PreWeight  72 82.41 5.18  82.30 70.0  94.9 -0.05    -0.16 0.61
## PostWeight 72 85.17 8.04  84.05 71.3 103.6  0.36    -0.81 0.95

On average, women with anorexia before treatment weighed 82.41 lbs.

On average, women with anorexia after treatment weighed 85.17 lbs, which is about a 2.76 lbs weight gain compared to before treatment.

4.3 Use Your Own Dataset

However, if you have an interesting dataset of your own, we encourage you to also try using that dataset alongside ours.