Penguin Themed Exam 3 Review

To prepare for Exam 3, be sure to review homework problems, readings, and class notes. In addition, below are some practice problems we have highlighted in particular.

Previous/Book Problems

Additional Problems

Packages

library(tidyverse)
library(palmerpenguins)
library(broom)
library(infer)
library(kableExtra)

penguins <- na.omit(penguins)

Consider the penguins data from palmerpenguins package in R. This data set includes measurements for various penguins in the island Palmer Archipelago. For simplicity, we removed the rows with NA values.

Problem 1

The table below contains a summary of statistics for the body_mass_g variable for each species. Find a 95% confidence interval for the body mass of the Gentoo. Be sure to mention any relevant conditions. If you use R, only consider the functions: pt(), qt(), pnorm(), qnorm().

penguins |> 
  group_by(species) |>
  summarize(n = n(), 
            sd = sd(body_mass_g), 
            min = min(body_mass_g), 
            max = max(body_mass_g), 
            mean = mean(body_mass_g), 
            median = median(body_mass_g))
# A tibble: 3 × 7
  species       n    sd   min   max  mean median
  <fct>     <int> <dbl> <int> <int> <dbl>  <dbl>
1 Adelie      146  459.  2850  4775 3706.   3700
2 Chinstrap    68  384.  2700  4800 3733.   3700
3 Gentoo      119  501.  3950  6300 5092.   5050

We can assume independence of Gentoo penguins. The sample size is large so normality is not as important, it seems like the mass of a penguin would be normally distributed.

# Lower bound
5092.437 + qt(0.025, df =118) * 501.4762/sqrt(119)
[1] 5001.403
# Upper Bound
5092.437 - qt(0.025, df =118) * 501.4762/sqrt(119)
[1] 5183.471

Problem 2

Describe and contrast each of the following. Be as specific as you can. You can use the penguins data set as an example if you would like.

  • A Data Distribution

  • A Randomization Distribution

  • A Sampling Distribution

A Data Distribution is the distribution of a variable from a sample. For example If you make a histogram of bill_length_mm you would be looking at the bill lengths of individual penguins.

A Randomization Distribution shows the distribution of a sample statistic using simulated data when the null is true. For example if you were to take the sample of penguins’ bill length between two different species are relabel the species, then caclualate the difference in means. You would repeat this process 1000 times, to have a range of possible values you would expect to observe if species label did not matter. These values can then be organized into a histogram to produce a randomization distribution.

A sampling distribution is a theoretical distribution that allows you to estimate the distribution of a sampling statistic using only one sample. For example if we were estimating the mean of bill lengths for one of these species of penguins (Gentoo) we’d check conditions then use the central limit theorem to see that the mean has a t-distribution with 118 degrees of freedom.

Problem 3

Suppose we wanted to test if mean body mass of the three types of penguins is the same.

  1. Write the appropriate null and alternative hypotheses for this test. For each, write the hypothesis both in words (as one complete sentence) and using a mathematical statement involving population parameters.

  2. What conditions should I consider?

  3. Which conditions can be relaxed and how?

  4. The R output for this test is provided below. Fill in the three ? below (without using R).

results <- aov(body_mass_g ~ species, data = penguins)
tidy(results)
ANOVA output
term df sumsq meansq statistic p.value
species ? 146864214 73432107.1 ? 2.892368e-82
Residuals ? 72443483 213697.6 NA NA

a. The solution below lists the math symbol solution followed by the English translation of those symbols.

\[H_0: \mu_1=\mu_2 = \mu_3\] \[H_A: \mu_1 \ne \mu_2 \text{ or } \mu_2 \ne\mu_3 \text{ or } \mu_3 \ne \mu_1 \] \[ H_0: \text{The means are the same}\] \[H_A: \text{at least one of the means is different than the others.}\]

b. We should consider:

  • Independence between and within the groups.

  • Equal variances between the groups.

  • The data within each group are about normal. (

c. The first assumption is the most critical, we can relax the remaining assumptions as the sample size increases.

d. \(k\) is the number of groups and \(n\) is the sample size.

term

df sumsq meansq statistic p.value
species k-1=2 146864214 73432107.1 \(\frac{72595110}{212332} = 341.8943\) 2.892368e-82
Residuals n-k=330 72443483 213697.6 NA NA

Problem 4

Test the hypothesis that the proportion of female penguins on Biscoe is 50%. Check conditions. Do as much of this by hand/calculator as you can. If you use R, only consider the functions: pt(), qt(), pnorm(), qnorm().

filter(.data = penguins, island == "Biscoe") |>
count(island, sex)
# A tibble: 2 × 3
  island sex        n
  <fct>  <fct>  <int>
1 Biscoe female    80
2 Biscoe male      83

Conditions:

  • A success will be female penguin. The success-failure condition is met because we have more than 10 female and not female penguins.

  • We must assume independence in this data.

Hypothesis:

\[H_0 : p = 0.5\] \[H_a : p \ne 0.5\] \[\alpha = 0.05\]

By the Central Limit Theorem \(p \sim N(0.5, SE)\)

\(SE = \sqrt{\frac{0.5*0.5}{163}} = 0.03916302\)

\(\hat{p} = 80/163 = 0.4907975\)

To find the pvalue:

Method 1 (no test statistic):

  • pnorm(q = 0.4907975, mean = 0.5, sd= 0.03916302) = 0.4071

Method 2 (test statistic):

  • Test static is \(Z=\frac{0.4908-0.50}{SE} = -0.2349793\)

  • Our p- value is 2*pnorm(-0.2349793) = 2* 0.4071 = 0.8142

The p-value is larger than \(\alpha\). We conclude in favor of \(H_0\) and we think that there are 50% female penguins on Biscoe.

Just to note that we may have made a type 2 error. We did not reject the null hypothesis. If the null hypothesis is actually false this would lead to an error.

Problem 5

Consider the following R output.

term estimate std.error statistic p.value
(Intercept) 17.229501 3.2818483 5.249938 6.595397e-07
bill_depth_mm 2.020768 0.2185866 ? 1.015502e-15
  1. What are the conditions for a linear regression using a mathematical model (like the one above)? Do we know if the data meet those conditions? Explain.

  2. Fill in the blanks: The R output displays a linear model for the Gentoo penguins where the __________ variable is bill_depth_mm and the __________ variable is bill_length_mm.

  3. Without computing the model yourself find the value for ? above.

  4. Make a 90% confidence interval for the slope of the regression equation. If you use R, only consider the functions: pt(), qt(), pnorm(), qnorm().

  5. In plain english interpret the slope 2.02 and intercept 17.23 from the output above.

a. Conditions:

  • Our data need be linear. (see graph)

  • Our data need to be independent and we assume it is.

  • The residuals should be roughly normal. (see residual graph)

  • The variability of our data should be relatively constant. (see graph)

Without using R it is tricky to check all of these conditions. Below is a point plot and a plot of the residuals to show that they are met. There is one outlier, but its not too far out…

penguins_gentoo <- filter(penguins, species == "Gentoo")
penguins_gentoo |>
  ggplot(aes(x=bill_length_mm, y= bill_depth_mm))+
  geom_point() +
  geom_smooth(method = "lm", se= FALSE)
`geom_smooth()` using formula = 'y ~ x'

pg_resid <-augment(results)
pg_resid |>
  ggplot(aes(.fitted, .resid))+
  geom_point()+
  geom_hline(yintercept = 0, color= "red") +
  labs(title = "Residual Graph")

b. Fill in the blanks: The R output displays a linear model for the Gentoo penguins where the response variable is bill_depth_mm and the predictor variable is bill_length_mm.

c. We use the formula for test statistic to get ?.

\(? = \frac{2.020768-0}{0.2185866} \approx 9.244702\)

d. 90% CI

#Lower
2.020768 + qt(0.05, df= 118)* 0.2185866
[1] 1.65838
#Upper
2.020768 - qt(0.05, df= 118)* 0.2185866
[1] 2.383156

e. We would expect an increase in bill depth of 1 mm to yield an increase of about 2.02 mm in body length. If it were possible to have a bill depth of zero we would expect a Gentoo penguin to have a body length of 17.23 .

Problem 6

Test to see if there is a difference between the average flipper length of the Adelie and Chinstrap Penguins. Be sure to discuss all conditions. If you use R, only consider the functions: pt(), qt(), pnorm(), qnorm().

filter(penguins, species %in% c("Adelie", "Chinstrap"))|>
  group_by(species) |>
  summarise(mean = mean(flipper_length_mm), 
            sd = sd(flipper_length_mm), 
            median = median(flipper_length_mm), 
            min = min(flipper_length_mm), 
            max = max(flipper_length_mm), 
            count = n()) |>
  kable(digits = 1)
species mean sd median min max count
Adelie 190.1 6.5 190 172 210 146
Chinstrap 195.8 7.1 196 178 212 68

Conditions:

  • Normality -because the sample size is large, and flipper length is likely normally distributed

  • Independence - We have to assume independence within the group. There is independence between groups because one penguin cannot be two species.

\[H_o: \mu_A = \mu_C\] \[H_a: \mu_A \ne \mu_C \] \[\alpha = 0.05\]

\(SE = \sqrt{\frac{s^2_A}{n_A} + \frac{s^2_C}{n_C}}= 1.015234\)

Test statistic: \(\frac{190.1 -195.8 -0}{0.8443915} = -5.614\)

Pvalue: 2* pt( -5.614, df= 67) \(\approx\) 2*0 = 0

We reject the null hypothesis and believe that the flipper length is different between these two species.

Just a note: Its possible we made a type one error. If we did then the null hypothesis is actually true. It is not possible to know if we made this error.

Problem 7

Consider the randomization distribution for the difference in average flipper length of the Adelie and Chinstrap Penguins. Using only information given in the previous problems and the graph below test to see if there is a difference in average flipper between the two selected species at the 1% significance level.

set.seed(62)
penguins |>
  filter(species %in% c("Adelie", "Chinstrap"))|>
  specify(flipper_length_mm ~ species) |>
  hypothesize(null = "independence") |>
  generate(reps = 1000, type = "permute") |>
  calculate(stat = "diff in means", order = c("Adelie", "Chinstrap")) |>
  visualise()

Conditions: Independence. The graph is clearly normal or t-distribution-ish.

In the previous problem we could have found the point estimate which is -5.72 which is so far to the left that it is no longer on this graph and would have a pvalue of zero. We should multiply that pvalue by 2, which would still be zero. And we would reject the null hypothesis.

Note: We do not use the test statistic in this problem because this is a distribution of the difference in means. We use the test statistic when we are dealing with a theoretical t-distribution.