Project Overview
As part of the coursework requirement for SDS 220, you will conduct a quantitative analysis on a real data set with the goal of using data analysis to address a research question. Your final product will be a written report describing the motivation for the project, details of the data used in the analysis, and the conclusions drawn from your analysis. The project is an opportunity to integrate what you’ve learned about data wrangling, data visualization, and statistical inference with statistical communication skills.
The core project requirements pertain to the statistical analysis you perform, and the narrative structure of your written report. Roughly speaking, your data analysis project will address a question like “How is {Variable X} related to {Variable Y}”. This means that your data set must measure several variables.
Your analysis must also include at least one hypothesis test that allows you to draw an inferential conclusions about how your outcome variable is related to your explanatory variable(s). For example, you may include a hypothesis test for a difference in means, a hypothesis test for a difference in proportions, or a hypothesis test on the coefficients of a regression model.
When choosing which variables to use in your analysis, it is prudent to keep the hypothesis testing requirement in mind. For example, before choosing a categorical variable with 5 levels as your outcome, and a numeric explanatory variable, ask yourself: Do I know how to conduct a hypothesis test that would answer a research question about how these kinds of variables relate? If not, it is recommended that you choose a different set of variables to analyze.
Your report should be structured as a narrative document that introduces the topic of your analysis (including your research question(s) and the goals of your data analysis), explains the nature of the data you are analyzing in sufficient detail, explains what analyses are being performed and what can be learned from them, and ends with overall conclusions related to your research question that are supported by your data analysis. Your writing should be supplemented with illustrative figures and tables that help the reader understand how these data help answer the questions of your investigation. Your report should also cite at least one outside source of information (in addition to citing the source of your data), though that source does not necessarily have to be a peer-reviewed `academic’ article.
Stages and Timeline
You will complete your project several stages outlined below. More details about what is required to complete each stage is described in the subsequent sections.
- Monday, November 20 - Group formation and Data Set Selection (2 points)
- Wednesday November 29 - Data Analysis Plan (5 points)
- Thursday December 14 - Informal Feedback (5 points)
- Wednesday, December 20 - Written Report (25 points)
- Wednesday, December 20 - Reflection (3 points)
Group formation and Data Set Selection
You will work in pairs to complete this project, and you will choose your partner yourself. If you do not have a good idea who you want to work with, I encourage you to solicit potential project partners in the course slack.
Informing the instructor who you are working with is considered the first checkpoint of the group project assignment. You are expected to fill out the provided form before the deadline posted. In the event you are unable to join a group by the deadline, or do not inform the instructor, you will be assigned to a group by the instructor.
Once you have formed your group, the next step in the project is chosing your data set and planning your analysis. You are free to use any data set the meets the below requirements:
You may not use any variable representing temporal measurement (months, days, years, etc.) as explanatory variable
You may not use proportions/percentages as your outcome variable
The data set must be in a “tidy” data format to start with
There must be at least 10 observations per level in all categorical variables
You may not use any data sets used in this class, the Into to Modern Statistics [IMS] book, or any data used as part of example demonstrations.
To simplify the process of choosing a data set, a curated list of data sets that are appropriate for this project is provided HERE. Additionally, if you wish to use a data set not on the curated list, you must get approval.
You will inform the instructor who you are working with, and your chosen data set using the form provided HERE (only one group member needs to fill out the form). This form is graded on completion provided it is submitted by the due date.
Data Analysis Plan
Once you’ve chosen a data set, you must form a research question, and plan an analysis suitable to address your question. One member of your group must fill out the Data Analysis form. This form to submit this is HERE and will ask you:
What variables you plan to analyze
What research question you have
What hypothesis you have about your variables
How you plan to test your hypothesis
How will you divide the work
Completing the Data Analysis form is considered the second checkpoint of the group project assignment. You are expected to fill out the Data Analysis form before the deadline posted. If no one from your group fills out the Data Analysis form by the posted deadline, your group will be assigned a data set by the instructor, and a receive a 0 for this portion of the grade. One point for thoroughly answering each question above in full sentences.
Note, your plan can have errors, this is chance to have feedback. We are looking to see if you have thought carefully about a plan. You are also allowed to make minor changes your plan after the form has been submitted.
Informal Feedback
You will have an opportunity to get/give informal feedback to your classmates. Rewriting, and editing are crucial parts of the writing process. On the last day of class you will need to bring four printed out copies of a draft of your written report. The written report does not need to be complete, but there should be at least some form of a rough draft. The more you have done, the better the feedback will be. During this class period your group will meet with at least two other groups and exchange drafts and grade each other using the rubric provided HERE. You may meet with more than two groups if you’d like.
The grade and comments you received from other groups will not be a factor in the grade you get for this activity. To recieve credit you will report on:
Certify that you provided feedback to at least two other group, and received feedback from at least two other groups. (1 point).
Upload your draft (1 point).
Summary of the feedback received for your group (1 points).
Detailed plan for adjustments and modifications, include how you plan to divide the work (2 points).
If you or a group member is absent you are still required to do the activity to receive points, but it will be on your own time. The form for submitting credit for informal feedback is HERE, and due no later than 11:55pm on Thursday Dec 14 (the day after the activity). All responses should be thorough and in full sentences to receive full credit. Only one group member needs to submit the form.
Data Analysis Report
In this final phase, you will execute the project by writing a technical report that introduces your topics, describes your data and analysis, and explains the conclusions that are supported by your analysis. Many of the key features of this report have already been described in the Core Project Requirements section. In general, your report should follow this basic format:
Introduction: an overview of your project. In a few paragraphs, you should explain clearly and precisely what your research question is, why it is interesting, and what contribution you have made towards answering that question. You should give an overview of the specifics of your model, but not the full details. Most readers never make it past the introduction, so this is your chance to hook the reader, and is in many ways the most important part of the paper!
Exploratory Data Analysis: a brief description of your data set. This section should introduce the reader to the data by answering questions such as: what variables are included? Where did they come from? What are units of measurement? What is the population that was sampled? How was the sample collected? This section should also acquaint the reader with your data set through the use of univariate and multivariate visualizations, and univariate and multivariate summaries.
Results: A hypothesis test set up to address your research questions(s). You should interpret the results in the context of the data explain their relevance. If you are using a regression model in your analysis, each of it’s coefficients should be clearly explained. You should include negative results, but be careful about how you interpret them. For example, you may want to say something along the lines of: ‘we found no evidence that explanatory variable x is associated with response variable y,’ or ‘explanatory variable x did not provide any additional explanatory power above what was already conveyed by explanatory variable z’. On other hand, you probably shouldn’t claim: ‘there is no relationship between x and y’.
Discussion/Conclusion: a summary of your findings and a discussion of their limitations. Be sure to remind the reader of the question that you originally set out to answer, and summarize your answer to the question. Your discussion should also protect yourself against misinterpretation by being clear about what is not implied by your research. Finally, you should also discuss the limitations of your analysis and/or data, and how it could be augmented or improved with future research. This `future questions’ portion of the discussion should include a brief exploratory analysis showing how your project could be extended to include additional explanatory variables in the future.
Appendix (Potentially Optional): If you have any supplemental analyses that would be of interest (e.g., examining the assumptions of your two-sample t-tets), you may wish to place those analyses in an appendix at the end of your report.
Bibliography: Your report should contain at least two references (one of those references must be the source of your data). You can use any reasonable citation style (e.g., APA or MLA format), but regardless of style, your bibliography should clearly identify the primary source you are referencing.
To complete the project, your group should submit
Your written report as a PDF document.
The .qmd or .Rmd file that, when rendered, creates the self-contained PDF document for the report.
Your bibtex references file (and your citation style file, if any citation style file was used)
There is no limit to the length of the technical report, but it should not be longer than it needs to be; you should strive to express your ideas concisely and precisely in all situations. Even though you are expected to write your technical report as an Rmarkdown or Quarto document, the PDF file you submit should not display any R code, warnings, or messages. Your technical report should also be well formatted and well organized (e.g., using properly formatted headers to divide sections, using LaTeX markup for mathematical symbols where needed, tables should have bolded column and row labels, etc.).
Your report will be graded out of 25 points using the rubric HERE.
Guidelines for Successful Statistical Writing
This document should be written for peer reviewers, who comprehend statistics at least as well as you do. You should aim for a level of complexity that is more statistically sophisticated than an article in the Science section of The New York Times, but less sophisticated than an academic journal. For example, your report will use terms that that you will likely never see in the Times (e.g. a \(t\)-statistic), but you should not dwell or expound on technical points with no obvious ramifications for the reader (e.g., explaining that your plots were made using ggplot2, or including the definition of a p-value after reporting the p-value of your test statistic). Your goal for this paper is to convince a statistically-minded reader (e.g. a student in this class, or a student from another school who has taken an introductory statistics class) that you have addressed an interesting research question in a meaningful way. But even a reader with no background in statistics should be able to read your report and get the gist of it.
A good example of how the writing in this report might differ from the writing you’ve done on a HW assignment or exam is how you describe (or rather, don’t) describe your null and alternative hypotheses. In a HW assignment or exam, you might be asked to explicitly state your null and alternative hypotheses in words and symbols, and included sentences like ‘The null hypothesis is that in the population of all babies, there is no difference in the average birth weight between babies born to smokers babies born to non-smokers’ or equations like \(\mu_A – \mu_B = 0\). But in a research paper, you would not include such literal descriptions of the null hypothesis; rather, you would explain your research question (‘Does the smoking behavior of pregnant women affect their baby’s birth weight?’) and how you chose to address this research question with data analysis (e.g., a two-sample t-test, using a two-tailed p-value). In other words, you shouldn’t explicitly write the null and alternative hypotheses, but a statistically literate reader should be able to answer the question ‘What are the null and alternative hypotheses being tested?’ based on your writing.
This report is not simply a dump of all the figures, tables, and calculations that you made during this project, or a story about all the things you tried out in the beginning of the project but didn’t end up keeping. Rather, the technical report should be focused and concise, and based on the minimal set of R code that is necessary to understand your results and findings in full. If you make a claim about your research question or data, it must be justified by explicit calculation. A knowledgeable reviewer should be able to run each line of code in your .qmd or .Rmd file without modification, and verify every statement that you have made.
And even though your report may be written in a Quarto document, and rendered in RStudio, its primary purpose is to be a narrative report that describes how you addressed your research question, not a computer program or a lab notebook with technical details and no context. You should not present tables, figures, or calculations without a written explanation of the information that is supposed to be conveyed by that table or figure. Keep in mind the distinction between data and information. For example, simply displaying a table with the means and standard deviations of your variables with not explanation is not meaningful. And adding an accompanying sentence that reiterates the content of the table (e.g. ‘the mean of variable x was 34.5 and the standard deviation was 2.8…’) is equally meaningless. What you should strive to do is interpret these values in context (e.g. ‘although variables \(x_1\) and \(x_2\) have similar means, the variability of \(x_1\) is much larger, suggesting…’).
Reflection
Write a short reflection piece (no more than one page/two to three paragraphs in length) describing what you learned from working on this project and how well you and your group navigated the collaborative learning process. Some questions to consider as you write your reflection statement include:
What aspects of the project did you most enjoy?
What aspects of the project did you find most challenging? Why? What made those aspects difficult?
What skill(s) used during the project are transferable to other areas of your life or studies? Explain.
What statistical idea(s) are you curious to know more about as a result of doing this project?
How well did you and your group members support one another and work as a team throughout the project? What did you do well in this regard? What might you approach differently next time?
Is there anything (confidentially) that you would like to report about the group dynamics?
The second part of the reflection is reporting the proportion of the project you and your partner completed. Try to not consider individual feelings for the person, but instead consider the project experience. If all contributed equally then you would say .50 for each person. However, maybe you weren’t able to contribute equally to the project, and your partner picked up the slack. You might give yourself less, and them more. Or maybe your partner wasn’t pulling their weight, and you had to pick up the slack. You might give yourself more and them less. Maybe someone did some truly outstanding work and you think they deserve a larger proportion. Maybe someone was really helpful in the working with the project data and you think they deserve a larger proportion. Provide a brief explanation for the proportion you assigned to you and your partner.
Your responses will not be seen by anyone but me. This information will be useful to me in understanding everyone’s contributions, and will serve as one small factor in helping me determine everyone’s grade.
Each person should submit their own private reflection by Wednesday December 20 at 11:55pm. Submit this reflection with a PDF document. Please make sure that your name is included at the top of your reflection.