Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’
Ɣ The primary goal of this paper is to predict the overall popular vote of the 2020 American presidential election using͓multilevel regression with post́stratification̮
Ɣ As the point is to forecast the election before it happens̩ no late submissions can be accepted̮
Ɣ We expect you to work as part of a group of 4 people̩ but groups of size 1́4 are fine̮ We have suggested a split of the work based on a 4́person group̩ but these are just suggestions̮
Ɣ ̻Person 1̼ Individuaĺlevel survey data ̻please see videǫ https̴̴̨web̮microsoftstream̮com̴video̴˦0e25a4éf33é42˦é˧4c˧́b5˧
˧ec374f43̨̼
○ Request access to the Democracy Fund ͳ UCLA Nationscape ͋Full
Data Set̨͌
https̴̴̨www̮voterstudygroup̮org̴publication̴nationscapédatáset̮
This could take a day or two̮ Please start early̮
○ Given the expense of collecting this data̩ and the privilege of having access to it̩ if you don͌t properly cite this dataset then you will get zero for this problem set̮
○ Once you have access then pick a survey of interest̮ We will use ͉ns20200102̮dta͊ in the example ̻your number may be different̼̮
○ This will be a large file and is not yours to share̮ Do not push it to GitHub ̻use the ̮gitignore file ́ see herę
https̴̴̨carpentrieśincubator̮github̮io̴git́Rstudiócourse̴02́igno
re̴index̮html̼̮
○ Use the example R code to get started preparing͓this dataset̩ and then go on cleaning and preparing it based on what you need̮
○ Make graphs and tables about the survey data and write beautiful sentences and paragraphs explaining͓everything̮
Ɣ ̻Person 2̼ Post́stratification data ̻please see videǫ https̴̴̨web̮microsoftstream̮com̴video̴4e0770a4́˦˧ef́403b́˦4˦0́cad
62eaecd0ą̼
○ We will use the American Community Surveys ̻ACS̼̮
○ Please create an account with IPUMS̨
https̴̴̨usa̮ipums̮org̴usa̴index̮shtml
○ You want the 201˦ 1́year ACS̮ Then you need to select some variables̮ This will depend on what you want to model and the
survey data̩ but some options includę REGION̩ STATEICP̩ AGE̩ SEX̩ MARST̩ RACE̩ HISPAN̩ BPL̩ CITIZEN̩ EDUC̩ LABFORCE̩
21
INCTOT̮ Have a look around and see what you are interested in̩ remembering that you will need to establish a correspondence to
the survey̮
○ Download the relevant post́stratification data ̻it͌s probably easiest to change the data format to ̮dta̼̮ Again̩ this can take some time̮
Please start this early̮
○ This will be a large file and is not yours to share̮ Do not push it to GitHub ̻use the ̮gitignore file ́ see herę
https̴̴̨carpentrieśincubator̮github̮io̴git́Rstudiócourse̴02́igno
re̴index̮html̼̮
○ Given the expense of collecting this data̩ and the privilege of having access to it̩ if you don͌t properly cite this dataset then you will get zero for this problem set̮
○ Clean and prepare the post́stratification dataset̮
○ Remember that you need cell counts for the sub́populations in
your model̮ See examples in the readings̮
Ɣ ̻Person 3 ́ start with simulated data while waiting͓for the real data̼ Modelling̮
○ You will want to explain vote intention based on a variety of explanatory variables̮ Construct the vote intention variable so that it
is binary ̻either ͋supports Trump͌ or ͋supports Biden̼̮͌
○ You are welcome to use OP() but you would need to explain the nuances of this decision in the model section ̻Hint̨ start herę
https̴̴̨statmodeling̮stat̮columbia̮edu̴2020̴01̴10̴lineaŕoŕlogis tićregressiońwith́binarýoutcomes̴̼̮
○ That said̩ you should probably use logistic regression if it is at all possible for you̮ If you don͌t know where to start then look at ̻in
increasing͓levels of complexity̼ gOP()̩ OPe4::gOPeU()̩ or bUPV::bUP()̮ There are examples of each in the readings̮
○ Think very deeply about model fit̩ diagnostics̩ and other similar things that you need in order to convince someone that your model
is appropriate̮
○ You have flexibility of the model that you use̩ ̻and hence the cells that you͌ll need to create next̼̮ In general̩ the more cells the better̩
but you may want fewer cells for simplicity in the writing͓process and to ensure a decent sample in each cell̮
○ Apply your trained model to the post́ stratification dataset to make the best estimate of the election result that you can̮ The specifics will depend on your modelling approach but will likely involve SUedLcW()̩ add_SUedLcWed_dUaZV()̩ or similar̮ See the
examples in the readings̮ We are primarily interested in the distribution of your forecast of the overall Presidential popular vote̩
22
and how the explanatory variables affect this̮ But great submissions would go beyond that̮ Also̩ you͌re taking͓a statistics course̩ so if
you just gave a central estimate and nothing͓else̩ then that would not be great̮
○ Create beautiful graphs and tables of your model and results̮
○ Create wonderful paragraphs talking͓about and explaining everything̮
Ɣ ̻Person 4 ́ start with simulated data̴results while waiting̼ Write up̮
○ Using R Markdown̩ please write a very thorough paper about your analysis and compile it into a PDF̮
○ The paper must be welĺwritten̩ draw on relevant literature̩ and show your statistical skills by explaining͓all statistical concepts that
you draw on̮
Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’
○ The paper must have the following sections̨
■ title̩ name̴s̩ and date̩
■ abstract and keywords̩
■ introduction̩
■ data̩
■ model̩
■ results̩
■ discussion̩ and
■ references̮
○ The paper may use appendices for supporting̩ but not critical̩ material̮
○ The discussion needs to be substantial̮ For instance̩ if the paper
were 10 pages long then a discussion should be at least 2̮5 pages̮
In the discussion̩ the paper must include subsections on
weaknesses and next steps ́ but these must be in proportion̮
Ɣ The report must provide a link to a GitHub repo that contains everything
̻apart from the raw data that you git ignored because it is not yours to
share̼̮ The code must be entirely reproducible̩ documented̩ and
readable̮ The repo must be welĺorganised and appropriately use folders
and README files̮
Ɣ The graphs and tables must be of an incredibly high standard̩ well
formatted̩ and report́ready̮ They should be clean and digestible̮
Furthermore̩ you should label and describe each table̴figure̮
Ɣ When you discuss the datasets ̻in the data section̼ ̻remember there will
be at least two datasets to discuss̼ you should make sure to discuss ̻at
least̨̼
Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’
○ Their key features̩ strengths̩ and weaknesses generally̮
○ The survey questionnaire ́ what is good and bad about it̯
23
○ A discussion of the methodology including͓how they find people to take the survey̳ what their population̩ frame̩ and sample were̳
what sampling approach they took and what some of the tradéoffs may be̳ what they do about nońresponse̳ the cost̮
○ This is just some of the issues strong͓submissions will consider̮
Show off your knowledge̮ If this becomes too detailed then you should push some of this to footnotes or an appendix̮
Ɣ The dataset section is probably an appropriate place to include an explanation of what post́stratification is ̻in noństatistical language̼ and the strengths and weaknesses of it̩ although this discussion may fit more naturally in another section̮ Regardless̩ be sure to justify the inclusion of each explanatory variable̮
Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’
Ɣ When you discuss your model ̻in the model section̼̩ you must be extremely careful to spell out the statistical model that you are using̩ defining and explaining͓each aspect and why it is important̮ ̻For a Bayesian model̩ a discussion of priors and regularization is almost always important̮̼ You should mention the software that you used to run the model̮ You should be clear about model convergence̩ model checks̩ and diagnostic issues̩ although you may push the details of this to an appendix depending on how detailed you get̮ How do the sampling͓and survey aspects that you discussed assert themselves in the modelling͓decisions
that you make̯ How can you convince a reader that you͌ve neither overfit nor underfit the data̯ Again̩ if it becomes too detailed then push some of the details to footnotes or an appendix̮
Ɣ You should present model results̩ graphs̩ figures̩ etc̩ in the results section̮ This section should strictly relay results̮ It must include text explaining all of these and summary statistics and similar̮ However̩ interpretation of these results and conclusions drawn from the results should be left for the discussion section̮
Ɣ Your discussion should focus on your model results̩ but this time interpreting them̩ and explaining what they mean̮ Put them in context̮ What do we learn about the world having͓understood your model and its results̯ What caveats could apply̯ To what extent does your model represent the small world and the large world ̻to use the language of McElreath̩ Ch 2̼̯ What are some weaknesses and opportunities for future work̯ Who is going to win the election̯ How confident are you in that
forecast̯ Do you have a small or large distribution̯ What could that
mean̯ Are you more confident in certain states̯ Do certain explanatory
variables carry more weight than others̯ Etc̮
Ɣ Check that you have referenced everything̮ Strong͓submissions will draw
on related literature in the discussion ̻and other sections̼ and would be
24
sure to also reference those̮ The style of references does not matter̩ but it
must be consistent̮
Ɣ If you don͌t cite R then you will get zero for this problem set̮
Ɣ As a team̩ via Quercus̩ submit a PDF of your paper̮ Again̩ in your paper
you must have a link to the associated GitHub repo̮ And you must include
the R Markdown file that produced the PDF in that repo̮
Ɣ The R Markdown file must exactly produce the PDF̮ Don͌t edit it manually
ex post ́ that isn͌t reproducible̮
Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’
Ɣ A good way to work as a team would be to split up the work̩ so that one
person is doing each section̮ The people doing͓the sections that rely on
data ̻such as the analysis and the graphs̼ could just simulate it while they
are waiting for the person putting͓together the data to finisḫ We have
recommended a split above̩ but you do what works for you̮
Ɣ It is expected that your submission be well written and able to be
understood by the average reader of say 53˦̮ This means that you are
allowed to use mathematical notation̩ but you must be able to explain it
all in plain Englisḫ Similarly̩ you can ̻and hint̨ you should̼ use survey̩
sampling̩ observational̩ and statistical terminology̩ but again you need to
explain it̮ The average person doesn͌t know what a ṕvalue is nor what a
confidence interval is̮ You need to explain all of this in plain language the
first time you use it̮ Your work should have flow and should be easy to
follow and understand̮ To communicate well̩ anyone at the university
level should be able to read your report once and relay back the
methodology̩ overall results̩ findings̩ weaknesses and next steps without
confusion̮
Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’
Ɣ It is recommended that you ̻informally̼ proofread one another͌s work ́
why not exchange papers with another group̯
Ɣ Everyone in the team receives the same mark̮
Ɣ There should be no evidence that this is a class assignment̮
Ɣ Again̩ no eƹtensions are possible̩ for obƳious reasons̮ The submission
portal ƴill close soon after midnight̮
25
Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’
Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’