Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’

Ɣ The primary goal of this paper is to predict the overall popular vote of the 2020 American presidential election using͓multilevel regression with post́stratification̮

Ɣ As the point is to forecast the election before it happens̩ no late submissions can be accepted̮

Ɣ We expect you to work as part of a group of 4 people̩ but groups of size 1́4 are fine̮ We have suggested a split of the work based on a 4́person group̩ but these are just suggestions̮

Ɣ ̻Person 1̼ Individuaĺlevel survey data ̻please see videǫ https̴̴̨web̮microsoftstream̮com̴video̴˦0e25a4éf33é42˦é˧4c˧́b5˧

˧ec374f43̨̼

○ Request access to the Democracy Fund ͳ UCLA Nationscape ͋Full

Data Set̨͌

https̴̴̨www̮voterstudygroup̮org̴publication̴nationscapédatáset̮

This could take a day or two̮ Please start early̮

○ Given the expense of collecting this data̩ and the privilege of having access to it̩ if you don͌t properly cite this dataset then you will get zero for this problem set̮

○ Once you have access then pick a survey of interest̮ We will use ͉ns20200102̮dta͊ in the example ̻your number may be different̼̮

○ This will be a large file and is not yours to share̮ Do not push it to GitHub ̻use the ̮gitignore file ́ see herę

https̴̴̨carpentrieśincubator̮github̮io̴git́Rstudiócourse̴02́igno

re̴index̮html̼̮

○ Use the example R code to get started preparing͓this dataset̩ and then go on cleaning and preparing it based on what you need̮

○ Make graphs and tables about the survey data and write beautiful sentences and paragraphs explaining͓everything̮

Ɣ ̻Person 2̼ Post́stratification data ̻please see videǫ https̴̴̨web̮microsoftstream̮com̴video̴4e0770a4́˦˧ef́403b́˦4˦0́cad

62eaecd0ą̼

○ We will use the American Community Surveys ̻ACS̼̮

○ Please create an account with IPUMS̨

https̴̴̨usa̮ipums̮org̴usa̴index̮shtml

○ You want the 201˦ 1́year ACS̮ Then you need to select some variables̮ This will depend on what you want to model and the

survey data̩ but some options includę REGION̩ STATEICP̩ AGE̩ SEX̩ MARST̩ RACE̩ HISPAN̩ BPL̩ CITIZEN̩ EDUC̩ LABFORCE̩

21

INCTOT̮ Have a look around and see what you are interested in̩ remembering that you will need to establish a correspondence to

the survey̮

○ Download the relevant post́stratification data ̻it͌s probably easiest to change the data format to ̮dta̼̮ Again̩ this can take some time̮

Please start this early̮

○ This will be a large file and is not yours to share̮ Do not push it to GitHub ̻use the ̮gitignore file ́ see herę

https̴̴̨carpentrieśincubator̮github̮io̴git́Rstudiócourse̴02́igno

re̴index̮html̼̮

○ Given the expense of collecting this data̩ and the privilege of having access to it̩ if you don͌t properly cite this dataset then you will get zero for this problem set̮

○ Clean and prepare the post́stratification dataset̮

○ Remember that you need cell counts for the sub́populations in

your model̮ See examples in the readings̮

Ɣ ̻Person 3 ́ start with simulated data while waiting͓for the real data̼ Modelling̮

○ You will want to explain vote intention based on a variety of explanatory variables̮ Construct the vote intention variable so that it

is binary ̻either ͋supports Trump͌ or ͋supports Biden̼̮͌

○ You are welcome to use OP() but you would need to explain the nuances of this decision in the model section ̻Hint̨ start herę

https̴̴̨statmodeling̮stat̮columbia̮edu̴2020̴01̴10̴lineaŕoŕlogis tićregressiońwith́binarýoutcomes̴̼̮

○ That said̩ you should probably use logistic regression if it is at all possible for you̮ If you don͌t know where to start then look at ̻in

increasing͓levels of complexity̼ gOP()̩ OPe4::gOPeU()̩ or bUPV::bUP()̮ There are examples of each in the readings̮

○ Think very deeply about model fit̩ diagnostics̩ and other similar things that you need in order to convince someone that your model

is appropriate̮

○ You have flexibility of the model that you use̩ ̻and hence the cells that you͌ll need to create next̼̮ In general̩ the more cells the better̩

but you may want fewer cells for simplicity in the writing͓process and to ensure a decent sample in each cell̮

○ Apply your trained model to the post́ stratification dataset to make the best estimate of the election result that you can̮ The specifics will depend on your modelling approach but will likely involve SUedLcW()̩ add_SUedLcWed_dUaZV()̩ or similar̮ See the

examples in the readings̮ We are primarily interested in the distribution of your forecast of the overall Presidential popular vote̩

22

and how the explanatory variables affect this̮ But great submissions would go beyond that̮ Also̩ you͌re taking͓a statistics course̩ so if

you just gave a central estimate and nothing͓else̩ then that would not be great̮

○ Create beautiful graphs and tables of your model and results̮

○ Create wonderful paragraphs talking͓about and explaining everything̮

Ɣ ̻Person 4 ́ start with simulated data̴results while waiting̼ Write up̮

○ Using R Markdown̩ please write a very thorough paper about your analysis and compile it into a PDF̮

○ The paper must be welĺwritten̩ draw on relevant literature̩ and show your statistical skills by explaining͓all statistical concepts that

you draw on̮

Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’

○ The paper must have the following sections̨

■ title̩ name̴s̩ and date̩

■ abstract and keywords̩

■ introduction̩

■ data̩

■ model̩

■ results̩

■ discussion̩ and

■ references̮

○ The paper may use appendices for supporting̩ but not critical̩ material̮

○ The discussion needs to be substantial̮ For instance̩ if the paper

were 10 pages long then a discussion should be at least 2̮5 pages̮

In the discussion̩ the paper must include subsections on

weaknesses and next steps ́ but these must be in proportion̮

Ɣ The report must provide a link to a GitHub repo that contains everything

̻apart from the raw data that you git ignored because it is not yours to

share̼̮ The code must be entirely reproducible̩ documented̩ and

readable̮ The repo must be welĺorganised and appropriately use folders

and README files̮

Ɣ The graphs and tables must be of an incredibly high standard̩ well

formatted̩ and report́ready̮ They should be clean and digestible̮

Furthermore̩ you should label and describe each table̴figure̮

Ɣ When you discuss the datasets ̻in the data section̼ ̻remember there will

be at least two datasets to discuss̼ you should make sure to discuss ̻at

least̨̼

Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’

○ Their key features̩ strengths̩ and weaknesses generally̮

○ The survey questionnaire ́ what is good and bad about it̯

23

○ A discussion of the methodology including͓how they find people to take the survey̳ what their population̩ frame̩ and sample were̳

what sampling approach they took and what some of the tradéoffs may be̳ what they do about nońresponse̳ the cost̮

○ This is just some of the issues strong͓submissions will consider̮

Show off your knowledge̮ If this becomes too detailed then you should push some of this to footnotes or an appendix̮

Ɣ The dataset section is probably an appropriate place to include an explanation of what post́stratification is ̻in noństatistical language̼ and the strengths and weaknesses of it̩ although this discussion may fit more naturally in another section̮ Regardless̩ be sure to justify the inclusion of each explanatory variable̮

Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’

Ɣ When you discuss your model ̻in the model section̼̩ you must be extremely careful to spell out the statistical model that you are using̩ defining and explaining͓each aspect and why it is important̮ ̻For a Bayesian model̩ a discussion of priors and regularization is almost always important̮̼ You should mention the software that you used to run the model̮ You should be clear about model convergence̩ model checks̩ and diagnostic issues̩ although you may push the details of this to an appendix depending on how detailed you get̮ How do the sampling͓and survey aspects that you discussed assert themselves in the modelling͓decisions

that you make̯ How can you convince a reader that you͌ve neither overfit nor underfit the data̯ Again̩ if it becomes too detailed then push some of the details to footnotes or an appendix̮

Ɣ You should present model results̩ graphs̩ figures̩ etc̩ in the results section̮ This section should strictly relay results̮ It must include text explaining all of these and summary statistics and similar̮ However̩ interpretation of these results and conclusions drawn from the results should be left for the discussion section̮

Ɣ Your discussion should focus on your model results̩ but this time interpreting them̩ and explaining what they mean̮ Put them in context̮ What do we learn about the world having͓understood your model and its results̯ What caveats could apply̯ To what extent does your model represent the small world and the large world ̻to use the language of McElreath̩ Ch 2̼̯ What are some weaknesses and opportunities for future work̯ Who is going to win the election̯ How confident are you in that

forecast̯ Do you have a small or large distribution̯ What could that

mean̯ Are you more confident in certain states̯ Do certain explanatory

variables carry more weight than others̯ Etc̮

Ɣ Check that you have referenced everything̮ Strong͓submissions will draw

on related literature in the discussion ̻and other sections̼ and would be

24

sure to also reference those̮ The style of references does not matter̩ but it

must be consistent̮

Ɣ If you don͌t cite R then you will get zero for this problem set̮

Ɣ As a team̩ via Quercus̩ submit a PDF of your paper̮ Again̩ in your paper

you must have a link to the associated GitHub repo̮ And you must include

the R Markdown file that produced the PDF in that repo̮

Ɣ The R Markdown file must exactly produce the PDF̮ Don͌t edit it manually

ex post ́ that isn͌t reproducible̮

Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’

Ɣ A good way to work as a team would be to split up the work̩ so that one

person is doing each section̮ The people doing͓the sections that rely on

data ̻such as the analysis and the graphs̼ could just simulate it while they

are waiting for the person putting͓together the data to finisḫ We have

recommended a split above̩ but you do what works for you̮

Ɣ It is expected that your submission be well written and able to be

understood by the average reader of say 53˦̮ This means that you are

allowed to use mathematical notation̩ but you must be able to explain it

all in plain Englisḫ Similarly̩ you can ̻and hint̨ you should̼ use survey̩

sampling̩ observational̩ and statistical terminology̩ but again you need to

explain it̮ The average person doesn͌t know what a ṕvalue is nor what a

confidence interval is̮ You need to explain all of this in plain language the

first time you use it̮ Your work should have flow and should be easy to

follow and understand̮ To communicate well̩ anyone at the university

level should be able to read your report once and relay back the

methodology̩ overall results̩ findings̩ weaknesses and next steps without

confusion̮

Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’

Ɣ It is recommended that you ̻informally̼ proofread one another͌s work ́

why not exchange papers with another group̯

Ɣ Everyone in the team receives the same mark̮

Ɣ There should be no evidence that this is a class assignment̮

Ɣ Again̩ no eƹtensions are possible̩ for obƳious reasons̮ The submission

portal ƴill close soon after midnight̮

25

Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’

Rstudio data analysis: Problem Set 4 ́’Diving into the deep end.’