程序代写代做代考 R Toolbox Assignment

R Toolbox Assignment

The purpose of this project is to tie together all the concepts that you have learned throughout
this course in order to create a portfolio piece that you can talk about during interviews and
career fairs.

Your submission must be in Shiny or R Markdown alongside all necessary resources for us to
run and test your code (i.e. data, code, documentation, links, etc). From import, manipulation
and analysis data through visualization and communicating results, your submission must
contain the entire workflow of your code. Our expectation is to be able to run your code
without any obvious errors. Please submit your files in GitHub by April 29 at 11:59pm. The link
to Github will be shared later in the semester.

Requirements
The question or problem that you want to solve is open ended. You are welcome to discuss it
with your instructional team, but it is up to you to decide what you want to pursue. You will
select your data, analyze it and create the visualizations you think will best drive the message
that you want your audience to see. The first and most important step is that you identify and
describe the problem statement. Subsequently everything you do from analysis to
visualizations should support the answer to your problem statement. Below is an outline and
guidelines for how to develop the toolbox.

Guidelines

1. Identify and state the problem or question that you are seeking to solve or answer
a. Make sure that your question/problem is quantifiable and can be answered with

the data that you select.
b. Outline the goals and criteria for your success when completing your analysis.

Define key metrics and views of data and how these tie to your response.
2. Choose a unique dataset(s) that interests you and can answer your question

a. This dataset must have breadth and depth. In this context depth refers to the
level of detail in the data and breadth refers to the number of dimensions that
you can explore with your data.

b. Here are some good data sources:
i. Kevin Chai’s Dataset List

ii. Kaggle
iii. Federal Government Data
iv. Smart Data Collective list
v. KDNuggets

vi. Healthcare Data
vii. Census Data

viii. Socrata

http://kevinchai.net/datasets
http://kevinchai.net/datasets
https://www.kaggle.com/datasets

Frontpage


http://www.smartdatacollective.com/bernardmarr/235366/big-data-20-free-big-data-sources-everyone-should-know
http://www.kdnuggets.com/datasets/index.html
http://www.healthdata.gov/
http://www.healthdata.gov/
http://www.census.gov/data.html
http://www.census.gov/data.html
https://opendata.socrata.com/browse?limitTo=datasets&utf8=%E2%9C%93
https://opendata.socrata.com/browse?limitTo=datasets&utf8=%E2%9C%93

3. Import, prepare and clean your dataset for analysis.
a. Document and mitigate any outliers, missing data, or incorrectly recorded data –

assess their impact on your data and how you mitigated this impact (i.e. what
values you chose to replace the missing values with, and why)

b. Thoroughly comment your code so that we can easily follow every step of what
you are trying to do and the logic/reason behind it

4. Analyze your data set
a. Create segments of data that enable you to focus on portions of your answer

(these segments will eventually lead to a visualization that will answer your
question)

b. Summarize each segment into a simple tabulation of data that brings out the key
insights

5. Visualize your data
a. Visualize these tabulations in a way that brings out the message that you are

trying to drive across. Make sure that this message ties closely and supports to
your problem and its answer.

b. Appropriate formatting of all your visualizations: usage of color/shape/size
aesthetics to differentiate groupings in your plots, a legend if needed to
understand your visual, labeling of your axes and titling of your plots

6. Put everything together in a Markdown PDF/HTML document or a Shiny application
a. Create 3-5 Visualizations that can either be included each as their own page in a

Shiny Dashboard (https://rstudio.github.io/shinydashboard/ ) or in R Markdown.
b. If you decide to go for Shiny: Usage of Shiny Controls for each visual in some

form: choosing your axis values, filtering values, selecting value, selecting graph
geometry or any interaction of similar or greater complexity

c. Bonus: Create a Shiny application in RMarkdown

Example
1.a. You are trying to understand the determinants of wages in the state of New York.
1.b. Some key drivers of wages can be location (zip code), age and educational attainment.
2.a. Census data will provide great breadth (hundreds or thousands of columns) and depth (all
of these columns can be drilled down from the state level all the way down to the county and
sometimes zip code).
3.a. Carefully and thoroughly review the Census data documentation so that you understand
how the data were gathered, how missing values are treated, and the availability of your data.
Download the necessary data from the Census website, use a package like readr to import it as
a tibble. You decide to use tidyr to turn your tibble into tidy data.
3.b. Some missing values are given a value of ‘99999998’ so you decide to write code to convert
these into NA and comment such code to make sure we can follow your logic and process.
4.a/b. Use dplyr to create buckets of wages broken out by age brackets to better understand
this variation across ages.
5.a. You visualize the data in 4.a using a bar chart
6.a. The output of 5.a is in your final RMarkdown or Shiny application

Grading
Your submission will be graded on a scale from 1 to 3 points per section.

1. Code workflow (50% of final grade)
a. 3 points –your code runs without error (warning and other messages okay as

long as they are not fatal to runtime)
b. 3 points – your code exactly replicates the output in your Shiny application or

RMarkdown submission.
c. 3 points – organization and workflow. The way your code is organized makes

sense. For example, we would not be able to manipulate data before importing it
or to visualize it before cleaning out missing values or outliers.

d. 3 points – your code is thoroughly commented so that we can follow everything
that you are trying to do in your script

e. 5 points – efficient use of R programming skills (i.e. leveraging functional
programming to work with repetitive tasks, using dplyr for creating segments of
data or tidyr to tidy data)

2. Analysis (20% of final grade)
a. 3 points – analysis questions and research plan are sufficiently defined as to be

answerable using data (i.e. asking how is the company doing? Vs how much did
revenue grow this year?

b. 3 points – the data you selected can appropriately answer the questions you set
out to explore

c. 3 points – appropriate use of existing metrics and creation of new metrics to
explore your data (i.e. yoy growth, composite metrics (weighted averages),
dummy variables, etc)

d. 3 points – appropriate and efficient segmentation of data to address the
questions in 2.a

e. BONUS: 5 points – create a slide deck that goes through your work as if you
were to present to a professional audience.

1. Visualization and communication (30% of final grade)
a. 3 points – creation of an appropriate visualization for your objectives in 2.a
b. 3 points – appropriate formatting of visualizations as outlined in Requirement #5
c. 3 points – adequate Shiny Interactivity as outlined in Requirement #6
d. BONUS: 5 points – create an Shiny application within an RMarkdown document

Please note that although you are encouraged to use data, work and code that you have done
before, this work MUST BE original and not previously submitted to another class.

Leave a Reply

Your email address will not be published. Required fields are marked *