Title : Preference of world cuisine survey
Topic : Exploratory Data Analysis
DataSet : food-world-cup-data.csv
Reference : https://fivethirtyeight.com/features/the-fivethirtyeight-international-food-associations-2014-world-cup/
CSV Ref. : https://github.com/fivethirtyeight/data/tree/master/food-world-cup
CVS Readme File :
1. There are information of data about value and description
# Food World Cup
This folder contains data behind the stories:
* [The FiveThirtyEight International Food Association’s 2014 World Cup](https://fivethirtyeight.com/features/the-fivethirtyeight-international-food-associations-2014-world-cup/)* [What is Americans’ Favorite Global Cuisine?](https://fivethirtyeight.com/features/what-is-americans-favorite-global-cuisine/)
Anwser key for the responses to the "Please rate how much you like the traditional cuisine of X:" questions.
Value | Description
--|--------------
5 | I love this country's traditional cuisine. I think it's one of the best in the world.
4 | I like this country's traditional cuisine. I think it's considerably above average.
3 | I'm OK with this county's traditional cuisine. I think it's about average.
2 | I dislike this country's traditional cuisine. I think it's considerably below average.
1 | I hate this country's traditional cuisine. I think it's one of the worst in the world.
N/A | I'm unfamiliar with this country's traditional cuisine.
Project Guide
The final project for the course is a technical blog post related to a data analysis project you will work on piecemeal over the course of the semester. The first task is to identify the dataset, understand the data and write questions you are planning to answer using that dataset. You may pick a data set from one of the resources mentioned on this webpage. The proposal should meet the following criteria:
1. Perform checks to determine quality of the data
(missing values, outliers, etc.)
- At first, checking missing values in csv.
- There was no missing values such as "NA".
- Let's check it out again, there were "", "N/A" as missing values.
- But, N/A means N/A | I'm unfamiliar with this country's traditional cuisine in Readme file.
- Then, I applied a letter conversion using na.strings. Finally it was able to be checked by is.na again.
- ** It's necessary what's NA data truly? and what's the most important N/A meaning in dataset compared with Readme file?
- Outliers
- There are no Outliers, Because this survey is preference survey.
The answers in the dataset are fixed.
- There are no Outliers, Because this survey is preference survey.
#library
library(tidyverse)
library(gridExtra)
library(ggplot2)
library(dplyr)
library(xts)
library(PerformanceAnalytics)
# Read CSV
# check.names = F means avoiding replacing spaces by dots within my header.
indata <- read.csv("food-world-cup-data.csv", fileEncoding = "Latin1", check.names = F, head = T)
is.data.frame(indata)
# Remove accents letters
indata <- sapply(indata , function(indata) gsub("Ê", "", indata))
indata <- as_tibble(indata)
is_tibble(indata)
# Checking missing values
sum(is.na(indata))
colSums(is.na(indata))
# Reload with new na value procession
indata <- read.csv("food-world-cup-data.csv", fileEncoding = "Latin1", check.names = F, head = T, na.strings= c("", " ", NA))
indata <- as_tibble(sapply(indata , function(indata) gsub("Ê", "", indata)))
indata <- as_tibble(indata)
sum(is.na(indata))
indata <- na.omit(indata)
sum(is.na(indata))
is_tibble(indata)
# I found out NA data, But there is no standard for removing data.
# Selecting columns from food question to gender is important data IG
# If there no data for preference for food at all, it need to be removed
2. Proposal on what questions you are interested in answering from the data.
- 1. What's variation in the dataset depending on each person?
- 2. What's correlation with traditional cuisine preference?
- 3. According to the dataset, What is important thing to prefer traditional cuisine in the dataset?
- 4. What's covariance with all data except preference? (education, income, gender, age)
3. Initial visualizations and if required transform to get the data ready .
- 1. Rename columns in dataset for visualizations.
- 2. There are visualizations by The data that South Korea was selected.
- **Arrange & Selection standard for my goal and analysis, here are only correlation with less datas with personal information **
#ggplot(indata, aes(Age)) + geom_bar()
#ggplot(indata, aes(Education)) + geom_bar()
names(indata) <- c("ID", "Level", "Interest", "Algeria", "Argentina", "Australia", "Belgium", "Bosnia", "Brazil", "Cameroon", "Chile", "Colombia", "CostaRica", "Croatia","Ecuador","England","France","Germany","Ghana","Greece","Honduras","Iran","Italy","IvoryCoast","Japan","Mexico","Netherlands","Nigeria","Portugal","Russia","SouthKorea","Spain","Switzerland","UnitedStates","Uruguay","China","India","Thailand","Turkey","Cuba","Ethiopia","Vietnam","Ireland","Gender","Age","Income","Education","Location")
l = ggplot(indata, aes(SouthKorea)) + geom_bar(aes(fill=Level), position="fill")
i = ggplot(indata, aes(SouthKorea)) + geom_bar(aes(fill=Income), position="fill")
e = ggplot(indata, aes(SouthKorea)) + geom_bar(aes(fill=Education), position="fill")
a = ggplot(indata, aes(SouthKorea)) + geom_bar(aes(fill=Age), position="fill")
grid.arrange(l,i,e,a,ncol=2,nrow=2)
A good reference for ideas on questions and EDA in general: https://r4ds.had.co.nz/exploratory-data-analysis.html#questions
'AI Master Degree > Math Foundations for Decision and DS' 카테고리의 다른 글
Analysis of Variance (0) | 2021.10.29 |
---|---|
Chap 9. Statistics Part 2 (0) | 2021.10.20 |
OpenGL is not available in this build in R studio 에러 발생 (0) | 2021.09.12 |