saurus2
Saurus2
saurus2
전체 방문자
오늘
어제
  • 분류 전체보기
    • 개발
      • AJAX
    • ML Ops
    • Profile
    • 음식점
    • 배낭여행
    • 컴퓨터공학
      • 알고리즘 공부
      • C++
      • Sever 스터디
      • Java spring
      • 알고리즘 _ 문제해결
      • 딥러닝
      • Java 정리
      • Python
      • LeetCode 1000
      • Machine Learning Study
      • Sign language Detection Pro..
      • LeetCode Solutions
    • 비콘
    • 데일리 리포트
    • 유학일기
      • 영어 공부
      • Daily
    • AI Master Degree
      • Data Mining
      • AI and Data engineering
      • Math Foundations for Decisi..
      • Natural Language Processing

블로그 메뉴

  • 홈
  • 태그
  • 미디어로그
  • 위치로그
  • 방명록

공지사항

인기 글

태그

  • 온라인저지
  • 개발자
  • 알고리즘문제해결
  • DFS
  • 취준
  • 알고리즘
  • 취업준비
  • 개발자 취업준비
  • BFS
  • 릿코드
  • c++
  • 백준
  • LeetCode
  • 리트코드
  • 딕셔너리
  • Python
  • 딥러닝
  • 파이썬
  • two pointer
  • 문제해결능력

최근 댓글

최근 글

티스토리

hELLO · Designed By 정상우.
saurus2

Saurus2

Project Proposal 1 - Food World Cup
AI Master Degree/Math Foundations for Decision and DS

Project Proposal 1 - Food World Cup

2021. 9. 9. 09:27

Title : Preference of world cuisine survey

Topic : Exploratory Data Analysis
DataSet : food-world-cup-data.csv
Reference : https://fivethirtyeight.com/features/the-fivethirtyeight-international-food-associations-2014-world-cup/

 

The FiveThirtyEight International Food Association’s 2014 World Cup

Walter: Here at FiveThirtyEight, two things are dominating our interest this summer. The first is the 2014 FIFA World Cup, an international tournament of countr…

fivethirtyeight.com

CSV Ref. : https://github.com/fivethirtyeight/data/tree/master/food-world-cup

 

GitHub - fivethirtyeight/data: Data and code behind the articles and graphics at FiveThirtyEight

Data and code behind the articles and graphics at FiveThirtyEight - GitHub - fivethirtyeight/data: Data and code behind the articles and graphics at FiveThirtyEight

github.com

 

CVS Readme File :

1. There are information of data about value and description

더보기

# Food World Cup
This folder contains data behind the stories:
  * [The FiveThirtyEight International Food Association’s 2014 World Cup](https://fivethirtyeight.com/features/the-fivethirtyeight-international-food-associations-2014-world-cup/)* [What is Americans’ Favorite Global Cuisine?](https://fivethirtyeight.com/features/what-is-americans-favorite-global-cuisine/)

Anwser key for the responses to the "Please rate how much you like the traditional cuisine of X:" questions.
Value | Description
--|--------------
5 | I love this country's traditional cuisine. I think it's one of the best in the world.
4 | I like this country's traditional cuisine. I think it's considerably above average.
3 | I'm OK with this county's traditional cuisine. I think it's about average.
2 | I dislike this country's traditional cuisine. I think it's considerably below average.
1 | I hate this country's traditional cuisine. I think it's one of the worst in the world.
N/A | I'm unfamiliar with this country's traditional cuisine.

 

Project Guide

The final project for the course is a technical blog post related to a data analysis project you will work on piecemeal over the course of the semester. The first task is to identify the dataset, understand the data and write questions you are planning to answer using that dataset. You may pick a data set from one of the resources mentioned on this webpage.  The proposal should meet the following criteria:

1. Perform checks to determine quality of the data
(missing values, outliers, etc.)

  • At first, checking missing values in csv.
    • There was no missing values such as "NA".
    • Let's check it out again, there were "", "N/A" as missing values.
    • But, N/A means N/A | I'm unfamiliar with this country's traditional cuisine in Readme file.
    • Then, I applied a letter conversion using na.strings. Finally it was able to be checked by is.na again.
    • ** It's necessary what's NA data truly? and what's the most important N/A meaning in dataset compared with Readme file? 
  • Outliers  
    • There are no Outliers, Because this survey is preference survey.  
      The answers in the dataset are fixed.  
#library
library(tidyverse)
library(gridExtra)
library(ggplot2)
library(dplyr)
library(xts)
library(PerformanceAnalytics)

# Read CSV
# check.names = F means avoiding replacing spaces by dots within my header.
indata <- read.csv("food-world-cup-data.csv", fileEncoding = "Latin1", check.names = F, head = T)
is.data.frame(indata)

# Remove accents letters
indata <- sapply(indata , function(indata) gsub("Ê", "", indata))
indata <- as_tibble(indata)
is_tibble(indata)

# Checking missing values
sum(is.na(indata))
colSums(is.na(indata))

# Reload with new na value procession
indata <- read.csv("food-world-cup-data.csv", fileEncoding = "Latin1", check.names = F, head = T, na.strings= c("", " ", NA))
indata <- as_tibble(sapply(indata , function(indata) gsub("Ê", "", indata)))
indata <- as_tibble(indata)
sum(is.na(indata))
indata <- na.omit(indata)
sum(is.na(indata))
is_tibble(indata)
더보기

# I found out NA data, But there is no standard for removing data. 
# Selecting columns from food question to gender is important data IG
# If there no data for preference for food at all, it need to be removed


2. Proposal on what questions you are interested in answering from the data.

  • 1. What's variation in the dataset depending on each person?  
  • 2. What's correlation with traditional cuisine preference?  
  • 3. According to the dataset, What is important thing to prefer traditional cuisine in the dataset?  
  • 4. What's covariance with all data except preference? (education, income, gender, age)  


3. Initial visualizations and if required transform to get the data ready .

  • 1. Rename columns in dataset for visualizations.  
  • 2. There are visualizations by The data that South Korea was selected.    
  • **Arrange & Selection standard for my goal and analysis, here are only correlation with less datas with personal information **  
        
#ggplot(indata, aes(Age)) + geom_bar()
#ggplot(indata, aes(Education)) + geom_bar()

names(indata) <- c("ID", "Level", "Interest", "Algeria", "Argentina", "Australia", "Belgium", "Bosnia", "Brazil", "Cameroon", "Chile", "Colombia", "CostaRica", "Croatia","Ecuador","England","France","Germany","Ghana","Greece","Honduras","Iran","Italy","IvoryCoast","Japan","Mexico","Netherlands","Nigeria","Portugal","Russia","SouthKorea","Spain","Switzerland","UnitedStates","Uruguay","China","India","Thailand","Turkey","Cuba","Ethiopia","Vietnam","Ireland","Gender","Age","Income","Education","Location")

l = ggplot(indata, aes(SouthKorea)) + geom_bar(aes(fill=Level), position="fill")
i = ggplot(indata, aes(SouthKorea)) + geom_bar(aes(fill=Income), position="fill")
e = ggplot(indata, aes(SouthKorea)) + geom_bar(aes(fill=Education), position="fill")
a = ggplot(indata, aes(SouthKorea)) + geom_bar(aes(fill=Age), position="fill")

grid.arrange(l,i,e,a,ncol=2,nrow=2)

 

Visualization


A good reference for ideas on questions and EDA in general: https://r4ds.had.co.nz/exploratory-data-analysis.html#questions

'AI Master Degree > Math Foundations for Decision and DS' 카테고리의 다른 글

Analysis of Variance  (0) 2021.10.29
Chap 9. Statistics Part 2  (0) 2021.10.20
OpenGL is not available in this build in R studio 에러 발생  (0) 2021.09.12
    'AI Master Degree/Math Foundations for Decision and DS' 카테고리의 다른 글
    • Analysis of Variance
    • Chap 9. Statistics Part 2
    • OpenGL is not available in this build in R studio 에러 발생
    saurus2
    saurus2
    Simple is Best

    티스토리툴바