# COVID 19 Impact on the Airport → Data Analysis in R

Download the data from here.

**Data**

`head(data) `

dim(data)

> 7247,11 #not good head visualization

**Types of variables**

`str(data)`

# Important Variables (common sense)

## Continuous Variables

**PercentofBaseline**

## Categorical Variables

**Date**- City (inside State)
- State (inside Country)
**Country (We will focus on this important Categorical Factor)**- Airport Name (inside City)

# Observations

## This is time-series data.

We don’t know if we have information about all the airports, cities of the four countries. Thus, the movement of PecentofBaseline only matters along time in each of the different airports, cities, and countries given.

# Exploratory Data Analysis

# Univariate

**Kernel Density of PercentofBaseline**

`ggplot(data, aes(x = PercentOfBaseline, fill = "red"))+geom_density(alpha = 0.2)`

# Bivariate

## Continuous vs Categorical

**Kernel Density of PercentofBaseline wrt Country**

`ggplot(data, aes(x = PercentOfBaseline, fill = Country))+geom_density(alpha = 0.2)`

**Observations and Plans**:

- All of them have at least two modes.
- Canada behaves differently than others. Why?
- Australia has two distinct modes (normal mixtures). Why?
- We can estimate the MLE means of this normal mixture and test if this fits a normal mixture well.

**Exercise**: You can do it with respect to each of the airports, cities, and states.

# PercentofBaseline vs Time (country-wise)

`ggplot(data, aes(x = Date, y = PercentOfBaseline, group = Country$, colour = Country)) + geom_line() +facet_wrap(~ Country)`

# Understanding

- Australia has a dip in the Percent of Baseline over time (due to better government policies).
- Canada has all throughout high Percent of Baseline, which results in different behavior.
- Chile has an increasing Percent of Baseline.
- The United States overall has a fluctuating and high Percent of Baseline.

**Exercise**: You can do it with respect to each of the airports, cities, and states.

# Modeling

# Chile

`lyr)`

data_chile = data %>% filter(Country == "Chile")

chile = data_chile[,c(2,5)]

head(chile)

# Convert into time-series data structure*

- 1,2,3,4,… as time index with no NA values.

`library(zoo)`

z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates

time(z) <- seq_along(time(z))#sequential data as 1,2,3,4,...

ts_chile = as.ts(z) #conversion into time series data structure

head(ts_chile)

- Date as a time index with NA values to be removed.

`library(zoo)`

library(tseries)#for removing na from ts()

z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates

ts_chile = as.ts(z)

ts_chile = na.remove(ts_chile)#remove na from time series

head(ts_chile)

If you convert here, like this it will create weird numbers. I prefer to do it as 1,2,3,4,… time index.

# Plot the time series

`plot(ts_chile)`

Two parts of the Time Series, we'll discuss are

- Decomposition (Trend + Seasonal + Error)
- Forecasting

# Decomposition

`decompose(ts_chile)`

# Forecast

## Model (ARIMA model)

`library(forecast)`

model = auto.arima(ts_chile)

model

## ACF Plot

`acf(model$residuals, main = "Correlogram")`

## PACF Plot

`pacf(model$residuals, main = "Partial Correlogram")`

## Ljung Box Test

The test determines whether or not errors are iid (i.e. white noise) or whether there is something more behind them; whether or not the autocorrelations for the errors or residuals are non zero.

`Box.test(model$residuals, type = "Ljung-Box")`

We do not reject the hypothesis, that the model shows a good fit.

## Normality of Residuals

`hist(model$residuals, freq = FALSE)`

lines(density(model$residuals))

## Forecast

`forecast = forecast(model,4)#4 = number of units you want to `

library(ggplot2)

autoplot(forecast) #plot of the model

`accuracy(forecast) #performance of the model`

`forecast #forecast values`

**Exercise**: Apply it to other countries as well → as Australia, the USA, Canada.

All the best.