COVID 19 Impact on the Airport → Data Analysis in R
Download the data from here.
Data
head(data)
dim(data)
> 7247,11 #not good head visualization
Types of variables
str(data)

Important Variables (common sense)
Continuous Variables
- PercentofBaseline
Categorical Variables
- Date
- City (inside State)
- State (inside Country)
- Country (We will focus on this important Categorical Factor)
- Airport Name (inside City)
Observations
This is time-series data.
We don’t know if we have information about all the airports, cities of the four countries. Thus, the movement of PecentofBaseline only matters along time in each of the different airports, cities, and countries given.
Exploratory Data Analysis
Univariate
Kernel Density of PercentofBaseline
ggplot(data, aes(x = PercentOfBaseline, fill = "red"))+geom_density(alpha = 0.2)

Bivariate
Continuous vs Categorical
Kernel Density of PercentofBaseline wrt Country
ggplot(data, aes(x = PercentOfBaseline, fill = Country))+geom_density(alpha = 0.2)

Observations and Plans:
- All of them have at least two modes.
- Canada behaves differently than others. Why?
- Australia has two distinct modes (normal mixtures). Why?
- We can estimate the MLE means of this normal mixture and test if this fits a normal mixture well.
Exercise: You can do it with respect to each of the airports, cities, and states.
PercentofBaseline vs Time (country-wise)
ggplot(data, aes(x = Date, y = PercentOfBaseline, group = Country$, colour = Country)) + geom_line() +facet_wrap(~ Country)

Understanding
- Australia has a dip in the Percent of Baseline over time (due to better government policies).
- Canada has all throughout high Percent of Baseline, which results in different behavior.
- Chile has an increasing Percent of Baseline.
- The United States overall has a fluctuating and high Percent of Baseline.
Exercise: You can do it with respect to each of the airports, cities, and states.
Modeling
Chile
lyr)
data_chile = data %>% filter(Country == "Chile")
chile = data_chile[,c(2,5)]
head(chile)

Convert into time-series data structure*
- 1,2,3,4,… as time index with no NA values.
library(zoo)
z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates
time(z) <- seq_along(time(z))#sequential data as 1,2,3,4,...
ts_chile = as.ts(z) #conversion into time series data structure
head(ts_chile)
- Date as a time index with NA values to be removed.
library(zoo)
library(tseries)#for removing na from ts()
z <- read.zoo(chile, format = "%Y-%m-%d")#zoo series for dates
ts_chile = as.ts(z)
ts_chile = na.remove(ts_chile)#remove na from time series
head(ts_chile)
If you convert here, like this it will create weird numbers. I prefer to do it as 1,2,3,4,… time index.
Plot the time series
plot(ts_chile)

Two parts of the Time Series, we'll discuss are
- Decomposition (Trend + Seasonal + Error)
- Forecasting
Decomposition
decompose(ts_chile)

Forecast
Model (ARIMA model)
library(forecast)
model = auto.arima(ts_chile)
model

ACF Plot
acf(model$residuals, main = "Correlogram")

PACF Plot
pacf(model$residuals, main = "Partial Correlogram")

Ljung Box Test
The test determines whether or not errors are iid (i.e. white noise) or whether there is something more behind them; whether or not the autocorrelations for the errors or residuals are non zero.
Box.test(model$residuals, type = "Ljung-Box")

We do not reject the hypothesis, that the model shows a good fit.
Normality of Residuals
hist(model$residuals, freq = FALSE)
lines(density(model$residuals))

Forecast
forecast = forecast(model,4)#4 = number of units you want to
library(ggplot2)
autoplot(forecast) #plot of the model

accuracy(forecast) #performance of the model

forecast #forecast values

Exercise: Apply it to other countries as well → as Australia, the USA, Canada.
All the best.