COVID-19 Forecasting: A Statistical Approach

Background: SARS-coronavirus-2 is a new virus infecting people and causing COVID-19 disease. The disease is causing a worldwide pandemic. Although some people never develop any signs or symptoms of disease when they are infected, other people are at very high risk for severe disease and death. Objective: If we’re able to intervene to prevent even some transmission, we can dramatically reduce the number of cases. And this is the public health goal for controlling COVID-19. Methods: This article initializes an approach for comparatively accurate values prediction of new cases and deaths for a particular day in order to be considered for preventive measures. The three statistical analysis methods considered for forecasting are Fbprophet, Moving average and the Autoregressive Integrated Moving Average algorithm. Results: The results obtained are in-line with the past and present trend of COVID-19 data collected from WHO website. Conclusion: The output is satisfactory for further consideration.


Introduction
A coronavirus is the virus which causes COVID-19. Corona viruses are a huge, diverse virus community, and in order to be able to see them, a powerful microscope is needed. Coronavirus stands for crown. There are several different types of corona viruses, and a large variety of mammals and birds are infected every year and some also cause mild respiratory disease in humans, hence, corona viruses are not new. The virus that causes COVID-19, is called SARS coronavirus 2. This virus originated in bats and is infected with this virus all the time. It developed the ability to jump between species and infect people. This is the third coronavirus known since 2002 that has developed. All of these corona viruses have emerged in bats and now attack individuals and can be transmitted from individual to individual.
The first virus was named extreme acute respiratory syndrome, which emerged in Guangdong, China in 2002, or SARS coronavirus. The next virus that appeared in the Middle East in 2012 was the Middle Eastern Respiratory Syndrome Coronavirus, or MERS, which now induces human infections and minor outbreaks. More recently, SARS coronavirus 2 appeared in Wuhan, China, at the end of 2019. Since the virus is similar to the virus that caused the first SARS coronavirus, it was named SARS coronavirus 2 . In other cells, viruses have to survive, and so they replicate in those cells and begin to kill other cells in the body. Diabetes, hypertension, any form of lung disease such as asthma or emphysema or COPD, which is a chronic obstructive pulmonary condition, are present medical situations that raise the likelihood of serious illness in COVID-19. There is also a high chance of relentless COVID-19 illness for patients suffering from heart failure, liver disease and some form of kidney disease. High sensitivity is also seen in individuals who do not have good immune systems. Often, owing to certain drugs for treating some illnesses, such as hormones or those that affect their immune response, few persons have damaged immune systems. If a person is HIV-positive and he/ she is on managed HIV infection treatment, so he/she is not at significantly high risk of serious COVID-19 disease. In general, it depends on the welfare of the person before he or she gets disease and access to treatment as well. Death from COVID-19 is rare among young people and stable individuals, but it does happen. Death is very frequent in elderly individuals or adults who have COVID-19 and it rises with age. In the US, 2% to 5% of individuals aged 65 to 75 years are expected to die from COVID-19. For those aged 75 to 85, the risk increased to 4 to 10 percent and was over 10 percent in people aged more than 85 years. Mechanical ventilation is expected of many prisoners with relentless lung disease. This is a ventilator-called breathing system. Ventilator protects lung cases as their body battles infection. Needy patients will be helped by this form of artificial breathing to help their lungs work so that their body can constantly receive the oxygen it requires when working hard to battle this infection. Nowadays, for patients with serious COVID-19 illness, this is the only regimen we have.
"The reproductive number is the number of individuals who would be affected by one infectious person if anyone with whom they have contact is vulnerable to the disease." It is a quantitative indicator of the disease's distribution. In other words, we can see that as a clever way to analyse how easily an illness in the population can spread. The higher reproductive number value correlates to the greater number of individuals that, during every epidemic, would get infection. Each person with COVID-19 can, on average, infect two to three other individuals. If we can rule out just one infection, that means that now each human infects only one person instead of two, so over time we can see a significant decrease in the number of people infected. Therefore, only one person is infected by the first person, and one person is infected by another person, and only one person is infected by that person again, so we end up with only four people who would be confirmed positive in this epidemic, instead of the fifteen that would have taken place if some infection at the initial point had not been ruled out. Thus, we will not interrupt the whole transmission in this manner, but we will surely have a major effect on slowing down the epidemic.

Material and Methods
Forecasting, however, requires ample historical data with no perfect prediction. Forecasts depend upon data reliability and variables of interest for prediction. Ankarali et al. 1 modeled the outbreak with different time series models and also predicted the indicators. They also evaluated the trends and seasonal effects. Petropoulos and Makridakis 3 in their paper introduced an even handed way to predict the continuation of the COVID-19. They assumed that the data used is reliable and also in future will pursue the past model of the disease. They described the timeline of a live forecasting exercise with massive potential implications for planning and decision making and provided objective forecasts for the confirmed cases of COVID-19.Roosa et al. 4 identified the initial cluster of severe pneumonia cases that triggered the COVID-19 epidemic in Wuhan, China in December 2019. They used the models that describe the empirical relationship of phenomena to each other in a way that is consistent with theory but not derived from it and validated these models during previous epidemics to develop the cause and assess short-term forecasts of the cumulative number of confirmed reported cases in Hubei province, the epicenter of the epidemic, and for the overall trajectory in China, excluding the province of Hubei. They collected daily reported cumulative confirmed cases for the 2019-nCoV outbreak for each Chinese province from the National Health Commission of China. Mean estimates and uncertainty bounds for both Hubei and other provinces have remained relatively stable in the last three reporting dates (February 7th -9th). According to their prediction, epidemic has reached saturation in both Hubei and other provinces. Their decisions recommend that the constraint approach enforced in China was strongly abbreviating the transmission and that the epidemic growth had become slowed in nowadays. For various tasks like goal setting and anomaly detection, a common and obvious technique of data science is Forecasting. It involves serious challenges in resulting a highquality reliable prediction. Taylor et al. 5 addressed these challenges with a practical way to forecast at scale. They proposed a modular regression model with interpretable parameters that can be intuitively adjusted by analysts with domain knowledge about the time series. Zakariah et al. 6 research paper on Laboratory Diagnostics in COVID-19: What We Know So Far provided a significant direction in this article.

Data Preprocessing
The data is extracted from Microsoft COVID tracker (https://www.bing.com/covid/local/india). The initial attributes were dates, new cases, total cases, new deaths, total deaths. Total cases and total deaths are the cumulative sum of values of cases and deaths respectively for a particular day. There were no significant errors in the data other than missing values. They were replaced with 0, as the ambiguity lied in the initial entry of the data, i.e., early days of COVID-19 when the values were almost zero.

Algorithms
It is a general assumption for all-time series analysis that the data is interpreted as deterministic and stochastic by the sum of two separate components. In this analysis, the very important feature is that the random noise is produced by individual shocks to the system, though it is always violated. And thus, it can be shown that the forecast approach such as exponential smoothing is very promising.

Fbprophet
It is an automatic forecasting procedure [ Fig.2] for time series data. The point of attraction in this is the nonlinear trend which is fitted with weekly, yearly, and daily seasonality. The time complexity of this algorithm [ Fig.1] is comparatively less than the other forecasting algorithm.

Moving Average
In math, the equation for evaluating data points by generating a sequence of averages of various subsets of the complete data set is a moving average. It is often referred to as a moving mean or rolling mean and is a kind of filter for finite impulse response[ Fig.5]. For any stationary time series, it serves as a general model class and states that the non-deterministic weekly stationary time series can be interpreted as the sum of square weights that are the actual past and potential input values.

Auto-Regression
Autoregression is a model of the time series that uses previous time phase measurements as feedback to a regression equation to estimate the value at the next point of time. It is a very basic principle that can contribute to detailed predictions on a number of issues in a time series.
Here, εis white noise process, β is the weights from values of input [ Fig.6].

Autocorrelation Function
Over successive time periods, autocorrelation reflects the degree of resemblance between a given time series and a lagged variant of itself [ Fig. 7]. Autocorrelation tests the relationship of a variable's present value to its previous values. Arbitrary snapshots of the process at various points of time and analysing the general behaviour of the series determine the stationary existence of a time series. The initial tests tolerate the actions of the autocorrelation function, promising to classify the deviation from the stationarity is a solid and slowly dying ACF(Autocorrelation Function).

Partial Autocorrelation Function
In order to get the order of the moving average phase, ACF is an important tool since it is supposed to break off after moving average order, i.e. q. In achieving the order of AR(Auto-regressor), i.e., p, ACF is not promising. As it would be a combination of exponential decay and damped sinusoidal expression. Therefore, it has been shown to be promising to examine this action of the time series PACF. The partial autocorrelation function (PACF) gives the partial association of a stationary time series with its own lagging values in time series analysis, regressing the time series values at all shorter lags [ Fig. 8]. This contrasts with the role of autocorrelation, which does not regulate other lags.

Results
The forecasting results are as follows:

Fbprophet
The forecast for number of cases and number of death according to Fbprophet can be visualized in the following graph and the tabular data is available in Table 1.1 and 1.2, respectively. (Fig. 9.1, 9.2)

Moving Average Function (MA)
The forecast for number of cases and number of deaths according to Moving Average can be visualized in the following graph and the tabular data is available in Table 2

Auto-Regressive Integrated Moving Average (ARIMA)
The forecast for number of cases and number of deaths according to ARIMA can be visualized in the following graph and the tabular data is available in Table 3

Analysis
Data analysis is an essential part of forecasting and it is an integral part to determine the approach to proceed further with the forecasting. There are several approaches mentioned here, as original approach, logarithmic approach, exponential approach.
The two instances of Data considered are:

Number of Cases
For first instance, the number of cases were considered. It is the number of cases recorded per day in the interval of 01 March 2020 to 19 September 2020.

Original Trend
Here, as the original trend manifests, there was not much variation, shift or alteration in the pattern for a better part of initial two and a half months. The first variation is visible in May and the shift came after June. The high alteration and shift is visible in Mid-July and hence till end there is a steep and peak variation in the data ( fig. 12).

Logarithmic Trend
Logarithmic trend suggests that the high variation starts from the beginning and the trend becomes linear with time which is exactly in contrast with the original trend. Hence, Logarithmic trend was not considered for further steps ( fig. 13).

Exponential Trend
The exponential trend seems to be stable w.r.t. the original trend. Hence, the exponential trend was also not considered for further analysis ( fig. 14).

Moving Average
The moving average of the original trend suggests that the data can be analyzed further and there is no irregularity or missing values ( fig. 15).

Shifted Original Trend
The original trend is shifted to extract the trend of the data irrespective of the count ( fig. 16).

Trend Decompose
Trend decompose is decomposing the different attributes of trend to extract meaning out of the variations in the data. The trend decomposition is applied on the shifted data for better interpretation (figs. 17 and 18). Here, the trend is decomposed in three attributes: 1. General Trend: The increase and decrease in the data irrespective of seasonality and irregular variations.
2. Seasonal Trend: The trend of impact on data due to a particular season or for a certain period of time.
3. Residual Trend: The trend of the irregular values w.r.t. time and its impact on the data.

MA Model:
The Moving Average model is applied by equating the value of p=0 in the ARIMA Model in order to eliminate the autoregressive factor ( fig. 19). Here, RSS stands for Residual Sum of Square. It measures the variation of modelling errors. The general formula is: where, y i -the observed value

ARIMA Model
The p and q values generated from ACF and PACF are then combined to fit the model for ARIMA to evaluate the RSS value of the modelled data. The main goal of this operation is to fit the data with least possible RSS value by increasing or decreasing the p or q values ( fig. 21).

Discussion
The sole purpose of this article is to describe the COVID-19 situation as well as predicting the future condition and count of the said attributes as per the real-world data declared by WHO for INDIA, to which the research actually succeeded.The following research initiates a concrete explanation of the algorithms used for forecasting the number of cases and number of deaths.

Conclusion
This research concludes that the two main algorithms used i.e., ARIMA (Auto Regressive Integrated Moving Average) and Fbprophet are well developed for predictive analysis of seasonal as well as non-seasonal data as they explicitly capture the seasonality and non-seasonality out of the trend that data follows. The results are satisfactorily promising in view of the recent COVID-19 cases of deaths as well as new cases.