Web Traffic Forecasting

Forecast future traffic to Wikipedia pages


Get Started!

Dataset
Web visits of 145k pages



The dataset is from the Kaggle competition Web Traffic Time Series Forecasting from 07/13/17 to 11/15/17.
Each row represents a visit series of a page from 07/31/15 to 12/31/16 (for training set 1) and to 09/01/17 (for training set 2).




name_project_access_agent
    ● Name: page name
    ● Project: website language
              Deutsch(de), English(en), Spanish(es), French(fr), Japanese(ja),
              Russian(ru), Chinese(zh), mediawiki, commons.wikimedia
    ● Accessibility: Type of access
              all-access, desktop, mobile
    ● Agent: Type of agent
              all-agents, spider


Goal
Forecast Two Months Web Visits



Forecast

The competition aims to forecast two months visits of 145k pages in Wikipedia. The competition ends on 09/11/17 and the expected forecasting dates are from 09/13/17 to 11/13/17.
In this website, in order to better visualize the results, we are mostly using training set 1 in this stage and compare the true visits data in training set 2. (That is to say, we are using the dataset from 07/31/15 to 12/31/16 to forecast the visits from 01/01/17 to 03/01/17.) After several try, we will move on to training set 2 using our finalized model.



Models

01

ARIMA
(Autoregressive Integrated Moving Average)

Common statistical model to well-forecast the time series data.
Many assumptions behind the model might be challenging.

02

LSTM
(Long short-term memory)

Common Deep Learning method used on time series data. Based on RNN, but fix the problem of long-term dependency.


Each model are discussed in their own sections, we will go deeper to show how they work.
Go ahead to explore each page for different models in the navigation bar, or follow the bottons under each section for full story!


Evaluation
Symmetric Mean Absolute Percent Error



What is MAPE?

Mean Absolute percent Error (MAPE) is a measure of prediction accuracy of a forecasting method using the difference between ture value and predicted value divided by the actual value, then taking the absolute value, summing them up, and divided by the number of the amount of the values.


where A is the actual value and F is the forecast value. However, MAPE is simple and convincing but it cannot be used if there are zero values because there would be a division by zero, and that's why this competition is asking for the SMAPE value.



What is SMAPE?

Symmetric Mean Absolute Percent Error (SMAPE) is an alternative method to MAPE when there are zero or near-zero demand for items. Since the low volume items have infinitely high error rates that skew the overall error rate, SMAPE self-limits to an error rate of 200% and reduces the influence of low volume items.




Baseline

According to the leaderboard on Kaggle, the mean SMAPE value between 35.48065 and 40.33832 are ranked within top 20%, and value lower than 41.94324 are ranked within top 50% in this competition. Our results will be considered based on this leaderboard.