Contents

Preparing data using Pandas for time series forecasting

August 24th, 2018 time series forecast

If you are interested in predicting the future movements of a dataset, you are going to need to setup the data so that it conforms to a certain format for time series training.

This article is not going to be a thorough treatment about time series analysis, so concepts like seasonality and other factors that might affect the accuracy of time series prediction are not in the purview of this tutorial. This tutorial will, on the other hand, provide you with all the code and a full dataset that you can download in a variety of formats right here in the article. So follow along, grab the dataset, and hopefully I have demystified this problem for a few people.

Assumptions

I have included a dataset you can download right here in the article to follow along with, but the point of any article is that you probably have some interest in extrapolating these ideas to your own problem; So, the assumptions of this tutorial are that you have some dataset with various features and a time component. This could be corn futures with daily timestamps, it could be monthly solar flare information from a government institution, it could be minute data of a security you are interested in, or any other dataset as long as it has a time feature. Really, the only requirement is that you have some dataset with a uniform time component that you desire to predict some time in the future. We will be using python and a library called pandas to prepare our dataset.

Ethereum Dataset

In the table below, I have provided the Ethereum to U.S dollar dataset for the last few months of 2018 in hourly increments. It is a decent sized dataset for tutorial purposes. It will be loaded into the table asynchronously. Ideally, you already see it by the time you are reading this!

Date Symbol Open High Low Close Volume From Volume To

If you want to follow along, go ahead and grab the dataset above locally. I will be using it in all the steps below.

Getting Started

We are going include the pandas library in our file and then use it to open our ethereum dataset. I'm using the csv version. I have used the same filename as it would automatically create coming out of the table export on this site.


df = pd.read_csv('tableExport.csv')

df = df.drop(df.columns[0], axis=1)

We are also going to add in the next line to drop the first empty row, which is just an empty row that is not necessary:


df = df.drop(df.columns[0], axis=1)
df.head()

You should see the following output, sans the empty first column:


               Date  Symbol    Open    ...       Close  Volume From  Volume To
0  2018-08-25 02-AM  ETHUSD  281.91    ...      282.88         7.34    2070.20
1  2018-08-25 01-AM  ETHUSD  281.63    ...      281.91        57.88   16300.39
2  2018-08-25 12-AM  ETHUSD  280.90    ...      281.63       231.89   65393.41
3  2018-08-24 11-PM  ETHUSD  280.44    ...      280.90       140.49   39605.99
4  2018-08-24 10-PM  ETHUSD  282.88    ...      280.44       847.26  239035.86

[5 rows x 8 columns]

Tidying up the Data

This is a tutorial to understand the basic concept of preparing a dataset for time series forecasting. I always understand a problem better when I can simplify it to its bare working components. Since feature selection is not important, let's get rid of the cognitive load of looking at some of these columns that are unimportant to what we are doing today.


df.drop(['Open', 'High', 'Low', 'Volume From', 'Volume To'], axis=1, inplace=True)
df.head()

Your head() should look like so:


               Date  Symbol   Close
0  2018-08-25 02-AM  ETHUSD  282.88
1  2018-08-25 01-AM  ETHUSD  281.91
2  2018-08-25 12-AM  ETHUSD  281.63
3  2018-08-24 11-PM  ETHUSD  280.90
4  2018-08-24 10-PM  ETHUSD  280.44

Reverse the Dataset

This won't always be necessary, but our current dataset is listed in descending order, and we need to reverse it so that the earliest dates are at the beginning of the dataset.


df = df.iloc[::-1]
print(df.head())

Once again check your table's head to make sure we have what we want.


                  Date  Symbol   Close
1312  2018-07-01 12-AM  ETHUSD  452.00
1311  2018-07-01 01-AM  ETHUSD  451.42
1310  2018-07-01 02-AM  ETHUSD  451.00
1309  2018-07-01 03-AM  ETHUSD  451.00
1308  2018-07-01 04-AM  ETHUSD  450.31

The Dataset is ready

Now our dataset is right where we want it to modify for forecasting. The basic idea is that we want to add another column, let's call it eth_future_price, that will be some increment in the future of what we are looking for. For this tutorial we will try one hour and then two hours into the future. You will notice past a few hours, it is fairly difficult to visually see what is happening anymore.

Let's create this new field


 df['eth_future_price'] = df['Close']
df.head()

#head() results

                  Date  Symbol   Close  eth_future_price
1312  2018-07-01 12-AM  ETHUSD  452.00            452.00
1311  2018-07-01 01-AM  ETHUSD  451.42            451.42
1310  2018-07-01 02-AM  ETHUSD  451.00            451.00
1309  2018-07-01 03-AM  ETHUSD  451.00            451.00
1308  2018-07-01 04-AM  ETHUSD  450.31            450.31


Once again, we use the trusty head method to verify that we have indeed created a new column that has copied over all our values.

Shifting the Dataset

Next, we are going to use the pandas shift method to move all our future_eth_price values back. I have perhaps sacrificed speed or brevity for maximum readability using column names to specify changes. For instance, you may have no idea what df.ix[:,[2,3,6]] is doing if you haven't looked at a certain script in a few months without going through the program and debugging. So my recommendation is to use pandas ability to specify with column names as much as possible. If you have to work on the code again in the future, your future self with thank you.

Let's shift our dataset:


df['eth_future_price'] = df['eth_future_price'].shift(-1)
df.head()

                  Date  Symbol   Close  eth_future_price
1312  2018-07-01 12-AM  ETHUSD  452.00            451.42
1311  2018-07-01 01-AM  ETHUSD  451.42            451.00
1310  2018-07-01 02-AM  ETHUSD  451.00            451.00
1309  2018-07-01 03-AM  ETHUSD  451.00            450.31
1308  2018-07-01 04-AM  ETHUSD  450.31            454.21


You will notice that from your head() output, that each eth_future_price column value corresponds to the Close price one row into the future. Another important thing to take note of is null values you are creating at the end of your dataset. Every time increment you go into the future creates another null value. Let's look at the tail() of our dataset:


df.tail()

               Date  Symbol   Close  eth_future_price
4  2018-08-24 10-PM  ETHUSD  280.44            280.90
3  2018-08-24 11-PM  ETHUSD  280.90            281.63
2  2018-08-25 12-AM  ETHUSD  281.63            281.91
1  2018-08-25 01-AM  ETHUSD  281.91            282.88
0  2018-08-25 02-AM  ETHUSD  282.88               NaN

Predicting 2 hours into the future

It is the same process, and we will print out the head and tail to better illustrate the consequences of predicting further out



df['eth_future_price'] = df['eth_future_price'].shift(-2)

df.head()

#You can see you need to count down two rows into the future to match up with the eth_future_price
                  Date  Symbol   Close  eth_future_price
1312  2018-07-01 12-AM  ETHUSD  452.00            451.00
1311  2018-07-01 01-AM  ETHUSD  451.42            451.00
1310  2018-07-01 02-AM  ETHUSD  451.00            450.31
1309  2018-07-01 03-AM  ETHUSD  451.00            454.21
1308  2018-07-01 04-AM  ETHUSD  450.31            451.96

df.tail()
#consequently, we now have Nan values for the rows moved down.

               Date  Symbol   Close  eth_future_price
4  2018-08-24 10-PM  ETHUSD  280.44            281.63
3  2018-08-24 11-PM  ETHUSD  280.90            281.91
2  2018-08-25 12-AM  ETHUSD  281.63            282.88
1  2018-08-25 01-AM  ETHUSD  281.91               NaN
0  2018-08-25 02-AM  ETHUSD  282.88               NaN

Clear out null values

If you were planning on training a model using this data, you would want to clear out the rows containing these NaN values. It very easy to do with pandas.


df = df.dropna()
df.tail()

#no more null values, but remember we lose data points for every increment we move into the future
6  2018-08-24 08-PM  ETHUSD  279.83            280.44
5  2018-08-24 09-PM  ETHUSD  282.88            280.90
4  2018-08-24 10-PM  ETHUSD  280.44            281.63
3  2018-08-24 11-PM  ETHUSD  280.90            281.91
2  2018-08-25 12-AM  ETHUSD  281.63            282.88

Conclusion

In conclusion, that's all there is to preparing your data for time series forecasting. If you were interested in using regression, neural networks, or any other type of model to predict future price movements, you would want to use the shifted column as your target value.

Back