Preparing data using Pandas for time series forecasting
If you are interested in predicting the future movements of a dataset, you are going to need to setup the data so that it conforms to a certain format for time series training.
This article is not going to be a thorough treatment about time series analysis, so concepts like seasonality and other factors that might affect the accuracy of time series prediction are not in the purview of this tutorial. This tutorial will, on the other hand, provide you with all the code and a full dataset that you can download in a variety of formats right here in the article. So follow along, grab the dataset, and hopefully I have demystified this problem for a few people.
I have included a dataset you can download right here in the article to follow along with, but the point of any article is that you probably have some interest in extrapolating these ideas to your own problem; So, the assumptions of this tutorial are that you have some dataset with various features and a time component. This could be corn futures with daily timestamps, it could be monthly solar flare information from a government institution, it could be minute data of a security you are interested in, or any other dataset as long as it has a time feature. Really, the only requirement is that you have some dataset with a uniform time component that you desire to predict some time in the future. We will be using python and a library called pandas to prepare our dataset.
In the table below, I have provided the Ethereum to U.S dollar dataset for the last few months of 2018 in hourly increments. It is a decent sized dataset for tutorial purposes. It will be loaded into the table asynchronously. Ideally, you already see it by the time you are reading this!
|Date||Symbol||Open||High||Low||Close||Volume From||Volume To|
If you want to follow along, go ahead and grab the dataset above locally. I will be using it in all the steps below.
We are going include the pandas library in our file and then use it to open our ethereum dataset. I'm using the csv version. I have used the same filename as it would automatically create coming out of the table export on this site.
df = pd.read_csv('tableExport.csv') df = df.drop(df.columns, axis=1)
We are also going to add in the next line to drop the first empty row, which is just an empty row that is not necessary:
df = df.drop(df.columns, axis=1) df.head()
You should see the following output, sans the empty first column:
Date Symbol Open ... Close Volume From Volume To 0 2018-08-25 02-AM ETHUSD 281.91 ... 282.88 7.34 2070.20 1 2018-08-25 01-AM ETHUSD 281.63 ... 281.91 57.88 16300.39 2 2018-08-25 12-AM ETHUSD 280.90 ... 281.63 231.89 65393.41 3 2018-08-24 11-PM ETHUSD 280.44 ... 280.90 140.49 39605.99 4 2018-08-24 10-PM ETHUSD 282.88 ... 280.44 847.26 239035.86 [5 rows x 8 columns]
Tidying up the Data
This is a tutorial to understand the basic concept of preparing a dataset for time series forecasting. I always understand a problem better when I can simplify it to its bare working components. Since feature selection is not important, let's get rid of the cognitive load of looking at some of these columns that are unimportant to what we are doing today.
df.drop(['Open', 'High', 'Low', 'Volume From', 'Volume To'], axis=1, inplace=True) df.head()
Your head() should look like so:
Date Symbol Close 0 2018-08-25 02-AM ETHUSD 282.88 1 2018-08-25 01-AM ETHUSD 281.91 2 2018-08-25 12-AM ETHUSD 281.63 3 2018-08-24 11-PM ETHUSD 280.90 4 2018-08-24 10-PM ETHUSD 280.44
Reverse the Dataset
This won't always be necessary, but our current dataset is listed in descending order, and we need to reverse it so that the earliest dates are at the beginning of the dataset.
df = df.iloc[::-1] print(df.head())
Once again check your table's head to make sure we have what we want.
Date Symbol Close 1312 2018-07-01 12-AM ETHUSD 452.00 1311 2018-07-01 01-AM ETHUSD 451.42 1310 2018-07-01 02-AM ETHUSD 451.00 1309 2018-07-01 03-AM ETHUSD 451.00 1308 2018-07-01 04-AM ETHUSD 450.31
The Dataset is ready
Now our dataset is right where we want it to modify for forecasting. The basic idea is that we want to add another column, let's call it eth_future_price, that will be some increment in the future of what we are looking for. For this tutorial we will try one hour and then two hours into the future. You will notice past a few hours, it is fairly difficult to visually see what is happening anymore.
Let's create this new field
df['eth_future_price'] = df['Close'] df.head() #head() results Date Symbol Close eth_future_price 1312 2018-07-01 12-AM ETHUSD 452.00 452.00 1311 2018-07-01 01-AM ETHUSD 451.42 451.42 1310 2018-07-01 02-AM ETHUSD 451.00 451.00 1309 2018-07-01 03-AM ETHUSD 451.00 451.00 1308 2018-07-01 04-AM ETHUSD 450.31 450.31
Once again, we use the trusty head method to verify that we have indeed created a new column that has copied over all our values.
Shifting the Dataset
Next, we are going to use the pandas shift method to move all our future_eth_price values back. I have perhaps sacrificed speed or brevity for maximum readability using column names to specify changes. For instance, you may have no idea what df.ix[:,[2,3,6]] is doing if you haven't looked at a certain script in a few months without going through the program and debugging. So my recommendation is to use pandas ability to specify with column names as much as possible. If you have to work on the code again in the future, your future self with thank you.
Let's shift our dataset:
df['eth_future_price'] = df['eth_future_price'].shift(-1) df.head() Date Symbol Close eth_future_price 1312 2018-07-01 12-AM ETHUSD 452.00 451.42 1311 2018-07-01 01-AM ETHUSD 451.42 451.00 1310 2018-07-01 02-AM ETHUSD 451.00 451.00 1309 2018-07-01 03-AM ETHUSD 451.00 450.31 1308 2018-07-01 04-AM ETHUSD 450.31 454.21
You will notice that from your head() output, that each eth_future_price column value corresponds to the Close price one row into the future. Another important thing to take note of is null values you are creating at the end of your dataset. Every time increment you go into the future creates another null value. Let's look at the tail() of our dataset:
df.tail() Date Symbol Close eth_future_price 4 2018-08-24 10-PM ETHUSD 280.44 280.90 3 2018-08-24 11-PM ETHUSD 280.90 281.63 2 2018-08-25 12-AM ETHUSD 281.63 281.91 1 2018-08-25 01-AM ETHUSD 281.91 282.88 0 2018-08-25 02-AM ETHUSD 282.88 NaN
Predicting 2 hours into the future
It is the same process, and we will print out the head and tail to better illustrate the consequences of predicting further out
df['eth_future_price'] = df['eth_future_price'].shift(-2) df.head() #You can see you need to count down two rows into the future to match up with the eth_future_price Date Symbol Close eth_future_price 1312 2018-07-01 12-AM ETHUSD 452.00 451.00 1311 2018-07-01 01-AM ETHUSD 451.42 451.00 1310 2018-07-01 02-AM ETHUSD 451.00 450.31 1309 2018-07-01 03-AM ETHUSD 451.00 454.21 1308 2018-07-01 04-AM ETHUSD 450.31 451.96 df.tail() #consequently, we now have Nan values for the rows moved down. Date Symbol Close eth_future_price 4 2018-08-24 10-PM ETHUSD 280.44 281.63 3 2018-08-24 11-PM ETHUSD 280.90 281.91 2 2018-08-25 12-AM ETHUSD 281.63 282.88 1 2018-08-25 01-AM ETHUSD 281.91 NaN 0 2018-08-25 02-AM ETHUSD 282.88 NaN
Clear out null values
If you were planning on training a model using this data, you would want to clear out the rows containing these NaN values. It very easy to do with pandas.
df = df.dropna() df.tail() #no more null values, but remember we lose data points for every increment we move into the future 6 2018-08-24 08-PM ETHUSD 279.83 280.44 5 2018-08-24 09-PM ETHUSD 282.88 280.90 4 2018-08-24 10-PM ETHUSD 280.44 281.63 3 2018-08-24 11-PM ETHUSD 280.90 281.91 2 2018-08-25 12-AM ETHUSD 281.63 282.88
In conclusion, that's all there is to preparing your data for time series forecasting. If you were interested in using regression, neural networks, or any other type of model to predict future price movements, you would want to use the shifted column as your target value.