Sunday, December 13, 2015

The Halloween effect with python and pandas

The Halloween effect, aka “sell in May and go away” is the observation that equity market returns tend to be worse over summer time in the northern hemisphere. Anyone who has followed markets for a while has probably noticed a distinct lull over the summer period.

But can we quantify this effect, does it really exist? We can and it does, and it’s simple to show with less than 10 lines of python.

Methods and madness

We create a two column data frame, one column with the monthly return, and the other a dummy variable that is 1 for our hold months (October – May) and 0 for our sell months (June – September).

Once we have created our dummy variable factor marking the events we wish to distinguish between, we do an OLS regression and look at the coefficient of our factor.

If it is “significant”, we conclude there is a material difference between a factor being present and when it is not.

If you are a commercial data scientist, you can use this same method to see if some key metric has actually changed after a marketing campaign or new release. This could be things like increasing user signups or revenue. Your dummy variable would be 0 before the campaign, and 1 afterwards.

If we can show our campaign worked, we can tell our boss how great we are and not to forget all our hard work come bonus time.


As an example, lets look at SPY from 1993 onwards. First we download the data from yahoo, and create a column of monthly returns. Then we code our dummy variable as described above and run the regression.

Looking at the pandas OLS output, we see the following summary:

#-----------------------Summary of Estimated Coefficients------------------------
#      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
#             x     0.0141     0.0054       2.61     0.0096     0.0035     0.0246
#     intercept    -0.0038     0.0044      -0.87     0.3869    -0.0124     0.0048
#---------------------------------End of Summary---------------------------------

Where x is our Halloween dummy variable with a p-value of 0.0096. Significant at any reasonable level. Take that EMH!

Looking at the data, the average monthly return is -0.28% for summer, and +1.11% for the Halloween period.

End notes

For the Halloween effect, rather than looking at monthly returns, we should probably look at the differential of monthly return and risk free rate. There’s a fairly comprehensive paper with a good historical review available here.

Also, there is a great and freely available book on working with time series data available here, the examples are in R but should be pretty easy to follow along.

Finally, code is up: here

Monday, May 11, 2015

Equity Ranking Backtest with Python/Pandas

I have been look at equities a bit of late, I am particularly interested in ranking a universe of equities for “low frequency” manual trading on a weekly or monthly basis.

Every period I would rank each name on a bunch of different factors, then invest in the highest ranked ones for that month.

I was initially working in R but the code grew unwieldy, and I wanted a second opinion on my approach so took the time to re implement it in python using Pandas.


For each symbol in our universe, we load the raw data and generate the information used for ranking. If we have 5 names, we end up with 5 dataframes.

Then we combine those dataframes into one big dataframe, and iterate through month by month, selecting the symbols that meet our ranking criteria. From those selected, we equally weight and sum the next period returns.

One thing that is really cool about the pandas dataframe is that it allows multiple rows with the same index.

This makes it easy to get the data for the month under consideration. We just pass the month to index function and get the subset of data for that month, e.g.

>>> df.ix['2015-02']
                 cpr       npr       avg   over  sym
2015-02-28  0.043302 -0.062449 -0.038914  False  DBC
2015-02-28 -0.025028  0.008524  0.006130   True  IEF
2015-02-28  0.056838 -0.014239  0.005434   True  VEU
2015-02-28 -0.037434  0.017171  0.015900   True  VNQ
2015-02-28  0.055832 -0.011697  0.009236   True  VTI

[5 rows x 5 columns]

In this example there are 5 symbols, and we see the ranking information for February 2015.

Another option would be to use hierarchical indexing, with a sub-index for each month, but this way worked for my needs and I think is quite clean and simple.

If anyone knows an equivalent in R that is as clean and easy to work with for multiple time series I would love to hear about it. 

Code Notes

The demo code does a simple back test of the GTAA/Relative Strength trend following system using ETFs.  

I have stripped it down to the basics so hopefully it is easy to understand. Load the data, generate the dataframe with the info we want, make a combined data frame, then go through month by month.

The ranking is done by filtering out names under their 10 month moving average, then selecting the top n based on average 3 month return.

The “cpr” column is the current period return, and the “npr” column is the next period return, which is the return realized if we select a given security for that month.

The data is just ETF data from Yahoo, which I have put up here. Code is here.

I found Python For Data Analysis a very useful book is when working with pandas.

Tuesday, March 24, 2015

Simulation and relative performance

There’s been some nice posts on randomness the last week or so, in particular here and here

I would like to look at how we can use simulations to get a better understanding of how some aspect of a trading system holds up relative to a bunch of random trades.

In this example, I look at entries on weekly data for SPY. The entry signal is to buy if the previous week closed down.

Over the time frame (2005-2014, about 10 years), it was long about 44% of the time, and out the rest.

In the simulation function, we generate random entry signals that will see us long about the same amount of time.

We track some metrics of system performance, in this case total return, average trade return and accuracy (i.e. how often a buy signal was correct).

I then use ggplot to make some density plots of the simulation metrics, marking the mean of the simulation results in red and the corresponding system metric in blue.

It looks like this

I basically want to see the blue line far away from the red line. In this case it seems fairly decent. You can also generate some p-values based off the simulation data as well.

For comparison, here is a daily system that is long if the previous close was above the 200 day simple moving average.

We can see there’s not a lot of difference between the moving average results and just entering randomly. (Note the accuracy metric has a different x-axis scale than the previous plot).

I use a similar idea for putting risk or open trade management ideas through their paces, seeing how well they hold up when managing random entries.

Code is up here. Thanks for reading