Sunday, November 24, 2013

Just for fun: attractors in R

I have a borderline unhealthy obsession with attractors. I thought I got it out of my system, but here we are. For whatever reason, I felt like making some in R.

You can find the R code here. It uses the attractor function to define density in a matrix, which is how often a given point gets hit. Then we plot the log scaled version of that using image().

Be warned, with a lot of points it can be quite slow. If you play around you might want to drop the n value to 10,000 or so.

Some pictures

Dejong


Clifford



If you like this sort of thing, I made a realtime interactive dejong attractor, using a kinect, openFrameworks and a lot of GLSL shaders. You can see a clip of it below, the source is also on my github if you are so inclined.


Attractor from dizzy pete on Vimeo.

You can find me on twitter and G+

Book Review: Applied Predictive Modeling by Max Kuhn and Kjell Johnson

This is a gem of a book.

From the introduction:

We intend this work to be a practitioner’s guide to the predictive modeling process and a place where one can come to learn about the approach and to gain intuition about the many commonly used and modern, powerful models.
…it was our goal to be as hands-on as possible, enabling the readers to reproduce the results within reasonable precision as well as being able to naturally extend the predictive modeling approach to their own data.

The book is structured into four main sections. First is General Strategies, which provides an introduction and discusses things like pre-processing and tuning. 

The next two sections cover regression and classification, each with chapters on linear and non-linear methods, as well as tree and rule based methods, with one to two chapters on practical issues such as measuring performance. 

The final section covers feature selection, predictor importance and a discussion around model performance. 

Features


There are a few things I really like:

It is not an academic or mathematical treatise; the emphasis is on practice, discussing the issues that commonly arise and how they can be approached. Plenty of references are provided for those wanting to dig deeper.

Every example has its data set and code available so one can work through the examples as presented. In most cases they are real world datasets and there is great discussion of the real world issues that arise, what should be considered and the various tradeoffs that can be made.

Discussion and code are separate. Aside from the excellent content, this is probably what I appreciate the most. Each chapter presents its content, with charts where appropriate, while the actual walk through of the code and raw output is in a separate section of the chapter. 

This makes it much easier to focus on the material being presented. It is always difficult to present source code along with discussion. This is not a book about programming per-se, it is about using existing tools to make intelligent and reasoned decisions about the task at hand. It makes a lot of sense to have the code presented separately.

Also, as far as I have read, each chart is at most only one page away from the text discussing it. This is a small thing but I feel there has been serious consideration about the presentation of the material and it has been done very well. [Update: This is not strictly true but is mostly the case throughout the book]

It is not a book about caret, the package of author Max Kuhn. To be honest I would be pretty happy even if it were about caret, which certainly does get some use in the code, but it is relatively package agnostic. 

Final thoughts


This is a great book, providing both the trees and the forest so to speak. I am unaware of any other book with similar content, and I wish I had something like this when I was first getting interested in machine learning. 

There are books that are very introductory, books that cover the details of the algorithms, and books that provide rigorous coverage of the theory, but these are not really accessible to those without a serious amount of mathematics. There are a few equations presented where appropriate, but it is certainly not the focus of the book.  

There are no real shortcomings, though if there were ever a second edition, coverage of time series methods and deep learning would be welcome. I appreciate they are both book worthy topics by themselves, and the latter is still very much a moving target. 

In summary: Great content, well written and well presented. This book would be my top recommendation to anyone looking to get started or working with predictive modeling. Well worth checking out. 

You can read more reviews on amazon here: Applied Predictive Modeling

You can find me on twitter and G+

Friday, October 25, 2013

Another Rcpp/RcppArmadillo success story

I finally had an excuse to try out RcppArmadillo and was amazed at just how well it worked out.

The exact details are a bit beyond a blog post, but basically I had a matrix of roughly 80,000x2 double valued samples (call it A) and needed to find where in another, much smaller matrix (call it B) each of the samples was closest too.

There was some structure to the B matrix, which is 500x2, so it wasn't necessary to iterate through all 500, each lookup jumped around a bit and took 8 iterations.

The pseduocode in R looked a bit like this

find_closest <- function(A, B) {
    for(i in 1:8) {
        ...voodoo...
    }
    return index into B that A is closest too
}

calc_distance <- function(A, B)  apply(A, 1, find_closest, B=B) 

All up it took about 9.5 seconds on my system, and 15 seconds on another.

The first pass I implemented find_closest using RcppArmadillo, and saw a healthy speed up to around 400 ms, which was a big improvement.

Then I realised I might as well do calc_distance/the apply in C++ as it was just a simple vector of integers being returned.

This gave an amazing performance leap, the function now takes around 12 milliseconds all up, down from 9.5 seconds. On another machine it would take 15 seconds, and ended up taking 10 milliseconds.

I was very surprised at this. I haven't have a chance to dig into the details, but I am assuming there is a reasonable amount of overhead passing the data from R to RcppArmadillo. In the case of apply, this would be incurred for every row/column the apply was running find_closest on. By moving the apply to C++, all the data was passed from R to C++ only once, giving the large speedup. Or so I guess.

I tried two versions, one that traversed A & B row wise, and the other by column. The column one generally seemed faster, 12 ms vs 19 ms for rows. According to the Arma docs it stores matrices in column-major order, which might explain that difference.

Would appreciate any comments or pointers to documentation so I can better understand what is going on under the hood there.



The case for data snooping

When we are backtesting automated trading systems, accidental data snooping or look forward errors are an easy mistake to make. The nature of the error in this context is making our predictions using the data we are trying to predict. Typically, it comes from a mistake with our calculations of time offsets somewhere.

However, it can be a useful tool. If we give our system perfect forward knowledge:

1) We establish an upper bound for performance.
2) We can get a quick read if something is worth pursuing further, and
3) It can help highlight other coding errors.

The first two are pretty closely related. If our wonderful model is built using the values it is trying to predict, and still performs no better than random guessing, it’s probably not worth the effort trying to salvage it.

The flip side is when it performs well, that will be as good as it will ever get.

There are two main ways it can help identifying errors. Firstly, if our subsequent testing on non-snooped data provides comparable performance, we probably have another look ahead bug lurking somewhere.


Secondly, things like having amazing accuracy yet still performing poorly is another sign of a bug lurking somewhere.

Example

I wanted to compare SVM models when trained with actual prices vs a series of log returns, using the rolling model code I put up earlier. As a baseline, I also added in a 200 day simple moving average model.

(S) Indicates snooped data

A few things strike me about this.

For the SMA system, peeking ahead by a day only provides a small increase in accuracy. Given the longer-term nature of the 200 day SMA this is probably to be expected.

For the SVM trained systems, the results are somewhat contradictory.

For the look forward models, training on price data had much lower accuracy than the log returns, and the log return model performed much better. Note that both could have achieved 100% accuracy by predicting its first column of training data.

However, when not snooping, the models trained on closing prices did much better than those trained on returns. I’m not 100% sure there isn’t still some bug lurking somewhere, but hey if the code was off it would’ve shown up in the forward tested results no?

Feel free to take a look and have a play around with the code, which is up here.

Monday, September 23, 2013

Building models over rolling time periods

Often I have some idea for a trading system that is of the form “does some particular aspect of the last n periods of data have any predictive use for subsequent periods.”

I generally like to work with nice units of time, such as 4 weeks or 6 months, rather than 30 or 126 days. It probably doesn’t make a meaningful difference in most cases, but it’s nice to have the option.

At first this seemed like something rollapply() from the zoo package could help with, but there are a number of preconditions that need to be met and frankly I find them to be a bit of a pain.

In a nutshell I have not been able to find a nice way for it to apply a function to a rolling subset of time series data nicely aligned to weekly or monthly boundaries.

All is not lost, there is a neat function in xts called endpoints(), which takes a time series and a period (e.g. “weeks”, “months”) and returns the indexes into that time series for the corresponding periods.

Using this information it becomes easy to subset the time set data using normal row subset operations.

The xts package also has period.apply but it runs on non-overlapping intervals, which is close but still not quite what I want.

In the script for this post there are 4 or so functions of note.

The main one is roll_model, which takes a set of data to be subsetted and passed to the model, the size of per model training and test sets and the period size to split things up, which is anything valid for use with endpoints().

A utility function is train_test_split which also uses endpoints() to split a subset of data into 2 sets, one for training the model, one for testing. In practice it needs to be the same period type as you expect to use with roll_model.

The function that actually builds the model and returns some results is run_model(), which calls train_test_split to get the training and test set, builds a model using ksvm in this example, and sees how it goes based on the test set.

Another utility function is called before that, data_prep which builds the main data object to be passed to roll_model. In this example it takes a set of log closes to close returns, sets Y to be the return at time t, X1 the log return at t-1, X2 at t-2 and so on.


The example model is not a particularly useful way of looking at things, which is not surprising given close to close returns are effectively random noise. But perhaps the script itself is useful for other ideas, and if anyone knows better/easier/alternate ways of doing the same thing I would love to hear about them.

The script is available here.

Sunday, September 8, 2013

Snapshot of the global zeitgeist



John le Carré, author of Tinker Tailor Solider Spy (among many others) and Miley Cyrus give a timely reminder that Australia is a small fish in a big pond.

Thursday, September 5, 2013

Type conversion and you (or and R)

Types and type conversion can be a tricky and intricate topic, and sometimes can lead to some real head-scratcher issues in R. Hence a somewhat confusing title.

This is for people still relatively new to R, and I will skip some gory details. Actually I will skip most of them, the canonical source for type and conversion information is the official R documentation, and the help pages for the functions at hand.

Instead I thought I would walk through some examples of when the type engine can behave in seemingly odd ways, and take a look at what is going on when mysterious errors arise and what can be done to track down their source.

What are types. 


Types describe the nature of data R is dealing with, at least as far as R cares. If you want to use the + operator, R needs to look at the data type on either side and work out if + is defined for those, and what type the result should be.

For example 1 + 1 will behave as you might expect, summing the two number and returning a numeric result, but what should R do if you add a character to a matrix? It uses types to answer these questions, and give an error message if no answer can be found.

Let’s get started, first we will create a vector with some numeric data

a <- c(1,2,3)
sum(a)
#[1] 6
a
#[1] 1 2 3

Nothing too exciting, we created a vector with three numeric values, and the builtin sum function behaves as we would expect. Let’s see what happens when we try to mix types in the vector:

a <- c(a, "hi")
a
#[1] "1"  "2"  "3"  "hi"

When we look at the contents of our vector, the numeric values have been converted to strings. This is because a vector is for homogenous storage, that is, it can only hold data of the same type. When we appended the string to it, R converted the numeric values to strings.

This is an example of implicit type conversion aka type coercion. R knows the vector can only hold data of the same type, and since it does not know how to turn ‘hi’ into a numeric value, it has turned the integers into strings. This has done magically (or implicitly) in the background.

What happens when we try to sum our vector?

sum(a)
#Error in sum(a) : invalid 'type' (character) of argument

We get an error, because R does not know how to add strings, at least in the conventional numerical sense of addition.

The general rule of thumb for conversion is “the bigger type wins.” Strings can represent more data than numeric values, so the numeric values are converted to strings. There aren’t that many data types in R, so often you will end up with strings when types are being mixed.

Type conversion is not particularly smart, which is generally a good idea in programming languages. If, instead of “hi”, we passed in the character representation of a numeric one “1”, it still would have converted the existing numeric values to strings, even though in theory the literal 1 could be converted to a numeric type. The type engine just sees strings and numerics, the biggest wins and the numerics are coerced.

Type conversion and apply


Sometimes when using apply and friends, you will get errors that at first glance must clearly be bugs in R. Let’s take a look at another example, this time using a data frame.

Unlike a vector or a matrix, a data frame is a heterogeneous type container; it is possible for it to store columns with different types. We will have five rows, a categorical factor column and two columns of numeric data.

n <- 5
#make some dummy data
df <- data.frame(cbind(rbinom(n, 1, 0.5), rnorm(n, 10, 5), rnorm(n, 20, 10)))
#make the first column a factor
df[,1] <- as.factor(df[,1])
head(df)
#  X1        X2       X3
#1  1  8.911567 27.28325
#2  1  9.933021 13.74879
#3  0 10.177231 20.65490
#4  0  6.368177 27.10183
#5  1 12.084135 14.54369

Now let’s try and sum the two numeric columns using apply

apply(df, 1, function(x) x[2] + x[3] )
#Error in x[2] + x[3] : non-numeric argument to binary operator

That’s weird. I’m pretty sure they are numeric, so addition should work

df[,2] + df[,3]
#[1] 36.19481 23.68181 30.83213 33.47001 26.62783

And it does. What gives?

We can see what datatype R thinks it is working with by using the mode() function.

apply(df, 1, function(x) mode(x))
#[1] "character" "character" "character" "character" "character"

Each row is being passed as character types, not the mixed data types we were expecting (i.e. factor, numeric, numeric).

In a nutshell, apply will first convert our data frame to a matrix before passing the rows to the defined function. Which it says in the help page, and I never took any notice of until things started getting weird.

The matrix container wants to store data all of the same type, and much like our initial vector example, our numeric values are being coerced into strings. Only once this coercion has taken place do the rows of the matrix get passed to the function we supplied apply with.

This is why we see our “non-numeric argument to binary operator” error message. What we thought was numeric data has been converted to character data, which we subsequently try to add.

Just to really drive this home, let’s look at the first row of our data frame:

a <- df[1,]
a
#  X1       X2       X3
#1  1 8.911567 27.28325

Now let’s see what happens when we convert it to a matrix as apply does:

b <- as.matrix(a)
b
#  X1  X2         X3      
#1 "1" "8.911567" "27.28325"
b[2] + b[3]
#Error in b[2] + b[3] : non-numeric argument to binary operator

We can see it has been converted to strings, and we end up with the same error message we saw when we used apply.

In this case the resolution is quite simple, we use the index operator in our call to apply

apply(df[,2:3], 1, function(x) x[1] + x[2])
#[1] 36.19481 23.68181 30.83213 33.47001 26.62783

Which works as we expect. Note in our function we changed the indexes to 1 and 2 as we are now only passing two columns of data to apply.

So, remember that in many cases apply will convert what you pass it to a matrix. The matrix wants all data to be of the same type, and will coerce as required. For the specific conditions see the help page for apply.

A second look


I’d like to take a look at another example. This time instead of a factor, we will have one column as a string type, and two numeric columns as before:

df <- data.frame(cbind(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10)))
#        X1               X2               X3
#1 subject1 14.6619839711866 6.94472759446703
#2 subject2  11.603910222178 27.6225162121889
#3 subject3 5.21881004622993 20.3409476386206
#4 subject4 16.3574724782284 39.0904723579448
#5 subject5 9.35407053787977 23.8568796326835

We know that apply will coerce its input data to a matrix, so pass in only the numeric columns:

apply(df[,2:3], 1, function(x) x[1] + x[2])
#Error in x[1] + x[2] : non-numeric argument to binary operator

Uh oh!

Lets take a look at the data we are passing in

a <- df[,2:3]
mode(a[,1])
#[1] "numeric"
as.matrix(a)
#     X2                  X3              
#[1,] "-3.89274205212847" "12.7336046818466"
#[2,] "12.3494043977024"  "17.9329667214396"
#[3,] "4.7419241278816"   "16.0664073330786"
#[4,] "8.50784944656814"  "8.65139145569206"
#[5,] "9.56191506080518"  "21.2114650777001"

What. Why are you strings? mode says you're numeric!

a[,1]
#[1] -3.89274205212847 12.3494043977024  4.7419241278816   8.50784944656814  9.56191506080518
#Levels: -3.89274205212847 12.3494043977024 4.7419241278816 8.50784944656814 9.56191506080518

Why has our numeric data turned into a factor??

Generally at this point, I take a quiet moment to reflect on what I’m doing with my life and why I'm not living in a yurt on a mountain somewhere, I mean people do it and they seem happy enough? Could it really be that bad? But where would I charge my laptop? Oh well, back to the task at hand. I bet the coffee would be terrible as well.

What is going on here? We created a data frame with a column of strings and two numeric columns. Why have we ended up with factors?

Lets take a look at our string column

df[,1]
#[1] subject1 subject2 subject3 subject4 subject5
#Levels: subject1 subject2 subject3 subject4 subject5

The strings were converted to factors, checking the data.frame help page we see a stringsAsFactors option. A likely culprit, lets see how we go

df <- data.frame(cbind(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10)), stringsAsFactors=FALSE)
as.matrix(df[,2:3])
#     X2                 X3              
#[1,] "7.19530271823023" "26.4186991862312"
#[2,] "13.6715492467442" "25.452128137706"
#[3,] "8.89363806613213" "20.1618970554355"
#[4,] "16.296512734304"  "16.2581582721134"
#[5,] "11.6454577442585" "17.5241594066948"

Why are they still strings?

In this case, the culprit is actually the cbind call, which is coercing our mixed strings and numerics into strings. When cbind finishes, the resulting matrix is passed to data.frame and our now string data gets converted to factors as due to stringsAsFactors being TRUE.

cbind(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10))
#     [,1]       [,2]                [,3]            
#[1,] "subject1" "14.0342542696833"  "30.5672885598002"
#[2,] "subject2" "8.44141744459018"  "35.1337567509022"
#[3,] "subject3" "11.6550656524794"  "10.1554349193507"
#[4,] "subject4" "18.0303214118231"  "14.9638066872277"
#[5,] "subject5" "0.180686583194847" "11.7124424267387"

Dropping the cbind gives us what we are after

df <- data.frame(paste("subject", 1:n, sep=''), rnorm(n, 10, 5), rnorm(n, 20, 10))
as.matrix(df[,2:3])
#     rnorm.n..10..5. rnorm.n..20..10.
#[1,]       13.557665        33.519719
#[2,]       15.086483        41.457651
#[3,]        7.010492         1.757224
#[4,]       11.008779        29.707944
#[5,]       15.777351        10.280138
apply(df[,2:3], 1, function(x) x[1] + x[2])
#[1] 47.077384 56.544134  8.767716 40.716723 26.057489

Phew.

The cbind call is not necessary when creating a data frame, which is designed to take a variable amount of data. It's excessive use is a bad habit I picked up learning to navigate the waters of R, I must be sure to use it only when necessary.

Summary


As you have seen, type conversion and coercion happens quite frequently, and usually you may not even realize it has happened. At least until mysterious error messages start appearing.

When they do, using mode() is the best way to see what is going on. If you are after specific type information I am wary of the is.* family, they are useful, but tend to be fairly generous when you are after specifics.

Also be sure to check the help pages for the functions you are using, they do usually note when conversion takes place.

A good rule of thumb is vector and matrix are used for data of the same type, list and data frame are used for data of mixed types.

The reason mode prints “numeric” for factors is that internally, factors are numeric. They just come with a list of strings (the factor levels), which are printed out in lieu of the internal numeric representation. You might also notice that calling mode on a matrix tells you it is a list, at which point I make a dry coughing sound and busy myself with a yurt catalogue.

You can find technical details about types here http://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects

Finally, it is possible to force R to use a specific mode using the storage.mode() function. In this case values that cannot be represented in the given mode will be converted to NA

a <- c(1:3, "hi")
storage.mode(a)
#[1] "character"
storage.mode(a) <- 'integer'
#Warning message:
#In storage.mode(a) <- "integer" : NAs introduced by coercion
a
#[1]  1  2  3 NA

Code in one file is here.