October, 2021 – I was attempting to fit two nested reinforcement learning models using R. One model was a basic two parameter delta model, the other was a model with an additional parameter that allowed it to weight evidence from a decay rule learning model (see Don et al., 2019, Cognition). When this additional parameter is set to zero, the model is identical to the delta model. Therefore the delta model is nested within the three-parameter hybrid model. Because of this, the three parameter model cannot possibly provide a poorer fit to the data than the nested delta model, at worst the two models would fit equally well.
These models were fit using R’s optim function, method = ‘L-BFGS-B’. For each fit random starting values were drawn from a uniform distribution and the model was fit several times, with the best fit being taken. This is done to prevent local minima – when the model says it has found the best solution, but it is not really the best solution.
The purpose of this post is to report that with only 10 random starting points, the three-parameter hybrid model fit the data more poorly than the delta model – or rather that local minima were found, instead of the same parameter values for the two-parameter delta model, with the additional parameter set to zero, so the models are identical.
I increased the number of random starting points from 10 to 50, and the problem was slightly improved, but there were still a few data sets where the three parameter model could not find the best solution for the simpler two-parameter model.
When I changed the number of random starting points from 50 t0 100, I am happy to report that there were no more local minimia with this data set (N=293). I also set a lower bound on the learning rate parameter to .01 (from .0001) because this decreased the parameter space and did not qualitatively change the model’s assumptions about behavior. Therefore I recommend that when fitting models using the maximum likelihood method, one should use at least 100 random starting points for models with three or more parameters.
One should also consider whether the parameter space is unnecessarily complex, given one’s theoretical goals. With RL models I think .01 is a reasonable bound for the learning rate or decay rate, because then it is less likely to be indistinguishable from the inverse temperature parameter (as the two parameters approach zero, they make identical predictions).
The parameter space increases exponentially if a model has three free parameters rather than two. Presumably more starting points will likely be needed for models with additional parameters. Perhaps I will expand this into a journal publication, but I wanted to give a warning to modelers out there to watch out for local minima. Think of ways to check to to ensure that you are finding the true best-fitting set of parameters, and think deeply about what your models are doing.