I have created a model testing pipeline for use in my internship and is run on Google Colab. This pipeline allows for the testing of multiple sets of models and parameters back-to-back. It will spin up a model and a set of parameters in a user-defined manner, perform training for 15 epochs, validating after every epoch. It uses two
ModelCheckpoints to save models as h5 files, one to save every epoch, and another to save only the best epoch, under a known name in a different folder, so that it can be easily loaded later.
For reference, every model/parameter set tested is identified using a unique tester id and a model count number, which is incremented every model. The model checkpoints saved every epoch also have the epoch number appended to the end.
After all 15 epochs, the best model is loaded and evaluated on our testing set. Then the next model and set of parameters is spun up and the process repeats until it hits a user-defined stopping point.
At least, that is how it is supposed to work.
What happens instead is that the first model to be run goes according to plan. Then the next model is loaded up and trains and validates for one epoch. However, when it comes time to save the checkpoint for the first epoch, the following is thrown:
RuntimeError: Unable to create link (name already exists)
After that occurs, the only way I have found to not encounter the error at the end of the first epoch is to reset the Colab runtime. At which point I get an additional 1 model out of it before the error occurs again. (Note: this is not the same 1 model that I got out before, I adjusted the method parameters to start at the next model that needed to run)
Finally, to firmly lay to rest the most common causes of this error, I have tried running both
for i, w in enumerate(model.weights): print(i, w.name). I do not have duplicate names indicated by either of these.
I am unsure why this behavior is occuring, my best guess is that it would fall under some combination of Colab's caching behavior and whatever methodology
ModelCheckpoint uses to save the files causing it to interpret a name overlap where there is none.
Any further insight that can be provided as to why this is occurring and how to solve it would be greatly appreciated.