Post mortem of Feb. 22nd maintenance
Feb. 24th 2014 | by Stefan Schuster
After the eventful last 2 days it's time for a post mortem. As some of you have realized, Mind42 was producing tons of errors after the maintenance (like "Temporary failure", or login wasn't working, ...). After trying to stabilize the situation during the first half of Monday we decided to revert the changes made during the weekend. This also means that all changes to mind maps, new mind maps, or new user sign ups that have been done since the end of the maintenance (approximately between Sunday 3 PM and Monday 9 AM) have been reverted as well. But considering that no real work was possible during this time due to all the errors, there shouldn't be too much data loss. Of course we are sorry about how things turned out, and learned a lot about the behavior of the database system we wanted to change to. As a compensation for the troubles, we've added 2 free days of ad-free Mind42 use to all paying customers. So much for the short summary of the events, now let's dive into the details (for all who are interested).
As mentioned in multiple previous blog posts, our number one priority currently is to make Mind42 more failsafe. The biggest bottleneck in this undertaking is our database server. Since it's a single server, Mind42 stops working whenever this one single server has a problem. Due to our hardware setup it's not that easy though, to simply add a second database server that could take over in case of a problem - so we searched for an alternative. Nowadays there are a lot of databases available that promise different advantages in regards to reliability and scalability (known as NoSQL databases). Our plan was to switch from our old monolithic SQL database (where we can't add a failover for technical reasons), to a new NoSQL database that is distributed over multiple servers by design. Which should be a solution to our bottleneck problem. So we changed our software to run on the new database, and actually some parts of the software (like the thumbnail images, user sessions and collaboration) where already running on this system for quite some time (and still are). Our plan during this maintenance window was to switch the rest of the system to the new database as well.
So this is where the maintenance this Saturday started. By copying all the data of Mind42 (users, mindmaps, revisions) into the new database system. This took longer than anticipated, so it was Sunday afternoon before we were able to switch Mind42 back online. There were no errors during the data migration, and immediately after turning Mind42 on everything looked fine. But then a first little hickup occurred. We haven't worried about this too much at this point in time, but monitored the situation. Mind42 is much less used during the weekend, so the real troubles started on Monday, when the usual load was kicking in. Our new database system just couldn't handle the load, produced timeouts and error message like "Temporary failure. Try again later". We tried to stabilize the situation and learned a lot about the quirks of this database (every software behaves a little bit different, so we also have to get used to the characteristics of this system under stress). In the end though we've realized, that there was no quick way of stabilizing the situation with our given server infrastructure. Therefore we decided to switch back to the old database system.
So, was all this downtime during the weekend (and first half of Monday) worthless? No, not completely. We're still facing the problem of the single point of failure with our old monolithic database. Also the changed Mind42 software (to work with the new database) worked without problems. Only the new database itself didn't behave quite like we've expected. At least not on our server infrastructure. So we've learned quite a lot and will continue to work on this issue, because ultimately our goal is to avoid unplanned downtimes due to problems with the database (like last October). Although in the future we definitely have to find ways for a smoother migration, so that switches like this don't make Mind42 unusable for nearly two days.
I hope this detailed summary answered most of your question. Again, we're terribly sorry about how things turned out. Thanks for your understanding.