Hey all,
About 20 minutes ago, all servers on the network crashed, the incident has been resolved now, and I'm going to do my best to explain what went wrong.
We noticed the disk usage of our main dedicated server getting extremely high (93%), so I was clearing out some old logs and database backups we didn't need anymore. Since some of our server-specific databases like block logs use flat files, the process of clearing them only partially is a lot harder than clearing them entirely.
First you need to loop over and parse every single entry in the database, which is quite a lengthy task on it's own, you then need to decide if the entry is worth keeping or throwing away. Before I continue, I want to say that I'm not certain this is the root cause of the issue, but it seems like the most probable contributing factor. Our main dedicated server has 128GB RAM, which is a lot, but we also have a lot of systems privately and publicly, and servers to run, so it adds up quickly. The specific database I was clearing out was 131GB of flat file storage. What I believe happened is that the process of iterating the database caused our memory to overflow into swap memory, and since the system disk usage was already so high (93%), it filled the rest of that up with temp memory for the operation, which caused our main databases and systems to crash with out of disk space errors.
We will hopefully soon be getting a third dedicated server to distribute some of our server load to, to prevent instances like this happening again in the future.
The server is back in full working condition now.
Happy Playing!
Thanks,
BattleDash