28 March 2018

Alexander Bryukhanov

Сhief technical officer at ISPsystem

How we developed our backup solution. Part one

This is a story of how ISPsystem developed a backup solution, told by Alexander Bryukhanov, the chief technical officer at ISPsystem.

All users can be divided into three groups:

those who never make backup copies,

those who do,

and those who check backups that have already been made.

Someone will simply be amused by my story, while others will see themselves in it. This is a story about ISPsystem 15 years ago, what we have changed since then, and what it has led to. The first part tells how we started developing a solution for backups of virtual servers.
So, back to early 2000s. OpenVZ and KVM are not yet created. FreeBSD Jail has just come out, and we are the first to develop a solution to provide virtual servers based on this technology.
Once you start to collect any data, you will face the problem: how not to lose it? We started with archives of virtual server files and it was enough, thanks to UnionFS.
However, there was something to consider: when deleting a file from a template, the so-called WHITEOUT is created, and tar cannot see it. Therefore, when recovering from such a backup, deleted files (if they were not replaced with other files) would be recovered from the template.
This service was in high demand, and we developed incremental backups. Tar on the basis of FreeBSD could do it right out of the box.
Backups were still in high demand. Our customers started to order not servers but server racks. For those who started with 56K Internet in the room of 20 square meters, it was a success. And over time, problems began to arise.

Problem one  —  the processor

Around that time we started to look at the ready solutions. I found no suitable solution for us apart from bacula, a very young product at that time. We tried to deploy it in one of the data centers but it failed to live up to our expectations. It turned out to be quite difficult to configure, getting files out of it was not as convenient as it was from a standard .tgz archive, and its performance was not really impressive.
Reducing priority of backing up did not lead to anything good either: sometimes 24 hours were not enough to complete a backup or backups just failed to create.
The solution was obvious: We needed to archive on a separate machine! Thankfully, it could be done easily with a shell script. And we did it. Instead of a usual file-server, we’ve got a full-fledged backup server. It solved the CPU problem. But then another issue appeared.

Problem Two  —  the disk

Full weekly backups were affected. We found the answer quickly. We had a previous copy in which there were most of the files. So our next step was to get files for the backup from the previous copy, not from the server. That was the first implementation of our own ispbackup. And the speed increased significantly!
It also solved the problem of WHITEOUT: readdir () could not see the files deleted but fts_read () could. In general, the gz thread which was used for compression didn’t allow reading from the middle. Besides, data repacking might be resource-intensive.
We split backup copies into smaller separate parts. One part contained a certain set of files where the beginning of files could not be moved against the beginning of the archive. In order not to repackage files during the repeated usage, the new archive could use several parts from the previous one. Re-used parts could contain outdated data. To get rid of it, we developed backup compression.
We have also got one funny bonus. The “hot” files began to gather gradually in some parts, and “cold” files gathered in others, which optimized the process. It’s great when something good happens unexpectedly.

Problem three: What if something went wrong?

If at some point something went wrong, a broken archive still could be created which could remain unnoticed for months. In fact, until the moment you need it … The lesson that we learned in a hard way: If you care about your data, check your backups.

Epilogue

Our backups were a continuously temporary solution but they worked! And we lived happily for a few years… until hard drive prices fell sharply and virtual machines sizes became significantly bigger.
We didn’t believe in it for a while. We set up daily, weekly and even no-backup options, but it was too late. Backing up has given way to reliable RAID or network storage monitoring. KVM and OpenVZ appeared. Instead of backing up all the files, we started to develop backups of user data for ISPmanager which is a different story.
The source code of ispbackup is available on GitHub.

Alexander Bryukhanov

Сhief technical officer at ISPsystem