Let me start by offering my apologies to you for the troubles we have caused you over the last 2 weeks and for the lack of updated responses from me. It truly has been one of the worst moments in my life and I expect it has been one of the most trying one in yours as well. We have finally achieved stability in the network and I can finally take a breath before diving into fixing any parts of system that worked on our old network but are not working in the cloud. And more importantly helping as many people get their sites back in order. I don’t have the time at this moment to explain every detail of this Epic disaster but I will try and cover the main ones below.
On Tuesday October 25th our primary File Server blew out a 3rd drive in its RAID array rendering it useless. Having learned a lesson from the March File Server outage we couldn’t wait a month to recover the files. So since we were around 3 to 4 weeks out from doing our migration to the Amazon’s AWS cloud network we figured it made sense to just go for it and get everyone up there. We got sites live within 3 days but it immediately became apparent the network our engineer had designed was massively under powered. And by Sunday we had added a few more servers that could deliver everyone’s websites well enough. There were so many complaints of people needing to upload images to replace their lost ones that we allowed people to start uploading, and at that point the sheer number of uploads overwhelmed our system and we went down again. We fought like this for the next week. Every time we got stability another group would start massive uploads and take us out of stability. There seemed to be no amount of servers that could offset this load since our bulk uploading was intimately tied into our ability to serve sites. In our plan to migrate to the cloud we had never intended to use our old uploading system because we knew this was a process we wanted to offload to a separate service we had almost completed building. So by Friday morning I started working on completing that separate upload system and then launched it to our clients on Sunday. This new uploader actually works 10 times faster than our old one, is built on a better image processing library, and never goes through our network. All uploads go straight to our S3 storage area and triggers a function that grabs the image, processes it, updates the database and then puts the images in your image folders. The important thing here is we wanted everyone to be using it instead so that we could take the pressure off the network. But still hundreds of thousands of images kept being pushed through the old upload system. As of tonight we have made the new S3 upload process our primary upload system. So very few parts of our system still use the old upload system. And this is how we were finally able to create stability.
On top of this mess our Support System was compromised. When we pointed everything to the cloud, support did not move (so we could ensure it stayed stable). But the ticket areas in the cloud admins were not connected back to our old network. It was pointed to a placeholder database on the cloud. No one noticed for a few days because support was still getting hundreds of tickets. But these successful communications were coming from people just sending and replying to the support@ emails of ifp3 and Redframe. Once we realized what was going on then a few hours later we had support connected in the cloud. All looked well but then the server that Support was on lost its outside internet connection and for half a day none of support’s replies made it off the server. So if you thought support was ignoring you they were not. They truly were doing the best they could to answer everyone.
Now I’ve been working a cycle of 40 hours on – 8 hours sleep over and over since this fiasco began. I have been doing everything in my power to restore your sites. There has not been a single waking moment I have had that has not been focused towards that end. No family moments, no personal moments: Just making every Redframe and ifp3 site whole again. I apologize for my personal lack of communication. I lost my network administrator 3 days after this went down and have basically had a crash course in cloud network management over the last 2 weeks. I know our system better than anyone since I built a large share of it. Yet our old network was a tangled mess that I was less familiar with. Some of that tangle was also being built into our cloud network and part of this I needed to undo to get things working properly. No one I could hire could undo it all and set things right since they didn’t really know how all the pieces fit together. So that is what I had to become an expert in over the last 10 days.
Initially our migration plan was to move much smaller groups of clients into the network to start feeling it under load. It would have been a 2 to 3 week process in itself. We really did not expect to have to flip a switch and have everyone flood in at once. And most certainly not with many having to recover their lost galleries from the last couple of weeks. Honestly moving to AWS was to be a present to everyone. Finally we have incredibly backed up files, more powerful servers and database (which is backed up almost 30 times a week), faster image uploads and basically an overall performance increase for every site. Every image is stored in a service called Cloudfront that can deliver images to Sydney or UK clients almost as fast as it can to someone in Oregon sitting in the same town as our servers. And this was the peace of mind I had envisioned for us. Once this is all ironed out it will be by far the best place for us to have our businesses (yours and mine). We almost have it but there are still a few more kinks to work out of the system and a pile of clients to help. We honestly will do our very best to help each and every customer affected by this.
And just to wrap this up: We are a small family company and this has really affected us deeply. Everyone who works for me is either a friend or a family member. And we all took this very personally and worked so many many extra hours to try and help everyone. I’m sorry if you were one that did not get the attention you needed. We will do our best to get to you and help you through this.