It can be the biggest disaster you have ever faced, or the best thing that has ever happened. Either way it’s a defining moment for your site, and the way you handle it can make or break your product.  What am I talking about? I am talking about that abnormal burst of traffic. Sure you knew the servers were going to get busy, and sure you planned for it, even scaled for it, but when the day came, you didn’t get the expected 200% increase in traffic you got 2000%. 

What can you do in this situation. Well a lot, but also not much. I recently had a client who this happened to. This is how we dealt with it.

Pre-Launch

Before the day of the huge traffic hit, we knew that we were going to have an increase in traffic. We expected a substantial increase. So to match we got a bigger slice. We host the site at slice host, so going vertical is not  a big issue. To be honest I don’t think anyone expected the response that we got. But we were scaled up to handle about 400% of the amount of traffic that was currently on the site. That’s quite a bit of an increase. 

Launch Day

The Promo Videos and Product pages launched. The marking affiliates swung into top gear, and the only thing I can say is WOW! 

Launch Day - 10am

I get the call. I check the logs and the server and code is just moving along and working well, but there just way to much traffic. Well the server is serving, but it’s just not enough to handle the raw amount of requests. So first thing we do is drop all current connections and step up the server size (again we host at SliceHost so it’s not a big deal). We know that it’s not going to be enough but baby steps, lets get at least some people served.

10:30am

Resize is complete and were serving consumers again, but it won’t hold long. We need a more stable solution. 

11am

Using slice hosts clone feature we setup 4 hosts with Round-robin DNS. It has it’s problems and the load is not spread out very well, but the site is stable now, if not a little slow.

1pm

We reconfigure the servers for a load balanced configuration. Test them and get them into production, switching DNS back to a single server, and preparing for the fact that we may need to move up to more servers.

That Evening

With the traffic starting to die down a bit, we left the four server configuration in place started working on shoring up parts of the site that we had duct taped together to get things back up in running. Mostly the fact that when we went to Round-robin DNS each server had it’s own disconnected mysql server. We also started making modifications to our host tracking software to accommodate the new servers. New apache settings were tested and deployed. 

Launch Day +1

Time for another WOW! moment, only this time the servers were handling the load better. We found out there were even more people hitting the website. 

Early Morning

The website was responding fine, just slow, but it’s a video page thats the main focus of this launch so slow just won’t do. 

Late Morning

We added 4 more servers to the load balancer (again using the clone feature from SliceHost). 

That afternoon and onward

We just sat back and watched progress bars and traffic reports. If the traffic got to a point were it might risk the servers speed or stability we would add a new server.

Here are some tips

  • Remember, when you get this type of traffic. It’s a mixed bag. It’s great that your getting that much attention, but it’s also bad that the servers were not prepared for it. Everyone eventually has these problems it’s response time to get the problem resolved that makes the difference. In our case we had the issue “resolved” in about 40 minuets, even if we wanted a better solution to the problem. 
  • Try to keep in mind that 40 happy users are better then 400 angry ones. If you have to set limits then set them. Try not to keep them long, but people who get a “Server Busy” response will understand more then someone who gets nothing but “the spinning globe”.
  • DO NOT go switching out your server setup.  Take your time. Test your solutions and make each change carefully. It takes longer but 400 angry customers are better then none. And worse then that, when you start trying to switch server setups your likely to induce more issues then your fixing. 
  • Make sure you identify all your bottle necks. Just adding servers won’t work. For us, we had a CPU/RAM issue which we handled by offloading processing to many backend servers, but we also had a problem with the network throughput.  To many people watching a video meant our server was trying to send out more then it could handle there too. Adding 1000 servers would not have fixed that. We handled it by switching the movie to a dedicated set of servers that do nothing but serve the videos. 
  • Simple is better. We all want a quick answer. Something right now, that magically makes everything better. But remember implementing those solutions ahead of time is great, but after the fact in the middle of the storm is just silly. Use simple steps to make the situation better until you have time to test everything. 
  • DO TEST EVERYTHING! Don’t just do it in production. Test, test, test.  Last thing you want is to crash the server and take 3 hours to get it back up. Just make tiny small changes. We did DNS Round-robin first. Not a great solution, but it worked to alleviate some pressure. Then we moved to 4 load balanced servers. Then we added in the content server (the video server), and finally moved up to 8 servers, fail-over MySQL servers, several content servers and then some.  But because we did it step by step the worse the customers saw after the first 30 minuets was a slow playing video.
  • Set up host monitoring. With out it we never would have caught the issue. Without it we would not be able to stay ahead of the issue now. Just simple things can make a huge difference. A ping check and latency check work great to test network load. Load testing and testing the actual URL make sure the server are still working well.
  • Remember, the traffic is temporary (probably). Meaning that after the initial rush is done your load numbers will change. Be prepared to want to down size. never sign up for something that you have to keep long. Also the traffic might not subside, so something really temporary is a mistake as well. 

All-in-All

A very memorable experience, but one that happens from time to time. The best thing to do is take a small set of steps, keep your cool, and TEST TEST TEST.

Coteyr.net Programming LLC. is about one thing. Getting your project done the way you like it. Using Agile development and management techniques, we are able to get even the most complex projects done in a short time frame and on a modest budget.

Feel free to contact me via any of the methods below. My normal hours are 10am to 10pm Eastern Standard Time. In case of emergency I am available 24/7.

Email: coteyr@coteyr.net
Phone: (813) 421-4338
GTalk: coteyr@coteyr.net
Skype: coteyr
Guru: Profile