Testing Web Services in Load Balanced Environments

Since the Internet has emerged, I've found that effectively testing web services is a sort arcane art form restricted to nerds like me - I have profited from this talent, however, I also think it's something that more people should know how to do.

Testing cleartext web services is actually simple - I'm sure a number of you have all done it - some readers may find this information remedially trivial, while others of you might decide that it's helpful.  In the past I've used the "open the web page" kind of approach - it either works or it doesn't, but that's not detailed enough if you're really wanting to do the job right.  There's a lot of testing that can be done using Telnet that most people just don't know about.

For those of you who do use telnet, please keep in mind that just telneting to the service port isn't good enough - just yesterday I got a call from some colleagues who "tested" a web server by "telneting to it on port 80" - it connected, so we thought it was working.  Turns out they were wrong, and it caused a major outage to boot - but it wasn't their fault for not knowing what I know - that's why I get the big bucks.

Basic Remote Testing

Basic remote testing is done two ways - using a web browser and then with Telnet.

Lets start with an example server at 192.168.1.100 - you have a web server running there, so you try browsing to it - so you open the website http://192.168.1.100.  Now, if this works and returns a page of any kind from the server, then it's all working.  This test is important - I might skip this often enough because folks don't usually call me when this works already - when testing this stuff, it never hurts to ask "Did you try browsing to it?"  This could save you quite a bit of time on the phone!


Testing with a web browser does present some problems; what if you get a blank page? that's tough to interpret - and how long does it take to time out? does it time out? different browsers behave differently, so there's a big problem in result interpretation here.  Unless you get really good results and know what you're seeing, you should perform more advanced testing.


The telnet test is a little more complicated.  Those of you running Windows Vista or Windows 7 will need to actually install the Windows telnet client (it's in Add/Remove Programs, Windows Components) - those of us using Mac's and Linux don't have to worry about this - telnet should be present on our systems already.

Telnet is a command line tool, so you'll have to open a terminal window in Linux or Mac, or start a command prompt in Windows.  From this prompt, to test a web server, all you have to do is type this;

        telnet 192.168.1.100 80

Breaking this down, the IP address 192.168.1.100 is the server we wish to connect to - this can be an IP address or a name like www.google.com - and the number 80 is the port we wish to connect to.  If you don't specify the port, telnet will assume that you want to use port 23, which is the standard telnet port.  Port 80 is the standard HTTP port.  If you're not familiar with the various ports and what they're used for, you can use google or check Wikipedia to find a common TCP services listing.

Now, if things work right and the server is at least listening on port 80, you will see a "connected" message of some kind - this message is slightly different from one platform to another, but most of the time it will clear the screen if it works.  If you get this far, for basic testing, you've done well.  Press CTRL+] key and type quit and you are returned to your command prompt.

If things aren't working right, you will see one of two different categories of problem - either a network response failure or a host response failure.

A network response failure will manifest with a long wait following by a connection failure message of some kind.  These happen if you are being blocked by a firewall or if your traffic is somehow not getting to the server.  Some network failures are fast - like if routing fails and you get a destination unreachable error - however, firewall blockages usually take 15-30 seconds to manifest.

A host response failure will manifest much sooner and usually get a connection refused message - this is a standard RFC-793 response (see pages 33 & 34) telling you that you that the remote service isn't running - if you get a refused message, this rules out interference from firewalls and other things getting in the way in most cases - however, some firewalls can be programmed to send out connection refused messages - so it's a good idea to try doing a traceroute and validate the network path and make sure that none of the firewalls, if any, on the path are messing with things.

HTTP Service Testing

Up to this point, we've done only very simple testing of the remote service - we've determined that's listening, but we've not actually discovered if it's working yet.  If a server is broken, or a load balancer is messing with you, then basic testing with Telnet isn't enough to know what's really going on yet.

Service testing means that you don't just connect to the server, but you also must interact with it.  There are rules for how the services interact, and each different service is described in it's own RFC.  In this case, we're doing to communicate with a web server using HTTP protocol on port 80.
  1. Open a command prompt

  2. Type in the command telnet 192.168.1.100 80

    At this point you should see a connected message of some kind.

  3. The next things you type may or may not be visible (depending on the OS you're using)

    Type:

    GET / HTTP/1.1
    Host: www.whatever.com:80

    (Press Enter/Return twice)
When this is done, and everything is working right. a web page (in HTML form) should spit back at you - and the connection might even close - or it might stay open waiting for more commands.

Breakdown of the HTTP 1.1 directives used here:

With Step 2, we use the same command from the Basic Remote Testing section of this article above.  

Step 3 has two different command directives in it - these are from the HTTP 1.1 protocol.  The first line in step 3 is the GET command.  It tells the server what HTML page you want - replace the / with something like /about.html and you can change the page you wish to retrieve.  

The second line in step 3 is telling the server what website to get that web page from - remember that the HTTP 1.1 standard allows for virtual hosting - where multiple websites live behind a single IP address.  When using HTTP 1.1 you have to tell the server what website you're testing with - and it requires you to separate the server portion of the URL from the page portion.

Here's some examples:

Example URL: (a yahoo search result for Open Source Anti-Spam)

http://search.yahoo.com/search;_ylt=Am8gS8IlXhBaelxqn9B0BGibvZx4?p=open+source+anti-spam&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-701

This URL can be fetched manually with:
  1. telnet search.yahoo.com 80
  2. GET /search;_ylt=Am8gS8IlXhBaelxqn9B0BGibvZx4?p=open+source+anti+spam&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-701
    HostL search.yahoo.com:80
Dealing with Load Balancers

Up until now, all that I've illustrated here is basic testing.  In a load balanced environment there's another added layer of complexity.

Load balancers are designed to take traffic going to one IP, and distribute it to a bank of servers all doing the same job.  In this way, a complex website with large amounts of traffic can be hosted on 10 servers, but all look like the same exact website.

One popular form of cheap and easy load balancing is done using DNS.  Someone will setup DNS so that their website results provide multiple IP addresses in their results.  A great example of this can be seen by running the command nslookup www.google.com - most people will see 2-10 servers in the results list for this, and these results will be customized depending on your ISP and a number of other factors.

This kind of load balancing isn't perfect however - google spends a lot of time and effort to make sure every one of those servers works right - and they also try to point any any servers you might be hosting for them as well - long story there - but large ISP's actually do host their own google caching servers, and this speeds up things substantially.

One common problem when you use DNS balancing is that if you have two servers in the DNS listing, and one of them is offline, your customers will experience a condition called Every Other Request Works - and when this happens, folks will fail the first time, then refresh and things work fine and keep working until 5 minutes pass and their DNS cache expires.

For the common masses, there are load balancing features in firewalls, and there are custom load balancing systems as well - nginx is a popular open-source load balancer - and F5's are a popular commercial load balancer.  Both work very well and can be found in major networks all over the Internet.

A common term used in load balanced IP's is the VIP.  A VIP in this context is a Virtual IP address that the load balancer creates for your website.  This VIP is used as the destination for all your web traffic.

So lets take my example from before - my web server on http://192.168.1.100 - now imagine that I've actually got ten of these web servers running - http://192.168.1.100 through http://192.168.1.109 are all identical boxes running with the same content.

Using a load balancer, I can create a pool of web servers and in this pool, each of my ten servers will be defined as members of that pool.  I then create my VIP so that the IP address 192.168.1.10 maps into my pool of web servers.

When hunting down problems in this environment, you must test all 11 of these IP addresses to make sure things are working.  This means performing the testing described in the HTTP Service Testing section on all 11 IP addresses - from 192.168.1.100, 192.168.1.101 192.168.1.102, through 192.168.1.109 and then on 192.168.1.10.

Problems with load balanced websites can be many and weird - so knowing that everything must be tested is very important.

Testing the front end VIP ensures that the load balancer is running correctly; testing the servers on the back end ensures that they're all working properly.

Load Balancer Health Checking

Engineers working in load balanced environments should be aware of is that all load balancers have integrated health checking - and if the person who setup the load balancer didn't setup the health checks properly, then you can experience many nasty weird problems.

If health checks aren't enabled, it's possible that you can access a perfectly good VIP and get nothing back beyond a basic RFC-793 TCP/IP handshake - because it may have forwarded your request to an offline server.

If health checks are setup with simple checks, then the load balancer might just do Basic Remote Testing - and that's it.  If a back end server in the pool is responding and connecting on port 80, but isn't actually responding to full GET commands, a load balancer doing Basic Remote Testing will eventually forward traffic to that dead server.  Some load balancers may only simply ping your webservers - and responding to a ping has nothing to do with checking a web service.

Health checks, when properly setup, perform full HTTP Service Testing - talking and testing each server, fetching a complete web page.  When a server stops responding or doesn't pass the health check tests, then it's disabled in the pool - and the load balancer engineer should be notified by email.

In conclusion, I hope you find this article helpful - it covers a lot of ground, but the processes described here are easy enough to remember and follow - and you may just impress some folks if you can pull these out of your hat and call the right engineers first.  If everyone working in NOC or helpdesk environments knew these techniques, they could cut hours off their troubleshooting calls.

Cheers, 
-JS

Comments

Popular Posts