Home > Resources > HACMP
Resources Collection > HACMP FAQs
This page contains the proverbial frequently asked questions about HACMP. These FAQs are in no particular order.
The list is too short. Why don't you help fix that problem?
This page is part of the Matilda Team's HACMP Resources Collection. The home page of the collection is located here.
IMPORTANT: read the disclaimer BEFORE you use any information provided in this collection.
IMPORTANT: Versions of this FAQ prior to 2001/06/13 described an incorrect way to create globally unique alternate ethernet MAC addresses. Check out this version to see the correct way to do this.
Very high-level questions
What is high availability all about?
High availability is about making an application highly available. It isn't about making the hardware highly available. I've never met a user who really cared if the server was running. On the other hand, I've met a LOT of users who care if the application(s) running on the server is/are running. Focus on making the application highly available and use the hardware merely as a tool to achieve that goal.
I've taught a fair number of HACMP courses for IBM. I've found that a lot of students struggle with the basic HACMP concepts until they realize that high availability is about making an application highly available. Once they grasp this concept, the rest of the HACMP concepts start to look pretty obvious.
What does HACMP stand for?
HACMP is an abbreviation for High Availability Cluster Multi-Processing.
The High Availability part refers to the HACMP features which enable one to build a cluster of multiple nodes (i.e. multiple IBM RS/6000's) which work together to provide a highly available application.
The Cluster Multi-Processing part refers to the HACMP features which enable one to build a cluster of multiple nodes which work together to provide improved application performance (i.e. parallel processing).
Note that the two feature sets overlap and can be used together to build a cluster of multiple nodes which work together to provide improved application performance and high availability.
Do I need a highly available cluster for my application?
The short answer is (probably): if you aren't sure if you need HACMP then you don't need HACMP.
The medium answer is: If you don't know if you need a highly available cluster for your application then you almost certainly don't (or you really don't understand your application's requirements in which case, you're not ready to put your application into a highly available cluster).
The long answer (known as Jose's Law in honour of the person who came up with it) is:
- Ask your manager if the application is business critical.
1a. If your manager says yes, investigate the tolerable downtime and economic loss of unavailability (you need to justify the cost of setting up an HA cluster).
1b. If the answer is no, ask his permission to investigate deeper.
- In either case, investigate further by first powering down the server. Wait for the calls. Say you are fixing it. Count the calls. Ask if it's critical. Ask how critical it is.
2a. If nobody who matters complains then format the disks and install an MP3 & Quake server.
2b. If the house is on fire then go back to your manager. You need HACMP and you now have the business case to support the need.
If you don't like any of these answers, how about contributing a better one.
What's the practical difference between a rotating resource group and a cascading resource group with the (new with HACMP 4.4) cascading without fallback option enabled?
Short answer: They're quite similar.
Medium answer: In practical terms, the two are roughly equivalent. Use the one which best represents how you intend to manage the resource group:
- create it as a rotating resource group if you intend to leave the resource group on whichever node it happens to be running on today for an extended period of time (i.e. you're treating it like a rotating resource group).
- create it as a cascading without fallback resource group if you intend to normally run the application on the primary node (i.e. you intend to move it back to the primary node at the earliest appropriate and/or convenient opportunity).
There are, of course, other differences including that NFS exports, NFS cross mounts and such are supported only in cascading resource groups.
Long answer: Although the medium answer is (arguably) correct, there's a subtle difference between a rotating resource group and a cascading without fallback resource group that's worth considering:
- when the current node in a rotating resource group fails, the resource group is moved to the next node in the rotation. The service IP address in the resource group replaces the boot adapter's IP address on the takeover node.
- when the current node in a cascading resource group fails, the resource group moves to the next lower priority node in the resource group. The service IP address (if configured) replaces the IP address of one of the takeover node's standby adapters.
At first glance, this may appear to be a meaningless difference but the difference can be significant. If the takeover node's standby adapter(s) have already been used to recover from adapter failures on the takeover node then the fallover of a cascading resource group will fail due to the lack of an available standby adapter. On the other hand, a rotating resource group is less likely to run into this kind of trouble since the takeover node's boot adapter will tend to remain available since HACMP on the takeover node will do a swap adapter if the boot adapter's physical network interface dies.
Another way of looking at this is that any failure, even an apparently innocuous failure of a standby adapter on a backup node, needs to be taken seriously and fixed promptly.
Yet another way of looking at this is that one needs to make sure that takeover nodes have enough standby adapters to provide the level of redundancy that is required to provide the appropriate level of availability.
More detail oriented questions
Do I need a serial network in my cluster?
Short answer: yes.
Medium answer: failure to configure a serial network in your cluster will result in a cluster which has a single point of failure. A missing or improperly configured serial network will also increase the likelihood of getting a partitioned cluster (trust me - you don't want to get one of these as a potential consequence is really quite nasty data corruption problems!).
Long answer: failure to configure a serial network could result in unnecessary failovers resulting from:
- the loss of the IP network (for example, switch death) resulting in each cluster node deciding that all other cluster nodes are dead and trying to take over applications which are still running on the other cluster nodes (you do NOT want this to happen).
Note that that this scenario is pretty much the worst-case example of a partitioned cluster. Slightly less spectacular but still very unpleasant partitioned cluster scenarios can occur (in clusters without properly configured serial networks) if network component failure results in groups of cluster nodes that can communicate within the group but not between the groups.
- bursts of high packet traffic on the IP networks resulting in the loss of a series of heartbeat packets
- failure of IP on a node resulting in other nodes deciding that the node with failed IP is down with the same result as the previous point.
- %%% Are there any others that are worth mentioning? %%%
If your cluster uses SSA disks, seriously consider using TMSSA (Target Mode SSA) to implement your serial network. In addition to being a robust serial network, this has the side-effect of causing HACMP to monitor your SSA loops since a failure of a TMSSA network (which HACMP will report) is, by definition, a serious problem w.r.t. your shared disks.
If you implement TMSSA, be sure to test it carefully (you'll need to sever all SSA connections between the hosts in order to cause a failure).
Ideally, every node in the cluster should be directly connected to every other node in the cluster via a serial network (one serial network for each pair of nodes). This rapidly becomes impractical as the number of nodes in the network gets larger. As a bare minimum, think of every node in the cluster as a router and ensure that every node in the cluster has a (conceptual) path to every other node in the cluster with said path only using serial networks. For example, a five node cluster with nodes A, B, C, D and E could meet this criteria with four serial networks - A to B, B to C, C to D and D to E. A somewhat better configuration would be to add a fifth network connecting E to A since this ensures that all surviving nodes have a (conceptual) "serial network path" to all other nodes even if one of the nodes fails.
Should I use SCSI disks as shared disks in an HACMP cluster?
Short answer: only if you absolutely have to.
Medium answer: your cluster will almost certainly failover faster if you use SSA disks or some other disk technology which doesn't suffer from ghost disks. Also, the inability to connect or disconnect (without powering down) boxes which are cabled together using SCSI cables can result in longer outages when dealing with certain disk enclosure related failures.
Long answer: I've recently had experience building a cluster that uses a pair of IBM 2104-DL1 SCSI enclosures to provide shared disk storage. We've been unable to avoid getting ghost disks when a cluster node is rebooted at a point in time when the other node has the disks varyed online. Since there are 24 shared disks in the cluster, dealing with the ghost disks results in failover times of about ten minutes for the cluster.
The other side of the coin is that using the 2104-DL1's saved the customer over $75,000 CAD. Whether or not the reduced failover time that one would experience with SSA disks is worth $75,000 CAD is a question that only the customer can answer.
Some additional points to ponder:
- It should be kept in mind that most if not all SCSI subsystems aren't supported by HACMP if you're using concurrent resource groups.
- I've been told that TMSCSI (Target Mode SCSI) is a bad idea because it significantly impacts the SCSI bus's performance due to the overhead involved in turning around the SCSI bus for each round-trip. I've not done any research, measurements or testing to see if what I've been told is correct. I'd suggest that if you must use shared SCSI disks then you should either avoid TMSCSI or do some VERY careful testing and measuring before you conclude that it's safe to use. Note that TMSSA is a completely different fish - it is a very lightweight protocol which (I've been told and which the SSA and TMSSA specifications seem to suggest) doesn't interfere with any other SSA activity on your shared SSA loops.
Which of the logical volumes on my shared VGs do I need to mirror?
Short answer: all of them.
Medium answer: if there is a logical volume on your shared disks that isn't worth mirroring then delete it. If it isn't worth mirroring then it isn't worth keeping.
Long answer: you need to mirror all logical volumes which your application uses. This specifically includes temporary space (eg. a big file system used by the application for data caching purposes). The reason is simple: if you lose a disk that contains the only copy of any of your logical volumes then the application will either hang or suffer disk I/O errors when it tries to access the lost space. This could seriously affect the availability of your application.
Bottom line: mirror EVERYTHING on the shared VGs. Do NOT use the mirrorvg command unless you are REALLY careful or you have only a two-disk shared VG. You need to be careful to get the mirrors onto the right physical volumes to ensure that the two halves of each mirror are in/on different adapters, busses, paths, power supplies, disk cabinets, nuclear reactors, etc, etc, etc.
What about rootvg? Do I need to mirror it?
Short answer: yes.
Medium answer: yes. Use a two-disk rootvg and check out the mirrorvg command (part of AIX since AIX 4.2.1) for an easy way to do this. Put the disks on separate controllers if at all possible. Read the man page carefully - there are some important issues that you need to get right when mirroring rootvg.
Long answer: you need to mirror all logical volumes which your application uses. The last time that we checked, all applications use the operating system and the operating system (i.e. AIX) uses rootvg logical volumes so mirror them. One that is sometimes missed is the paging space. Mirror the paging space if you want your node to be able to survive the loss of a physical volume containing paging space (i.e. if you don't mirror it then you've got a single point of failure).
Most existing HACMP clusters are configured to send system dumps to the primary paging space. Some (all?) versions of AIX won't generate a system dump into a mirrored paging space. Create a separate unmirrored dump space if this restriction applies to your version of AIX (assume that it does if you aren't sure). Even if AIX supported them, a mirrored dump space would be a waste of disk space since you won't need it unless the AIX kernel crashes and a kernel crash combined with a disk failure is either a double failure or was almost certainly caused by the disk failure.
Another factor to consider is that AIX doesn't respond well if an unmirrored rootvg physical volume is lost. Basically what happens is that the parts of AIX which notice the problem either hang or are terminated by the disk errors. The end result is that the operating system's features and facilities gradually stop working until, eventually, someone notices (i.e. a human) or something critical terminates (if the HACMP cluster manager is ever affected (rather unlikely but it happens) then the node will die about fifteen seconds later. Either eventuality could take a LONG time (think in terms of hours of elapsed time) during which random parts of your application will have probably either terminated or hanged.
Bottom line: mirror rootvg. Use the mirrorvg command and then follow the extra steps described in the mirrorvg man page for your version of AIX.
Short answer: carefully.
Medium answer: don't pick an address that is already in use on the subnet(s) that your cluster is attached to. Also, avoid picking an address which might be added to the subnet in the future (this is the tricky part!).
Long answer: the medium answer is correct but not very useful. Here's a way to do it:
Ethernet networks
Look at the default HW address that is assigned to one of your service/boot adapters. If you examine the first byte of the address as bits numbered 0 through 7 starting with the most significant bit, you should find that bit 1 is zero (if bit 1 is one then you've gone far too deep into the wilderness - try to find your way back out and see if you get luckier next time). Change bit 1 to one and use the resulting HW address as your alternative HW address.
For example, the HW address on an ethernet adapter on one of our RS/6000s is 08:00:5a:fc:32:b9. The first byte represented as bits is 00001000. Changing bit 1 to 1 makes it 01001000. Converting this back into hex gets the local HW address of 48:00:5a:fc:32:b9 which is what we'd use as my alternative HW address if we were configuring our service adapter for HW address takeover (of course, we'd drop the colons when we entered it into the HACMP "Change/Show Adapter" smit screen as 48005afc32b9).
Why does this work? The IEEE 802.3 folks (i.e. the folks responsible for the ethernet standard) decided to reserve bit 1 for locally assigned HW addresses. HW addresses assigned by ethernet card vendors are NEVER supposed to have bit 1 set to 1. Setting the bit to 1 turns the address into a local address (it is a global address when the bit is 0). As long as you don't change any of the other bits, you should have a worldwide unique local HW address (unless someone else on your subnet changed their bit 1 to 1 and happened to changed their other bits to exactly match your adapter's default HW address (which would be rather foolish and hopefully quite unlikely)).
IMPORTANT: this section used to say that bit 6 is the bit to be flipped. This was a mistake. Toggle bit 1 as described above.
Tokenring Networks
Change the first byte to 0x42 and leave the remaining bytes unchanged. This is analogous to setting bit 1 in an ethernet HW address.
What about other network technologies?
We're not sure. If you find out, please drop us a note so that we can augment this answer.
Gratuitous ARP
Most (all?) current versions of AIX support a feature called "gratuitous ARP". If your version of AIX supports this feature then you may not need to configure HWAT since an ARP packet that causes clients to update/flush their ARP caches will be broadcast whenever a network interface's IP address changes (including when HACMP moves an IP address as a result of a network adapter failure or a node failure). I'm not sure which version of AIX was the first to support gratuitous ARP. My recollection is that the feature was introduced in AIX 4.3.3 (does anyone happen to know for sure?).
A few things to keep in mind w.r.t. gratuitous ARP:
- Although support for it is fairly wide spread, not all client operating systems respond properly to the gratuitous ARP packet so you need to do a fair bit of testing to make sure that the clients in your cluster respond correctly to the gratuitous ARP packet.
- Only hosts on the same logical subnet as the cluster need to support gratuitous ARP (i.e. don't worry about clients that are on the far side of a router).
Should I install the AIX man pages?
Short answer: yes.
Medium answer: yes.
Long answer: yes.
Really really long answer: yes. A system without man pages is a system which is harder to maintain. A system which is harder to maintain is a system which is more likely to be maintained incorrectly. Need we say more?
IMPORTANT: If you lack the appropriate skills, experience and/or
competency, are unwilling to take responsibility for your actions,
or if you don't like these disclaimers then
don't use this information.
|