Booting LAM
"; include("$topdir/includes/nav_header.php3"); ?> Table of contents:
  1. Can I run LAM as root?
  2. What does "booting LAM" mean?
  3. Are there any tutorials available on getting started with LAM/MPI?
  4. Can I run LAM/MPI jobs under Globus?
  5. Can I run LAM/MPI jobs on a BProc cluster?
  6. Can I run LAM/MPI jobs under PBS?
  7. What conditions have to be met for LAM to be booted successfully?
  8. How do I add the LAM executables to my $PATH?
  9. I have more than one NIC on a host. Which IP name/address do I list in the boot schema?
  10. What is the recon tool? What do I use it for?
  11. recon succeeded, but lamboot failed. Why?
  12. What is a .rhosts file? Why do I need it?
  13. Should I use "+" in my .rhosts file?
  14. Can I use ssh with LAM (instead of rsh)?
  15. How do I make ssh not ask me for my password?
  16. recon/lamboot claims that it cannot find LAM executables on the remote node. What does that mean?
  17. Does LAM use static port numbers?
  18. Can I lamboot to hosts outside of my firewall?
  19. lamboot seems to hang -- why? And what do I do?
  20. Can I issue multiple lamboot's on a single machine?
  21. How do I lamboot multi-processor machines?

[ Return to FAQ ]


1. Can I run LAM as root?

No. It is a Very Bad Idea to run LAM as root. LAM will actually explicitly disallow root from running all exectuables except recon (recon is allowed so that sysadmins who are installing LAM can test basic functionality).

The reasons why root should not run LAM executables are almost identical to those listed in the question "Should I run LAM ias a root-level service for all my users to access?" in the "Typical setup of LAM" section.

[ Top of page | Return to FAQ ]


2. What does "booting LAM" mean?

The LAM/MPI environment needs to be "booted" before any user MPI applications can be run.

LAM uses a daemon on each node for process control, meta environment control, and, in some cases, message passing. "Booting LAM" refers to the act of launching these daemons on each node. The lamboot command is used to boot LAM; after a successful lamboot, user programs can be run in the LAM/MPI environment.

Once the user is finished with LAM, the lamhalt command is used to shut down the LAM/MPI environment and remove the daemons from each node. Once the lamhalt command has been successfully run, no more LAM/MPI programs can be invoked until another lamboot is successfully issued.

[ Top of page | Return to FAQ ]


3. Are there any tutorials available on getting started with LAM/MPI?

Yes, there are several. Click on the "tutorials" link in the left-hand navigation.

Here are a few of the tutorials available:

[ Top of page | Return to FAQ ]


4. Can I run LAM/MPI jobs under Globus?
Applies to LAM 7.0 and above

Yes, but in limited scenarios.

LAM/MPI can boot LAM across a Globus grid using the fork scheduler only. Notes about the globus boot SSI module:

The following is an example boot schema for the globus boot module:

"inky:12853:/O=My/OU=Com/CN=HPC Grp" lam_install_path=/opt/lam cpu=2
"pinky:3245:/O=My/OU=Com/CN=HPC Grp" lam_install_path=/opt/lam cpu=4
"blinky:2345:/O=My/OU=Com/CN=HPC Grp" lam_install_path=/opt/lam cpu=4
"clyde:82342:/O=My/OU=Com/CN=HPC Grp" lam_install_path=/software/lam

Be sure to see the LAM/MPI User's Guide for more details about the globus module.

[ Top of page | Return to FAQ ]


5. Can I run LAM/MPI jobs on a BProc cluster?
Applies to LAM 7.0 and above

Yes.

Ensure that LAM/MPI was compiled and installed with support for the bproc boot SSI module (you can run the laminfo command to see if the bproc boot module is included in your installation). When on a BProc head node, lamboot (etc.) should automatically choose to use the bproc boot module and launch the LAM daemons using native BProc mechanisms.

Notes about the bproc boot SSI module:

More details about the bproc boot module are available in the LAM/MPI Installation Guide and LAM/MPI User's Guide.

[ Top of page | Return to FAQ ]


6. Can I run LAM/MPI jobs under PBS?
Applies to LAM 7.0 and above

Yes, LAM/MPI can be booted natively in PBS batch jobs (both OpenPBS and PBS Pro).

When used from within a PBS jobs, lamboot (etc.) will use the native PBS Task Managament (TM) interface to launch the LAM daemons on the nodes that were allocated to the job. The tm boot SSI module does this task; use the laminfo command to see if support for the tm module is included in your LAM/MPI installation. Some notes about the tm boot SSI module:

Be sure to see the LAM/MPI User's Guide for more information about the tm boot SSI module.

[ Top of page | Return to FAQ ]


7. What conditions have to be met for LAM to be booted successfully?

For each machine that LAM is to be booted on, all of the following conditions must be met:

All of these prerequisites must be met before LAM can be booted properly.

NOTE: OSCAR users should already have all of these conditions met already. If you are having a problem with lamboot, check to see that a simple ssh between nodes works properly.

[ Top of page | Return to FAQ ]


8. How do I add the LAM executables to my $PATH?

LAM must be able to find the LAM executables in your $PATH on every node. As such, your configuration/initialization files need to add the LAM executables to your $PATH properly.

How to do this may be highly dependant upon your local configuration, so you may need to consult with your local system administrator. Some system administrators take care of these details for you, some don't. YMMV. Some common examples are included below, however.

You must have at least a minimum understanding of how your shell works to get the LAM executables in your $PATH properly. Note that the LAM executables must be added to your $PATH in two situations: (1) when you login to an interactive shell, (2) and when you login to non-interactive shells on remote nodes.

NOTE: OSCAR users should already have this step taken care of. OSCAR uses a package called switcher to setup the $PATH for users. You may need to set your personal default to use LAM/MPI if it is not already the system default. Consult the OSCAR User's Manual for more details.

[ Top of page | Return to FAQ ]


9. I have more than one NIC on a host. Which IP name/address do I list in the boot schema?

Two common configurations for setting up clusters for parallel computing are:

In each case, there's at least one node that has two IP addresses (and potentially two IP names) -- which one should be used in the LAM boot schema?

The answer is to use the IP name/address that refers to the NIC that you want LAM to use for TCP/IP communication (both LAM "meta" information and MPI message passing). LAM will use the NIC associated with the name/address used in the boot schema file. For example, in the first scenario above, the master node should be represented in the boot schema file with the IP address/name of its NIC on the private network. In the second scenario, the IP address/name of each node's 100Mbps NIC should be used to get maximum bandwidth for message passing.

Note that LAM can work fine in the first scenario if you specify the IP name/address of the NIC on the public network if the networking on the master node is configured to route traffic from the private network to the public network (usually behind Network Address Translation, or NAT). This is usually not a good idea, however, because it effectively causes extra network hops for traffic from the slave nodes to the master node, and therefore adds latency to message passing. In most cases, the IP name/address for the NIC on the private network should be used.

Also note that LAM will resolve all IP names only on the node where lamboot is executed. Hence, the local name resolution setup only matters on that node; name resolution does not occur on any other node. Internally, LAM only uses IP addresses.

For non-TCP/IP communication mechanisms, LAM will only use these IP addresses for "meta" information.

[ Top of page | Return to FAQ ]


10. What is the recon tool? What do I use it for?

recon is used to verify that a user has the correct setup to boot LAM properly. It checks to see if LAM can be started on all the nodes in a given boot schema.

Users use recon to check/verify that their shell startup scripts (e.g., .cshrc, .profile, .bashrc, etc.) set the environment properly to ensure that LAM can be started on the local and remote nodes properly.

recon does this by attempting a "fake" boot process on each node in the boot schema. recon will attempt to launch "tkill -N" on each node (the -N option indicates that tkill should not do anything).

If "tkill -N" can be executed successfully on each node, the following has been verified:

Note that this does not guarantee that lamboot will function properly; it only gives a pretty good indication that it will. lamboot can still fail for other reasons.

[ Top of page | Return to FAQ ]


11. recon succeeded, but lamboot failed. Why?

There can be many reasons.

Note that recon does not do everything that lamboot, which is why it is only a pretty good test, not a conclusive test. lamboot can sometimes fail with not-particularly-helpful error messages (particularly in LAM versions prior to 7.0).

A common cause for lamboot failure is that one of the hostnames in the boot schema resolved to the address 127.0.0.1. This is fine when there is only one hostname involved (i.e., lambooting on a single machine). However, when the LAM universe consists of more than one machine, none of the hostnames can resolve to the address 127.0.0.1. This is because 127.0.0.1 is a "special" IP address that always maps back to the local machine -- it's the localhost address. So if a node in the LAM universe tries to use the 127.0.0.1 address to try to contact another node in the LAM universe, it will actually be opening a socket to itself, not the intended destination node. And the connection will therefore fail.

Hence, all hostnames in the boot schema must resolve to the IP address of the network interface card (NIC) that you wish LAM to use.

You can tell if "the 127.0.0.1 problem" is happening to you if you lamboot with the -d switch -- see if any of the hboot lines in the debugging output show 127.0.0.1.

Unfortunately, some Linux distributions automatically put the hostname of the machine on the same line as localhost in /etc/hosts. For example, consider the following /etc/hosts file that is on the machine blinky, which is the "master" node in a cluster. blinky has a single NIC, with IP address 192.168.1.10:

127.0.0.1     localhost blinky
192.168.1.10  masternode.example.com masternode
192.168.1.11  node1.example.com node1
1921.68.1.11  node2.example.com node2

If the name "blinky" is used in a boot schema with other hosts, the lamboot will fail. The following solutions are available:

NOTE: Starting with LAM 7.0, LAM will detect this situation and give an error immediately rather than trying to boot and failing. Versions prior to 7.0 will try to boot and abort with amorphous, undescriptive error messages.

[ Top of page | Return to FAQ ]


12. What is a .rhosts file? Why do I need it?

If you are using rsh to launch processes on remote nodes (either by setting this at configure time, letting configure use the default value of "rsh", or by setting the LAMRSH environment variable when you invoke recon or lamboot), you will probably need to have a $HOME/.rhosts file.

This file allows you to execute commands on remote nodes without being prompted for a password. The permissions on this file usually must be 0644 (rw-r--r--). It must exist in your home directory on every node that you plan to use LAM with.

Each line in the .rhosts file indicates a machine and user that programs may be launched from. For example, if the user steve wishes to launch programs from the machine stevemachine to the machines alpha, beta, and gamma, there must be a .rhosts file on each of the three remote machines (alpha, beta, and gamma) with at least the following line in it:

stevemachine steve

The first field indicates the name of the machine where jobs may originate from; the second field indicates the user ID who may originate jobs from that machine. It is better to supply a fully-qualified domain name for the machine name (for security reasons -- there may be many machines named stevemachine on the internet). So the above example should be:

stevemachine.example.com steve

The LAM Team strongly discourages the use of "+" in the .rhosts file. This is always a huge security hole.

If rsh does not find a matching line in the $HOME/.rhosts file, it will prompt you for a password. LAM requires the password-less execution of commands; if rsh prompts for a password, lamboot and recon will fail.

NOTE: Some implementations of rsh are very picky about the format of text in the .rhosts file. In particular, some do not allow leading white space on each line in the .rhosts file, and will give a misleading "permission denied" error if you have white space before the machine name.

NOTE: It should be noted that rsh is not considered "secure" or "safe" -- .rhosts authentication is considered fairly weak. The LAM Team recommends that you use ssh ("Secure Shell") to launch remote programs, as it uses a much stronger authentication system.

NOTE: OSCAR users should not need .rhosts files. OSCAR is configured to automatically use user-level passwordless-ssh between all nodes in the cluster.

[ Top of page | Return to FAQ ]


13. Should I use "+" in my .rhosts file?

No!

While there are a very small number of cases where using "+" in your .rhosts file may be acceptable, the LAM Team highly recommends that you do not.

Using a "+" in your .rhosts file indicates that you will allow any machine and/or any user to connect as you. This is extremely dangerous, especially on machines that are connected to the internet. Consider the fact that anyone on the internet can connect to your machine (as you) -- it should strike fear into your heart.

The + should not be used for either field of the .rhosts file.

Instead, you should use the full and proper hostname and username of accounts that are authorized to remotely login as you to that machine (or machines). This is usually just a list of your own username on a list of machines that you wish to run LAM over. See the "What is a .rhosts file? Why do I need it?" question for further explanation, as well as your local rsh documentation.

Additionally, the LAM Team strongly recommends that rsh is not used -- it is considered weak remote authentication. Instead, we recommend the use of ssh -- the secure remote shell. See the questions "Can I use ssh with LAM?" and "How do I make ssh not ask for me for my password?" for more details.

[ Top of page | Return to FAQ ]


14. Can I use ssh with LAM (instead of rsh)?

Yes, you can change the remote transport agent that LAM uses to spawn the LAM daemons. While rsh is the default, it can be changed to other agents, such as ssh. ssh is a popular choice because of the added security that it provides over the .rhosts security provided by rsh. And since ssh can pass AFS tokens, it presents an attractive, highly secure, yet fully-AFS-authenticated method, for invoking LAM.

If you choose to use ssh, the 1.x series of ssh may require the use of the "-x" command line flag. "-x" prevents X forwarding, which may prevent an xauth status message from being printed on stderr. lamboot/recon/etc. interprets information on stderr to mean that a remote invocation has failed; ssh's "-x" may prevent this. The "-p" option may also be useful for suppressing stderr output; see the ssh documentation.

You can specify to use ssh at configure time with the --with-rsh flag:

% ./configure --with-rsh="ssh -x"

Additionaly, in LAM 7.1.1, you can override the remote shell agent that was specified at configure with the LAMRSH environment variable. Setting this environment variable before invoking recon, lamboot, or any other LAM executable will force LAM to use that remote shell program instead. For example, using a Bourne shell (or some other sh derrivative):

% LAMRSH="ssh -x"
% export LAMRSH
% recon myhostfile

Or, using the C shell (or some csh derrivative):

% setenv LAMRSH "ssh -x"
% recon myhostfile

NOTE: OSCAR users typically already have LAM setup to use ssh by default.

[ Top of page | Return to FAQ ]


15. How do I make ssh not ask me for my password?

There are multiple ways.

Note that there are two mainstream versions of ssh. One is the freeware package OpenSSH; the other is SSH, a commercial package from SSH Communications Security Corp.

This documentation provides an overview for using user keys and the OpenSSH 2.x key management agent (if your OpenSSH only supports 1.x key management, you should upgrade). See the OpenSSH documentation for more details and a more thorough description. The process is essentially the same for the commercial SSH, but the command names and filenames are slightly different. Consult the SSH documentation for more details.

References to ssh in this text refer to OpenSSH.

Normally, when you use ssh to connect to a remote host, it will prompt you for your password. However, in order for lamboot and recon to work properly, you need to be able to execute jobs on remote nodes without typing in a password. In order to do this, you will need to set up RSA (ssh 1.x and 2.x) or DSA (ssh 2.x) authentication. We recomend using DSA authentication as it is generally \"better\" (i.e., more secure) than RSA authentication. As such, this text will describe the process for DSA setup -- RSA setup is analogous, but takes slightly different commands and filenames.

This text will briefly show you the steps involved in doing this, but the ssh documentation is authorative on these matters should be consulted for more information.

The first thing that you need to do is generate an DSA key pair to use with ssh-keygen:

% ssh-keygen -t dsa

Accept the default value for the file in which to store the key ($HOME/.ssh/id_dsa) and enter a passphrase for your keypair. You may choose to not enter a passphrase and therefore obviate the need for using the ssh-agent. However, this weakens the authentication that is possible, because your secret key is [potentially] vulnerable to compromise because it is unencrypted. See the ssh documentation.

Next, copy the $HOME/.ssh/id_dsa.pub file generated by ssh-keygen to $HOME/.ssh/authorized_keys:

% cd $HOME/.ssh
% cp id_dsa.pub authorized_keys

In order for DSA authentication to work, you need to have the $HOME/.ssh directory in your home directory on all the machines you are running LAM on. If your home directory is on a common filesystem, this is already taken care of. If not, you will need to copy the $HOME/.ssh directory to your home directory on all LAM nodes (be sure to do this in a secure manner -- perhaps using the scp command), particularly if your secret key is not encrypted).

ssh is very particular about file permissions. Ensure that your home directory on all your machines is set to mode 755, your $HOME/.ssh directory is also set to mode 755, and that the following files inside $HOME/.ssh have the following permissions:

-rw-r--r--  authorized_keys
-rw-------  id_dsa
-rw-r--r--  id_dsa.pub
-rw-r--r--  known_hosts

You are now set up to use DSA authentication. However, when you ssh to a remote host, you will still be asked for your DSA passphrase (as opposed to your normal password). This is where the ssh-agent program comes in. It allows you to type in your DSA passphrase once, and then have all successive invocations of ssh automatically authenticate you against the remote host. To start up the ssh-agent, type:

% eval `ssh-agent`

You will probably want to start the ssh-agent before you start X, so that all your windows will inherit the environment variables set by this command. Note that some sites invoke ssh-agent for each user upon login automatically; be sure to check and see if there is an ssh-agent running for you already.

Once the ssh-agent is running, you can tell it your passphrase by running the ssh-add command:

% ssh-add $HOME/.ssh/id_dsa

At this point, if you ssh to a remote host that has the same $HOME/.ssh directory as your local one, you should not be prompted for a password. If you are, a common problem is that the permissions in your $HOME/.ssh directory are not as they should be.

Note that this text has covered the ssh commands in very little detail. Please consult the ssh documentation for more information.

NOTE: OSCAR users should already have passwordless-ssh setup, and should not need to perform any of the above steps.

[ Top of page | Return to FAQ ]


16. recon/lamboot claims that it cannot find LAM executables on the remote node. What does that mean?

When recon or lamboot cannot find the LAM executables on a remote node, it means that LAM tried to invoke a LAM executable on the remote node, and the shell failed to find it. This usually indicates that the directory where the LAM executables are found is not in the user's path.

That is, in the user's $HOME/.cshrc (not the user's $HOME/.login!), $HOME/.profile, $HOME/.bashrc, or whatever other shell startup script is used, the directory for the LAM executables must be put in the path environment variable.

Sometimes the directory is put in the path properly, but after the startup script has exited for non-interactive shells. That is, users typically put the extra path statement at the end of their .cshrc (or whatever) file -- this may not be the Right Thing to do.

If your .cshrc file has a line similar to the following:

if ($?USER == 0 || $?prompt == 0) exit

then you must set the path before this line.

[ Top of page | Return to FAQ ]


17. Does LAM use static port numbers?

No. The lamboot command sets up sockets between all nodes in the system. The sockets that are used, and the port numbers that are used to connect these sockets are completely dynamic.

Similarly, when MPI_INIT is invoked in user programs, additional sockets may be setup. These sockets, and the port numbers that are used to connect them are also completely dynamic.

This may be changed in a future release if enough users ask for static port numbers.

[ Top of page | Return to FAQ ]


18. Can I lamboot to hosts outside of my firewall?

Since LAM does not use static port numbers, it would be very difficult to map predictable holes through a firewall to allow LAM to boot properly. Additionally, in C2C mode, user MPI programs will establish futher dynamic sockets.

Until LAM supports static socket numbers, launching LAM jobs through a firewall is highly unlikely.

[ Top of page | Return to FAQ ]


19. lamboot seems to hang -- why? And what do I do?

If lamboot seems to hang for no discernable reason, use the -d switch to either recon or lamboot. This will provide a lot of information on exactly what LAM is trying to do at each step of the way.

The -d switch also sends a lot of debugging output to the system logs (syslog) from the LAM daemons on each node. This output can also be quite helpful in finding problems. The system logs are typically located in directories such as /var/adm or /var/log, but you system's setup may be different.

[ Top of page | Return to FAQ ]


20. Can I issue multiple lamboot's on a single machine?

While there is nothing to prevent you from executing lamboot multiple times on the same host, it probably does not do what you expect. lamboot will kill any running MPI programs and any pre-existing LAM daemon by the same user on a given node before starting up a new LAM daemon.

That is, in most cases, there can only be one LAM daemon per user on a node at any given time -- and this is usually sufficient for most users. It is a common misconception that you need multiple LAM environments to run multiple user MPI programs simultaneously. This is not true -- you can have a single LAM/MPI environment booted, and run multiple user MPI programs in the same environment (even on the same nodes).

Exceptions to this are when running under a batch queueing system -- the batch scheduler may schedule multiple jobs by the same user to the same node. In this case, there clearly needs to be multiple LAM daemons owned by the same user on the same node.

LAM 7.1.1 will automatically do the Right Thing for lamboot's executed inside of PBS, SGE, and LSF batch jobs. That is, if LAM detects that it is running in a PBS job, it will automatically adapt itself to allow one LAM daemon per PBS / SGE / LSF job (vs. the default behavior of one LAM daemon per node), even if PBS / SGE / LSF jobs overlap nodes.

Users of any other batch system may manually set this behavior with LAM 7.1.1 in their batch script files by insertting the following lines before the lamboot command (for sh-related shells):

  LAM_MPI_SESSION_SUFFIX="${BATCH_JOBID}"
  export LAM_MPI_SESSION_SUFFIX
  # ...rest of script, to include the lamboot command

and for csh-related shells:

  setenv LAM_MPI_SESSION_SUFFIX "${BATCH_JOBID}"
  # ...rest of script, to include the lamboot command

where users of other batch systems would use an appropriate environment variable that gives the batch job ID instead of $BATCH_JOBID. Consult your batch system's documentation.

[ Top of page | Return to FAQ ]


21. How do I lamboot multi-processor machines?

lamboot has been extended to understand multiple CPUs on a single host, and is intended to be used in conjunction with the new "C" mpirun syntax for running on SMP machines (see the section on mpirun). Multiple CPUs can be indicated in two ways: list a hostname multiple times, or add a "cpu=N" phrase to the host line (where "N" is the number of CPUs available on that host). For example, the following hostfile:

        blinky
        blinky
        blinky
        blinky
        pinky cpu=2
indicates that there are four CPUs available on the "blinky" host, and that there are two CPUs available on the "pinky" host. Note that this works nicely in a PBS environment, because PBS will list a host multiple times when multiple vnodes on a single node have been allocated by the scheduler.

It is important to note that LAM has no concept of CPU scheduling issues -- that is the operating system's responsibility. Specifying "cpu=M" or listing a hostname multiple times in a boot schema file is simply shorthand for indicating to LAM how many processes you will want to launch on a given machine. In the above example, if the machine blinky really only has two processors (instead of four, as it is listed), LAM will still launch four user processes (See the "Running LAM/MPI applications" section of the FAQ) on blinky because it was listed this way in the boot schema. The operating system is responsible for scheduling those four processes between blinky's two CPUs.

Note that different usernames can be specified for specific hosts as well. For example:

       blinky cpu=2 user=lamguest
specifies that the username "lamguest" should be used to login to the machine "blinky". This is different than previous syntax for specifying usernames for remote nodes; the old use (not even described here :-) is still available, but its use is depricated.

[ Top of page | Return to FAQ ]