Jacket HPC
From Jacket Wiki
Back to Documentation, Jacket MGL
Introduction
Jacket includes functionality to seamlessly utilize multiple GPUs across multiple computer nodes via the MATLAB® Distributed Computing Server. With the simple addition of well known parallel constructs such as PARFOR and SPMD, pre-existing code may be dispatched across all GPUs and CPUs in a cluster. In many cases, little to no code revision is required to take advantage of this new parallel computing capability.
Jacket's ability to span computation across multiple GPUs either locally or over a network allows for an unprecedented ability to transparently scale GPU and CPU computing resources. Additional GPUs added to a host may now be instantly utilized without a code modification - simply increment the number of MATLAB workers via the MATLAB command prompt. When a host is not capable of driving more GPUs, simply add a GPU to another host on the network and drive it via the MATLAB Distributed Computing Server. With the addition of Jacket, pre-existing CPU clusters may be upgraded through the installation of GPUs, significantly increasing the cluster's computational capability without investing in new development for specialized GPU code. MATLAB paired with Jacket is the easiest and most scalable solution for GPU computing available.
The Jacket Multi-GPU license Jacket MGL enables single node, multi-GPU systems to run Jacket code. It is available with every Jacket trial license, along with an example application available in the installation directory, <jacket_root>/examples/mgl_example.
The Jacket High-Performance Computing license Jacket HPC enables multi-node, multi-GPU systems to run Jacket code.
Jacket HPC: Getting Started
Parallel computing in MATLAB is built around the concept of workers. The number of workers is declared by using the command MATLABPOOL. Although a user can create any number of workers, the optimal performance is only achieved when there is one CPU core dedicated for each worker. For example, on a quad-core machine, it is recommended to create four workers only. The similar philosophy applies to workers assignment to GPUs. The total number of workers in the compute pool should be equal to the number of GPUs present in the system. Moreover, best performance is achieved when there is only one host CPU assigned to a GPU worker.
As a pre-requisite, please read the Jacket MGL page, as well as the steps to set up a Concurrent Licensing Server, as they also apply to Jacket HPC.
Requirements
To get HPC working the following pre-requisites must be satisfied:
- Linux Operating System. Jacket HPC is currently only available on Linux 32-bit and 64-bit platforms. Recommended platforms at present are Red-Hat and Fedora-based systems.
- MATLAB Distributed Computing Server (DCS). In order to create parallel applications in a cluster setup, a user must have DCS installed on the cluster computers. MATLAB documentation is the best source of information for DCS documentation.
- Jacket Concurrent Networking Server Package. Available for download from the Manage Licenses page once Jacket HPC or Concurrent Network Licenses have been purchased.
- The Jacket Daemon. This daemon service enables coordination between the Jacket-enabled workers. It must run on every slave node. At startup each worker queries the local daemon to see which card it is assigned.
Please contact Support if you want to download the daemon for any other Operating System.
- Jacket HPC License. The Jacket HPC license add-on must be present in your license, with the appropriate number of GPUs.
Setting up Jacket HPC
Set up MATLAB
You need to set up your Distributed Computing environment so that it is able to create matlabpools, create and destroy workers. If you already have this set up, you may ignore this step.
Note: The steps in this section are indicative (and for Linux machines). Actual steps to be performed may vary according to your specific system.
The mdce and admincenter services are typically found in the toolbox/distcomp/bin folder of MATLAB.
- Start the mdce service on all nodes: ./mdce start.
- Start the admincenter service: ./admincenter.
- Create as many hosts as the number of slave nodes in your cluster.
- Create a job manager on your master node.
- Try creating some workers. If you are unable to do so, please read MATLAB documentation.
Next, create a configuration in MATLAB as follows:
- Open MATLAB in GUI Mode
- Open the Menu Parallel > Manage Configurations
- Create a job configuration with the name of the job manager and the head-node. (For example, hpc@head-node).
Click "Start Validation" to validate the setup.
Important Notes:
- You may need administrator access to fully enable the MATLAB services above to run.
- If not running mdce as administrator, you can run it using the -u flag and specify a user that has appropriate privileges.
- If the validation of your job configuration fails with the following message:
- "Lab <lab number> on host <hostname> failed to connect to the MATLAB client on host <head-node name>".
- You may need to put your head-node's IP address into /etc/hosts on the head-node.
Set up Licensing System
- Download a concurrent license file. Take care to ensure that the port number you choose is different from mdce's port.
- Follow the steps here to unpack and start the license server, lmgrd.
- On all the slave nodes, start jacketd as the same user that started mdce.
./jacketd # daemon mode (runs in background) ./jacketd -f # optional foreground mode (output sent to terminal) ./jacketd -d # optional diagnostic output as clients are connecting
You can also run this with
Additional Steps
You may need to perform the following added steps in some cases. These are recommended but not always necessary.
- On slave nodes, you may need to set the environment export FORCE_MGL=1 (or setenv('FORCE_MGL',1) from within MATLAB) (When is this needed?)
- To enable workers to communicate, give /tmp/jacketd.sock write permissions (this file is created by jacketd).
- Set an environment variable, LM_LICENSE_FILE=port-number@head-node/
- Add a line, USE_SERVER, just below the VENDOR line on the downloaded license file jlicense.dat.
- If you are still having problems, set FORCE_HPC to 1 before running any program as shown below.
>> spmd; setenv('FORCE_HPC', 1); run_my_code(); end
Testing the setup
All commands below are run from the master (head) node.
In a PCT or DCS environment, you can test your configuration with the following example where the host computer is setup as a single node, 2-worker cluster:
>> spmd; ginfo; end Lab 1: GPU0(enabled) GeForce GTX295, 1212 MHz, 895 MB VRAM, Compute1.3(single) GPU1(enabled) GeForce GTX295, 1212 MHz, 895 MB VRAM, Compute1.3(single)(in use) Lab 2: GPU0(enabled) GeForce GTX295, 1212 MHz, 895 MB VRAM, Compute1.3(single)(in use) GPU1(enabled) GeForce GTX295, 1212 MHz, 895 MB VRAM, Compute1.3(single)
Note that typing ginfo on the head node does not completely guarantee that your setup works. You need to run the command in a pool as shown above to properly test that the command is fanned out to the HPC workers.
To see Jacket HPC in action, try running the MGL Example (If you are in the Jacket directory, type addpath examples/mgl_example; mgl_example).
Trouble-shooting
Issues running a Designated-Computer Jacket license on a Jacket HPC node
Read about Designated-Computer licenses here.
You may face some startup issues if a machine is configured both as :
- a slave node on a Jacket HPC cluster and
- a desktop machine running a single-node license of Jacket
Although running a machine in dual licensing modes is not recommended, this frequently happens if you use the slave node for daily Jacket work as well as run periodic Jacket HPC cluster tests.
It may manifest itself in the form of the following:
- license-related errors upon Jacket startup
- a MATLAB hang or crash when you type the first command in Jacket
Please be aware that the changes below may cause either your single-node or your Jacket HPC setup to behave incorrectly.
The licensing software is the same for both single-node and Jacket HPC, and relies upon an environment variable, LM_LICENSE_FILE that points to the licensing server (usually of the form port@server).
This variable might have been created when you set up your machine as a Jacket HPC slave. Unset this variable (for example, in Linux, you could set it to LM_LICENSE_FILE=) for the single-node session. Set it back for use with Jacket HPC.
lmgrd: Failed to open the TCP port number in the license
MATLAB's MDCE service uses a port to communicate with cluster nodes. Typically, this port number is 27000, and is displayed in messages when you start mdce.
If your concurrent license file (downloaded using the instructions here) contains the same port number, lmgrd will fail to open the port. Therefore, take care to choose a different port while downloading the license.
Job Configuration: Validation Failure
Lab <lab number> on host <hostname> failed to connect to the MATLAB client on host <head-node name>.
Try putting your head-node's IP address into /etc/hosts on the head-node as: IP-Addr head-node
The Jacket License service could not be contacted
If you get this message while running spmd;ginfo;end, that means the jacketd service running on the workers is uncontactable.
To resolve this:
- Ensure that jacketd and mdce were started as the same user. One way to ensure this is to start both as root. Another is to start them at startup.
- To enable workers to communicate, give the file /tmp/jacketd.sock write-permissions (this file is created after you start jacketd).
- Set an environment variable, LM_LICENSE_FILE=port-number@head-node (for example, 27001@head-node)
- Add a line, USE_SERVER, just below the VENDOR line on the downloaded license file jlicense.dat on all servers and clients.
Please contact us if you need help.
Running the MGL Example causes an error: No available Jacket license found or license is invalid
To work around this error, you need to set an environment variable on all your workers (Only the workers):
export FORCE_HPC=1
If you are unable to set this up, please contact Support and we'll walk you through the steps.
There may be a situation where you are able to add hosts to the Admin Center, but they show as "Unavailable".
You can verify this by clicking on "Test connectivity. The result of running this is shown in the screenshot to the right.
Typically this means an external program is blocking access to the mdce service. This may be an incorrectly-configured firewall, for instance.
On Linux machines, you may try temporarily disabling the firewalls as iptables stop and check if that enables your nodes to become available. If they do, you need to reconfigure your firewall.
Starting lmgrd: /lib/ld-lsb.so.3: bad ELF interpreter: No such file or directory
Some Red-Hat Linux distributions do not include the Red-Hat LSB libraries. You may need to manually install this (available through fedora yum or rpm repositories).
