Setup Instructions
There are two tasks to complete in order to get set up with the WFA. First, the WFA library must be obtained. Due to export restrictions, the code is maintained on a private gitlab server. Follow the instructions in the Obtaining the WFA Code section. Second, the compilers necessary to run the code are proprietary to Cerebras Systems, Inc. The compilers are usually packaged with the hardware in singularity containers. The compilers necessary for the WFA have been integrated into version 1.8.0 and above in the SDK container.
Obtaining the WFA Code
This github repository contains documentation only. The full code repository is available here. To obtain access to the code, register on the MFiX website contact Terry Jordan at terry.jordan@netl.doe.gov to be added to the privet gitlab repository. This process has been adopted until NETL makes a determination about the export control nature of this work.
After being added to the repository, checkout the latest repo:
git clone https://mfix.netl.doe.gov/gitlab/tjordan/cerebrasdev.git
Running on Neocortex
Necortex is an NSF funded computer run by the Pittsburgh Supercomputing Center (PSC) and consists of a Superdome Flex and two CS-2’s (the system that contains a WSE). PSC runs regular user solicitations for time on Neocortex. NETL has been working with Cerebras and PSC to ensure the WFA can run on the Neocortex system.
The first thing to do is follow the instruction in the Obtaining the WFA Code section to get the repository for the WFA set up on Neocortex. If there is no special reason to use a different branch, checkout Master and make sure you pull the latest changes.
The next thing to do is set up the container that contains the proprietary Cerebras Containers. The WFA compiles against the Paint and Angler compilers using Make. The container you use will have to have this capability built in. According to Cerebras, the SDK container at 1.8.0 and later will have this built in. If the container installed on the hardware is a lower version number, then one will have to obtain a custom built container from Leighton Wilson at cerebras (reach out to him on the Neocotex Slack channel). As of the writing of this version of the documentation (3/24/23), Neocortex has not upgraded to 1.8.0. Thus, users will have to generate a binary in the custom container and then submit the run on hardware using the latest installed container on Neocortex.
Interactive Singularity
Throughout the development and run cycle, you will need generate images inside a singularity container. While it is possible to do this in batch mode, it is more convenient to launch an interactive singularity environment as several python packages have to be installed and it can waste a lot of time to run in batch mode.
Because singularity is locked down, installing python packages in the normal way is not possible. To get around this,
we create a directory which will contain python packages and then set the .local
path in the home directory to
this location.
mkdir <path_to>/Cerebras-extra-python-packages
We may need to add this location to PYTHONPATH.
export PYTHONPATH=<path_to>/Cerebras-extra-python-packages:$PYTHONPATH
We also link the path to the WFA repository to a root level folder /cerebras_dev
inside the container. If
<path_to_repo>
is set correctly, then a cd cerebras_dev
inside the container should take you to the
top level directory of the WFA repository where the README.md
file exists. <path_to_sif_file>
is either
the path to a custom sif container or the default environment variable to the installed SDK container on Neocortex
depending on if the installed container is at/above 1.8.0. Run on local SSD storage to reduce the time needed to generate
and transfer hardware images. First checkout an SDF node
The cd to local storage and create a project directory
Launch the singularity container on SDF node
singularity run -B <path_to>/Cerebras-extra-python-packages:$HOME/.local -B <path_to_repo>/:/cerebras_dev --env PYTHONPATH=\$HOME/.local:\$PYTHONPATH <path_to_sif_file>
Because we have to use the python installed in the container, if you use a python manager like Conda, you will have
to deactivate it so that python
points to the containers python. For conda this is simple:
conda deactivate
Once this is set up, leave the terminal open so you can do development and generate simulator/hardware images.
Checking Installation
Now that the container environment is set up, it is good to check that the WFA is working correctly. Inside the terminal with the singularity container running:
cd /cerebras_dev
Then launch the tests by:
python build_and_test.py -neo 1
This should install all the python dependencies and start launching the tests. You should see output similar to this:
# lots of install messages
Finished processing dependencies for WSE-Field-Equations==0.0.1
test_FDNS_bc_Vz : pass
test_scalar_array_mul : pass
# lots more tests
All tests should pass if installed correctly and the branch you are on shows a pass in the gitlab runner in the online repository. It may take 5-10 min to run all the tests depending on how many are checked in.
Launching Portrait from Neocortex
If doing development on Neocortex, which is headless, one will want to launch the visual debugging tool, Portrait.
Navigate to the input file directory and run the input file with one of the debug level options -dl <1,2,3>
,
all of them will launch the Portrait web server for you:
python <input_file>.py -dl <1,2,3>
The last line in the console after running the debug option will look something like this:
http://br023.ib.bridges2.psc.edu:8080/paint.html
This is the server to connect to in the format of:
http://<server>:<port>/paint.html
From your local machine ssh into Neocortex:
ssh -L <port>:<sever>:<port> <username>@bridges2.psc.edu
Launch web browser (firefox and chrome work the best) and navigate to:
localhost:<port>/paint.html
Compiling for Hardware Neocortex
By default the WFA will compile and run in Simfabric (the Cerebras built hardware simulator). To to compile
hardware, specify the hardware input name (-hin <hardware_run_name>
) and the make goal as bench.img
(-mg bench.image
). A module or local script can be used. For a module use the following command
python -m WFA.<module>.<project_script> -mg bench.img -vr 0 <additional_args>
For a local scipt use the following command .. code-block:: console
python <run_name>.py -mg bench.img -vr 0 <additional_args>
This a hardware image (binary) that is ready to run on hardware.
The job can then be submitted to Neocortex through slurm. Make sure the <hardware_run_name>
does not include
the .tar.gz
extension is not included in the -c
option. The -o <field_variable>
option
specifies the name of the WSE_Array
to save as output from the WSE. Use the same node on which you generated
the hardware image.
(cd /cerebras_dev/WSE_Field_Equation/util; sbatch --nodelist=sdf-<1 or 2> neocortex_slurm_script_local -c <full path to the local project_dir> -o <field_variable>)
Retrieving Data
On Neocortex, the fastest way to run is to move data off the network drives and onto the local SSD’s. The
neocortex_slurm_script_bash
script will move the <hardware_run_name>.tar.gz
file to the local drive
on the super dome flex at /local1/<project_name>/<hardware_run_name>
. To access this data, log into the SDF
node:
srun --nodelist=sdf-<1 or 2> --pty bash
and navigate to the local storage
cd /local1/<project_name>/<hardware_run_name>
The results will be in a series .ckpt
files. If the program ran successfully and got to the end, the
progend.txt
file should have a 1
in it.
Saving time Dependent Data
Multiple checkpoints can be saved out in series as the calculations are completed. The WFA can be configured to write
out files from one loop starting at iteration <start_iteration>
and writing out every <mod_iterations>
.
To do this, the controller puts itself in a pause state and waits for the host
to check if it is ready to take data off. If it is, the host will issue a data retrieval command and get the data off
the WSE. This is a far from optimal solution for IO and currently uses a very slow debug interface. It will be
replaced with a different interface that puts the host in a wait to receive mode and uses the full 1.2Tb/s data link
to bring down frame save rates to a few milliseconds and not interrupt solution progress.
To enable this, the conditional_pause
method in the loop context manager must be called.
with for_loop('time_march', args.time_steps) as time_march:
iter_counter[0] = iter_counter + 1
T[2:-2,0,0].push_to_host(args.start_iteration_push, args.modulo_iteration_push)
if args.add_time_pause > 0:
time_march.conditional_pause()
ns.fluid_iteration(save_debug_images=False, num_reduction_lim_its=nerr,
max_pseudo_its=max_pseudo_its, fluid_tol=nl_tol)
It is often beneficial to put this in a guard with a command line option set.
# Head of file
wse = WSE_Interface()
parser = wse.get_cmd_line_parser()
# Add Program Specific command line arguments
parser.add_argument("-ts", "--time_steps", help="specify number of time steps",
type=int, default=5)
parser.add_argument("-atp", "--add_time_pause",
help="adds a time march pause to save data", type=int,
default=-1)
# Some point after all the arguments are added
args = wse.get_available_cmd_args()
# Rest of problem setup
In this example, the program code would get compiled with a -atp 1
set on the command line and
-ts <max_time_steps>
.
Then to launch the job on Neocortex, additional command line arguments would be set in the submission.
(cd /cerebras_dev/WSE_Field_Equation/util; sbatch --nodelist=sdf-<1 or 2> neocortex_slurm_script_bash -c <hardware_run_name> -o <field_variable> -s <start_iteration> -p <mod_iterations>)
This allows one to set up saving data over a given time period. For example setting <start_iteration> = 100
,
<mod_iterations> = 10
, and compiling with <max_time_steps> = 200
will output 11 .ckpt
, equally
spaced at time steps between 100 and 200.
Converting Checkpoints to VTR
The resulting checkpoint need to be converted to be visualized. The process uses a post processing script, tovtkfromBin.py, to generate a vtkRectilinearGrid (VTR) file.
Copy the conversion script to you results directoy:
$ cp <WFA_Path>/WSE_Field_Equation/util/tovtkfromBin.py <Checkpoint_Location>
Run the script:
$ python ./tovtkfromBin.py <Checkpoint_filename> <Attribute_name>
ParaView Visualization
Paraview is post-processing visualization tool. It can be run locally or remotely. The local execution uses typical point and click operation. Client/Server mode is a more involved process.
Client/Server Mode
Login to Bridges2
$ ssh <username>@bridges2.psc.edu
Download the latest version of Paraview
$ wget https://www.paraview.org/paraview-downloads/download.php?submit=Download&version=v5.11&type=binary&os=Linux&downloadFile=ParaView-5.11.0-MPI-Linux-Python3.9-x86_64.tar.gz
Extract ParaView
$ tar -zxvf ParaView-5.11.0-MPI-Linux-Python3.9-x86_64.tar.gz
Check out a GPU node
$ ssh interact -p GPU-shared --gres=gpu:v100-32:1
Run ParaView server
$ <ParaView_Directory>/bin/pvserver
Get the port information from the console which will look something like this:
Waiting for client...
Connection URL: cs://v003.ib.bridges2.psc.edu:11111
Accepting connection(s): v003.ib.bridges2.psc.edu:11111
Client connected.
This is the server to connect to in the format of:
cs://<server>:<port>/
From your local machine use ssh to link the Bridges2 port to you local port:
ssh -L <port>:<sever>:<port> <username>@bridges2.psc.edu
On the local system, download and extract ParaView using the same process above. Run ParaView locally:
$ <ParaView_Directory>/bin/paraview
Use the ParaView GUI to connect to the server. File->Connect:
Choose Server Configuration/Add a server:
Edit Server Configuration:
Use File->Open to open the VTR.
Running on Colab
Colab is a facility in California which hosts Cerebras’ WSE farm. NETL has purchased time on this facility. The follwing are instruction to run on the Colab facilities. These instructions are largely specific to the NETL team and those working with the core WFA development team.
Connecting to Colab
Connection is done over ssh with private/public keys. It is not possible to connect to Colab without contacting NETL and Cerebras to arrange for ssh connection privileges. Connections are managed over a VPN system called global protect. The software can be found here. You will need a user name and password from Cerebras to log in and download.
Once global protect is installed, add the server access01.vpn.cerebras.net to the global protect client by navigating to hamburger > Settings > General > add. Put the above portal address in the dialog box and click ok. Now log into global connect using your provided user name and password.
NETL has been provided access to servers: sc-r11r13-s10 and sc-r11r13-s11. s10 is our development node and s11 is connected to the WSE. both can be logged into with the following:
ssh -L 8080:<server>:865 <user_name>@<server> -i C:\<path\to\private\key>
This grants terminal access to either server with port forwarding so that portrait can be accessed from your local machine the code:-L 8080:<server>:865 can be omitted if there is no need to run portrait and only has to be done once in an active terminal to support the connection.
Running the WFA in Docker
Once the WFA is checked out and a docker container is copied to the home directory, log into s10. s10 is equipped with docker and the image can be run with the following:
docker run -it --rm -p 865:8080 -v $HOME:/home -v /netl:/netl -v /usr:/host_usr <image_name>
Once docker is loaded and running in a terminal, navigate to the root directory for the WFA and run code:python build_and_test.py. If the branch is passing on gitlab, it should pass all the tests locally.
Running on hardware
s11 is connected to the WSE and has access to the same directories as s10. It does not have docker for obvious reasons. Once logged in via ssh, set up two environment variables:
export PFAB=$HOME/pfabric.bin
export FABRIC=$HOME/vfabric.bin
export NODE=10.254.27.224
The above could be added to your code:.bashrc file
Next, expand the stack space available to linux:
ulimit -s unlimited
Now check the current health of the WSE to ensure that it is not in a bad configuration:
cssurvey node=$NODE fabric=$PFAB
Check for one way links and that it found over 800K cores. A healthy output should look like:
Node at 10.254.27.224, 132 links over 12 ports
P0 = 10.254.27.225
P1 = 10.254.27.226
P2 = 10.254.27.227
P3 = 10.254.27.228
P4 = 10.254.27.229
P5 = 10.254.27.230
P6 = 10.254.27.231
P7 = 10.254.27.232
P8 = 10.254.27.233
P9 = 10.254.27.234
PA = 10.254.27.235
PB = 10.254.27.236
Use socket 3 for control messages
tilecount 818958
tilecount 818958
corecount=801864
corecount=801864
Detected 801864 cores
gcount: 0 in raw scan
0 one way links
translating fabric offset -746,-496 -> 0,0
801864 nodes; 1 components; 0 x-error; 0 y-error; 0 linkerror 0 taberror 0 inverror
fabric size: 774 x 1036
ghost sightings: 0
[ 0] 499R [ 1] 491R [ 2] 475R [ 3] 459R [ 4] 443R [ 5] 430R [ 6] 414R [ 7] 408R [ 8] 395R [ 9] 382R [ 10] 359R
[ 11] 351R [ 12] 337R [ 13] 321R [ 14] 305R [ 15] 289R [ 16] 273R [ 17] 265R [ 18] 249R [ 19] 233R [ 20] 209R [ 21] 201R
[ 22] 186R [ 23] 170R [ 24] 155R [ 25] 141R [ 26] 125R [ 27] 117R [ 28] 101R [ 29] 85R [ 30] 61R [ 31] 53R [ 32] 37R
[ 33] 1004R [ 34] 988R [ 35] 972R [ 36] 948R [ 37] 940R [ 38] 924R [ 39] 908R [ 40] 892R [ 41] 876R [ 42] 860R [ 43] 852R
[ 44] 836R [ 45] 820R [ 46] 796R [ 47] 788R [ 48] 773R [ 49] 757R [ 50] 741R [ 51] 727R [ 52] 711R [ 53] 703R [ 54] 687R
[ 55] 671R [ 56] 647R [ 57] 639R [ 58] 623R [ 59] 607R [ 60] 591R [ 61] 579R [ 62] 563R [ 63] 555R [ 64] 539R [ 65] 523R
[ 66] 521L [ 67] 536L [ 68] 552L [ 69] 559L [ 70] 575L [ 71] 591L [ 72] 607L [ 73] 623L [ 74] 638L [ 75] 646L [ 76] 670L
[ 77] 685L [ 78] 701L [ 79] 709L [ 80] 725L [ 81] 741L [ 82] 757L [ 83] 773L [ 84] 789L [ 85] 797L [ 86] 820L [ 87] 836L
[ 88] 852L [ 89] 860L [ 90] 876L [ 91] 892L [ 92] 908L [ 93] 924L [ 94] 939L [ 95] 947L [ 96] 971L [ 97] 987L [ 98] 1003L
[ 99] 38L [100] 53L [101] 61L [102] 84L [103] 100L [104] 115L [105] 123L [106] 139L [107] 155L [108] 171L [109] 187L
[110] 202L [111] 210L [112] 234L [113] 249L [114] 265L [115] 273L [116] 289L [117] 302L [118] 313L [119] 329L [120] 345L
[121] 353L [122] 377L [123] 393L [124] 408L [125] 416L [126] 432L [127] 444L [128] 460L [129] 475L [130] 490L [131] 498L
If the output does not look like this, the WSE needs to be restarted. Contact Cerebras through slack for help.
If the output looks healthy then proceed to create a virtual fabric big enough to run your program.
csvfabric pfabric=$PFAB vfabric=$FABRIC x0=<x_origin> y0=<y_origin> w=<width> h=<height>
Now pause all the tiles in that fabric section:
cscontrol fabric=$FABRIC pause
Now initialize the memory space on that fabric section:
cswipe fabric=$FABRIC
Now load your program:
csprog image=<path/to/my/image/file> fabric=$FABRIC
Now unpause the tiles:
cscontrol fabric=$FABRIC run
The tiles can be paused at any point and the memory state can be inspected:
csget fabric=$FABRIC map=<path/to/my/map/file> symbol=<symbol_in_map_file> data=<optional_binary_out_file>
Useful Tools and Methods
Several runtime environments are headless which makes doing development work difficult. Here are a few useful tools to make this easier.
Custom Spyder IDE
NETL is maintaining a custom version of the Spyder IDE which supports syntax highlighting for WSE kernel code. Installation can be done with conda by following the instructions below:
Follow the instructions in the Obtaining the WFA Code section
clone the repository from the gitlab repository
check out the v.5.4.2_wse branch
navigate to the folder with the setup.py and env.yml files
run
conda env create -f env.yml --name spyder
on linux orconda env create -f env_win.yml --name spyder
on windowsrun
conda activate spyder
run
python setup.py install
run
spyder
on linux orspyder.bat
in thepath/to/spyder/repository/scripts
directory in windows
sshfs
sshfs is very handy for doing remote development as it allows one to mount a remote directory over standard ssh interfaces. Follow the instructions here to get it set up. It has been tested and works with Neocortex. This will allow you to edit files remotely in whatever editor you choose. Once installed do the following:
Neocortex from Linux
First, make a directory to mount
sudo mkdir /mnt/<mount_name>
Next mount whatever directory you wish
sudo sshfs <user_name>@neocortex.psc.edu:<path/to/directory> /mnt/<mount_name>
Neocortex from Windows
Open the file browser
Right click on This PC
Select Map Network Drive…
Select a drive letter
Paste the following (with correct substitutions) in the Folder: text box
\\sshfs <user_name>@neocortex.psc.edu[\<path\to\directory>]
Pycharm
Pycharm is another IDE similar to Spyder. It has a useful feature in that it will render .rst
files and has
some minimal word processing capabilities. This makes it very good for editing Sphinx documentation files.
To install:
get the latest community edition here.
untar the file
run with
sh <path_to_pycharm>binpycharm.sh