Setup Instructions ================== There are two tasks to complete in order to get set up with the WFA. First, the WFA library must be obtained. Due to export restrictions, the code is maintained on a private gitlab server. Follow the instructions in the `Obtaining the WFA Code`_ section. Second, the compilers necessary to run the code are proprietary to Cerebras Systems, Inc. The compilers are usually packaged with the hardware in singularity containers. The compilers necessary for the WFA have been integrated into version 1.8.0 and above in the SDK container. Obtaining the WFA Code ---------------------- This github repository contains documentation only. The full code repository is available `here <https://mfix.netl.doe.gov/gitlab/tjordan/cerebrasdev/-/tree/master/WSE_Field_Equation>`_. To obtain access to the code, register on the `MFiX website <https://mfix.netl.doe.gov/register>`_ contact Terry Jordan at terry.jordan@netl.doe.gov to be added to the privet gitlab repository. This process has been adopted until NETL makes a determination about the export control nature of this work. After being added to the repository, checkout the latest repo: .. code-block:: console git clone https://mfix.netl.doe.gov/gitlab/tjordan/cerebrasdev.git Running on Neocortex -------------------- `Necortex <http://https://www.cmu.edu/psc/aibd/neocortex/>`_ is an NSF funded computer run by the Pittsburgh Supercomputing Center (PSC) and consists of a Superdome Flex and two CS-2's (the system that contains a WSE). PSC runs regular user solicitations for time on Neocortex. NETL has been working with Cerebras and PSC to ensure the WFA can run on the Neocortex system. The first thing to do is follow the instruction in the `Obtaining the WFA Code`_ section to get the repository for the WFA set up on Neocortex. If there is no special reason to use a different branch, checkout Master and make sure you pull the latest changes. The next thing to do is set up the container that contains the proprietary Cerebras Containers. The WFA compiles against the Paint and Angler compilers using Make. The container you use will have to have this capability built in. According to Cerebras, the SDK container at 1.8.0 and later will have this built in. If the container installed on the hardware is a lower version number, then one will have to obtain a custom built container from Leighton Wilson at cerebras (reach out to him on the Neocotex Slack channel). As of the writing of this version of the documentation (3/24/23), Neocortex has not upgraded to 1.8.0. Thus, users will have to generate a binary in the custom container and then submit the run on hardware using the latest installed container on Neocortex. Interactive Singularity ^^^^^^^^^^^^^^^^^^^^^^^ Throughout the development and run cycle, you will need generate images inside a singularity container. While it is possible to do this in batch mode, it is more convenient to launch an interactive singularity environment as several python packages have to be installed and it can waste a lot of time to run in batch mode. Because singularity is locked down, installing python packages in the normal way is not possible. To get around this, we create a directory which will contain python packages and then set the :code:`.local` path in the home directory to this location. .. code-block:: console mkdir <path_to>/Cerebras-extra-python-packages We may need to add this location to PYTHONPATH. .. code-block:: console export PYTHONPATH=<path_to>/Cerebras-extra-python-packages:$PYTHONPATH We also link the path to the WFA repository to a root level folder :code:`/cerebras_dev` inside the container. If :code:`<path_to_repo>` is set correctly, then a :code:`cd \cerebras_dev` inside the container should take you to the top level directory of the WFA repository where the :code:`README.md` file exists. :code:`<path_to_sif_file>` is either the path to a custom sif container or the default environment variable to the installed SDK container on Neocortex depending on if the installed container is at/above 1.8.0. Run on local SSD storage to reduce the time needed to generate and transfer hardware images. First checkout an SDF node .. code-block:: console srun --nodelist=sdf-<1 or 2> --mem 200000 --pty bash The cd to local storage and create a project directory .. code-block:: console cd /local1/<Charge ID> mkdir <project_dir> cd <project_dir> Launch the singularity container on SDF node .. code-block:: console singularity run -B <path_to>/Cerebras-extra-python-packages:$HOME/.local -B <path_to_repo>/:/cerebras_dev --env PYTHONPATH=\$HOME/.local:\$PYTHONPATH <path_to_sif_file> Because we have to use the python installed in the container, if you use a python manager like Conda, you will have to deactivate it so that :code:`python` points to the containers python. For conda this is simple: .. code-block:: console conda deactivate Once this is set up, leave the terminal open so you can do development and generate simulator/hardware images. Checking Installation ^^^^^^^^^^^^^^^^^^^^^ Now that the container environment is set up, it is good to check that the WFA is working correctly. Inside the terminal with the singularity container running: .. code-block:: console cd /cerebras_dev Then launch the tests by: .. code-block:: console python build_and_test.py -neo 1 This should install all the python dependencies and start launching the tests. You should see output similar to this: .. code-block:: console # lots of install messages Finished processing dependencies for WSE-Field-Equations==0.0.1 test_FDNS_bc_Vz : pass test_scalar_array_mul : pass # lots more tests All tests should pass if installed correctly and the branch you are on shows a pass in the gitlab runner in the online repository. It may take 5-10 min to run all the tests depending on how many are checked in. Launching Portrait from Neocortex ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If doing development on Neocortex, which is headless, one will want to launch the visual debugging tool, Portrait. Navigate to the input file directory and run the input file with one of the debug level options :code:`-dl <1,2,3>`, all of them will launch the Portrait web server for you: .. code-block:: console python <input_file>.py -dl <1,2,3> The last line in the console after running the debug option will look something like this: .. code-block:: console http://br023.ib.bridges2.psc.edu:8080/paint.html This is the server to connect to in the format of: .. code-block:: console http://<server>:<port>/paint.html From your local machine ssh into Neocortex: .. code-block:: console ssh -L <port>:<sever>:<port> <username>@bridges2.psc.edu Launch web browser (firefox and chrome work the best) and navigate to: .. code-block:: console localhost:<port>/paint.html Compiling for Hardware Neocortex ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ By default the WFA will compile and run in Simfabric (the Cerebras built hardware simulator). To to compile hardware, specify the hardware input name (:code:`-hin <hardware_run_name>`) and the make goal as bench.img (:code:`-mg bench.image`). A module or local script can be used. For a module use the following command .. code-block:: console python -m WFA.<module>.<project_script> -mg bench.img -vr 0 <additional_args> For a local scipt use the following command .. code-block:: console python <run_name>.py -mg bench.img -vr 0 <additional_args> This a hardware image (binary) that is ready to run on hardware. The job can then be submitted to Neocortex through slurm. Make sure the :code:`<hardware_run_name>` does not include the :code:`.tar.gz` extension is not included in the :code:`-c` option. The :code:`-o <field_variable>` option specifies the name of the :code:`WSE_Array` to save as output from the WSE. Use the same node on which you generated the hardware image. .. code-block:: console (cd /cerebras_dev/WSE_Field_Equation/util; sbatch --nodelist=sdf-<1 or 2> neocortex_slurm_script_local -c <full path to the local project_dir> -o <field_variable>) Retrieving Data ^^^^^^^^^^^^^^^ On Neocortex, the fastest way to run is to move data off the network drives and onto the local SSD's. The :code:`neocortex_slurm_script_bash` script will move the :code:`<hardware_run_name>.tar.gz` file to the local drive on the super dome flex at :code:`/local1/<project_name>/<hardware_run_name>`. To access this data, log into the SDF node: .. code-block:: console srun --nodelist=sdf-<1 or 2> --pty bash and navigate to the local storage .. code-block:: console cd /local1/<project_name>/<hardware_run_name> The results will be in a series :code:`.ckpt` files. If the program ran successfully and got to the end, the :code:`progend.txt` file should have a :code:`1` in it. Saving time Dependent Data ^^^^^^^^^^^^^^^^^^^^^^^^^^ Multiple checkpoints can be saved out in series as the calculations are completed. The WFA can be configured to write out files from one loop starting at iteration :code:`<start_iteration>` and writing out every :code:`<mod_iterations>`. To do this, the controller puts itself in a pause state and waits for the host to check if it is ready to take data off. If it is, the host will issue a data retrieval command and get the data off the WSE. This is a far from optimal solution for IO and currently uses a very slow debug interface. It will be replaced with a different interface that puts the host in a wait to receive mode and uses the full 1.2Tb/s data link to bring down frame save rates to a few milliseconds and not interrupt solution progress. To enable this, the :code:`conditional_pause` method in the loop context manager must be called. .. code-block:: python with for_loop('time_march', args.time_steps) as time_march: iter_counter[0] = iter_counter + 1 T[2:-2,0,0].push_to_host(args.start_iteration_push, args.modulo_iteration_push) if args.add_time_pause > 0: time_march.conditional_pause() ns.fluid_iteration(save_debug_images=False, num_reduction_lim_its=nerr, max_pseudo_its=max_pseudo_its, fluid_tol=nl_tol) It is often beneficial to put this in a guard with a command line option set. .. code-block:: python # Head of file wse = WSE_Interface() parser = wse.get_cmd_line_parser() # Add Program Specific command line arguments parser.add_argument("-ts", "--time_steps", help="specify number of time steps", type=int, default=5) parser.add_argument("-atp", "--add_time_pause", help="adds a time march pause to save data", type=int, default=-1) # Some point after all the arguments are added args = wse.get_available_cmd_args() # Rest of problem setup In this example, the program code would get compiled with a :code:`-atp 1` set on the command line and :code:`-ts <max_time_steps>`. Then to launch the job on Neocortex, additional command line arguments would be set in the submission. .. code-block:: console (cd /cerebras_dev/WSE_Field_Equation/util; sbatch --nodelist=sdf-<1 or 2> neocortex_slurm_script_bash -c <hardware_run_name> -o <field_variable> -s <start_iteration> -p <mod_iterations>) This allows one to set up saving data over a given time period. For example setting :code:`<start_iteration> = 100`, :code:`<mod_iterations> = 10`, and compiling with :code:`<max_time_steps> = 200` will output 11 :code:`.ckpt`, equally spaced at time steps between 100 and 200. Converting Checkpoints to VTR ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The resulting checkpoint need to be converted to be visualized. The process uses a post processing script, tovtkfromBin.py, to generate a vtkRectilinearGrid (VTR) file. Copy the conversion script to you results directoy: .. code-block:: console $ cp <WFA_Path>/WSE_Field_Equation/util/tovtkfromBin.py <Checkpoint_Location> Run the script: .. code-block:: console $ python ./tovtkfromBin.py <Checkpoint_filename> <Attribute_name> ParaView Visualization ^^^^^^^^^^^^^^^^^^^^^^ Paraview is post-processing visualization tool. It can be run locally or remotely. The local execution uses typical point and click operation. Client/Server mode is a more involved process. Client/Server Mode """""""""""""""""" Login to Bridges2 .. code-block:: console $ ssh <username>@bridges2.psc.edu Download the latest version of Paraview .. code-block:: console $ wget https://www.paraview.org/paraview-downloads/download.php?submit=Download&version=v5.11&type=binary&os=Linux&downloadFile=ParaView-5.11.0-MPI-Linux-Python3.9-x86_64.tar.gz Extract ParaView .. code-block:: console $ tar -zxvf ParaView-5.11.0-MPI-Linux-Python3.9-x86_64.tar.gz Check out a GPU node .. code-block:: console $ ssh interact -p GPU-shared --gres=gpu:v100-32:1 Run ParaView server .. code-block:: console $ <ParaView_Directory>/bin/pvserver Get the port information from the console which will look something like this: .. code-block:: console Waiting for client... Connection URL: cs://v003.ib.bridges2.psc.edu:11111 Accepting connection(s): v003.ib.bridges2.psc.edu:11111 Client connected. This is the server to connect to in the format of: .. code-block:: console cs://<server>:<port>/ From your local machine use ssh to link the Bridges2 port to you local port: .. code-block:: console ssh -L <port>:<sever>:<port> <username>@bridges2.psc.edu On the local system, download and extract ParaView using the same process above. Run ParaView locally: .. code-block:: console $ <ParaView_Directory>/bin/paraview Use the ParaView GUI to connect to the server. File->Connect: .. image:: images/connect.png :width: 335 Choose Server Configuration/Add a server: .. image:: images/add_server.png :width: 532 Edit Server Configuration: .. image:: images/edit_server.png :width: 537 Use File->Open to open the VTR. Running on Colab ---------------- Colab is a facility in California which hosts Cerebras' WSE farm. NETL has purchased time on this facility. The follwing are instruction to run on the Colab facilities. These instructions are largely specific to the NETL team and those working with the core WFA development team. Connecting to Colab ^^^^^^^^^^^^^^^^^^^ Connection is done over ssh with private/public keys. It is not possible to connect to Colab without contacting NETL and Cerebras to arrange for ssh connection privileges. Connections are managed over a VPN system called global protect. The software can be found `here <https://access01.vpn.cerebras.net>`_. You will need a user name and password from Cerebras to log in and download. Once global protect is installed, add the server `access01.vpn.cerebras.net` to the global protect client by navigating to `hamburger > Settings > General > add`. Put the above portal address in the dialog box and click ok. Now log into global connect using your provided user name and password. NETL has been provided access to servers: `sc-r11r13-s10` and `sc-r11r13-s11`. s10 is our development node and s11 is connected to the WSE. both can be logged into with the following: .. code-block:: console ssh -L 8080:<server>:865 <user_name>@<server> -i C:\<path\to\private\key> This grants terminal access to either server with port forwarding so that portrait can be accessed from your local machine the code:`-L 8080:<server>:865` can be omitted if there is no need to run portrait and only has to be done once in an active terminal to support the connection. Running the WFA in Docker ^^^^^^^^^^^^^^^^^^^^^^^^^ Once the WFA is checked out and a docker container is copied to the home directory, log into s10. s10 is equipped with docker and the image can be run with the following: .. code-block:: console docker run -it --rm -p 865:8080 -v $HOME:/home -v /netl:/netl -v /usr:/host_usr <image_name> Once docker is loaded and running in a terminal, navigate to the root directory for the WFA and run code:`python build_and_test.py`. If the branch is passing on gitlab, it should pass all the tests locally. Running on hardware ^^^^^^^^^^^^^^^^^^^ s11 is connected to the WSE and has access to the same directories as s10. It does not have docker for obvious reasons. Once logged in via ssh, set up two environment variables: .. code-block:: console export PFAB=$HOME/pfabric.bin export FABRIC=$HOME/vfabric.bin export NODE=10.254.27.224 The above could be added to your code:`.bashrc` file Next, expand the stack space available to linux: .. code-block:: console ulimit -s unlimited Now check the current health of the WSE to ensure that it is not in a bad configuration: .. code-block:: console cssurvey node=$NODE fabric=$PFAB Check for one way links and that it found over 800K cores. A healthy output should look like: .. code-block:: console Node at 10.254.27.224, 132 links over 12 ports P0 = 10.254.27.225 P1 = 10.254.27.226 P2 = 10.254.27.227 P3 = 10.254.27.228 P4 = 10.254.27.229 P5 = 10.254.27.230 P6 = 10.254.27.231 P7 = 10.254.27.232 P8 = 10.254.27.233 P9 = 10.254.27.234 PA = 10.254.27.235 PB = 10.254.27.236 Use socket 3 for control messages tilecount 818958 tilecount 818958 corecount=801864 corecount=801864 Detected 801864 cores gcount: 0 in raw scan 0 one way links translating fabric offset -746,-496 -> 0,0 801864 nodes; 1 components; 0 x-error; 0 y-error; 0 linkerror 0 taberror 0 inverror fabric size: 774 x 1036 ghost sightings: 0 [ 0] 499R [ 1] 491R [ 2] 475R [ 3] 459R [ 4] 443R [ 5] 430R [ 6] 414R [ 7] 408R [ 8] 395R [ 9] 382R [ 10] 359R [ 11] 351R [ 12] 337R [ 13] 321R [ 14] 305R [ 15] 289R [ 16] 273R [ 17] 265R [ 18] 249R [ 19] 233R [ 20] 209R [ 21] 201R [ 22] 186R [ 23] 170R [ 24] 155R [ 25] 141R [ 26] 125R [ 27] 117R [ 28] 101R [ 29] 85R [ 30] 61R [ 31] 53R [ 32] 37R [ 33] 1004R [ 34] 988R [ 35] 972R [ 36] 948R [ 37] 940R [ 38] 924R [ 39] 908R [ 40] 892R [ 41] 876R [ 42] 860R [ 43] 852R [ 44] 836R [ 45] 820R [ 46] 796R [ 47] 788R [ 48] 773R [ 49] 757R [ 50] 741R [ 51] 727R [ 52] 711R [ 53] 703R [ 54] 687R [ 55] 671R [ 56] 647R [ 57] 639R [ 58] 623R [ 59] 607R [ 60] 591R [ 61] 579R [ 62] 563R [ 63] 555R [ 64] 539R [ 65] 523R [ 66] 521L [ 67] 536L [ 68] 552L [ 69] 559L [ 70] 575L [ 71] 591L [ 72] 607L [ 73] 623L [ 74] 638L [ 75] 646L [ 76] 670L [ 77] 685L [ 78] 701L [ 79] 709L [ 80] 725L [ 81] 741L [ 82] 757L [ 83] 773L [ 84] 789L [ 85] 797L [ 86] 820L [ 87] 836L [ 88] 852L [ 89] 860L [ 90] 876L [ 91] 892L [ 92] 908L [ 93] 924L [ 94] 939L [ 95] 947L [ 96] 971L [ 97] 987L [ 98] 1003L [ 99] 38L [100] 53L [101] 61L [102] 84L [103] 100L [104] 115L [105] 123L [106] 139L [107] 155L [108] 171L [109] 187L [110] 202L [111] 210L [112] 234L [113] 249L [114] 265L [115] 273L [116] 289L [117] 302L [118] 313L [119] 329L [120] 345L [121] 353L [122] 377L [123] 393L [124] 408L [125] 416L [126] 432L [127] 444L [128] 460L [129] 475L [130] 490L [131] 498L If the output does not look like this, the WSE needs to be restarted. Contact Cerebras through slack for help. If the output looks healthy then proceed to create a virtual fabric big enough to run your program. .. code-block:: console csvfabric pfabric=$PFAB vfabric=$FABRIC x0=<x_origin> y0=<y_origin> w=<width> h=<height> Now pause all the tiles in that fabric section: .. code-block:: console cscontrol fabric=$FABRIC pause Now initialize the memory space on that fabric section: .. code-block:: console cswipe fabric=$FABRIC Now load your program: .. code-block:: console csprog image=<path/to/my/image/file> fabric=$FABRIC Now unpause the tiles: .. code-block:: console cscontrol fabric=$FABRIC run The tiles can be paused at any point and the memory state can be inspected: .. code-block:: console csget fabric=$FABRIC map=<path/to/my/map/file> symbol=<symbol_in_map_file> data=<optional_binary_out_file> Useful Tools and Methods ------------------------ Several runtime environments are headless which makes doing development work difficult. Here are a few useful tools to make this easier. Custom Spyder IDE ^^^^^^^^^^^^^^^^^ NETL is maintaining a custom version of the Spyder IDE which supports syntax highlighting for WSE kernel code. Installation can be done with conda by following the instructions below: 1. Follow the instructions in the `Obtaining the WFA Code`_ section 2. clone the repository from the `gitlab <https://mfix.netl.doe.gov/gitlab/tjordan/spyder/-/tree/wse_support>`_ repository 3. check out the v.5.4.2_wse branch 4. navigate to the folder with the setup.py and env.yml files 5. run :code:`conda env create -f env.yml --name spyder` on linux or :code:`conda env create -f env_win.yml --name spyder` on windows 6. run :code:`conda activate spyder` 7. run :code:`python setup.py install` 8. run :code:`spyder` on linux or :code:`spyder.bat` in the :code:`path/to/spyder/repository/scripts` directory in windows sshfs ^^^^^ sshfs is very handy for doing remote development as it allows one to mount a remote directory over standard ssh interfaces. Follow the instructions `here <https://phoenixnap.com/kb/sshfs>`_ to get it set up. It has been tested and works with Neocortex. This will allow you to edit files remotely in whatever editor you choose. Once installed do the following: Neocortex from Linux """""""""""""""""""" First, make a directory to mount .. code-block:: console sudo mkdir /mnt/<mount_name> Next mount whatever directory you wish .. code-block:: console sudo sshfs <user_name>@neocortex.psc.edu:<path/to/directory> /mnt/<mount_name> Neocortex from Windows """""""""""""""""""""" 1. Open the file browser 2. Right click on `This PC` 3. Select `Map Network Drive...` 4. Select a drive letter 5. Paste the following (with correct substitutions) in the `Folder:` text box .. code-block:: console \\sshfs <user_name>@neocortex.psc.edu[\<path\to\directory>] Pycharm ^^^^^^^ Pycharm is another IDE similar to Spyder. It has a useful feature in that it will render :code:`.rst` files and has some minimal word processing capabilities. This makes it very good for editing Sphinx documentation files. To install: 1. get the latest community edition `here <https://www.jetbrains.com/pycharm/download/#section=linux>`_. 2. untar the file 3. run with :code:`sh <path_to_pycharm>\bin\pycharm.sh`