WSE Array Module
================
The WSE_Array class is designed to allow users to work with 3D arrays in a
similar way to NumPy.  In fact, user defined WSE_Arrays are initialized from
3D NumPy arrays through the `initData` keyword on instantiation.  The first two
axes in the NumPy arrays map to tile coordinates and axis 2 maps to the memory
axis on each tile.

Each WSE_Array is designed to represent field data within a cell/voxel in a 3D
structured finite element/volume simulation. Thus, the shape of each WSE_Array
is set to the number of cells in the cardinal axes (including boundary cells).

Each tile holds a column of cells in the third axis of the NumPy data.  Arithmetic
is written from the perspective of a single tile in the array.

Slicing a WSE_Array object is similar to NumPy but is in relative coordinates.
The first axis is usually a slice range which specifies the elements of each
local vector on each tile.  It is equivalent to the slicing in the third axis
of a 3d NumPy array.  The second two axes specify the tile position relative
to the local tile coordinate.  The second axis specifies the relative position
east and west, and the third axis specifies the relative position north and
south.

Arithmetic operations are done in the worker field.  When neighbor operations
are involved, moats send their data into the worker field.  Moats are designed
to hold NSEW boundary conditions and are usually only updated at the tail end
of a time step.  Boundary conditions can be updated through a copy operation
which will move neighboring values from the worker field into the moat.


WSE_Array format:

.. code-block:: python

    dst[1:-1, 0, 0] = s0[2:,0,0] + s1[:-2,0,0]
    dst[1:-1, 0, 0] = s0[1:-1,1,0] + s1[1:-1,0,0]

NumPy format:

.. code-block:: python

    dst[1:-1,1:-1,1:-1] = s0[1:-1,1:-1,2:] + s1[1:-1,1:-1,:-2]
    dst[1:-1,1:-1,1:-1] = s0[1:-1,2:,1:-1] + s1[1:-1,1:-1,1:-1]


.. autoclass:: WFA.WSE_Array.WSE_Array
    :members:


WSE Array: Advanced Performance Optimization
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For clarity, the discussion in this section is not related to solution accuracy, but is related
to maximizing performance when possible.

The memory system of the WSE is set up with multiple banks as is common in many architectures.
The WSE allows up to 128 bytes of read and 64 bytes of writes per cycle through a 7 stage pipeline.
If the processing pathway will allow it, an instruction can be conducted in Simultaneous Instruction
Multiple Data (SIMD).  The WSE allows for SIMD 2 in single precision and SIMD 4 in half.  However,
obtaining the performance of SIMD 2/4 requires that data be accessed on correct banks.  In short, a
memory bank can only be accessed once per cycle which places restrictions on the placement of data
within the memory space.  For example, addition operations can be done in SIMD 2 in single precision.
The nature of the high-performance operation necessitates that S0 and S1 be separated by 4 banks.

To address this, the WFA maintains eight words (4 SP values) at the head of each local contiguous
memory space so that banks can be properly managed.  The precompiler in the WFA is designed to identify
and eliminate bank conflicts in the graph when dealing with temporaries in mathematical expressions,
but does not (yet) have the freedom to adjust the banking for user defined arrays.  Thus, the WFA
provides a few methods to ensure banking correctness that will maximize performance.

In the example below, :code:`temp` and :code:`spatial_grad` are user defined arrays.  That is, they are
instantiated without the :code:`_isWorkingVector` keyword option set to :code:`True`. Do not manually set
this option, only discussed for illustration and understanding.  Working vectors are recycled in expressions
and user data will be lost if manually set.

Here, the final expression :code:`spatial_grad[1:-1,0,0] = spatial_grad[1:-1,0,0] + temp[1:-1,0,0]` is
an addition (or as equally could be a subtraction).  To maximize performance, the data in :code:`spatial_grad[1:-1,0,0]`
must be separated from the data in :code:`temp[1:-1,0,0]` by four banks.  To force this to happen, the
:code:`temp._next_offset = temp[1:-1,0,0]._find_offset_for_add(spatial_grad[1:-1,0,0])` statement forces
the bank for temp to align with :code:`spatial_grad[1:-1,0,0]` at the next write into :code:`temp[1:-1,0,0]`.
The write into temp happens in :code:`temp[1:-1,0,0] = self.dHx[1:-1,0,0] - self.dHx[1:-1,1,0]`

.. code-block:: python

    # Set the bank alignment of temp[1:-1,0,0] to be compatible with spatial_grad[1:-1,0,0]
    # at the next write into temp[1:-1,0,0] for an addition operation
    temp._next_offset = temp[1:-1,0,0]._find_offset_for_add(spatial_grad[1:-1,0,0])

    # do an operation that writes into temp
    temp[1:-1,0,0] = self.dHx[1:-1,0,0] - self.dHx[1:-1,1,0]

    # temp is not properly aligned with spatial_grid to maximize performance
    spatial_grad[1:-1,0,0] = spatial_grad[1:-1,0,0] + temp[1:-1,0,0]

Similarly, a multiply will not execute at maximum SIMD 1 performance unless SO and S1 are on different
banks.  The WFA provides a :code:`_find_offset_for_mul` that will ensure that operands for a multiply
operation are on different banks.

Performance Limitations
_______________________
It is not always possible to ensure correct banking.  This often happens with additions with offset memory references.
For example, it is common to add top and bottom cells together from the same array.

.. code-block:: python

    sum[1:-1,0,0] = x[:2,0,0] + x[2:,0,0]

This configuration will always run in SIMD1 because the banking does not allow for SIMD2, thus this will run at 1 add
per cycle.

Future WFA Precompiler Improvements
___________________________________

It is certainly possible to develop a better compiler that will manage banks more transparently.  This is one of the
goals for a new version of the WFA.