WSE Array Module

The WSE_Array class is designed to allow users to work with 3D arrays in a similar way to NumPy. In fact, user defined WSE_Arrays are initialized from 3D NumPy arrays through the initData keyword on instantiation. The first two axes in the NumPy arrays map to tile coordinates and axis 2 maps to the memory axis on each tile.

Each WSE_Array is designed to represent field data within a cell/voxel in a 3D structured finite element/volume simulation. Thus, the shape of each WSE_Array is set to the number of cells in the cardinal axes (including boundary cells).

Each tile holds a column of cells in the third axis of the NumPy data. Arithmetic is written from the perspective of a single tile in the array.

Slicing a WSE_Array object is similar to NumPy but is in relative coordinates. The first axis is usually a slice range which specifies the elements of each local vector on each tile. It is equivalent to the slicing in the third axis of a 3d NumPy array. The second two axes specify the tile position relative to the local tile coordinate. The second axis specifies the relative position east and west, and the third axis specifies the relative position north and south.

Arithmetic operations are done in the worker field. When neighbor operations are involved, moats send their data into the worker field. Moats are designed to hold NSEW boundary conditions and are usually only updated at the tail end of a time step. Boundary conditions can be updated through a copy operation which will move neighboring values from the worker field into the moat.

WSE_Array format:

dst[1:-1, 0, 0] = s0[2:,0,0] + s1[:-2,0,0]
dst[1:-1, 0, 0] = s0[1:-1,1,0] + s1[1:-1,0,0]

NumPy format:

dst[1:-1,1:-1,1:-1] = s0[1:-1,1:-1,2:] + s1[1:-1,1:-1,:-2]
dst[1:-1,1:-1,1:-1] = s0[1:-1,2:,1:-1] + s1[1:-1,1:-1,1:-1]

class WFA.WSE_Array.WSE_Array(name='array', **kwargs)

Class to handle expressive math with array objects.

abs()

Calculate element wise absolute value of a tensor.

In this example x is a 10x10x10 domain and initialized it as random. The absolute value of x is calculated and reassigned it to x.

def _calc_errors(self, detlaV, V, error_num, error_denom):
    with ASM() as calc_errors_asm:
        abs_numerator = calc_errors_asm.malloc_array()
        abs_numerator[2:-2,0,0] = detlaV[2:-2,0,0].abs()
        abs_numerator[2:-2,0,0]._reduce_max_2_control(error_num)

Raises:: ValueError – Only valid on center.
Returns:: result – The negated WSE_Array.
Return type:: WSE_Array

copy(other, moatArray=None)

Copies other into self. If moatArray is specified, copies other into moatArray memory space on the moat involved rather than self.

# Create the initial data
T_init = np.random.rand(10, 10, 10)

# Instantiate the WSE Array objects needed
x = WSE_Array(name='x', initData=T_init, bank=0)
result = WSE_Array(name='result', initData=np.zeros_like(T_init), bank=2)

# Copy x into result
result[:,0,0].copy(x[:,0,0])

Parameters:

other (WSE_Array) – The WSE_Array to copy.
moatArray (WSE_Array, optional) – The WSE_Array to copy into on the moat if specified. The default is None.

Raises:

ValueError – Slices are different lengths. Moat Array direction must be center. WSE Array copy from current direction is invalid.

Returns:

result – A copy of self.

Return type:

WSE_Array

exp()

Take the nature exponent value of a WSE_Array, out = exp(in).

# Create the initial Temperature field
init_np = np.random.rand(args.x, args.y, args.z)

# Instantiate the WSE Array objects needed
data = WSE_Array(initData=init_np)

data[1:-1, 0, 0] = data[1:-1, 0, 0].exp()

Raises:: ValueError – Only valid on center.
Returns:: result – The natural exponent of the WSE_Array.
Return type:: WSE_Array

float32_to_int16(word=0)

converts a float32 to an int 16 representation. The inherent data structure is based on 32 bit representation, with an upper and lower word of 16 bit width. This function will take a float value and round to the nearest integer and put it in either the upper or lower word based on the word parameter. A 0 will put it in the lower word and a 1 will put it in the upper word. It is possible to store two int16 values in a standard WSE_Array

init_np = np.random.random(args.x, args.y, args.z)
init_np = init_np.astype(np.float32)

# Instantiate the WSE Array objects needed
data = WSE_Array(name='data', initData=init_np)
data2 = WSE_Array(name='data2', initData=np.zeros_like(init_np))

data2[1:-1, 0, 0] = data[1:-1, 0, 0].to_int16()

Parameters:: word (integer) – optional parameter to set the word position of the int16 data, default 0
Returns:: result – the WSE_Array containing the integer representation
Return type:: WSE_Array

gather(index_array, length_scalar)

Gathers data from self according to an index array with lengths specified in length scalar. The resulting data gets compressed in the memory axis to be a dense representation on each tile with dense data filling the first length_scalar elements on the resulting array

init_np = np.random.random(args.x, args.y, args.z)
init_ls = np.zeros((args.x, args.y))
init_np = init_np > 0.5

init_np = init_np.astype(np.float32)

# Instantiate the WSE Array objects needed
data = WSE_Array(name='data', initData=init_np)
data2 = WSE_Array(name='data2', initData=np.zeros_like(init_np))
data3 = WSE_Array(name='data3', initData=np.zeros_like(init_np))
count = WSE_Local_Scalar('count', initData=init_ls)

data2[1:-1, 0, 0], count[0, 0] = data[1:-1, 0, 0].where_true()
data3[1:-1, 0, 0] = data[1:-1, 0, 0].gather(data2[1:-1, 0, 0], count[0, 0])

Parameters:

index_array (WSE_Array) – A WSE_Array containing the values of indices to gather
length_scalar (WSE_Local_Scalar) – A WSE_Local_Scalar containing the number of elements to gather

Raises:

ValueError – Only valid on center.

Returns:

result – The result of the gather with values copied from self at the indices specified in index_array

Return type:

WSE_Array

get_value_at_index(index, local_scalar_value)

populates a local scalar with the value from self defined at the index specified

array1 = WSE_Array()
ls1 = WSE_Local_Scalar()

array1.get_value_at_index(2, ls1)

Parameters:

index (int, WSE_Local_Scalar) – An integer or local scalar that defines the index into the array to get data from
local_scalar_value – The local scalar to store the value in

local_reduce_sum(local_scalar)

Does a reduction along the memory axis into a local scalar

array1 = WSE_Array()
ls1 = WSE_Local_Scalar()

array1[1:-1,0,0].local_reduce_sum(ls1)

Parameters:: local_scalar (WSE_Local_Scalar) – the local scalar to store the result in

max_with(other)

Computes the element wise maximum of self with other.

self._wse.debug_message('FD_Stokes._compute_dT_dt: calculate first order upwind convection in x',
active=debug_active) dT_dt[2:-2,0,0] = dT_dt[2:-2,0,0] - self.Vx[2:-2,0,0].max_with(0.0)*(self.T[2:-2,0,
0] - self.T[2:-2,-1,0])*self._dx Vd_p[2:-2,0,0] = self.Vx[2:-2,1,0] dT_dt[2:-2,0,0] = dT_dt[2:-2,0,
0] + Vd_p[2:-2,0,0].min_with(0.0)*(self.T[2:-2,0,0] - self.T[2:-2,1,0])*self._dx

Parameters:: other (int, float) – A value to max with.
Raises:: ValueError – Only max with constant is implemented at this time.
Returns:: result – A WSE_Array where each element if the maximum value of self and other.
Return type:: TYPE

min_with(other)

Computes the element wise minimum of self with other.

dt_Vx = (self.dx/2.1)/(self.max_abs_Vx + 1e-12)

self._wse.debug_message('FD_Stokes._calc_allowable_dt: conv_dt', active=debug_active)
conv_dt = dt_Vx.min_with(dt_Vy)
conv_dt = conv_dt.min_with(dt_Vz)
if self.check_abs_V:
    self.dtFinal[:,0,0] = self.dtFinal[:,0,0] * conv_dt
return conv_dt.min_with(self.dt_diff)

Parameters:: other (int, float) – A value to min with.
Returns:: neg_max.__neg__() – A WSE_Array where each element if the minimum value of self and other.
Return type:: WSE_Array

push_to_host(start_iteration, stop_iteration, modulo, ts)

Push the wafer data to the host.

with while_loop('time_march', 0.0, inf_loop_scalar) if infinite_loop else for_loop('time_march',
nt) as time_march:

    iter_counter[0] = iter_counter + 1
    T[2:-2,0,0].push_to_host(args.start_iteration_push,args.modulo_iteration_push)

Parameters:

start_iteration (int) – The start iteration to push data to host.
modulo (int) – The modulo to push data to host.

Return type:

None.

random()

fills a WSE_array with random numbers between 0 and 1.

# Create the initial Temperature field
init_np = np.random.rand(args.x, args.y, args.z)

# Instantiate the WSE Array objects needed
data = WSE_Array(initData=init_np)

data[1:-1, 0, 0] = data[1:-1, 0, 0].random()

Raises:: ValueError – Only valid on center.
Returns:: result – A WSE_array filled with random float values between 0 and 1.
Return type:: WSE_Array

reduce_dot_2_control(other, result_scalar)

Computes the dot product reduction between self and other and places it in a WSE_Global_Scalar on the control tile.

# Create the initial data
init_np = np.random.rand(10, 10, 10)
init_np2 = np.random.rand(10, 10, 10)

# Instantiate the WSE Array objects needed
data = WSE_Array(name='data', initData=init_np)
data2 = WSE_Array(name='data', initData=init_np2)

# Instantiate the WSE Global Scalar object
scalar = WSE_Global_Scalar(name='scalar', value=1)

# Compute the dot product of data with data2 and store in scalar on control tile
data[1:-1, 0, 0].reduce_dot_2_control(data2[1:-1, 0, 0], scalar)

Parameters:

other (WSE_Array) – The other Array object to do the dot product with.
result_scalar (WSE_Global_Scalar, int) – The WSE_Global_Scalar object to or scalar index to where to store the result.

Raises:

ValueError – Dot product slices are different lengths.

Return type:

None.

reduce_max_2_control(result_scalar)

Computes the max reduction of self and places it in a WSE_Global_Scalar on the control tile.

# Create the initial data
init_np = np.random.rand(10, 10, 10)

# Instantiate the WSE Array object
data = WSE_Array(name='data', initData=init_np)

# Instantiate the WSE Global Scalar object
scalar = WSE_Global_Scalar(name='scalar', value=1)

# Reduce the data to the scalar on the control tile
data[1:-1, 0, 0].reduce_max_2_control(scalar)

Parameters:: result_scalar (WSE_Global_Scalar, int) – The WSE_Global_Scalar object to or scalar index to where to store the result.
Return type:: None.

reduce_norm_2_control(result_scalar)

Computes the norm of self and places it in a WSE_Global_Scalar on the control tile.

#Create the initial data
init_np = np.random.rand(10, 10, 10)

#Instantiate the WSE Array object
data = WSE_Array(name='data', initData=init_np)

#Instantiate the WSE Global Scalar object
scalar = WSE_Global_Scalar(name='scalar', value=1)

#Compute the norm of data and store in scalar on the control tile
data[1:-1,0,0].reduce_norm_2_control(scalar)

Parameters:: result_scalar (WSE_Global_Scalar, int) – The WSE_Global_Scalar object or scalar index to where to store the result.
Return type:: None.

running_m_s_2d(store_index, n_height=None)

This function calculates a running mean and standard deviation using Welford’s Algorithm for two-dimensional problems set up to run many simulations at the same time. This is useful in statistical dynamics problems such as the Ising model.

To correctly use this function, many 2D simulations should be set up in a 3d tensor such that each simulation has its 2d axes along the vertical, N-S axis in the WFA (axis 1 in numpy initialization) and the other in the memory direction (axis 2 in numpy initialization). The horizontal, E-W axis in the WFA (axis 0 in numpy initialization) is used to fill many simultaneous 2D simulations.

This function will do a reduce mean along the 2d simulation axes and then calculate the running mean and standard deviation. The function is set up to overlap communication/computation of the reduction along the N-S tile direction and Wolford’s Algorithm calculation such that with sufficient work between successive calls to this function, the workers can be kept busy and not wait for the tile reduction and statistics update calculations. The only impact on solution progress with sufficient workload is the reduction along the memory axis which must be done by each worker.

Data ends up in the lower reduction row stored in the symbols running_means and running_stds.

with for_loop('statistical_iteration', 100):

    # put statistical calculations here and ensure the result
    # ends up in stats.  The mean and std will be stored in
    # the first position in the storage arrays

    stats[1:-1, 0, 0].running_m_s_2d(0)

Parameters:

store_index (int) – The position within the storage arrays to place the statistics.
n_height (int, optional) – The size of the dimension in the N-S direction to be reduced. If not supplied, the dimension of the N-S worker field is used. This allows partial worker reductions to be enabled with masking. The default is None.

Return type:

None.

scatter(index_array, length_scalar)

Scatters data from self into destination according to an index array with lengths specified in length scalar.

init_np = np.random.random(args.x, args.y, args.z)
init_ls = np.zeros((args.x, args.y))
init_np = init_np > 0.5

init_np = init_np.astype(np.float32)

# Instantiate the WSE Array objects needed
data = WSE_Array(name='data', initData=init_np)
data2 = WSE_Array(name='data2', initData=np.zeros_like(init_np))
data3 = WSE_Array(name='data3', initData=np.zeros_like(init_np))
data4 = WSE_Array(name='data4', initData=np.zeros_like(init_np))
count = WSE_Local_Scalar('count', initData=init_ls)

data2[1:-1, 0, 0], count[0, 0] = data[1:-1, 0, 0].where_true()
data3[1:-1, 0, 0] = data[1:-1, 0, 0].gather(data2[1:-1, 0, 0], count[0, 0])
data4[1:-1, 0, 0] = data3[1:-1, 0, 0].scatter(data2[1:-1, 0, 0], count[0, 0])

Parameters:

index_array (WSE_Array) – A WSE_Array containing the values of indices to scatter
length_scalar (WSE_Local_Scalar) – A WSE_Local_Scalar containing the number of elements to scatter

Raises:

ValueError – Only valid on center.

Returns:

result – The result of the scatter with values copied from self at the indices specified in index_array

Return type:

WSE_Array

set_value_at_index(index, local_scalar_value)

sets the value of an array at an index

array1 = WSE_Array()
ls1 = WSE_Local_Scalar()

array1.get_value_at_index(2, ls1)

Parameters:

index (int, WSE_Local_Scalar) – An integer or local scalar that defines the index into the array to get data from
local_scalar_value – The local scalar to store the value in

sqrt()

Computes the element wise square root of a tensor.

# Create the initial Temperature field
init_np = np.random.rand(args.x, args.y, args.z)

# Instantiate the WSE Array objects needed
data = WSE_Array(initData=init_np)

data[1:-1, 0, 0] = data[1:-1, 0, 0].sqrt()

Raises:: ValueError – Only valid on center.
Returns:: result – The square root of the WSE_Array.
Return type:: WSE_Array

where_true()

Fills a WSE_Array with coded index information where self is not zero and fills a WSE_Local_Scalar with coded information about the length of valid index values.

init_np = np.random.random(args.x, args.y, args.z)
init_ls = np.zeros((args.x, args.y))
init_np = init_np > 0.5

init_np = init_np.astype(np.float32)

# Instantiate the WSE Array objects needed
data = WSE_Array(name='data', initData=init_np)
data2 = WSE_Array(name='data2', initData=np.zeros_like(init_np))
count = WSE_Local_Scalar('count', initData=init_ls)

data2[1:-1, 0, 0], count[0, 0] = data[1:-1, 0, 0].where_true()

Raises:

ValueError – Only valid on center.

Returns:

result_array (WSE_Array) – The nature exponent of the WSE_Array.
result_scalar (WSE_Local_Scalar) – The nature exponent of the WSE_Array.

WSE Array: Advanced Performance Optimization

For clarity, the discussion in this section is not related to solution accuracy, but is related to maximizing performance when possible.

The memory system of the WSE is set up with multiple banks as is common in many architectures. The WSE allows up to 128 bytes of read and 64 bytes of writes per cycle through a 7 stage pipeline. If the processing pathway will allow it, an instruction can be conducted in Simultaneous Instruction Multiple Data (SIMD). The WSE allows for SIMD 2 in single precision and SIMD 4 in half. However, obtaining the performance of SIMD 2/4 requires that data be accessed on correct banks. In short, a memory bank can only be accessed once per cycle which places restrictions on the placement of data within the memory space. For example, addition operations can be done in SIMD 2 in single precision. The nature of the high-performance operation necessitates that S0 and S1 be separated by 4 banks.

To address this, the WFA maintains eight words (4 SP values) at the head of each local contiguous memory space so that banks can be properly managed. The precompiler in the WFA is designed to identify and eliminate bank conflicts in the graph when dealing with temporaries in mathematical expressions, but does not (yet) have the freedom to adjust the banking for user defined arrays. Thus, the WFA provides a few methods to ensure banking correctness that will maximize performance.

In the example below, temp and spatial_grad are user defined arrays. That is, they are instantiated without the _isWorkingVector keyword option set to True. Do not manually set this option, only discussed for illustration and understanding. Working vectors are recycled in expressions and user data will be lost if manually set.

Here, the final expression spatial_grad[1:-1,0,0] = spatial_grad[1:-1,0,0] + temp[1:-1,0,0] is an addition (or as equally could be a subtraction). To maximize performance, the data in spatial_grad[1:-1,0,0] must be separated from the data in temp[1:-1,0,0] by four banks. To force this to happen, the temp._next_offset = temp[1:-1,0,0]._find_offset_for_add(spatial_grad[1:-1,0,0]) statement forces the bank for temp to align with spatial_grad[1:-1,0,0] at the next write into temp[1:-1,0,0]. The write into temp happens in temp[1:-1,0,0] = self.dHx[1:-1,0,0] - self.dHx[1:-1,1,0]

# Set the bank alignment of temp[1:-1,0,0] to be compatible with spatial_grad[1:-1,0,0]
# at the next write into temp[1:-1,0,0] for an addition operation
temp._next_offset = temp[1:-1,0,0]._find_offset_for_add(spatial_grad[1:-1,0,0])

# do an operation that writes into temp
temp[1:-1,0,0] = self.dHx[1:-1,0,0] - self.dHx[1:-1,1,0]

# temp is not properly aligned with spatial_grid to maximize performance
spatial_grad[1:-1,0,0] = spatial_grad[1:-1,0,0] + temp[1:-1,0,0]

Similarly, a multiply will not execute at maximum SIMD 1 performance unless SO and S1 are on different banks. The WFA provides a _find_offset_for_mul that will ensure that operands for a multiply operation are on different banks.

Performance Limitations

It is not always possible to ensure correct banking. This often happens with additions with offset memory references. For example, it is common to add top and bottom cells together from the same array.

sum[1:-1,0,0] = x[:2,0,0] + x[2:,0,0]

This configuration will always run in SIMD1 because the banking does not allow for SIMD2, thus this will run at 1 add per cycle.

Future WFA Precompiler Improvements

It is certainly possible to develop a better compiler that will manage banks more transparently. This is one of the goals for a new version of the WFA.