Parallel programing for economic models using GPUs

We want to test how fast we can compute an economic model using GPU computing tools.

To do so we benchmark our code with a well known economic problem: a consumption savings model. We build off the model and codes outlined in A Practical Guide to Parallel Computing in Macroeconomics by Jesús Fernández-Villaverde and David Zarruk Valencia, repository here.

Headline results

We benchmark three python parallel programming methods:

  1. CPU parallization using numpy package and vectorized code;
  2. CPU parallization using numba package and non-vectorized code;
  3. DPU parallization using numba package and non-vectorized code; and

Cupy is a numpy-like package that automatically parralizes vector operation using GPUs. For this reason we benchmark it against a similar vector implementation using numpy. We also benchmark against numba, which was the fasted non-GPU based python pacakge in our tests. Numba is a just-in-time compiler that can get speeds close to C in python.

Headline results:

  1. Non-parallel python (numba): 13.7 seconds;
  2. Numpy vectorized: 10.8 seconds;
  3. Numba non-vectorized code: 3 seconds;
  4. Cupy vectorized: 0.04 seconds.

Cupy is almost 75x faster than our best alternative using Numba (~3 seconds)! It is also a significant speed-up relative to the best time reposted by Fernández-Villaverde and Zarruk: 1.42 seconds in a GPU implementation using Open ACC and C++.

We ran this on a machine with a Intel i7-9750H CPU with 6cores/12threads, and a Nvidia GeForce GTX 1650 GPU with 896 CUDA cores.

Initialization of the model

The code to set parameters and model objects is taken directly ffrom

In [51]:
#--------------------------------#
#         House-keeping          #
#--------------------------------#

from numba import jit,  njit, prange, int64, float64, vectorize, guvectorize
from numba.experimental import jitclass
import numpy
import numpy as np
import math
import time
from scipy.stats import norm
from collections import OrderedDict
import sys

#added by brian
from numba import cuda
import cupy as cp

import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 14})
plt.rcParams.update({'lines.linewidth': 3.0})
plt.style.use('default')

#--------------------------------#
#         Initialization         #
#--------------------------------#

# Number of workers

# Grid for x
nx            = 1500;
xmin          = 0.1;
xmax          = 4.0;

# Grid for e: parameters for Tauchen
ne            = 15;
ssigma_eps    = 0.02058;
llambda_eps   = 0.99;
m             = 1.5;

# Utility function
ssigma        = 2;
bbeta         = 0.97;
T             = 10;

# Prices
r             = 0.07;
w             = 5;

# Initialize the grid for X
xgrid = numpy.zeros(nx)

# Initialize the grid for E and the transition probability matrix
egrid = numpy.zeros(ne)
P     = numpy.zeros((ne, ne))




#--------------------------------#
#         Grid creation          #
#--------------------------------#

# Function to construct the grid for capital (x)
size = nx;
xstep = (xmax - xmin) /(size - 1);
it = 0;
for i in range(0,nx):
    xgrid[i] = xmin + it*xstep;
    it = it+1;


# Function to construct the grid for productivity (e) using Tauchen (1986)
size = ne;
ssigma_y = math.sqrt(math.pow(ssigma_eps, 2) / (1 - math.pow(llambda_eps,2)));
estep = 2*ssigma_y*m / (size-1);
it = 0;
for i in range(0,ne):
    egrid[i] = (-m*math.sqrt(math.pow(ssigma_eps, 2) / (1 - math.pow(llambda_eps,2))) + it*estep);
    it = it+1;


# Function to construct the transition probability matrix for productivity (P) using Tauchen (1986)
mm = egrid[1] - egrid[0];
for j in range(0,ne):
    for k in range(0,ne):
        if (k == 0):
            P[j, k] = norm.cdf((egrid[k] - llambda_eps*egrid[j] + (mm/2))/ssigma_eps);
        elif (k == ne-1):
            P[j, k] = 1 - norm.cdf((egrid[k] - llambda_eps*egrid[j] - (mm/2))/ssigma_eps);
        else:
            P[j, k] = norm.cdf((egrid[k] - llambda_eps*egrid[j] + (mm/2))/ssigma_eps) - norm.cdf((egrid[k] - llambda_eps*egrid[j] - (mm/2))/ssigma_eps);


# Exponential of the grid e
for i in range(0,ne):
    egrid[i] = math.exp(egrid[i]);
In [52]:
#--------------------------------#
#     Structure and function     #
#--------------------------------#

# Value function
VV = math.pow(-10.0, 5);

specs = OrderedDict()
specs['ind'] = int64
specs['ne'] = int64
specs['nx'] = int64
specs['T'] = int64
specs['age'] = int64
specs['P'] = float64[:,:]
specs['xgrid'] = float64[:]
specs['egrid'] = float64[:]
specs['ssigma'] = float64
specs['bbeta'] = float64
specs['w'] = float64
specs['r'] = float64
specs['V'] = float64[:,:,:]


# Data structure of state and exogenous variables
@jitclass(specs)
class modelState(object):
    def __init__(self,ind,ne,nx,T,age,P,xgrid,egrid,ssigma,bbeta,w,r,V):
        self.ind        = ind
        self.ne         = ne
        self.nx         = nx
        self.T          = T
        self.age        = age
        self.P          = P
        self.xgrid      = xgrid
        self.egrid      = egrid
        self.ssigma     = ssigma
        self.bbeta      = bbeta
        self.w          = w
        self.r          = r
        self.V          = V
# Function that returns value for a given state
# ind: a unique state that corresponds to a pair (ie,ix)
@njit
def value_func(states):

    ind = states.ind
    ne = states.ne
    nx = states.nx
    T = states.T
    age = states.age
    P = states.P
    xgrid = states.xgrid
    egrid = states.egrid
    ssigma = states.ssigma
    bbeta = states.bbeta
    w = states.w
    r = states.r
    V = states.V

    ix = int(math.floor(ind/ne));
    ie = int(math.floor(ind%ne));

    VV = math.pow(-10.0, 3)
    for ixp in range(0,nx):
        expected = 0.0;
        if(age < T-1):
            for iep in range(0,ne):
                expected = expected + P[ie, iep]*V[age+1, ixp, iep]

        cons  = (1 + r)*xgrid[ix] + egrid[ie]*w - xgrid[ixp];

        utility = math.pow(cons, (1-ssigma))/(1-ssigma) + bbeta*expected;

        if(cons <= 0):
            utility = math.pow(-10.0,5);
        
        if(utility >= VV):
            VV = utility;

        utility = 0.0;

    return[VV];

@njit(parallel=True)
def compute(age, V):

    for ind in prange(0,nx*ne):

        states = modelState(ind, ne, nx, T, age, P, xgrid, egrid, ssigma, bbeta, w, r, V)

        ix = int(math.floor(ind/ne));
        ie = int(math.floor(ind%ne));
            
        V[age, ix, ie] = value_func(states)[0];

    return(V)

No parallel programming

In [54]:
start3 = time.time()
# Initialize value function V
V     = numpy.zeros((T, nx, ne))

for age in range(T-1, -1, -1):
    V = compute_nonparrallel(age, V)

    finish2 = time.time() - start3
    print("Age: ", age+1, ". Time: ", round(finish2, 4), " seconds.")


finish_nonpar = time.time() - start3
print("TOTAL ELAPSED TIME (not parrallelized): ", round(finish_nonpar , 4), " seconds. \n")
Age:  10 . Time:  1.0383  seconds.
Age:  9 . Time:  2.4395  seconds.
Age:  8 . Time:  3.8552  seconds.
Age:  7 . Time:  5.2854  seconds.
Age:  6 . Time:  6.7314  seconds.
Age:  5 . Time:  8.164  seconds.
Age:  4 . Time:  9.5816  seconds.
Age:  3 . Time:  10.982  seconds.
Age:  2 . Time:  12.3605  seconds.
Age:  1 . Time:  13.732  seconds.
TOTAL ELAPSED TIME (not parrallelized):  13.734  seconds. 

CPU parralel with Numba

In [56]:
start = time.time()
# Initialize value function V
V     = numpy.zeros((T, nx, ne))

for age in range(T-1, -1, -1):
    V = compute(age, V)

    finish = time.time() - start
    print("Age: ", age+1, ". Time: ", round(finish, 4), " seconds.")


finish = time.time() - start
print("TOTAL ELAPSED TIME: ", round(finish, 4), " seconds. \n")


#---------------------#
#     Some checks     #
#---------------------#

print(" - - - - - - - - - - - - - - - - - - - - - \n")
print("The first entries of the value function: \n")

for i in range(0,3):
	print(round(V[0, 0, i], 5))

print(" \n")
Age:  10 . Time:  0.2563  seconds.
Age:  9 . Time:  0.5881  seconds.
Age:  8 . Time:  0.9043  seconds.
Age:  7 . Time:  1.1925  seconds.
Age:  6 . Time:  1.4723  seconds.
Age:  5 . Time:  1.7735  seconds.
Age:  4 . Time:  2.0831  seconds.
Age:  3 . Time:  2.3928  seconds.
Age:  2 . Time:  2.6691  seconds.
Age:  1 . Time:  2.9593  seconds.
TOTAL ELAPSED TIME:  2.9603  seconds. 

 - - - - - - - - - - - - - - - - - - - - - 

The first entries of the value function: 

-2.11762
-2.07729
-2.02366
 

CPU parallel with vectorizing Numpy

In [57]:
# Bellman equation with assets x as endog state
#
#   v(x,e,t) = max_{x'} { u((1+r)x+wy-x') + beta * E[ v(x',y',t+1)|y] }
#
nstates=nx*ne

V = np.zeros((nx*ne,T))
b = np.zeros((nx*ne,T))
xgrid_vec = xgrid[np.newaxis,:]

# felicity as function of state & choice (nx x ne)
#   preset consumption, and penalize violation of budget constraint

#testing dimentions
#temp1=(1+r)*np.kron(np.ones((ne,1)),xgrid[:,np.newaxis])                   
#temp2 = w* np.kron(egrid[:,np.newaxis],np.ones((nx,1)))
#print(temp1.shape,temp2.shape) 
#temp3 = -np.ones((nstates,1))*xgrid
#print(temp1.shape,temp2.shape,temp3.shape)                    
#cons =temp1 +temp2 +temp3

cons = (1+r)*np.kron(np.ones((ne,1)),xgrid[:,np.newaxis])  +w* np.kron(egrid[:,np.newaxis],np.ones((nx,1))) -np.ones((nstates,1))*xgrid
#print(cons.shape)
util=(-10**(10))*np.ones((nstates,nx))
#print(util.shape)
util[cons>0]=(cons[cons>0]**(1-ssigma))/(1-ssigma)
#print(util.shape)
In [58]:
tempV=np.zeros(nstates)
longetrans=np.kron(P,np.ones((1,nx)))
picker=np.kron(np.ones((ne,1)),np.eye(nx))

#print(picker.shape,longetrans.shape, tempV.shape)
#(tempV[:,np.newaxis] @ np.ones((1,nx))).shape
In [59]:
start = time.time()

for t in range(T-1, -1, -1):
    # Bellman eqn
    print(t)
    smallcont= longetrans @ (picker* (tempV[:,np.newaxis] @ np.ones((1,nx)) ) )
    cont=np.kron(smallcont,np.ones((nx,1)))
    v_before_max = util + bbeta #* cont
    #print(v_before_max.shape)
    tempV= np.copy(np.amax(v_before_max,axis=1))
    
    policyind = np.argmax(util + bbeta * cont,axis=1)
    #print(tempV.shape,V.shape)
    V[:,t]= np.copy(tempV)
    b[:,t]=xgrid[policyind]
    
    finish = time.time() - start
    print(['Age: ', t, '. Time: ', round(finish, 3), ' seconds.'])
    
finish_nonpar = time.time() - start
print("TOTAL ELAPSED TIME (not parrallelized): ", round(finish_nonpar , 4), " seconds. \n")    

print( " - - - - - - - - - - - - - - - - - - - - - \n")
print( "The first entries of the value function: \n")

for i in range(0,3):
	print(round(V[i, 0], 5))

print(" \n")
9
['Age: ', 9, '. Time: ', 1.015, ' seconds.']
8
['Age: ', 8, '. Time: ', 2.066, ' seconds.']
7
['Age: ', 7, '. Time: ', 3.147, ' seconds.']
6
['Age: ', 6, '. Time: ', 4.271, ' seconds.']
5
['Age: ', 5, '. Time: ', 5.355, ' seconds.']
4
['Age: ', 4, '. Time: ', 6.502, ' seconds.']
3
['Age: ', 3, '. Time: ', 7.586, ' seconds.']
2
['Age: ', 2, '. Time: ', 8.611, ' seconds.']
1
['Age: ', 1, '. Time: ', 9.777, ' seconds.']
0
['Age: ', 0, '. Time: ', 10.835, ' seconds.']
TOTAL ELAPSED TIME (not parrallelized):  10.8352  seconds. 

 - - - - - - - - - - - - - - - - - - - - - 

The first entries of the value function: 

0.72151
0.72168
0.72185
 

In [60]:
plt.plot(b[:,1])
plt.title('Policy function')
#save result to compare
b_non_gpu = b[0:100,1]

GPU parallel with vectorization and Cupy

Convert vectorized numpy code to cupy. This is easy, replace np with cp everywhere. We just need to ensure all the vectors are initiaized with cupy.

In [61]:
P=cp.array(P)
xgrid=cp.array(xgrid)
egrid=cp.array(egrid)

#store value & policy functions
V = cp.zeros((nx*ne,T))
b = cp.zeros((nx*ne,T))

nstates=nx*ne
In [62]:
#
#   v(x,e,t) = max_{x'} { u((1+r)x+wy-x') + beta * E[ v(x',y',t+1)|y] }
#

V = cp.zeros((nx*ne,T))
b = cp.zeros((nx*ne,T))

# felicity as function of state & choice (nx x ne)
#   preset consumption, and penalize violation of budget constraint

cons = (1+r)*cp.kron(cp.ones((ne,1)),xgrid[:,cp.newaxis])  +w* cp.kron(egrid[:,cp.newaxis],cp.ones((nx,1))) -cp.ones((nstates,1))*xgrid
#print(cons.shape)
util=(-10**(10))*cp.ones((nstates,nx))
#print(util.shape)
util[cons>0]=(cons[cons>0]**(1-ssigma))/(1-ssigma)
#print(util.shape)
In [63]:
tempV=cp.zeros(nstates)
longetrans=cp.kron(P,cp.ones((1,nx)))
picker=cp.kron(cp.ones((ne,1)),cp.eye(nx))
#print(picker.shape,longetrans.shape, tempV.shape)

#(tempV[:,np.newaxis] @ np.ones((1,nx))).shape
In [64]:
start = time.time()

for t in range(T-1, -1, -1):
    # Bellman eqn
    print(t)
    smallcont= longetrans @ (picker* (tempV[:,cp.newaxis] @ cp.ones((1,nx)) ) )
    cont=cp.kron(smallcont,cp.ones((nx,1)))
    v_before_max = util + bbeta #* cont
    #print(v_before_max.shape)
    
    tempV= cp.copy(cp.amax(v_before_max,axis=1))
    policyind = cp.argmax(util + bbeta * cont,axis=1)
    
    #print(tempV.shape,V.shape)
    V[:,t]= cp.copy(tempV)
    b[:,t]=xgrid[policyind]
    
    finish = time.time() - start
    print(['Age: ', t, '. Time: ', round(finish, 3), ' seconds.'])
    
finish_nonpar = time.time() - start
print("TOTAL ELAPSED TIME (cupy parrallelized): ", round(finish_nonpar , 4), " seconds. \n")    
9
['Age: ', 9, '. Time: ', 0.039, ' seconds.']
8
['Age: ', 8, '. Time: ', 0.044, ' seconds.']
7
['Age: ', 7, '. Time: ', 0.049, ' seconds.']
6
['Age: ', 6, '. Time: ', 0.053, ' seconds.']
5
['Age: ', 5, '. Time: ', 0.059, ' seconds.']
4
['Age: ', 4, '. Time: ', 0.065, ' seconds.']
3
['Age: ', 3, '. Time: ', 0.072, ' seconds.']
2
['Age: ', 2, '. Time: ', 0.076, ' seconds.']
1
['Age: ', 1, '. Time: ', 0.08, ' seconds.']
0
['Age: ', 0, '. Time: ', 0.084, ' seconds.']
TOTAL ELAPSED TIME (cupy parrallelized):  0.0839  seconds. 

Confirming we have the right answer

We print the first ten elements of the policy function, and then the difference between them

In [65]:
print(b_non_gpu[:10])
print(b[:10,1])
[0.12081388 0.12081388 0.12341561 0.12341561 0.12601734 0.12601734
 0.12861908 0.12861908 0.13122081 0.13122081]
[0.12081388 0.12081388 0.12341561 0.12341561 0.12601734 0.12601734
 0.12861908 0.12861908 0.13122081 0.13122081]
In [66]:
print('Difference = ', cp.array(b_non_gpu[:10])-b[:10,1] )
Difference =  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
In [67]:
print(r'Speedup: (vs numpy vectorized)',  round(10/0.07,3))
print(r'Speedup: (vs numba)', round(3.24/0.043,3))
Speedup: (vs numpy vectorized) 142.857
Speedup: (vs numba) 75.349

Plotting comparison

In [75]:
fig = plt.figure(figsize=(9,5))
ax = fig.add_axes([0,0,1,1])
langs = ['Numba (no parallel)', 'Numpy (CPU, vectorized)', 'Numba (CPU, non-vectorized)' , 'Cupy (GPU, vectorized)']
speed = [13.7, 10.8425,3.2096,0.2606]
ax.bar(langs,speed, color='gray' )
ax.set_ylabel('Time, seconds')
ax.set_title('Comparison of speed of parallel programming packages')

fig.savefig('parallel_speeds.png')
plt.show()
In [ ]: