Artificial neural networks are computational models inspired by biological nervous systems, capable of approximating functions that depend on a large number of inputs. A network is defined by a connectivity structure and a set of weights between interconnected processing units ("neurons"). Neural networks "learn" a given task by tuning the set of weights under an optimization procedure. Let's create a feedforward neural network with DiffSharp and implement the backpropagation algorithm for training it. As mentioned before, backpropagation is just a special case of reverse mode AD. We start by defining our neural network structure. The network will consist of several layers of neurons. Each neuron works by taking an input vector x and calculating the activation (output) a=σ(∑iwixi+b), where wi are synapse weights associated with each input, b is a bias, and σ is an activation function representing the rate of action potential firing in the neuron. Chart A conventional choice for the activation function had been the sigmoid σ(z)=1/(1+e−z) for a long period because of its simple derivative and gain control properties. Recently the hyperbolic tangent tanh and the rectified linear unit ReLU(z)=max(0,z) have been more popular choices due to their convergence and performance characteristics. Now let's write the network evaluation code and a function for creating a given network configuration and initializing the weights and biases with small random values. In practice, proper weight initialization has been demonstrated to have an important effect on training convergence and it would depend on the network structure and the type of activation functions used. Network evaluation is implemented using linear algebra, where we have a weight matrix Wl holding the weights of all neurons in layer l. The elements of this matrix wij represent the weight between the j-th neuron in layer l−1 and the i-th neuron in layer l. We also have a bias vector bl holding the biases of all neurons in layer l (one bias per neuron). Thus evaluating the network is just a matter of computing the activation al=Wlal−1+bl, for each layer and passing the activation vector (output) of each layer as the input to the next layer, until we get the network output as the output vector of the last layer.