ANN = computational modle inspired by biological nervous systems.
FFNN = feed-forwards as opposed to recurrent ANNs
BkP = backpropagation is just a special case or reverse mode automatic differentiation RMAD
;; That is where I am still stuck, how is Trask
a = sigma(wx) ;; guru has not buried the bias term, treating it as another, b
A conventional choice for the activation function had been the sigmoid σ(z)=1/(1+e−z) for a long period because of its simple derivative and gain control properties. Recently the hyperbolic tangent tanh and the rectified linear unit ReLU(z)=max(0,z) have been more popular choices due to their convergence and performance characteristics.
weights and biases initialized randomly ;; WHY, what happens when you don't? If the dynsys is truly dissipative, it
should not matter.
In practice, proper weight initialization has been demonstrated to have an important effect on training convergence and it would depend on the network structure and the type of activation functions used.
ACTIVATION = ?? is it ai = wi ai-1 + bi or is it sigma of that????
LOSS ;; the energy function whose gradient is intuitively meaningful .... my take on it
guru calls this business of extra labelling TAGGING which keep you from going around in circles when
using Leibniz notations dx/dy != (dy/dx)inverse in either meaning of inverse despite the fractional notation.
The last comment in RMDAD.png says tagging avoids PERTURBATION CONFUSION
and a bit of browsing reveals that this is a very complicated thing mostly
having to do with FMAD ?! and not worth worrying about at this moment.