Activation functions for Neural Networks

4 min readJan 3, 2022


In a Neural Networks, a single neurons actually executes 2 tasks :

  • first, with the Matrices of Weights and Biases, it does a linear transformation of the inputs into a single value
  • then, it applies a specific activation function, transforming the single value into the output value of the neuron

The activation function can be any kind of function defined on R, but, obviously some very specific functions are used in most of the cases.

The purpose of an activation function is to add non-linearity to the neural network. If we were not appliying any function (or equivalently adding the linear function f(x) = x), it is easy to show that the output would be a linear combination of all the inputs, no matter the number of hidden layers and neurons involved.

  1. Linear functions

Any function f(x) = a.x + b would not add non-linearity to the network and therefore would be equivalent to a single perceptron.

2. Binary function

This is the most obvious way to add non-linearity in the system. The Binary function decides whether a neuron should be activated or not.

It is mainly used in the perceptron network.

The limitations of binary step function:

  • It cannot provide multi-value outputs — for example, it cannot be used for multi-class classification problems.
  • The gradient of the step function is zero, which causes a hindrance in the backpropagation process.

3. Sigmoid function

This is one of the most common activation functions used in mutli layer neural networks :

  • It is commonly used for models where we have to predict the probability as an output. Since probability of anything exists only between the range of 0 and 1, sigmoid is the right choice because of its range.
  • The function is differentiable and provides a smooth gradient, i.e., preventing jumps in output values. This is represented by an S-shape of the sigmoid activation function.

This is the function used in Logistic Regression, which can be replicated using only one Neuron Network.

The downside of this function is that :

  • For values greater than 3 or less than -3, the function will have very small gradients. As the gradient value approaches zero, the network ceases to learn and suffers from the Vanishing gradient problem

4. TanH function

Tanh function is very similar to the sigmoid/logistic activation function :

  • The output of the tanh activation function is Zero centered; hence we can easily map the output values as strongly negative, neutral, or strongly positive.
  • Usually used in hidden layers of a neural network as its values lie between -1 to; therefore, the mean for the hidden layer comes out to be 0 or very close to it. It helps in centering the data and makes learning for the next layer much easier.

As a downside :

  • it also faces the problem of vanishing gradients similar to the sigmoid activation function

In practice, tanH non-linearity is often preferred to sigmoid non-linearity.

5. ReLu function

ReLU stands for Rectified Linear Unit.

Although it gives an impression of a linear function, ReLU has a derivative function and allows for backpropagation while simultaneously making it computationally efficient.

The advantages of using ReLU as an activation function are as follows:

  • Since only a certain number of neurons are activated, the ReLU function is far more computationally efficient when compared to the sigmoid and tanh functions.
  • ReLU accelerates the convergence of gradient descent towards the global minimum of the loss function due to its linear, non-saturating property.

The drawback of ReLU :

  • All the negative input values become zero immediately, which decreases the model’s ability to fit or train from the data properly : this can create dead neurons which never get activated

6. SoftMax function

SoftMax function is described as a combination of multiple sigmoids.

It calculates the relative probabilities. Similar to the sigmoid/logistic activation function, the SoftMax function returns the probability of each class.

It is most commonly used as an activation function for the last layer of the neural network in the case of multi-class classification.

Choice of function for Hidden Layers

Choice of function for Output Layer

There are perhaps three activation functions you may want to consider for use in the output layer; they are:

  • Linear
  • Logistic (Sigmoid)
  • Softmax

This is not an exhaustive list of activation functions used for output layers, but they are the most commonly used.