Simple Neural Network
Interactive visualization of a feedforward neural network for binary classification. Click on the plot to add data points!
Interactive Demo
How to use: Click on the left plot to add data points (left click for class 0 - blue, right click for class 1 - red). Then click "Train Network" to see the neural network learn the decision boundary!
Decision Boundary
Training Loss
Gradient Magnitude Over Training
What is gradient magnitude?
The gradient magnitude (L2 norm of all gradients) shows how much the network's weights are being updated. Large gradients mean the network is learning rapidly, while small gradients indicate the network is converging to a solution. Watching gradients helps detect issues like vanishing gradients (too small) or exploding gradients (too large).
The gradient magnitude (L2 norm of all gradients) shows how much the network's weights are being updated. Large gradients mean the network is learning rapidly, while small gradients indicate the network is converging to a solution. Watching gradients helps detect issues like vanishing gradients (too small) or exploding gradients (too large).
Network Architecture
Input Layer (2) → Hidden Layer (4) → Output Layer (1)
Total Parameters: 2×4 + 4 (weights + biases) + 4×1 + 1 = 17 parameters
Training Statistics
Current Loss
-
Accuracy
-
Epochs Trained
0
Learning Rate
0.1
Algorithm Overview
Goal
A neural network learns complex patterns by stacking layers of artificial neurons. Each neuron applies a weighted sum followed by a non-linear activation function. Through backpropagation, the network adjusts its weights to minimize prediction errors. This simple architecture (2 inputs → 4 hidden neurons → 1 output) can learn non-linear decision boundaries that linear models cannot.
Input
• Training data: (x₁, x₂, y) where x₁, x₂ are 2D coordinates
• x - 2D feature vector (input coordinates)
• y - binary class label (0 or 1)
• x - 2D feature vector (input coordinates)
• y - binary class label (0 or 1)
Output
• Trained weights and biases for all layers
• Prediction function: probability of class 1
• Performance metrics: Loss (Binary Cross-Entropy), Accuracy
• Prediction function: probability of class 1
• Performance metrics: Loss (Binary Cross-Entropy), Accuracy
Mathematical Formulas
Forward Propagation
$$\mathbf{h} = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)$$
$$\hat{y} = \sigma(\mathbf{W}_2 \mathbf{h} + b_2)$$
where:
• \(\mathbf{x} \in \mathbb{R}^2\) - input vector (2D)
• \(\mathbf{W}_1 \in \mathbb{R}^{2 \times 4}\) - weights from input to hidden layer
• \(\mathbf{b}_1 \in \mathbb{R}^4\) - hidden layer biases
• \(\sigma(z) = \frac{1}{1+e^{-z}}\) - sigmoid activation function
• \(\mathbf{h} \in \mathbb{R}^4\) - hidden layer activations
• \(\mathbf{W}_2 \in \mathbb{R}^{4 \times 1}\) - weights from hidden to output layer
• \(b_2 \in \mathbb{R}\) - output bias (scalar)
• \(\hat{y} \in (0, 1)\) - predicted probability
• \(\mathbf{x} \in \mathbb{R}^2\) - input vector (2D)
• \(\mathbf{W}_1 \in \mathbb{R}^{2 \times 4}\) - weights from input to hidden layer
• \(\mathbf{b}_1 \in \mathbb{R}^4\) - hidden layer biases
• \(\sigma(z) = \frac{1}{1+e^{-z}}\) - sigmoid activation function
• \(\mathbf{h} \in \mathbb{R}^4\) - hidden layer activations
• \(\mathbf{W}_2 \in \mathbb{R}^{4 \times 1}\) - weights from hidden to output layer
• \(b_2 \in \mathbb{R}\) - output bias (scalar)
• \(\hat{y} \in (0, 1)\) - predicted probability
Activation Function (Sigmoid)
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
Properties:
• Maps any input to range (0, 1)
• Smooth and differentiable everywhere
• Derivative: \(\sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))\)
• Used for both hidden and output layers in this demo
• Maps any input to range (0, 1)
• Smooth and differentiable everywhere
• Derivative: \(\sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))\)
• Used for both hidden and output layers in this demo
Loss Function (Binary Cross-Entropy)
$$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]$$
where:
• \(n\) - number of training samples
• \(y_i \in \{0, 1\}\) - true label for sample \(i\)
• \(\hat{y}_i \in (0, 1)\) - predicted probability for sample \(i\)
• Penalizes confident wrong predictions heavily
• Lower loss = better predictions
• \(n\) - number of training samples
• \(y_i \in \{0, 1\}\) - true label for sample \(i\)
• \(\hat{y}_i \in (0, 1)\) - predicted probability for sample \(i\)
• Penalizes confident wrong predictions heavily
• Lower loss = better predictions
Backpropagation & Weight Update
$$\mathbf{W} \leftarrow \mathbf{W} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}}$$
$$\mathbf{b} \leftarrow \mathbf{b} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{b}}$$
where:
• \(\alpha\) - learning rate (step size, typically 0.01-0.5)
• \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}}\) - gradient of loss with respect to weights
• \(\frac{\partial \mathcal{L}}{\partial \mathbf{b}}\) - gradient of loss with respect to biases
• Gradients computed via chain rule (backpropagation)
• Process repeats for multiple epochs until convergence
• \(\alpha\) - learning rate (step size, typically 0.01-0.5)
• \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}}\) - gradient of loss with respect to weights
• \(\frac{\partial \mathcal{L}}{\partial \mathbf{b}}\) - gradient of loss with respect to biases
• Gradients computed via chain rule (backpropagation)
• Process repeats for multiple epochs until convergence
Key Concepts
Why Neural Networks? Unlike linear models, neural networks can learn non-linear patterns. The hidden layer creates new feature representations that make complex patterns separable.
Activation Functions: Non-linear activations (like sigmoid) are crucial. Without them, stacking layers would be equivalent to a single linear transformation.
Backpropagation: The algorithm computes gradients efficiently by applying the chain rule backward through the network, updating weights to reduce loss.
Learning Rate: Controls how big the weight updates are. Too high causes instability, too low makes training very slow.
Try the demos: Circular data shows how NNs handle radial patterns. XOR data demonstrates a classic problem that requires non-linearity!
Activation Functions: Non-linear activations (like sigmoid) are crucial. Without them, stacking layers would be equivalent to a single linear transformation.
Backpropagation: The algorithm computes gradients efficiently by applying the chain rule backward through the network, updating weights to reduce loss.
Learning Rate: Controls how big the weight updates are. Too high causes instability, too low makes training very slow.
Try the demos: Circular data shows how NNs handle radial patterns. XOR data demonstrates a classic problem that requires non-linearity!