Simple Neural Network - Data Science Cheatsheet

Interactive Demo

How to use: Click on the left plot to add data points (left click for class 0 - blue, right click for class 1 - red). Then click "Train Network" to see the neural network learn the decision boundary!

Decision Boundary

Training Loss

Gradient Magnitude Over Training

What is gradient magnitude?
The gradient magnitude (L2 norm of all gradients) shows how much the network's weights are being updated. Large gradients mean the network is learning rapidly, while small gradients indicate the network is converging to a solution. Watching gradients helps detect issues like vanishing gradients (too small) or exploding gradients (too large).

Network Architecture

                    Input Layer (2) → Hidden Layer (4) → Output Layer (1)
                

Total Parameters: 2×4 + 4 (weights + biases) + 4×1 + 1 = 17 parameters

Training Statistics

Current Loss

-

Accuracy

-

Epochs Trained

0

Learning Rate

0.1

Algorithm Overview

Goal

A neural network learns complex patterns by stacking layers of artificial neurons. Each neuron applies a weighted sum followed by a non-linear activation function. Through backpropagation, the network adjusts its weights to minimize prediction errors. This simple architecture (2 inputs → 4 hidden neurons → 1 output) can learn non-linear decision boundaries that linear models cannot.

Input

• Training data: (x₁, x₂, y) where x₁, x₂ are 2D coordinates
• x - 2D feature vector (input coordinates)
• y - binary class label (0 or 1)

Output

• Trained weights and biases for all layers
• Prediction function: probability of class 1
• Performance metrics: Loss (Binary Cross-Entropy), Accuracy

Mathematical Formulas

Forward Propagation

$$\mathbf{h} = \sigma(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1)$$ $$\hat{y} = \sigma(\mathbf{W}_2 \mathbf{h} + b_2)$$

where:
• $\mathbf{x} \in \mathbb{R}^2$ - input vector (2D)
• $\mathbf{W}_1 \in \mathbb{R}^{2 \times 4}$ - weights from input to hidden layer
• $\mathbf{b}_1 \in \mathbb{R}^4$ - hidden layer biases
• $\sigma(z) = \frac{1}{1+e^{-z}}$ - sigmoid activation function
• $\mathbf{h} \in \mathbb{R}^4$ - hidden layer activations
• $\mathbf{W}_2 \in \mathbb{R}^{4 \times 1}$ - weights from hidden to output layer
• $b_2 \in \mathbb{R}$ - output bias (scalar)
• $\hat{y} \in (0, 1)$ - predicted probability

Activation Function (Sigmoid)

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Properties:
• Maps any input to range (0, 1)
• Smooth and differentiable everywhere
• Derivative: $\sigma'(z) = \sigma(z) \cdot (1 - \sigma(z))$
• Used for both hidden and output layers in this demo

Loss Function (Binary Cross-Entropy)

$$\mathcal{L} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i) \right]$$

where:
• $n$ - number of training samples
• $y_i \in \{0, 1\}$ - true label for sample $i$
• $\hat{y}_i \in (0, 1)$ - predicted probability for sample $i$
• Penalizes confident wrong predictions heavily
• Lower loss = better predictions

Backpropagation & Weight Update

$$\mathbf{W} \leftarrow \mathbf{W} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{W}}$$ $$\mathbf{b} \leftarrow \mathbf{b} - \alpha \frac{\partial \mathcal{L}}{\partial \mathbf{b}}$$

where:
• $\alpha$ - learning rate (step size, typically 0.01-0.5)
• $\frac{\partial \mathcal{L}}{\partial \mathbf{W}}$ - gradient of loss with respect to weights
• $\frac{\partial \mathcal{L}}{\partial \mathbf{b}}$ - gradient of loss with respect to biases
• Gradients computed via chain rule (backpropagation)
• Process repeats for multiple epochs until convergence

Key Concepts

Why Neural Networks? Unlike linear models, neural networks can learn non-linear patterns. The hidden layer creates new feature representations that make complex patterns separable.

Activation Functions: Non-linear activations (like sigmoid) are crucial. Without them, stacking layers would be equivalent to a single linear transformation.

Backpropagation: The algorithm computes gradients efficiently by applying the chain rule backward through the network, updating weights to reduce loss.

Learning Rate: Controls how big the weight updates are. Too high causes instability, too low makes training very slow.

Try the demos: Circular data shows how NNs handle radial patterns. XOR data demonstrates a classic problem that requires non-linearity!