I need to write SGD-Perceptron for digit recognition according to this guidelines:
- The selection is MNIST database with 60000 training samples.
- Get random vector and calculate net (weighted sum) and find y_j = f(net_j), where f(x) is activation function. Also w_0j are offset weights (bias weights), x0 = 1
- Find ε = 0.5 * sum((d_j-y_j) ** 2), where d is desired vector (zero vector with 1 across from desired digit, I think) created from selection (y_train). I also think ε is called loss function.
- If ε < ε_threshold, break the cycle.
- Adjust the weights: w_ij += -η * δ_j * x_i, δ_j = -(d_j-y_j) ∙ f'(net_j), where η is learning rate, ∙ operator is inner product, as I think.
- Repeat steps 2-5 until the cycle breaks.
The problem is perceptron do not learn whatever values of η, ε_threshold and offset (random initial weight offset) I set. I have similar results after every vector, for example:
desired digit from y_train: 3 (index 3 in following arrays)
weighted fum for each j: [-2.39 -2.21 -2.254 -2.49 -2.41 -2.16 -2.17 -2.37 -2.35 -2.32]
error vector (desired-y_pred): [-0.083 -0.098 -0.095 0.924 -0.082 -0.102 -0.101 -0.085 -0.086 -0.088]
Which mean the weights point at all outputs equally probable, although error vector tries to adjust them correctly.
I use sigmoid (σ(x)) activation function with derivative σ(x)*(1-σ(x)).
Here is my code:
import random
import numpy as np
from keras.datasets import mnist
from scipy.misc import derivative
(X_train, y_train), (X_test, y_test) = mnist.load_data()
# Normalize data
X_train = X_train / 255.0
X_test = X_test / 255.0
# Constants from data
X_vector_len = 28*28
y_vector_len = 10
# Learning parameters
learning_rate = 0.05
offset = 0.03 # Random initial weight offset
e_threshold = 0.01
def activation_function(x):
return 1 / (1 + np.exp(-x))
def acivation_derivative(x):
return activation_function(x) * (1 - activation_function(x))
def train(X_train, y_train):
# Weight initialization
weights = np.random.rand(X_vector_len, y_vector_len) * offset * 2 - offset
bias = np.random.rand(y_vector_len) * offset * 2 - offset
epochs = 0
while True:
# Getting X and D vectors
s = random.randrange(len(X_train))
sample = np.append(X_train[s], [])
desired = np.zeros(y_vector_len, dtype=np.float64)
desired[y_train[s]] = 1
net = np.dot(sample, weights) + bias # Weighted sum
y_pred = activation_function(net)
e_vec = desired - y_pred
e = sum(e_vec ** 2) / 2 # Error ε for current vector
if e < e_threshold:
break
gradient = -e_vec.dot(acivation_derivative(net))
for j in range(y_vector_len):
for i in range(X_vector_len):
weights[i, j] -= learning_rate * gradient * float(sample[i])
bias[j] -= learning_rate * gradient
epochs += 1
return weights, bias, epochs
def test(X_test, y_test, weights, bias):
correct_predictions = 0
for j, text in enumerate(y_test):
sample = np.append(X_train[j], [])
net = np.dot(sample, weights) + bias
print(f"Test {j+1}: Desired {text}, prediction: {net.argmax()} ({net})")
if text == net.argmax():
correct_predictions += 1
return correct_predictions / len(X_test)
if __name__ == "__main__":
weights, bias, epochs = train(X_train, y_train)
print(f"Model trained for {epochs} epochs")
accuracy = test(X_test, y_test, weights, bias)
print(f"Accuracy for test selection: {accuracy * 100:.2f}%")
Searching for mistakes in formulas and reading some implementations and math theory about SGD didn’t help me. I only got more questions because implementations are different:
-
Why neural network does not learn despite the error vector contains right coefficients (positive for desired, negative for other digits)
-
The gradient δ is a number as it is an inner product, what is δ_j? If I should use δ as δ_j, doesn’t it mix the weights? The value of error_j = d_j – y_j definitely must affect specific outputs. I tried to use both δ and δ_j = -(d_j-y_j) * f'(net_j) (regular product) with same results.
-
Sigmoid function has range of values = (0, 1), leading to the values of error vector either 1-σ(x) or 0-σ(x), in other words, (-1, 1). Am I right the error vector is one of the coefficients for weights and it must be positive if the result is correct and vice versa, while other coefficients always non-negative? But how then other activation functions, e.g. f(x)=x, f(x)=arctg(x) with different range of values work?
-
I have used unit step function as activation function in the other algorithm, is it applicable for SGD?
-
Can I use numpy functions to adjust weights somehow? I need to multiply all w_ij values by x_i in rows.
cherv11 is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.