Jinay Jain

When I first discovered the field of machine learning, I found it largely approachable and easy to delve into. Of course there were the complex calculus equations and mathematical derivations behind the scenes, but if you wanted to you could choose to skip all of that and focus on the high-level of things. In fact, the world renowned course on machine learning by Andrew Ng encouraged skipping the derivations in search of more intuitive understanding. Those who wanted to stick around would be able to, but it was not an integral part of the course. This freedom in technical depth allows any beginner to find themselves intuitively understanding the concepts and still not feel pressured to learn the fine details.

Lego Analogy

People (especially the media) generally like to describe creating neural networks (NN) as putting together pieces of Lego bricks and creating something original out of these small pieces. And for the most part, this analogy works quite well in describing the way we create these neural networks. A deep learning framework like TensorFlow will equip you with various layers, but it’s up to the user on how they want to structure their network for a specific task. This facet of the machine learning process contributes to how the field is so approachable and open to all people with some programming experience. If you were really ambitious you could even create a “deep learning Lego set” that even a child might learn to manipulate (someone please make this thank you).

But what these people fail to recognize is the fact that those Lego blocks are constantly communicating with each other in the form of losses and gradients to achieve some common goal. The entire learning process relies on this ability to communicate so that the first layer in the network gets the information on how it can alter its weights to better equip the last layer in making its predictions.

Playing with Keras

The massively popular framework called Keras has shown us the power of translating this Lego block analogy into real code. 90% of people who do deep learning will benefit from the framework’s easy to implement Sequential model. The code given below will have you performing better on the MNIST dataset than most professionals in the late 20th century.

# Taken from the Keras team's GitHub examples
model = Sequential()
model.add(Conv2D(32, kernel_size=(3, 3),
                 activation='relu',
                 input_shape=input_shape))
model.add(Conv2D(64, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))
model.add(Flatten())
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(num_classes, activation='softmax'))

model.compile(loss=keras.losses.categorical_crossentropy,
              optimizer=keras.optimizers.Adadelta(),
              metrics=['accuracy'])

model.fit(x_train, y_train,
          batch_size=batch_size,
          epochs=epochs,
          verbose=1,
          validation_data=(x_test, y_test))

Every time you invoke the model.add() function, you’re instantly initializing thousands of weights and computations with one line of code. The same NN written in pure NumPy (a power math library for Python), would easily take hundreds more lines to create. The fact that deep learning has reached this high level of abstraction speaks to its usability and adaptability.

Modularity of Learning and Backpropagation

To elaborate on that inter-layer communication idea, consider the various aspects that play into training a neural network. Arguably the most fundamental idea in machine learning is minimizing some loss function so that you better model and “fit” your data. Even then, there are tons of different loss functions you might use, all of which are compatible with most network models. For example, one could use both the classic “mean-squared error” loss and “crossentropy” loss for the same NN. Instead of having to alter your whole program for that one change, all you have to do is change what you loss function returns.

This same principle goes for optimizers, who actually change the weights of the network based on gradients, too! Most people learn the basic stochastic gradient descent (SGD) equation. Quite simply, it takes a single, “downhill,” fixed-length step in the direction that it sees will improve its performance the next time it sees that same training example. As you learn about new optimizers like SGD with momentum, which evolves into RMSprop, which then evolves into many optimizers. But they still all want to take the optimal step in order for the algorithm to learn, some more efficient and useful than others.

The way these optimizers receive those gradients boils down to a fundamental rule of calculus, the chain rule. Geoffrey Hinton’s idea of backpropagation piggy-backs off that simple rule to calculate the gradients passed into optimizers. When I first learned about the true inner-workings of backprop with Stanford’s CS231n course online, I was blown away by how simple the concept really was. For those of you interested, I encourage you to watch that lecture too.

My man Andrej Karpathy delivering his masterful lecture on backprop circa 2016

Modularity Across Use-Cases

Before the era of deep-learning, it was uncommon to see some practices used in natural language processing be implemented in an image classification task. The fields simply didn’t line up in many cases (i.e. pixels vs. words). When powerful neural networks were introduced to these fields, though, we saw that phenomena completely flip. Now, models like attention based transformers originally built for text processing, are learning how to generate images. If you really think about it, this product of deep learning isn’t surprising. Most problems you can model as an optimization problem on some loss function can be implemented in a deep learning perspective. We’re now seeing quite old ideas like reinforcement learning suddenly become viable options for solving complex environments. Once again, the losses, optimizers, training methods, and neural network architectures can remain largely similar across these fields.

What This Means

The amount of research papers published in the field is rapidly increasing, in tandem with the amount of people becoming interested in deep learning as well. The modularity of deep learning means that an innovation in one aspect of the learning process could impact methods used in many others. This only accelerates the rate of growth in AI, making previously impractical ideas possible using newly discovered methods of training. But many of us aren’t at the forefront of AI research, and we just want to train an older model on our own datasets.

Thankfully, this product of deep learning helps us lowly folk too. Most purposes will only require a high-level Keras implementation anyways. But if you want to increase your customization, you’re also welcome to tweak those things. I really wrote this post to show you how cool deep learning is, so if I’ve done that, then that’s awesome. ¯\_(ツ)_/¯

If you’ve enjoyed, take a look at my YouTube channel! 😀