A brief intro to deep learning and ANNs
Artificial neural networks have been a useful machine learning model since as early as the 1940s. Similar to a biological neural network, an ANN has a number of layers which can be combined in a number of ways:
- Feedforward networks. These networks are uni-directional and propagate their activations from the first to the last layer in a feedforward manner.
- Convolutional networks. Inspired by the visual cortex, they apply a filter over the whole input (usually suitable for image data).
- Recurrent networks. These networks are similar to the feedback systems as they combine sequential propagation with loops as well. They're often used in NLP applications.
A neural network may contain several layers and hence can be divided between a shallow and deep network. Although there's no strict definition of how many layers comprise a deep network, they usually have many layers. Deep learning has become a de facto choice in artificial intelligence.
What are the advantages of deep learning?
Now, a natural question arises: why all the buzz about deep learning? And it’s a valid question. I still remember getting almost irked by this term back in 2017/18 due to its overuse, and it took me some time before I truly appreciated the power of deep learning.
Deep learning has a number of advantages:
1. Representation learning
While traditional machine learning models rely on hand-crafted features (feature extraction), deep learning takes on the representation learning responsibility by itself.
2. Performance scalability
Traditional ML algorithms like SVM, decision trees, or even shallow multilayer perceptrons (MLPs) continue to improve their performance with more (training) data. However, their performance reaches a “flat region” (no further increase in accuracy) beyond some data size. Deep models, on the other hand, can scale remarkably well on big data as we can train our models even on terabytes of data.
The classical ML wisdom says that we need to have a simpler model (with an optimum/lower number of parameters) to avoid overfitting. DL models seem to defy this classical wisdom, and here, “the more, the better” seems to work pretty well not only on the training data but even for unseen examples.
3. Non-convex optimization
A deep model with dozens of layers will inevitably have a non-convex loss function. In other words, we can have a number of local minima in the loss function’s landscape and it would be pretty hard to train them. But here, it again defies conventional wisdom as most deep models converge pretty well.
4. Libraries support
There are lots of libraries in Python for deep learning – TensorFlow, PyTorch, Flax, and Keras, to name but a few.
In this article, we're going to explore Keras. If you want to know more about TensorFlow and PyTorch and how they compare, you might like to read my article, PyTorch vs. TensorFlow: which is best for deep learning?
5. GPU support
Any GPU with CUDA support (NVidia GPUs) supports deep learning’s simple operations. Hence, we can make use of GPGPU (General-Purpose GPU) computing to save training (and/or inference) time.
Usually, deep models are trained using SGD-based optimizers utilizing the backpropagation algorithm, which applies gradient descent (or any other variant like Adam) on neural networks in reverse order (from output to input layer). Since we can't take a derivative with respect to the loss function for the inner layers (any hidden layer except the last one), we use the classical Calculus’s chain rule for this purpose.
That was a bit of theory. Now, let’s get down to business and talk about Keras.
What is Keras?
Keras was introduced in 2015 as a front-end deep-learning library by Francois Çhollet and his team at Google. Keras’ philosophy is simply Deep learning for humans. It's further described on its website as follows:
Keras is an API designed for human beings, not machines. Keras follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear & actionable error messages. Keras also gives the highest priority to crafting great documentation and developer guides.
What I like most about this is the mention of cognitive load. Despite working with PyTorch, JAX, and TensorFlow for a long time, it still feels like making a neural network can be daunting. Keras makes it far simpler, allowing us to stay focused on the design and not get bogged down in too many programming details.
But that's enough preamble. You need to see it to believe it:
cnnModel = keras_core.Sequential( [ inputLayer, convLayer1, poolLayer1, convLayer2, poolLayer2, flatten, dropOut, outputLayer, ] )
That’s it. Assuming we've already defined these layers, we've defined a CNN model of two layers – a fully connected layer and a dropout – by stacking them together simply as a list of numbers. Sounds exciting? Let’s explore further with a quick overview of Keras.
How to set up Keras
We can install Keras for Jupyter Notebook with pip as:
!pip install keras
If you've already installed TensorFlow, Keras would have also come as a complimentary library. We can simply import it:
Creating the first Keras model
Keras provides end-to-end ML pipeline support, but covering all its aspects requires a lot of time, so I'll keep them concise here. We can divide the Keras pipeline roughly into:
- Data processing
- Model creation
- Hyperparameter tuning
We can have data from pretty diverse sources. It can be a collection of text transcripts for some NLP project, raw images for a computer vision task, or just a CSV file. Each type of data comes with its own challenges. Keras works with three types of data:
- NumPy arrays
- Python generators
The last option is an optimized one, as it uses TensorFlow’s optimized
DataSet feature. It’s especially useful in managing the computational resources at hand (GPU and CPU). We can convert raw data into
keras.utils.image_dataset_from_directory: It’s pretty useful for (supervised) computer vision tasks. All we have to do is segregate the images into different folders according to their classes. Keras will automatically convert them into
DataSetwith respective labels.
keras.utils.text_dataset_from_directory: We can also do the same for NLP tasks. Similarly, text files can be placed in the respective folders and it will make their
I tried it by creating a couple of folders,
classB – both nested within the main
import keras dataset1 = keras.utils.image_dataset_from_directory('./SampleImagesDataSet')
2023-09-17 Found 3 files belonging to 2 classes.
We can also specify the batch size (which will come in handy later on during the training) by specifying
batch_size. If we look into the dataset variable, it shows us:
<_BatchDataset element_spec=(TensorSpec(shape=(None, 256, 256, 3), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>
Simple, isn’t it? Having processed the data, let’s move on to the next step.
Keras uses two types of models:
Sequential is pretty straightforward and allows us to make a neural network by simply stacking the layers on top of each other with respective parameters. The output of each layer becomes the input of the succeeding one.
Functional is much more flexible and provides us the leverage to make many advanced/complex designs, like layers with multiple inputs/outputs. As its name depicts, it also facilitates the functional programming paradigm.
Sequential is also pretty straightforward and allows us to make a neural network by simply stacking the layers on top of each other with respective parameters. The output of each layer becomes the input of the succeeding one.
It has some pretty basic functions, for example:
- Input layer
Input() as its name suggests, is used to define the input layer. It takes the dimensions of an input (be it an image or any type of data) as an input.
from keras import layers inputLayer = layers.Input(shape=(256,256,3))
- Convolution layer
Conv2D() is quite an important function used to define the convolutional layer. Its arguments are:
Number of filters: To ensure we don’t overfit (or underfit in some cases) to a single filter, we can define a number of filters. Each filter has the same size but they're applied (and they learn) independently of each other.
kernel_size: Usually, we define this as an odd number (you're free to define any filter size you like), such as 3 × 3, 5 × 5, etc. Here, MNIST images are already pretty small, so 3 × 3 will work.
activation: The activation function to use. Usually, we use ReLU for the intermediate layers. Please feel free to try others, too.
Similarly, there are other useful functions available. By combining them, we get our model:
cnnModel = keras.Sequential( [ layers.Input(shape=(28, 28, 1)), #MNIST images layers.Conv2D(32, kernel_size=(3, 3), activation="relu"), layers.MaxPooling2D(pool_size=(2, 2)), layers.Conv2D(64, kernel_size=(3, 3), activation="relu"), layers.MaxPooling2D(pool_size=(2, 2)), layers.Flatten(), layers.Dropout(0.5), layers.Dense(10), ] )
A curious reader can always check the model and parameters evolution with the layers in the following way:
**Output** Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= conv2d (Conv2D) (None, 26, 26, 32) 320 max_pooling2d (MaxPooling2 (None, 13, 13, 32) 0 D) conv2d_1 (Conv2D) (None, 11, 11, 64) 18496 max_pooling2d_1 (MaxPoolin (None, 5, 5, 64) 0 g2D) flatten (Flatten) (None, 1600) 0 dropout (Dropout) (None, 1600) 0 dense (Dense) (None, 10) 16010 ================================================================= Total params: 34826 (136.04 KB) Trainable params: 34826 (136.04 KB) Non-trainable params: 0 (0.00 Byte) _________________________________________________________________
Having defined the model’s architecture, we can now optimize/train our model. Training is nothing but finding the values of the parameters leading to the minimization of the loss function (or optimization, in other words). Hence, it's important to select the optimizer and loss function carefully.
An optimizer further depends on some relevant attributes, commonly known as hyperparameters, like learning rate, momentum, etc.
To specify all this information, we use
compile(). This function requires:
- Optimizer function
- Respective hyperparameters (if any)
- Loss function
For example, we'll compile the aforementioned model with Adam and cross-entropy loss as:
Great! But we aren’t done yet. As a final step, we need to specify the dataset with some relevant hyperparameters (like batch size or number of epochs). For that, we will call
fit(). It will take:
- Data samples (X)
- Respective labels (Y)
- Relevant hyperparameters
#cnnModel.fit() #It won't work as we didn't specify the actual dataset yet.
The above code won’t work as the dataset is placed locally on my system. So, I would highly encourage you to make a dataset yourself (even a few images for each class will do as a starter). In case you just want to run it and make a dataset later on, we can use the publicly available dataset, like MNIST, here.
MNIST is already available in the Keras datasets. We just have to make sure to:
- Convert the labels into one-hot encoding - Keras provides
- Resizing the MNIST images into 28 × 28 × 1 rather than the intrinsic 28 × 28. Its reason is obvious as Keras (or any DL library) usually expects our images in either 4D (for batched input) or 3D (a single image). We will use NumPy’s
expand_dimsfor the purpose.
from keras.utils import to_categorical import numpy as np (xTrain, yTrain), (xTest, yTest) = keras.datasets.mnist.load_data() yTrain = to_categorical(yTrain, 10) xTrain = np.expand_dims(xTrain, -1) cnnModel.fit(xTrain, yTrain, batch_size=128, epochs=3)
Epoch 1/3 469/469 [==============================] - 10s 22ms/step - loss: 2.9006 Epoch 2/3 469/469 [==============================] - 9s 19ms/step - loss: 3.4552 Epoch 3/3 469/469 [==============================] - 9s 20ms/step - loss: 3.3820 <keras.src.callbacks.History at 0x7f81ca9c9bd0>
As an ML engineer, you would have realized how difficult it is to find the correct set of hyperparameters (which is an optimization problem in itself). Keras provides some help by
Before using it, we need to get ourselves familiar with the
HyperParameter class and some of its methods:
Here, we provide a set of possible hyperparameter values to choose from. We provide the names and respective values of the hyperparams, followed by some optional arguments.
For example, we can optimize the learning rate as:
optimizedLearningRate = hp.Choice('learning_rate', values=[0.001, 0.003, 0.0001, 0.0003])
In other scenarios, we may have a large search space, in which case specifying them explicitly using
Choice() won’t be the right idea.
Int() is useful here as it takes the minimum and maximum number in the range and returns the optimal value. For example, we can find the best number of filters for our second layer (in the model above) as:
optimizedFiltersCount = hp.Int('filters', min_value=32, max_value=512, step=32)
We can go even crazier by looking for floating-point values within a range as well. For that, we can use
Float() in lieu of
To make it all useful, we'll redefine the model above using hyperparameter optimization.
It can often throw an error. So, in case it does, please upgrade it using pip:
!pip install keras-tuner --upgrade
Once imported successfully, we can redefine our model. In order to use the hyperparameter optimization/search, we need to redefine our model within a function (defined by us), taking a hyperparameter object as its input.
def OptimizedModel(hp): model = keras.Sequential() model.add(layers.Input(shape=(28, 28, 1))) model.add(layers.Conv2D(32, kernel_size=(3, 3), activation="relu")) model.add(layers.MaxPooling2D(pool_size=(2, 2))) #Now, we will try hyperparam tuning by finding the 2nd Conv's number of filters using hp.Int(). optimizedFiltersCount = hp.Int('filters', min_value=16, max_value=96, step=16) model.add(layers.Conv2D(filters=optimizedFiltersCount, kernel_size=(3, 3), activation="relu")) # The rest of the model is the same model.add(layers.MaxPooling2D(pool_size=(2, 2))) model.add(layers.Flatten()) model.add(layers.Dropout(0.5)) model.add(layers.Dense(10)) #Similarly, we will try hyperparam tuning for the learning rate as well optimizedLearningRate = hp.Choice('learning_rate', values=[0.001, 0.003, 0.0001, 0.0003]) model.compile(optimizer=keras.optimizers.legacy.Adam(learning_rate=optimizedLearningRate),loss=keras.losses.categorical_crossentropy,metrics=['accuracy']) #Finally, we can return from the function. return model
Having defined the
OptimizedModel, now we can initialize a hyperparameter tuner/optimizer.
keras_tuner provides us with a number of search algorithms, like:
- Random Search
- Bayesian Optimization
- Grid Search
Usually, ML courses don’t cover hyperparameter tuning, so curious readers are invited to read more about these algorithms. We'll use a random search tuner here. A tuner takes some arguments, like:
hypermodel– the name of the function with a hyperparameter-optimized model.
objective– whether it’s accuracy, validation accuracy, or some other metric we want to optimize for.
max_trials– how many failed trials we allow.
Let’s initialize and see it in action:
hyperParamTuner = keras_tuner.RandomSearch( hypermodel=OptimizedModel, objective="accuracy", max_trials=7, executions_per_trial=1, overwrite=True, directory="./" )
hyperParamTuner.search(xTrain, yTrain, epochs=3)
Trial 2 Complete [00h 00m 44s] Best accuracy So Far: None Total elapsed time: 00h 01m 32s Search: Running Trial #3 Value |Best Value So Far |Hyperparameter 96 |64 |filters 0.0003 |0.003 |learning_rate Epoch 1/3 1875/1875 [==============================] - 18s 9ms/step - loss: 10.1364 Epoch 2/3 1875/1875 [==============================] - 19s 10ms/step - loss: 10.6086 Epoch 3/3 1875/1875 [==============================] - 18s 10ms/step - loss: 10.6009
That was fun. Now, let’s proceed further to see what else Keras has in store for us.
What’s new in Keras 3.0
Keras 3.0 is scheduled to launch this fall. While the exact launch date is still unclear, we can already use its beta version, Keras Core.
We can simply import it:
Using TensorFlow backend
Keras used to be pretty common back in 2017/18, so if you're switching back to Keras (or even if you're totally new to it), there are some cool features on offer, like:
- Seamless support of PyTorch, TensorFlow, and JAX
- Ops support
- The ability to combine Keras and backend code
- Using diverse data pipelines
- Support for functional programming
More backends - beyond TensorFlow
The major reason why Keras fell out of favor with the community since 2019-2020 was the gradual rise of PyTorch and JAX. Keras, on the other hand, had become a TensorFlow-only wrapper.
Keras developers realized the need of the hour and now we have support for all three leading DL frameworks. Personally, it came as a surprise to me that Keras has done so well 2,3 years after JAX’s established presence among the research community, and it being a Google product itself. Nevertheless, all is well that ends well, and now we have support for not only TensorFlow but also PyTorch and JAX in the Keras core.
import os os.environ["KERAS_BACKEND"] = "<jax or torch>"
TensorFlow works by building computational graphs. An
Operation is a node in a TensorFlow graph that takes tensors as input and produces an output. It’s pretty similar to the normal operators (can be as simple as arithmetic operators) but the main difference is both operands and the output are tensors here.
Inspired by this concept, Keras has implemented Ops in Keras core. The majority of the Ops are normal NumPy operations, though it support some advanced procedures as well. Although it's a custom implementation of NumPy, both function names and the arguments are the same.
We can import
from keras_core import ops as ops
As we can confirm, ops functionality is the same as its NumPy counterparts.
x = ops.linspace(0,2,20) x
<tf.Tensor: shape=(20,), dtype=float64, numpy= array([0. , 0.10526316, 0.21052632, 0.31578947, 0.42105263, 0.52631579, 0.63157895, 0.73684211, 0.84210526, 0.94736842, 1.05263158, 1.15789474, 1.26315789, 1.36842105, 1.47368421, 1.57894737, 1.68421053, 1.78947368, 1.89473684, 2. ])>
We can use ops in the Keras model by using the Lambda layer.
Combining Keras and native code
Keras core allows us to use Keras’ intrinsic functions with low-level libraries like PyTorch, TensorFlow, or JAX. It allows a lot more power and enables developers to switch across the libraries seamlessly. For example, the code here combines both Keras and native PyTorch code:
from torch import nn from keras_core import layers class HybridCNN(nn.Module): def __init__(self): super().__init__() self.model = keras_core.Sequential( [ layers.Input(shape=(28, 28, 1)), layers.Conv2D(32, kernel_size=(3, 3), activation="relu"), layers.MaxPooling2D(pool_size=(2, 2)), layers.Conv2D(64, kernel_size=(3, 3), activation="relu"), layers.MaxPooling2D(pool_size=(2, 2)), layers.Flatten(), layers.Dropout(0.5), layers.Dense(10), ] ) def forward(self, x): return self.model(x)
Have you noticed the beauty of the code above? It combines both PyTorch and Keras code, yes. But it does so without using a
torch backend as well. It’s pretty cool and makes sure that our code is ubiquitous to the DL community.
Diverse data pipelines
Things become even more interesting for data pipelines - the core component of any DL (or even ML) project. Unlike classical Keras, we can now combine a Tensorflow
Dataset with PyTorch
DataLoader, NumPy array, Pandas data frames, or Keras core’s own
Functional programming support
Functional programming is a pretty simple and cool paradigm that uses deterministic functions without any side effects. A side effect refers to the modification of values beyond the scope of the function or any I/O operation (including the mere
print() statement). These functions are also known as pure functions.
Since pure functions are deterministic, they'll always give the same output for the same inputs, regardless of the environment.
Keras, on the other hand, uses stateful functions (which may keep some records beyond the function’s scope). Luckily, now Keras core supports the pure/stateless counterparts as well. They're especially useful for JAX development as it's based on the functional programming model.
Neural network architecture
In layers, we can use the method
stateless_call(). It's a (stateless) alternative to
__call__(). Since it's free of side effects, it can be integrated seamlessly into a functional programming framework like JAX.
Similarly, optimizers can use
stateless_apply() to mimic the
apply() function in a stateless manner.
For metrics, we have
stateless_result() as a side-effect-free implementation of
Keras: deep learning for humans
Keras is a high-level DL library with a number of applications, from object detection to generative modeling. It has been adopted by a number of famous companies like Adobe, Twitter, Tesla, IBM, and Salesforce. With increased support in the upcoming version, Keras is set to lead the DL engineer’s arsenal of tools.
This post serves as both an introduction and a thorough review of Keras’s features. I hope it will be pretty helpful for an aspiring Keras engineer. In the end, it really is deep learning for humans.