Deep learning with Keras

How to create your first Keras model for deep learning, and what's new in Keras 3.0


We’re Apify, and our mission is to make the web more programmable. This article about deep learning with Keras was inspired by our work on getting better data for deep learning models. Check us out.

A brief intro to deep learning and ANNs

Artificial neural networks have been a useful machine learning model since as early as the 1940s. Similar to a biological neural network, an ANN has a number of layers which can be combined in a number of ways:

  • Feedforward networks. These networks are uni-directional and propagate their activations from the first to the last layer in a feedforward manner.
  • Convolutional networks. Inspired by the visual cortex, they apply a filter over the whole input (usually suitable for image data).
  • Recurrent networks. These networks are similar to the feedback systems as they combine sequential propagation with loops as well. They're often used in NLP applications.

A neural network may contain several layers and hence can be divided between a shallow and deep network. Although there's no strict definition of how many layers comprise a deep network, they usually have many layers. Deep learning has become a de facto choice in artificial intelligence.

What are the advantages of deep learning?

Now, a natural question arises: why all the buzz about deep learning? And it’s a valid question. I still remember getting almost irked by this term back in 2017/18 due to its overuse, and it took me some time before I truly appreciated the power of deep learning.

Deep learning has a number of advantages:

1. Representation learning

While traditional machine learning models rely on hand-crafted features (feature extraction), deep learning takes on the representation learning responsibility by itself.

2. Performance scalability

Traditional ML algorithms like SVM, decision trees, or even shallow multilayer perceptrons (MLPs) continue to improve their performance with more (training) data. However, their performance reaches a “flat region” (no further increase in accuracy) beyond some data size. Deep models, on the other hand, can scale remarkably well on big data as we can train our models even on terabytes of data.

3. Generalization

The classical ML wisdom says that we need to have a simpler model (with an optimum/lower number of parameters) to avoid overfitting. DL models seem to defy this classical wisdom, and here, “the more, the better” seems to work pretty well not only on the training data but even for unseen examples.

3. Non-convex optimization

A deep model with dozens of layers will inevitably have a non-convex loss function. In other words, we can have a number of local minima in the loss function’s landscape and it would be pretty hard to train them. But here, it again defies conventional wisdom as most deep models converge pretty well.

4. Libraries support

There are lots of libraries in Python for deep learning – TensorFlow, PyTorch, Flax, and Keras, to name but a few.

In this article, we're going to explore Keras. If you want to know more about TensorFlow and PyTorch and how they compare, you might like to read my article, PyTorch vs. TensorFlow: which is best for deep learning?

5. GPU support

Any GPU with CUDA support (NVidia GPUs) supports deep learning’s simple operations. Hence, we can make use of GPGPU (General-Purpose GPU) computing to save training (and/or inference) time.

Usually, deep models are trained using SGD-based optimizers utilizing the backpropagation algorithm, which applies gradient descent (or any other variant like Adam) on neural networks in reverse order (from output to input layer). Since we can't take a derivative with respect to the loss function for the inner layers (any hidden layer except the last one), we use the classical Calculus’s chain rule for this purpose.

That was a bit of theory. Now, let’s get down to business and talk about Keras.

Deep learning with Keras: artificial neural networks

What is Keras?

Keras was introduced in 2015 as a front-end deep-learning library by Francois Çhollet and his team at Google. Keras’ philosophy is simply Deep learning for humans. It's further described on its website as follows:

Keras is an API designed for human beings, not machines. Keras follows best practices for reducing cognitive load: it offers consistent & simple APIs, it minimizes the number of user actions required for common use cases, and it provides clear & actionable error messages. Keras also gives the highest priority to crafting great documentation and developer guides.

What I like most about this is the mention of cognitive load. Despite working with PyTorch, JAX, and TensorFlow for a long time, it still feels like making a neural network can be daunting. Keras makes it far simpler, allowing us to stay focused on the design and not get bogged down in too many programming details.

But that's enough preamble. You need to see it to believe it:

cnnModel = keras_core.Sequential(

That’s it. Assuming we've already defined these layers, we've defined a CNN model of two layers – a fully connected layer and a dropout – by stacking them together simply as a list of numbers. Sounds exciting? Let’s explore further with a quick overview of Keras.

Note: Before we begin, it's worth noting that we'll use Jupyter notebooks throughout the blog as the coding reference.

How to set up Keras

We can install Keras for Jupyter Notebook with pip as:

!pip install keras

If you've already installed TensorFlow, Keras would have also come as a complimentary library. We can simply import it:

import keras

Creating the first Keras model

Keras provides end-to-end ML pipeline support, but covering all its aspects requires a lot of time, so I'll keep them concise here. We can divide the Keras pipeline roughly into:

  • Data processing
  • Model creation
  • Optimization/training
  • Hyperparameter tuning

Data processing

We can have data from pretty diverse sources. It can be a collection of text transcripts for some NLP project, raw images for a computer vision task, or just a CSV file. Each type of data comes with its own challenges. Keras works with three types of data:

  • NumPy arrays
  • Python generators
  • TensorFlow DataSet objects

The last option is an optimized one, as it uses TensorFlow’s optimized DataSet feature. It’s especially useful in managing the computational resources at hand (GPU and CPU). We can convert raw data into DataSet as:

  • keras.utils.image_dataset_from_directory: It’s pretty useful for (supervised) computer vision tasks. All we have to do is segregate the images into different folders according to their classes. Keras will automatically convert them into DataSet with respective labels.
  • keras.utils.text_dataset_from_directory: We can also do the same for NLP tasks. Similarly, text files can be placed in the respective folders and it will make their DataSet.

I tried it by creating a couple of folders, classA, and classB – both nested within the main SampleImagesDataSet folder.

import keras

dataset1 = keras.utils.image_dataset_from_directory('./SampleImagesDataSet')

Found 3 files belonging to 2 classes.
Note: In order to reproduce it, you can make any two (or n folders) on your local system and name them as per your desired classes. The purpose here is just to provide a template; you can place any images inside these folders to reproduce them.

We can also specify the batch size (which will come in handy later on during the training) by specifying batch_size. If we look into the dataset variable, it shows us:

<_BatchDataset element_spec=(TensorSpec(shape=(None, 256, 256, 3), dtype=tf.float32, name=None), TensorSpec(shape=(None,), dtype=tf.int32, name=None))>

Simple, isn’t it? Having processed the data, let’s move on to the next step.

Model creation

Keras uses two types of models:

  1. Sequential
  2. Functional

Sequential is pretty straightforward and allows us to make a neural network by simply stacking the layers on top of each other with respective parameters. The output of each layer becomes the input of the succeeding one.

Note: You saw a sequential model a bit earlier in the introduction, where we made a CNN model by stacking different types of layers on top of each other.

Functional is much more flexible and provides us the leverage to make many advanced/complex designs, like layers with multiple inputs/outputs. As its name depicts, it also facilitates the functional programming paradigm.

Sequential API

Sequential is also pretty straightforward and allows us to make a neural network by simply stacking the layers on top of each other with respective parameters. The output of each layer becomes the input of the succeeding one.

It has some pretty basic functions, for example:

  • Input layer

The Input() as its name suggests, is used to define the input layer. It takes the dimensions of an input (be it an image or any type of data) as an input.

from keras import layers
inputLayer = layers.Input(shape=(256,256,3))
Caution: Don’t pass the input itself (image, etc.) as an input here. That’s a job for later at the time of optimization. Right now, we're just defining the model’s architecture.
  • Convolution layer

Conv2D() is quite an important function used to define the convolutional layer. Its arguments are:

  • Number of filters: To ensure we don’t overfit (or underfit in some cases) to a single filter, we can define a number of filters. Each filter has the same size but they're applied (and they learn) independently of each other.
  • kernel_size: Usually, we define this as an odd number (you're free to define any filter size you like), such as 3 × 3, 5 × 5, etc. Here, MNIST images are already pretty small, so 3 × 3 will work.
  • activation: The activation function to use. Usually, we use ReLU for the intermediate layers. Please feel free to try others, too.

Similarly, there are other useful functions available. By combining them, we get our model:

cnnModel = keras.Sequential(
                layers.Input(shape=(28, 28, 1)),        #MNIST images
                layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
                layers.MaxPooling2D(pool_size=(2, 2)),
                layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
                layers.MaxPooling2D(pool_size=(2, 2)),

A curious reader can always check the model and parameters evolution with the layers in the following way:


Model: "sequential"
 Layer (type)                Output Shape              Param #
 conv2d (Conv2D)             (None, 26, 26, 32)        320

 max_pooling2d (MaxPooling2  (None, 13, 13, 32)        0

 conv2d_1 (Conv2D)           (None, 11, 11, 64)        18496

 max_pooling2d_1 (MaxPoolin  (None, 5, 5, 64)          0

 flatten (Flatten)           (None, 1600)              0

 dropout (Dropout)           (None, 1600)              0

 dense (Dense)               (None, 10)                16010

Total params: 34826 (136.04 KB)
Trainable params: 34826 (136.04 KB)
Non-trainable params: 0 (0.00 Byte)


Having defined the model’s architecture, we can now optimize/train our model. Training is nothing but finding the values of the parameters leading to the minimization of the loss function (or optimization, in other words). Hence, it's important to select the optimizer and loss function carefully.

An optimizer further depends on some relevant attributes, commonly known as hyperparameters, like learning rate, momentum, etc.

To specify all this information, we use compile(). This function requires:

  • Optimizer function
  • Respective hyperparameters (if any)
  • Loss function

For example, we'll compile the aforementioned model with Adam and cross-entropy loss as:

cnnModel.compile(optimizer=keras.optimizers.Adam(learning_rate=0.003),              loss=keras.losses.CategoricalCrossentropy())
Note: This may throw some warnings on the new M2 processors, as we observed.

Great! But we aren’t done yet. As a final step, we need to specify the dataset with some relevant hyperparameters (like batch size or number of epochs). For that, we will call fit(). It will take:

  • Data samples (X)
  • Respective labels (Y)
  • Relevant hyperparameters
Note: If we have a DataSet object, it will automatically fetch the respective labels. Also, the batch size is already specified in the data processing function, as we saw earlier. #It won't work as we didn't specify the actual dataset yet.

The above code won’t work as the dataset is placed locally on my system. So, I would highly encourage you to make a dataset yourself (even a few images for each class will do as a starter). In case you just want to run it and make a dataset later on, we can use the publicly available dataset, like MNIST, here.

MNIST is already available in the Keras datasets. We just have to make sure to:

  • Convert the labels into one-hot encoding - Keras provides to_categorical() for that.
  • Resizing the MNIST images into 28 × 28 × 1 rather than the intrinsic 28 × 28. Its reason is obvious as Keras (or any DL library) usually expects our images in either 4D (for batched input) or 3D (a single image). We will use NumPy’s expand_dims for the purpose.
from keras.utils import to_categorical
import numpy as np

(xTrain, yTrain), (xTest, yTest) = keras.datasets.mnist.load_data()

yTrain = to_categorical(yTrain, 10)
xTrain = np.expand_dims(xTrain, -1), yTrain, batch_size=128, epochs=3)
Epoch 1/3
469/469 [==============================] - 10s 22ms/step - loss: 2.9006
Epoch 2/3
469/469 [==============================] - 9s 19ms/step - loss: 3.4552
Epoch 3/3
469/469 [==============================] - 9s 20ms/step - loss: 3.3820

<keras.src.callbacks.History at 0x7f81ca9c9bd0>

Hyperparameter tuning

As an ML engineer, you would have realized how difficult it is to find the correct set of hyperparameters (which is an optimization problem in itself). Keras provides some help by KerasTurner.

Before using it, we need to get ourselves familiar with the HyperParameter class and some of its methods:


Here, we provide a set of possible hyperparameter values to choose from. We provide the names and respective values of the hyperparams, followed by some optional arguments.

For example, we can optimize the learning rate as:

optimizedLearningRate = hp.Choice('learning_rate', values=[0.001, 0.003, 0.0001, 0.0003])
Note: We'll automatically assume the HyperParameter object as hp unless specified otherwise.


In other scenarios, we may have a large search space, in which case specifying them explicitly using Choice() won’t be the right idea. Int() is useful here as it takes the minimum and maximum number in the range and returns the optimal value. For example, we can find the best number of filters for our second layer (in the model above) as:

optimizedFiltersCount = hp.Int('filters', min_value=32, max_value=512, step=32)


We can go even crazier by looking for floating-point values within a range as well. For that, we can use Float() in lieu of Int().

To make it all useful, we'll redefine the model above using hyperparameter optimization.

import keras_tuner

It can often throw an error. So, in case it does, please upgrade it using pip:

!pip install keras-tuner --upgrade

Once imported successfully, we can redefine our model. In order to use the hyperparameter optimization/search, we need to redefine our model within a function (defined by us), taking a hyperparameter object as its input.

def OptimizedModel(hp):
    model = keras.Sequential()
    model.add(layers.Input(shape=(28, 28, 1)))
    model.add(layers.Conv2D(32, kernel_size=(3, 3), activation="relu"))
    model.add(layers.MaxPooling2D(pool_size=(2, 2)))
    #Now, we will try hyperparam tuning by finding the 2nd Conv's number of filters using hp.Int().
    optimizedFiltersCount = hp.Int('filters', min_value=16, max_value=96, step=16)
    model.add(layers.Conv2D(filters=optimizedFiltersCount, kernel_size=(3, 3), activation="relu"))
    # The rest of the model is the same
    model.add(layers.MaxPooling2D(pool_size=(2, 2)))
    #Similarly, we will try hyperparam tuning for the learning rate as well
    optimizedLearningRate = hp.Choice('learning_rate', values=[0.001, 0.003, 0.0001, 0.0003])
    #Finally, we can return from the function.
    return model

Having defined the OptimizedModel, now we can initialize a hyperparameter tuner/optimizer. keras_tuner provides us with a number of search algorithms, like:

  • Random Search
  • Bayesian Optimization
  • Grid Search

Usually, ML courses don’t cover hyperparameter tuning, so curious readers are invited to read more about these algorithms. We'll use a random search tuner here. A tuner takes some arguments, like:

  • hypermodel – the name of the function with a hyperparameter-optimized model.
  • objective – whether it’s accuracy, validation accuracy, or some other metric we want to optimize for.
  • max_trials – how many failed trials we allow.

Let’s initialize and see it in action:

hyperParamTuner = keras_tuner.RandomSearch(
), yTrain, epochs=3)
Trial 2 Complete [00h 00m 44s]

Best accuracy So Far: None
Total elapsed time: 00h 01m 32s

Search: Running Trial #3

Value             |Best Value So Far |Hyperparameter
96                |64                |filters
0.0003            |0.003             |learning_rate

Epoch 1/3
1875/1875 [==============================] - 18s 9ms/step - loss: 10.1364
Epoch 2/3
1875/1875 [==============================] - 19s 10ms/step - loss: 10.6086
Epoch 3/3
1875/1875 [==============================] - 18s 10ms/step - loss: 10.6009

That was fun. Now, let’s proceed further to see what else Keras has in store for us.

What's new in Keras 3.0?

What’s new in Keras 3.0

Keras 3.0 is scheduled to launch this fall. While the exact launch date is still unclear, we can already use its beta version, Keras Core.

We can simply import it:

import keras_core
Using TensorFlow backend
Note: We'll keep using the terms Keras 3.0 and Keras core interchangeably. The term ‘Keras’ (without any suffix) will refer to the earlier/classical versions.

Keras used to be pretty common back in 2017/18, so if you're switching back to Keras (or even if you're totally new to it), there are some cool features on offer, like:

  • Seamless support of PyTorch, TensorFlow, and JAX
  • Ops support
  • The ability to combine Keras and backend code
  • Using diverse data pipelines
  • Support for functional programming

More backends - beyond TensorFlow

The major reason why Keras fell out of favor with the community since 2019-2020 was the gradual rise of PyTorch and JAX. Keras, on the other hand, had become a TensorFlow-only wrapper.

Note: Plenty of you would have noticed that Keras and TensorFlow come together. Installing either of the libraries automatically led to the other’s installation as well.

Keras developers realized the need of the hour and now we have support for all three leading DL frameworks. Personally, it came as a surprise to me that Keras has done so well 2,3 years after JAX’s established presence among the research community, and it being a Google product itself. Nevertheless, all is well that ends well, and now we have support for not only TensorFlow but also PyTorch and JAX in the Keras core.

Note: To use a backend other than the default TensorFlow, please set the respective backend first as:
import os
os.environ["KERAS_BACKEND"] = "<jax or torch>"
Follow it by calling the Keras core. If we have already called the Keras core and will set a backend later on, it will not work; we will have to rerun the code in the correct order.


TensorFlow works by building computational graphs. An Operation is a node in a TensorFlow graph that takes tensors as input and produces an output. It’s pretty similar to the normal operators (can be as simple as arithmetic operators) but the main difference is both operands and the output are tensors here.

Inspired by this concept, Keras has implemented Ops in Keras core. The majority of the Ops are normal NumPy operations, though it support some advanced procedures as well. Although it's a custom implementation of NumPy, both function names and the arguments are the same.

We can import ops from keras_core:

from keras_core import ops as ops

As we can confirm, ops functionality is the same as its NumPy counterparts.

x = ops.linspace(0,2,20)
<tf.Tensor: shape=(20,), dtype=float64, numpy=
array([0.        , 0.10526316, 0.21052632, 0.31578947, 0.42105263,
       0.52631579, 0.63157895, 0.73684211, 0.84210526, 0.94736842,
       1.05263158, 1.15789474, 1.26315789, 1.36842105, 1.47368421,
       1.57894737, 1.68421053, 1.78947368, 1.89473684, 2.        ])>

We can use ops in the Keras model by using the Lambda layer.

Combining Keras and native code

Keras core allows us to use Keras’ intrinsic functions with low-level libraries like PyTorch, TensorFlow, or JAX. It allows a lot more power and enables developers to switch across the libraries seamlessly. For example, the code here combines both Keras and native PyTorch code:

from torch import nn
from keras_core import layers

class HybridCNN(nn.Module):
    def __init__(self):
        self.model = keras_core.Sequential(
                layers.Input(shape=(28, 28, 1)),
                layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
                layers.MaxPooling2D(pool_size=(2, 2)),
                layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
                layers.MaxPooling2D(pool_size=(2, 2)),

    def forward(self, x):
        return self.model(x)

Have you noticed the beauty of the code above? It combines both PyTorch and Keras code, yes. But it does so without using a torch backend as well. It’s pretty cool and makes sure that our code is ubiquitous to the DL community.

Diverse data pipelines

Things become even more interesting for data pipelines - the core component of any DL (or even ML) project. Unlike classical Keras, we can now combine a Tensorflow Dataset with PyTorch DataLoader, NumPy array, Pandas data frames, or Keras core’s own PyDataset objects.

Functional programming support

Functional programming is a pretty simple and cool paradigm that uses deterministic functions without any side effects. A side effect refers to the modification of values beyond the scope of the function or any I/O operation (including the mere print() statement). These functions are also known as pure functions.

Since pure functions are deterministic, they'll always give the same output for the same inputs, regardless of the environment.

Keras, on the other hand, uses stateful functions (which may keep some records beyond the function’s scope). Luckily, now Keras core supports the pure/stateless counterparts as well. They're especially useful for JAX development as it's based on the functional programming model.

Neural network architecture

In layers, we can use the method stateless_call(). It's a (stateless) alternative to __call__(). Since it's free of side effects, it can be integrated seamlessly into a functional programming framework like JAX.


Similarly, optimizers can use stateless_apply() to mimic the apply() function in a stateless manner.


For metrics, we have stateless_result() as a side-effect-free implementation of result().

Keras: deep learning for humans

Keras: deep learning for humans

Keras is a high-level DL library with a number of applications, from object detection to generative modeling. It has been adopted by a number of famous companies like Adobe, Twitter, Tesla, IBM, and Salesforce. With increased support in the upcoming version, Keras is set to lead the DL engineer’s arsenal of tools.

This post serves as both an introduction and a thorough review of Keras’s features. I hope it will be pretty helpful for an aspiring Keras engineer. In the end, it really is deep learning for humans.

If you need data for your models, you might be interested in web scraping methods for collecting data for AI.

Get started now

Step up your web scraping and automation