Tensorflow in Practice Learning Note

A learning note of the coursera specialization Tensorflow in practice given by deeplearning.ai.

  • Course 1: Introduction to TensorFlow for AI, ML and DL
  • Course 2: Convolutional Neural Networks in TensorFlow
  • Course 3: Natural Language Processing in TensorFlow
  • Course 4: Sequences, Time Series and Prediction

C1W1: A New Programming Paradigm


New programming paradigm

input output
Triditional Programming Rules, Data Answers
Machine Learning Answers, Data Rules


How to fit a line

import tensorflow as tf
import numpy as np
from tensorflow import keras
model = tf.keras.Sequential([keras.layers.Dense(units=1, input_shape=[1])])
model.compile(optimizer='sgd', loss='mean_squared_error')
xs = np.array([-1.0,  0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([-3.0, -1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)
model.fit(xs, ys, epochs=500)

The predicted value is not 19.0 but a little under. It is because neural networks deal with probabilities, so given the data that we fed the NN with, it calculated that there is a very high probability that the relationship between XX and YY is Y=2X1Y=2X-1, but with only 6 data points we can’t know for sure. As a result, the result for 10 is very close to 19, but not necessarily 19.

C1W2: Introduction to Computer Vision


Why are the labels numbers instead of words

Using a number is a first step in avoiding bias – instead of labelling it with words in a specific language and excluding people who don’t speak that language! You can learn more about bias and techniques to avoid it here.

What is cross entropy (CE)

CE=i=0C1yilog(f(xi))CE = - \sum_{i=0}^{C - 1} y_i \cdot log( f(\vec{x_i}) )


  • CC: the number of classes
  • xi\vec{x_i}: the feature vector of the example ii
  • yiy_i: the label of the example ii
  • ff: the learned prediction function which takes the feacture vector xi\vec{x_i} and returns the probability of being class yiy_i

When c=2c = 2

CE=[yilog(pi)+(1yi)log(1pi)]CE = - \big[ y_i \cdot log( p_i ) + (1 - y_i) \cdot log( 1 - p_i ) \big]

Difference between categorical_crossentropy and sparse_categorical_crossentropy

  • If your targets are one-hot encoded, use categorical_crossentropy.
    Examples of one-hot encodings:
  • But if your targets are integers, use sparse_categorical_crossentropy.
    Examples of integer encodings (for the sake of completion):


# Early stopping
class myCallback(tf.keras.callbacks.Callback):
  def on_epoch_end(self, epoch, logs={}):
      print("\nReached 60% accuracy so cancelling training!")
      self.model.stop_training = True

callbacks = myCallback()

mnist = tf.keras.datasets.fashion_mnist
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
# Data normalization
training_images  = training_images / 255.0
test_images = test_images / 255.0
model = tf.keras.models.Sequential([tf.keras.layers.Flatten(), 
                                    tf.keras.layers.Dense(128, activation=tf.nn.relu), 
                                    tf.keras.layers.Dense(10, activation=tf.nn.softmax)])
model.compile(optimizer = 'adam',
              loss = 'sparse_categorical_crossentropy',
model.fit(training_images, training_labels, epochs=5, callbacks=[callbacks])
model.evaluate(test_images, test_labels)

C1W3: Enhancing Vision with Convolutional Neural Networks


Convolution Layer

Each kernal is an edge detector which is perfect for computer vision, because often it’s features that can get highlighted like this that distinguish one item for another, and the amount of information needed is then much less…because you’ll just train on the highlighted features.

MaxPooling Layer

The convolution layer is followed by a MaxPooling layer which is then designed to compress the image, while maintaining the content of the features that were highlighted by the convolution

Why CNN works

CNN tries different filters on the image and learning which ones work when looking at the training data. As a result, when it works, you’ll have greatly reduced information passing through the network, but because it isolates and identifies features, you can also get increased accuracy



# Reshape to a 4D tensor, otherwise the Convolutions do not recognize the shape
training_images=training_images.reshape(60000, 28, 28, 1)
training_images=training_images / 255.0
test_images = test_images.reshape(10000, 28, 28, 1)

# 2-convolution-layer NN
model = tf.keras.models.Sequential([
  # default: strides = 1, padding = 'valid'
  tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)), 
  # default: strides = None (same as pool_size), padding = 'valid'
  tf.keras.layers.MaxPooling2D(2, 2),
  tf.keras.layers.Conv2D(64, (3,3), activation='relu'), 
  tf.keras.layers.Dense(128, activation='relu'),
  tf.keras.layers.Dense(10, activation='softmax')
_________________________________________________________________ || 
Layer (type)                 Output Shape              Param #    || Comments
================================================================= || 
conv2d (Conv2D)              (None, 26, 26, 64)        640        || = 64 x (3 x 3 x 1 + 1)
_________________________________________________________________ || 
max_pooling2d (MaxPooling2D) (None, 13, 13, 64)        0          || 
_________________________________________________________________ || 
conv2d_1 (Conv2D)            (None, 11, 11, 64)        36928      || = 64 x (3 x 3 x 64 + 1)
_________________________________________________________________ || 
max_pooling2d_1 (MaxPooling2 (None, 5, 5, 64)          0          || 
_________________________________________________________________ || 
flatten_1 (Flatten)          (None, 1600)              0          || 
_________________________________________________________________ || 
dense_2 (Dense)              (None, 128)               204928     || = 128 x (1600 + 1)
_________________________________________________________________ || 
dense_3 (Dense)              (None, 10)                1290       || = 10 * (128 + 1)
================================================================= || 
Total params: 243,786
Trainable params: 243,786
Non-trainable params: 0

How to compute output size

Convolution layer

(n+2pf+1)×(n+2pf+1)(n + 2p - f + 1) \times (n + 2p - f + 1)

MaxPooling layer

Floor(heightfs+1)×Floor(weightfs+1)Floor(\frac{height - f}{s} + 1) \times Floor(\frac{weight - f}{s} + 1)

  • nn: input size
  • pp: padding size
  • ff: filter size

Two kinds of padding:

  • Valid: no padding

p=0p = 0

  • Same: results in padding the input such that the output has the same length as the original input

n+2pf+1=n    p=(f1)/2n + 2p - f + 1 = n \implies p = (f - 1) / 2

where ff is almost always odd number

How to compute number of parameters

NF×(f×f×NCinput+1)NF \times (f \times f \times NC_{input} + 1 )

  • NFNF: number of filters
  • NCinputNC_{input}: number of input channels
  • Each filter has a bias term
  • Convolutions Over Volume

Visualizing the Convolutions and Pooling

Each row represents an itea. There are 3 shoes images here.
The 4 columns represent the output of the first 4 layers (conv2d, max_pooling2d, conv2d_1, max_pooling2d_1).
We can find the commonality for the same kind of items.

C1W4: Using Real-world Images



  • ImageGenerator can flow images from a directory and perform operations such as resizing them on the fly.
  • You can point it at a directory and then the sub-directories of that will automatically generate labels for you
|-- training
|   |-- horse
|   |   |-- 1.jpg
|   |   |-- 2.jpg
|   |   `-- 3.jpg
|   `-- human
|       |-- 1.jpg
|       |-- 2.jpg
|       `-- 3.jpg
`-- validation
    |-- horse
    |   |-- 1.jpg
    |   |-- 2.jpg
    |   `-- 3.jpg
    `-- human
        |-- 1.jpg
        |-- 2.jpg
        `-- 3.jpg

If you point ImageGenerator to training directory, it will generate a stream of images labelled with horse or human


Why mini-batch

For large neural networks with very large and highly redundant training sets, it is nearly always best to use mini-batch learning.

  • The mini-batches may need to be quite big when adapting fancy methods.
  • Big mini-batches are more computationally efficient.


  • Momentum
  • RMSProp
  • Adam



import tensorflow as tf
from tensorflow.keras.optimizers import RMSprop

model = tf.keras.models.Sequential([
    # Note the input shape is the desired size of the image 300x300 with 3 bytes color
    # This is the first convolution
    tf.keras.layers.Conv2D(16, (3,3), activation='relu', input_shape=(300, 300, 3)),
    tf.keras.layers.MaxPooling2D(2, 2),
    # The second convolution
    tf.keras.layers.Conv2D(32, (3,3), activation='relu'),
    # The third convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    # The fourth convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    # The fifth convolution
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    # Flatten the results to feed into a DNN
    # 512 neuron hidden layer
    tf.keras.layers.Dense(512, activation='relu'),
    # Only 1 output neuron. It will contain a value from 0-1 where 0 for 1 class ('horses') and 1 for the other ('humans')
    tf.keras.layers.Dense(1, activation='sigmoid')

# Train our model with the binary_crossentropy loss, 
# because it's a binary classification problem and our final activation is a sigmoid.
# [More details](http://www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf)

Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 298, 298, 16)      448       
max_pooling2d (MaxPooling2D) (None, 149, 149, 16)      0         
conv2d_1 (Conv2D)            (None, 147, 147, 32)      4640      
max_pooling2d_1 (MaxPooling2 (None, 73, 73, 32)        0         
conv2d_2 (Conv2D)            (None, 71, 71, 64)        18496     
max_pooling2d_2 (MaxPooling2 (None, 35, 35, 64)        0         
conv2d_3 (Conv2D)            (None, 33, 33, 64)        36928     
max_pooling2d_3 (MaxPooling2 (None, 16, 16, 64)        0         
conv2d_4 (Conv2D)            (None, 14, 14, 64)        36928     
max_pooling2d_4 (MaxPooling2 (None, 7, 7, 64)          0         
flatten (Flatten)            (None, 3136)              0         
dense (Dense)                (None, 512)               1606144   
dense_1 (Dense)              (None, 1)                 513       
Total params: 1,704,097
Trainable params: 1,704,097
Non-trainable params: 0

The convolutions reduce the shape from 90000 (300 x 300) down to 3136


# All images will be rescaled by 1./255
train_datagen = ImageDataGenerator(rescale=1/255)
validation_datagen = ImageDataGenerator(rescale=1/255)

# Flow training images in batches of 128 using train_datagen generator
train_generator = train_datagen.flow_from_directory(
        '/tmp/horse-or-human/',  # This is the source directory for training images
        target_size=(300, 300),  # All images will be resized to 150x150
        batch_size=128, # number of images for each batch
        # Since we use binary_crossentropy loss, we need binary labels

# Flow training images in batches of 128 using train_datagen generator
validation_generator = validation_datagen.flow_from_directory(
        '/tmp/validation-horse-or-human/',  # This is the source directory for validation images
        target_size=(300, 300),  # All images will be resized to 150x150
        batch_size=32, # number of images for each batch
        # Since we use binary_crossentropy loss, we need binary labels

history = model.fit_generator(
      steps_per_epoch=8, # number of batches for each epoch durning training  
      validation_data = validation_generator,
      validation_steps=8) # number of batches for each epoch durning validation  

Visualizing Intermediate Representations

As you can see we go from the raw pixels of the images to increasingly abstract and compact representations. The representations downstream start highlighting what the network pays attention to, and they show fewer and fewer features being “activated”; most are set to zero. This is called “sparsity.” Representation sparsity is a key feature of deep learning.

These representations carry increasingly less information about the original pixels of the image, but increasingly refined information about the class of the image. You can think of a convnet (or a deep network in general) as an information distillation pipeline.

C2W1: Exploring a Larger Dataset



import numpy as np
import random
from   tensorflow.keras.preprocessing.image import img_to_array, load_img

# Let's define a new Model that will take an image as input, and will output
# intermediate representations for all layers in the previous model after
# the first.
successive_outputs = [layer.output for layer in model.layers[1:]]

#visualization_model = Model(img_input, successive_outputs)
visualization_model = tf.keras.models.Model(inputs = model.input, outputs = successive_outputs)

# Let's prepare a random input image of a cat or dog from the training set.
cat_img_files = [os.path.join(train_cats_dir, f) for f in train_cat_fnames]
dog_img_files = [os.path.join(train_dogs_dir, f) for f in train_dog_fnames]

img_path = random.choice(cat_img_files + dog_img_files)
img = load_img(img_path, target_size=(150, 150))  # this is a PIL image

x   = img_to_array(img)                           # Numpy array with shape (150, 150, 3)
x   = x.reshape((1,) + x.shape)                   # Numpy array with shape (1, 150, 150, 3)

# Rescale by 1/255
x /= 255.0

# Let's run our image through our network, thus obtaining all
# intermediate representations for this image.
successive_feature_maps = visualization_model.predict(x)

# These are the names of the layers, so can have them as part of our plot
layer_names = [layer.name for layer in model.layers]

# -----------------------------------------------------------------------
# Now let's display our representations
# -----------------------------------------------------------------------
for layer_name, feature_map in zip(layer_names, successive_feature_maps):
  if len(feature_map.shape) == 4:
    # Just do this for the conv / maxpool layers, not the fully-connected layers
    n_features = feature_map.shape[-1]  # number of features in the feature map
    size       = feature_map.shape[ 1]  # feature map shape (1, size, size, n_features)
    # We will tile our images in this matrix
    display_grid = np.zeros((size, size * n_features))
    # Postprocess the feature to be visually palatable
    for i in range(n_features):
      x  = feature_map[0, :, :, i]
      x -= x.mean()
      x /= x.std ()
      x *=  64
      x += 128
      x  = np.clip(x, 0, 255).astype('uint8')
      display_grid[:, i * size : (i + 1) * size] = x # Tile each filter into a horizontal grid

    # Display the grid

    scale = 20. / n_features
    plt.figure( figsize=(scale * n_features, scale) )
    plt.title ( layer_name )
    plt.grid  ( False )
    plt.imshow( display_grid, aspect='auto', cmap='viridis' ) 

C2W2: Augmentation: A technique to avoid overfitting


Image augmentation

  • Image augmentation implementation in Keras: https://keras.io/preprocessing/image/

  • Image generator library lets you load the images into memory, process the images and then steam that to the training set to the neural network we will ultimatedly learn on.The preprocessing doesn’t require you to edit your raw images, nor does it amend them for you on-disk. It does it in-memory as it’s performing the training, allowing you to experiment without impacting your dataset.

  • As we start training, we’ll initially see that the accuracy is lower than with the non-augmented version. This is because of the random effects of the different image processing that’s being done. As it runs for a few more epochs, you’ll see the accuracy slowly climbing.

  • The image augmentation introduces a random element to the training images but if the validation set doesn’t have the same randomness, then its results can fluctuate. You don’t just need a broad set of images for training, you also need them for testing or the image augmentation won’t help you very much.(which does NOT mean that you should augment your validation set, see below)

  • Validation dataset should not be augmented: the validation set is used to estimate how your method works on real world data, thus it should only contain real world data. Adding augmented data will not improve the accuracy of the validation. It will at best say something about how well your method responds to the data augmentation, and at worst ruin the validation results and interpretability. As the validation accuracy is no longer a good proxy for the accuracy on new unseen data if you augment the validation data


train_datagen = ImageDataGenerator(

C2W3: Transfer Learning


What is transfer learning

You can take an existing model, freeze many of its layers to prevent them being retrained, and effectively ‘remember’ the convolutions it was trained on to fit images, then added your own DNN underneath this so that you could retrain on your images using the convolutions from the other model.

Why dropout can do the regularization

The idea behind Dropouts is that they remove a random number of neurons in your neural network. This works very well for two reasons:

  • The first is that neighboring neurons often end up with similar weights, which can lead to overfitting, so dropping some out at random can remove this.

  • The second is that often a neuron can over-weigh the input from a neuron in the previous layer, and can over specialize as a result. It can not rely on any of the input which will be randomly dropped, instead, it will spread the weights, by which the weights will be shrinked.


from tensorflow.keras import layers
from tensorflow.keras import Model
from tensorflow.keras.optimizers import RMSprop

from tensorflow.keras.applications.inception_v3 import InceptionV3

local_weights_file = '/tmp/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5'

pre_trained_model = InceptionV3(input_shape = (150, 150, 3), 
                                include_top = False,  # whether to include the fully-connected layer at the top of the network.
                                weights = None) # one of None (random initialization) or 'imagenet' (pre-training on ImageNet).

for layer in pre_trained_model.layers:
  layer.trainable = False

last_layer = pre_trained_model.get_layer('mixed7')
last_output = last_layer.output

# Flatten the output layer to 1 dimension
x = layers.Flatten()(last_output)
# Add a fully connected layer with 1,024 hidden units and ReLU activation
x = layers.Dense(1024, activation='relu')(x)
# Add a dropout rate of 0.2
x = layers.Dropout(0.2)(x)                  
# Add a final sigmoid layer for classification
x = layers.Dense  (1, activation='sigmoid')(x)           

model = Model( pre_trained_model.input, x) 

model.compile(optimizer = RMSprop(lr=0.0001), 
              loss = 'binary_crossentropy', 
              metrics = ['acc'])

C2W4: Multiclass Classification


  • Use CGI to generate images for Rock, Paper, Scissors


train_generator = training_datagen.flow_from_directory(

# Same for validation

model = tf.keras.models.Sequential([
    # Convolution layers
    # ...
    # Flatten the results to feed into a DNN
    # 512 neuron hidden layer
    tf.keras.layers.Dense(512, activation='relu'),
    # 3 nodes with softmax
    tf.keras.layers.Dense(3, activation='softmax') 

Another way of using fit_generator API via (images, labels), instead of via directory

history = model.fit_generator(train_datagen.flow(training_images, training_labels, batch_size=32),
                              steps_per_epoch=len(training_images) / 32,
                              validation_data=validation_datagen.flow(testing_images, testing_labels, batch_size=32),
                              validation_steps=len(testing_images) / 32)

C3W1: Sentiment in text


from tensorflow.keras.preprocessing.text import Tokenizer
sentences = [
  'I love my dog',
  'I love my cat'
tokenizer = Tokenizer(num_words = 100, oov_token='<OOV>')

word_index = tokenizer.word_index


  • If the number of distinct words is bigger than num_words, the tokenizer will do is take the top 100 words by volume
  • num_words is optional. If it is not set, it will take all the words in the sentences
  • oov_token is used for words that aren’t in the word index
  • Punctuation like spaces and the comma, have actually been removed
  • Token is case sensitive => convert to lower case
  • word_index is sorted by commonality
sequences = tokenizer.texts_to_sequences(sentences)

If you train a neural network on a corpus of texts, and the text has a word index generated from it, then when you want to do inference with the train model, you’ll have to encode the text that you want to infer on with the same word index, otherwise it would be meaningless.

test_seq = tokenizer.texts_to_sequences(test_data)

New words which are not in the index will be lost in the sequences
In the case:

  • We need a very board corpus
  • We need to put a special value for unknown word Tokenizer(num_words = 100, oov_token="<OOV>")
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded = pad_sequences(sequences)

Pad leading zeros to fill the size of the longest sequence

padded = pad_sequences(sequences, padding='post', truncating='post', maxlen=5)


  • If you only want your sentences to have a maximum of five words. You can say maxlen=5
  • Sentences longer than the maxlen lose information from the beginning by default
  • If you want to lose from the end instead, you can do so with the truncating parameter

C3W2: Word Embeddings


Why subwords works poorly

Not only do the meanings of the words matter, but also the sequence in which they are found.
Subwords are meaningless and our neural network does not take the order of the words into account.
This is where RNN comes to play.


Check TF version

import tensorflow as tf


  • Use python3
  • If the version of tensorflow is 1.x, you should do tf.enable_eager_execution() which is default in tensorflow 2.x

Download imdb_reviews via tensorflow-datasets

!pip install -q tensorflow-datasets
import tensorflow_datasets as tfds
imdb, info = tfds.load("imdb_reviews", with_info=True, as_supervised=True)
train_data, test_data = imdb['train'], imdb['test']


Prepare dataset

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type = 'post'
padding_type = 'post'
oov_tok = '<OOV>'

# train_sentences is a list of string
tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_tok)
word_index = tokenizer.word_index
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, 
# validation_sentences is a list of string
validation_sequences = tokenizer.texts_to_sequences(validation_sentences)
validation_padded = pad_sequences(validation_sequences, 
# label is a list of string
label_tokenizer = Tokenizer()
training_label_seq = np.array(label_tokenizer.texts_to_sequences(train_labels))
validation_label_seq = np.array(label_tokenizer.texts_to_sequences(validation_labels))


  • the number of unique label is always very small, no need to set num_words and oov_token
  • Once labels are parsed into a list, we need to convert the list into numpy array which is required by tf.keras APIs used below

Train word embedding label

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    # tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(6, activation='softmax'),

model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

num_epochs = 30
history = model.fit(train_padded, training_label_seq, 
                    validation_data=(validation_padded, validation_label_seq), 


  • Flatten() more parameters => more accurate
  • GlobalAveragePooling1D less parameters => less accurate but still good
  • GlobalAveragePooling1D averages across the vector to flatten it out
  • Check out the model summary below
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 16)           160000    
flatten (Flatten)            (None, 1920)              0         
dense (Dense)                (None, 6)                 11526     
dense_1 (Dense)              (None, 1)                 7         
Total params: 171,533
Trainable params: 171,533
Non-trainable params: 0
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 120, 16)           160000    
global_average_pooling1d (Gl (None, 16)                0         
dense_2 (Dense)              (None, 6)                 102       
dense_3 (Dense)              (None, 1)                 7         
Total params: 160,109
Trainable params: 160,109
Non-trainable params: 0

As shown in the figure above, here is how this network works:

  1. Each word in one input sequence is transformed into a one-hot coding encoding vector, which is why Embedding layer take vocab_size as a parameter.
  2. Each one-hot vector passes through the same embedding layer, it will be transformed into 16-dim vector. For a sequence, we have 120 such vectors.
  3. Instead of flatten these 120 vectors, we take average of them. So the output is still a 16-dim vector.
  4. The following 2 dense layer is straightforward.

Global Average Pooling (GAP) is generally better flatten layer in the structure above, because it only needs less weight which leads to some extent of regularization and can accelarate the training as well.

Word embedding visualization

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_sentence(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # shape: (vocab_size, embedding_dim)

import io

out_v = io.open('vecs.tsv', 'w', encoding='utf-8')
out_m = io.open('meta.tsv', 'w', encoding='utf-8')
for word_num in range(1, vocab_size):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")


C3W3: Sequence models


  • In terms of loss and accuracy curves, 2-layer LSTM is more smooth.
  • LSTM is more likely to overfit than flatten and averaged layer.
  • In this week, we tried B-LSTM, B-GRU and Conv1D models. All of them have over-fitting issue, it is natually because there are words which are out of vocabulary. They can not learning during training and leads to the over-fitting.

Model comparison

IMDB Subwords 8K

Training takes too long to run in colab, so no plots.

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
Model: "sequential"
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          523840    
bidirectional (Bidirectional (None, 128)               66048     
dense (Dense)                (None, 64)                8256      
dense_1 (Dense)              (None, 1)                 65        
Total params: 598,209
Trainable params: 598,209
Non-trainable params: 0
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64, return_sequences=True)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
Model: "sequential"
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          523840    
bidirectional (Bidirectional (None, None, 128)         66048     
bidirectional_1 (Bidirection (None, 64)                41216     
dense (Dense)                (None, 64)                4160      
dense_1 (Dense)              (None, 1)                 65        
Total params: 635,329
Trainable params: 635,329
Non-trainable params: 0
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(tokenizer.vocab_size, 64),
    tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
Model: "sequential"
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, None, 64)          523840    
conv1d (Conv1D)              (None, None, 128)         41088     
global_average_pooling1d (Gl (None, 128)               0         
dense (Dense)                (None, 64)                8256      
dense_1 (Dense)              (None, 1)                 65        
Total params: 573,249
Trainable params: 573,249
Non-trainable params: 0


model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
Model: "sequential"
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 16)           16000     
bidirectional (Bidirectional (None, 64)                12544     
dense (Dense)                (None, 24)                1560      
dense_1 (Dense)              (None, 1)                 25        
Total params: 30,129
Trainable params: 30,129
Non-trainable params: 0
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    tf.keras.layers.Conv1D(128, 5, activation='relu'),
    tf.keras.layers.Dense(24, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
Model: "sequential"
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 120, 16)           16000     
conv1d (Conv1D)              (None, 116, 128)          10368     
global_max_pooling1d (Global (None, 128)               0         
dense (Dense)                (None, 24)                3096      
dense_1 (Dense)              (None, 1)                 25        
Total params: 29,489
Trainable params: 29,489
Non-trainable params: 0
Bidirectional LSTM 1D Convolutional Layer
Time per epoch 85s 3s


model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size+1, embedding_dim, input_length=max_length, weights=[embeddings_matrix], trainable=False),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')

num_epochs = 50
history = model.fit(training_sequences, 
                    validation_data=(test_sequences, test_labels), 
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 16, 100)           13802600  
dropout (Dropout)            (None, 16, 100)           0         
conv1d (Conv1D)              (None, 12, 64)            32064     
max_pooling1d (MaxPooling1D) (None, 3, 64)             0         
lstm (LSTM)                  (None, 64)                33024     
dense (Dense)                (None, 1)                 65        
Total params: 13,867,753
Trainable params: 65,153
Non-trainable params: 13,802,600

Applying regularization techniques like drop out can overcome overfitting. We can see from the figures below that the validation loss does not increase sharply!

Without Dropout With Dropout

C3W4: Sequence models and literature


When you have very large bodies of text with many many words, the word based prediction does not work well. Because the number of unique words in the collection is very big, and there are over millions of sequences generated using the algorithm. So the labels alone would require the storage of many terabytes of RAM.

A better approache is character-based prediction. The full number of unique characters in a corpus is far less than the full number of unique words, at least in English. So the same principles that you use to predict words can be used to apply here.


corpus = data.lower().split("\n")

total_words = len(tokenizer.word_index) + 1 # Add 1 for OOV

# create input sequences using list of tokens
input_sequences = []
for line in corpus:
	token_list = tokenizer.texts_to_sequences([line])[0]
	for i in range(1, len(token_list)):
		n_gram_sequence = token_list[:i+1]

# pad sequences 
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, 

# create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]

label = tensorflow.keras.utils.to_categorical(label, num_classes=total_words)

model = Sequential()
# input_length: minus 1 since the last word is the label
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1)) 
model.add(Bidirectional(LSTM(150, return_sequences = True)))
model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

history = model.fit(predictors, label, epochs=100, verbose=1)
Model: "sequential_1"
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 10, 100)           321100    
bidirectional (Bidirectional (None, 10, 300)           301200    
dropout (Dropout)            (None, 10, 300)           0         
lstm_1 (LSTM)                (None, 100)               160400    
dense (Dense)                (None, 1605)              162105    
dense_1 (Dense)              (None, 3211)              5156866   
Total params: 6,101,671
Trainable params: 6,101,671
Non-trainable params: 0

C4W1: Sequences and Prediction


Imputation: Fill data in the pase or fill the missing data
Trends: upward or downward
Seasonalities: repeated patterns
Autocorrelation: correlated with a delayed copy of itself (lag)
Noise:random / occasional values
Combination of all the above
Non-stationary time series: the behavior changed, it should be trained by using time window

Split training period, validation period, test period

  • Fixed partition:
    If test period is the most recent dataset which has a strong signal for the future, it should be used to train the model, otherwise the model may not be optimal. So it is quite common to use just a training period and a validation period for model training, and the test set is in the future

  • Roll-forward partition:
    At each iteration, we train the model on a training period. And we use it to forecast the following day, or the following week, in the validation period. It can been seen as doing fixed partitioning a number of times, and then continually refining the model as such


mse = np.square(errors).mean()
mae = np.abs(errors).mean()

mse penalize more large errors than mae does.
if large errors are potentially dangerous and they cost you much more than smaller errors, then you may prefer the mse. But if your gain or your loss is just proportional to the size of the error, then the mae may be better.

Moving average and differencing

  1. Use differencing to cancel out the seasonality and trends
  2. Use moving average to forecast the difference time series
  3. Use moving average to past time series
  4. Add back the smoothed differece to the smoothed past time series

Trailing windows and centered windows

Moving averages using centered windows can be more accurate than using trailing windows. But we can’t use centered windows to smooth present values since we don’t know future values. However, to smooth past values we can afford to use centered windows.


from tensorflow import keras
def moving_average_forecast(series, window_size):
  """Forecasts the mean of the last few values.
     If window_size=1, then this is equivalent to naive forecast"""
  forecast = []
  for time in range(len(series) - window_size):
    forecast.append(series[time:time + window_size].mean())
  return np.array(forecast)x
print(keras.metrics.mean_squared_error(x_valid, naive_forecast).numpy())
print(keras.metrics.mean_absolute_error(x_valid, naive_forecast).numpy())

C4W2: Deep Neural Networks for Time Series


Preparing feature and labels

dataset = tf.data.Dataset.range(10)
dataset = dataset.window(5, shift=1, drop_remainder=True)
dataset = dataset.flat_map(lambda window: window.batch(5))
dataset = dataset.map(lambda window: (window[:-1], window[-1:]))
dataset = dataset.shuffle(buffer_size=10)
dataset = dataset.batch(2).prefetch(1)
for x,y in dataset:
  print("x = ", x.numpy())
  print("y = ", y.numpy())
  • On line 3, each window is an instance of class tensorflow.python.data.ops.dataset_ops._VariantDataset containing 5 elements. But We need to convert it into a tensor, so we just cut it to batches by 5 elements. This is why we have window.batch(5)
  • On line 5, shuffle fills a buffer with buffer_size elements, then randomly samples elements from this buffer, replacing the selected elements with new elements. For perfect shuffling, a buffer size greater than or equal to the full size of the dataset is required and the downside is that it really takes long time. If you don’t care about perfect shuffling, choosing a small number of buffer will just speed things up. You can even buffer_size is set to 1, in this case, no shuffle will happen here
  • On line 6, according to the tensorflow doc:
    The tf.data API provides a software pipelining mechanism through the tf.data.Dataset.prefetch transformation, which can be used to decouple the time data is produced from the time it is consumed. In particular, the transformation uses a background thread and an internal buffer to prefetch elements from the input dataset ahead of the time they are requested. Thus, to achieve the pipelining effect illustrated above, you can add prefetch(1) as the final transformation to your dataset pipeline (or prefetch(n) if a single training step consumes n elements).

Sequence Bias

Sequence bias is when the order of things can impact the selection of things. For example, if I were to ask you your favorite TV show, and listed “Game of Thrones”, “Killing Eve”, “Travellers” and “Doctor Who” in that order, you’re probably more likely to select ‘Game of Thrones’ as you are familiar with it, and it’s the first thing you see. Even if it is equal to the other TV shows. So, when training data in a dataset, we don’t want the sequence to impact the training in a similar way, so it’s good to shuffle them up.

Find the best learning rate

lr_schedule = tf.keras.callbacks.LearningRateScheduler(
    lambda epoch: 1e-8 * 10**(epoch / 20))
optimizer = tf.keras.optimizers.SGD(lr=1e-8, momentum=0.9)
model.compile(loss="mse", optimizer=optimizer)
history = model.fit(dataset, epochs=100, callbacks=[lr_schedule], verbose=0)
# plot the loss per epoch against the learning rate per epoch
lrs = 1e-8 * (10 ** (np.arange(100) / 20))
plt.semilogx(lrs, history.history["loss"])
plt.axis([1e-8, 1e-3, 0, 300])

Here, the best learning rate is around 7e-6, because it is the lowest point of the curve where it’s still relatively stable.

C4W3: Recurrent Neural Networks for Time Series


For numeric series, things such as closer numbers in the series might have a greater impact than those further away from our target value.

In some cases, you might want to input a sequence, but you don’t want to output on and you just want to get a single vector for each instance in the batch. This is typically called a sequence to vector RNN. But in reality, all you do is ignore all of the outputs, except the last one. When using Keras in TensorFlow, this is the default behavior.

If you want the recurrent layer to output a sequence, you have to specify return_sequences=True when creating the layer. You’ll need to do this when you stack one RNN layer on top of another.

(huber loss)[https://en.wikipedia.org/wiki/Huber_loss]
The Huber function is a loss function that’s less sensitive to outliers and as this data can get a little bit noisy, it’s worth giving it a shot.


dataset = windowed_dataset(x_train, window_size, batch_size, shuffle_buffer_size)

model = tf.keras.models.Sequential([
  tf.keras.layers.Lambda(lambda x: tf.expand_dims(x, axis=-1)),
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True)),
  tf.keras.layers.Lambda(lambda x: x * 100.0)

model.compile(loss="mse", optimizer=tf.keras.optimizers.SGD(lr=1e-5, momentum=0.9),metrics=["mae"])
history = model.fit(dataset,epochs=500,verbose=1)

The last lambda layer is used to scale up the outputs by 100, which helps training. The default activation function in the RNN layers is tanH which is the hyperbolic tangent activation. This outputs values between negative one and one. Since the time series values are in that order usually in the 10s like 40s, 50s, 60s, and 70s, then scaling up the outputs to the same ballpark can help us with learning.

C4W4: Real-world time series data


model = tf.keras.models.Sequential([
  tf.keras.layers.Conv1D(filters=32, kernel_size=5,
                      strides=1, padding="causal",
                      input_shape=[None, 1]),
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True)),
  tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(32, return_sequences=True)),
  tf.keras.layers.Lambda(lambda x: x * 200)

This simply pads the layer’s input with zeros in the front so that we can also predict the values of early time steps in the window

A good explanation (here)[https://theblog.github.io/post/convolution-in-autoregressive-neural-networks/]
