First you should know, method of solving video classification task is better suit for Convolutional RNN than LSTM or any RNN Cell, just as CNN is better suit for image classification task than MLP
Those RNN cell (e.g LSTM, GRU) is expect inputs with shape (samples, timesteps, channels)
, since you are deal inputs with shape (samples, timesteps, width, height, channels)
, so you should using tf.keras.layers.ConvLSTM2D instead
Following example code will show you how to build a model that can deal your video classification task:
import tensorflow as tf
from tensorflow.keras import models, layers
timesteps = 60
width = 192
height = 192
channels = 1
action_num = 5
model = models.Sequential(
shape=(timesteps, width, height, channels)
filters=64, kernel_size=(3, 3), padding="same", return_sequences=True, dropout=0.1, recurrent_dropout=0.1
pool_size=(1, 2, 2), strides=(1, 2, 2), padding="same"
filters=32, kernel_size=(3, 3), padding="same", return_sequences=True, dropout=0.1, recurrent_dropout=0.1
pool_size=(1, 2, 2), strides=(1, 2, 2), padding="same"
filters=16, kernel_size=(3, 3), padding="same", return_sequences=False, dropout=0.1, recurrent_dropout=0.1
pool_size=(2, 2), strides=(2, 2), padding="same"
layers.Dense(256, activation='relu'),
layers.Dense(action_num, activation='softmax')
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
Model: "sequential"
Layer (type) Output Shape Param #
conv_lst_m2d (ConvLSTM2D) (None, 60, 192, 192, 64) 150016
max_pooling3d (MaxPooling3D) (None, 60, 96, 96, 64) 0
batch_normalization (BatchNo (None, 60, 96, 96, 64) 256
conv_lst_m2d_1 (ConvLSTM2D) (None, 60, 96, 96, 32) 110720
max_pooling3d_1 (MaxPooling3 (None, 60, 48, 48, 32) 0
batch_normalization_1 (Batch (None, 60, 48, 48, 32) 128
conv_lst_m2d_2 (ConvLSTM2D) (None, 48, 48, 16) 27712
max_pooling2d (MaxPooling2D) (None, 24, 24, 16) 0
batch_normalization_2 (Batch (None, 24, 24, 16) 64
flatten (Flatten) (None, 9216) 0
dense (Dense) (None, 256) 2359552
dense_1 (Dense) (None, 5) 1285
Total params: 2,649,733
Trainable params: 2,649,509
Non-trainable params: 224
Beware you should reorder your data to the shape (samples, timesteps, width, height, channels)
before feed in above model (i.e not like np.reshape
, but like np.moveaxis
), in your case the shape should be (120, 60, 192, 192, 1)
, then you can split your 120
video to batchs and feed to model