First you should know, method of solving video classification task is better suit for Convolutional RNN than LSTM or any RNN Cell, just as CNN is better suit for image classification task than MLP
Those RNN cell (e.g LSTM, GRU) is expect inputs with shape (samples, timesteps, channels)
, since you are deal inputs with shape (samples, timesteps, width, height, channels)
, so you should using tf.keras.layers.ConvLSTM2D instead
Following example code will show you how to build a model that can deal your video classification task:
import tensorflow as tf
from tensorflow.keras import models, layers
timesteps = 60
width = 192
height = 192
channels = 1
action_num = 5
model = models.Sequential(
[
layers.Input(
shape=(timesteps, width, height, channels)
),
layers.ConvLSTM2D(
filters=64, kernel_size=(3, 3), padding="same", return_sequences=True, dropout=0.1, recurrent_dropout=0.1
),
layers.MaxPool3D(
pool_size=(1, 2, 2), strides=(1, 2, 2), padding="same"
),
layers.BatchNormalization(),
layers.ConvLSTM2D(
filters=32, kernel_size=(3, 3), padding="same", return_sequences=True, dropout=0.1, recurrent_dropout=0.1
),
layers.MaxPool3D(
pool_size=(1, 2, 2), strides=(1, 2, 2), padding="same"
),
layers.BatchNormalization(),
layers.ConvLSTM2D(
filters=16, kernel_size=(3, 3), padding="same", return_sequences=False, dropout=0.1, recurrent_dropout=0.1
),
layers.MaxPool2D(
pool_size=(2, 2), strides=(2, 2), padding="same"
),
layers.BatchNormalization(),
layers.Flatten(),
layers.Dense(256, activation='relu'),
layers.Dense(action_num, activation='softmax')
]
)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
Outputs:
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv_lst_m2d (ConvLSTM2D) (None, 60, 192, 192, 64) 150016
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 60, 96, 96, 64) 0
_________________________________________________________________
batch_normalization (BatchNo (None, 60, 96, 96, 64) 256
_________________________________________________________________
conv_lst_m2d_1 (ConvLSTM2D) (None, 60, 96, 96, 32) 110720
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 60, 48, 48, 32) 0
_________________________________________________________________
batch_normalization_1 (Batch (None, 60, 48, 48, 32) 128
_________________________________________________________________
conv_lst_m2d_2 (ConvLSTM2D) (None, 48, 48, 16) 27712
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 24, 24, 16) 0
_________________________________________________________________
batch_normalization_2 (Batch (None, 24, 24, 16) 64
_________________________________________________________________
flatten (Flatten) (None, 9216) 0
_________________________________________________________________
dense (Dense) (None, 256) 2359552
_________________________________________________________________
dense_1 (Dense) (None, 5) 1285
=================================================================
Total params: 2,649,733
Trainable params: 2,649,509
Non-trainable params: 224
_________________________________________________________________
Beware you should reorder your data to the shape (samples, timesteps, width, height, channels)
before feed in above model (i.e not like np.reshape
, but like np.moveaxis
), in your case the shape should be (120, 60, 192, 192, 1)
, then you can split your 120
video to batchs and feed to model