Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.3k views
in Technique[技术] by (71.8m points)

keras - LSTM for Video Input

I am a newbie trying out LSTM.

I am basically using LSTM to determine action type (5 different actions) like running, dancing etc. My input is 60 frames per action and roughly let's say about 120 such videos

train_x.shape = (120,192,192,60)

where 120 is the number of sample videos for training, 192X192 is the frame size and 60 is the # frames.

train_y.shape = (120*5) [1 0 0 0 0 ..... 0 0 0 0 1] one hot-coded

I am not clear as to how to pass 3d parameters to lstm (timestamp and features)

model.add(LSTM(100, input_shape=(train_x.shape[1],train_x.shape[2])))
model.add(Dropout(0.5))
model.add(Dense(100, activation='relu'))
model.add(Dense(len(uniquesegments), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(train_x, train_y, epochs=100, batch_size=batch_size, verbose=1)

i get the following error

Input 0 of layer sequential is incompatible with the layer: expected ndim=3, found ndim=4. Full shape received: (None, 192, 192, 60)

training data algorithm

Loop through videos
            Loop through each frame of a video
                    logic
                    append to array
            convert to numpy array
            roll axis to convert 60 192 192 to 192 192 60
  add to training list
convert training list to numpy array

training list shape <120, 192, 192, 60>

question from:https://stackoverflow.com/questions/65879627/lstm-for-video-input

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

First you should know, method of solving video classification task is better suit for Convolutional RNN than LSTM or any RNN Cell, just as CNN is better suit for image classification task than MLP

Those RNN cell (e.g LSTM, GRU) is expect inputs with shape (samples, timesteps, channels), since you are deal inputs with shape (samples, timesteps, width, height, channels), so you should using tf.keras.layers.ConvLSTM2D instead

Following example code will show you how to build a model that can deal your video classification task:

import tensorflow as tf
from tensorflow.keras import models, layers

timesteps = 60
width = 192
height = 192
channels = 1
action_num = 5

model = models.Sequential(
    [
        layers.Input(
            shape=(timesteps, width, height, channels)
        ),
        layers.ConvLSTM2D(
            filters=64, kernel_size=(3, 3), padding="same", return_sequences=True, dropout=0.1, recurrent_dropout=0.1
        ),
        layers.MaxPool3D(
            pool_size=(1, 2, 2), strides=(1, 2, 2), padding="same"
        ),
        layers.BatchNormalization(),
        layers.ConvLSTM2D(
            filters=32, kernel_size=(3, 3), padding="same", return_sequences=True, dropout=0.1, recurrent_dropout=0.1
        ),
        layers.MaxPool3D(
            pool_size=(1, 2, 2), strides=(1, 2, 2), padding="same"
        ),
        layers.BatchNormalization(),
        layers.ConvLSTM2D(
            filters=16, kernel_size=(3, 3), padding="same", return_sequences=False, dropout=0.1, recurrent_dropout=0.1
        ),
        layers.MaxPool2D(
            pool_size=(2, 2), strides=(2, 2), padding="same"
        ),
        layers.BatchNormalization(),
        layers.Flatten(),
        layers.Dense(256, activation='relu'),
        layers.Dense(action_num, activation='softmax')
    ]
)

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

Outputs:

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv_lst_m2d (ConvLSTM2D)    (None, 60, 192, 192, 64)  150016    
_________________________________________________________________
max_pooling3d (MaxPooling3D) (None, 60, 96, 96, 64)    0         
_________________________________________________________________
batch_normalization (BatchNo (None, 60, 96, 96, 64)    256       
_________________________________________________________________
conv_lst_m2d_1 (ConvLSTM2D)  (None, 60, 96, 96, 32)    110720    
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 60, 48, 48, 32)    0         
_________________________________________________________________
batch_normalization_1 (Batch (None, 60, 48, 48, 32)    128       
_________________________________________________________________
conv_lst_m2d_2 (ConvLSTM2D)  (None, 48, 48, 16)        27712     
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 24, 24, 16)        0         
_________________________________________________________________
batch_normalization_2 (Batch (None, 24, 24, 16)        64        
_________________________________________________________________
flatten (Flatten)            (None, 9216)              0         
_________________________________________________________________
dense (Dense)                (None, 256)               2359552   
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 1285      
=================================================================
Total params: 2,649,733
Trainable params: 2,649,509
Non-trainable params: 224
_________________________________________________________________

Beware you should reorder your data to the shape (samples, timesteps, width, height, channels) before feed in above model (i.e not like np.reshape, but like np.moveaxis), in your case the shape should be (120, 60, 192, 192, 1), then you can split your 120 video to batchs and feed to model


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...