I built an autoencoder for image data. Following a simplified version:
wx = 28
hx = 28
n_channel = 3
latent_dim = 10
n_filter = 16
x_in = Input((wx, hx, n_channel))
xe = Conv2D(filters=n_filter, kernel_size=(3, 3))
xe = Flatten()(xe)
xe = Dense(latent_dim)(xe)
l_in = Input((latent_dim,))
xe = Dense(n_filter*wx*hx)(l_in)
xe = Reshape((n_filter, wx, hx))(xe)
xe = Conv2D(filters=n_channel, kernel_size=(3, 3))
im_encoder = Model(x_in, xe)
im_decoder = Model(l_in, xd)
x_out = im_decoder(im_encoder(x_in))
im_model = Model(x_in, x_out)
This works fine.
Now, I want to put this architecture on a fixed number of images (n_frames) in video data. I have therefore following input shape (n_samples, n_frames, wx, wy, n_channel).
A first idea of a timedistributed layer followed by a dense layer leads to bad result:
n_frames = 8
x_in = Input((n_frames, wx, hx, n_channel))
xe = TimeDistributed(m_encoder)(xe)
xe = Flatten()(xe)
xe = Dense(latent_dim)(xe)
l_in = Input((latent_dim,))
xe = Dense(n_frames*latent_dim)(l_in)
xe = Reshape((n_frames, latent_dim))(xe)
xe = TimeDistributed(m_decoder)(xe)
vi_encoder = Model(x_in, xe)
vi_decoder = Model(l_in, xd)
x_out = vi_decoder(vi_encoder(x_in))
vi_model = Model(x_in, x_out)
why does that not work? should I include lstm layers or what is a good approach?
question from:
https://stackoverflow.com/questions/65907912/what-is-a-good-approach-for-an-autoencoder-on-video-data 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…