Classifying video presents unique challenges for machine learning models. As I’ve covered in my previous posts, video has the added (and interesting) property of temporal features in addition to the spatial features present in 2D images. While this additional information provides us more to work with, it also requires different network architectures and, often, adds larger memory and computational demands.We won’t use any optical flow images. This reduces model complexity, training time, and a whole whack load of hyperparameters we don’t have to worry about. Every video will be subsampled down to 40 frames. So a 41-frame video and a 500-frame video will both be reduced to 40 frames, with the 500-frame video essentially being fast-forwarded. We won’t do much preprocessing. A common preprocessing step for video classification is subtracting the mean, but we’ll keep the frames pretty raw from start to finish.
Features
- Classify one frame at a time with a ConvNet
- Extract features from each frame with a ConvNet, passing the sequence to an RNN, in a separate network
- Use a time-dstirbuted ConvNet, passing the features to an RNN, much like #2 but all in one network (this is the lrcn network in the code).
- Extract features from each frame with a ConvNet and pass the sequence to an MLP
- Use a 3D convolutional network (has two versions of 3d conv to choose from)
- This code requires you have Keras 2 and TensorFlow 1 or greater installed