Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks

28 Oct 2017

Paper: IEEE

Key idea:

We take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture.

Background knowledge:

log-mel filterbank feature
LSTMP - LSTM with Recurrent Projection Layer:

By setting n_r < n_c we can increase the model memory (n_c) and still be able to control the number of parameters in the recurrent connections and output layer. n_r is the number of units in the recurrent projection layer
context dependent acoustic modeling

Motivation:

Higher-level modeling of xt can help to disentangle underlying factors of variation within the input, which should then make it easier to learn temporal structure between successive time steps.
If factors of variation in the hidden states could be reduced, then the hidden state of the model could summarize the history of previous inputs more efficiently. In turn, this could make the output easier to predict. Reducing variation in the hidden states can be modeled by having DNN layers after the LSTM layers.

Input

For each input at time t:

Input
- each $x_t$ is a 40-dimensional log-mel filterbank feature.
- l contextual vectors at left
- r contextual vectors at right

Output

For each output at time t:

Output
- output state label is delayed 5 frames

Network:

Input = Input(size=(40, r+l+1))
H = Conv2D(256, kernel_size=(9,9))(Input) # filter (9,9) on frequency-time
H = Maxpool2D(pool_size=(3,1), strides=(3,1))(H) # pooling 3 on frequency only
H = Conv2D(256, kernel_size=(4,3))(H)
H_cnn = TimeDistributed(Linear(256))(H) # shared parameters on time
x_t = Input(size=(40, 1))
H = Unknown_add(x_t, H_cnn)
H = LSTMP(832, recurrent_projection_units=512, truncated_bptt_steps=20, return_sequence=True)(H)
H = LSTMP(832, recurrent_projection_units=512, truncated_bptt_steps=20, return_sequence=False)(H)
H = Unknown_add(H, H_cnn)
H = Linear(1024)(H)
H = Linear(1024)(H)

r = 0
Loss: cross-entropy
Optimizer: (distributed) asynchronous stochastic gradient descent (ASGD)
Initialization: Unit variance Gaussian for LSTM, glorot normal/uniform for CNN and DNN
Learning rates: exponentially decay
Sequence training in larger data sets.

Experiments:

From Table 2, A larger context of 20 hurts performance. Use l=10
From Table 3, unrolling 30 time steps degrades WER.
From Table 4, it is beneficial to use DNN layers to transform the output of the LSTM layer to a space that is more discriminative and easier to predict output targets.
From Table 5, CLDNN performs better.
From Table 6, CLDNN benefits from the better weight initialization (uniform initialization).
From Table 7, short term feature x_t to the LSTM has better performance. CNN features only to LSTM is sufficient.
From table 8 & 9, dataset outperform LSTM in larger datasets.

Pros & Cons:

Pros:
- Good result and good performance
- Good intuition
Cons:
- The paper is not very detailed like, it did not fully explain the network (multi-scale addiction and input features). This is troublesome since it did not provide code.
- The baseline are relatively simple.

Deep Paper Pool really deep.

Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks

Key idea:

Background knowledge:

Motivation:

Input

Output

Network:

Experiments:

Pros & Cons:

Related Posts

The Elephant in the Room 07 Sep 2018

Metric learning with spectral graph convolutions on brain connectivity networks 04 Jan 2018

LSTM Time and Frequency Recurrence for Automatic Speech Recognition 29 Oct 2017