The Elephant in the Room

07 Sep 2018

Paper: arxiv
Code: Published: 9 Aug 2018

Key Idea:

We showcase a family of common failures of state-of-the art object detectors, “object transplanting”.

Backgroung knowledge:

Object transplanting

putting (“transplanting”) an object from one image in a new location of another image

Experiment 1:

Using state-of-the-art object detection method (Faster-RCNN [9] with a NASNet backbone [20])
Image of a living-room from the Microsoft COCO object detection benchmark
Transplant “elephant” (refer as T) into this image at various locations
several interesting phenomena
1. Detection is not stable: the object may occasionaly become undetected or be detected with sharp changes in confidence
2. The reported identity of the object T is not consistent (chair in 1,f): the object may be detected as a variety of different classes depending on location
3. The object causes non-local effects: objects nonoverlapping with T can switch identity, boundingbox, or disappear altogether.

Detailed Analysis of these 3 findings:

Dataset: 2017 version of the MS-COCO dataset
Models: models from the Tensorflow Object Detection API (Table II)
Test Image Generation: picking a random pair of images I, J and transplanting a random object from the image J into image I
- Randomly pick image J $\ne$ I
- Randomly select a object in J “transplant” to a series of location $t_x, t_y$ in image I
- Dectect result: $D_{x,y}$ with bounding box coordinates, detection score and object category
- Confidently detected object classes: $C_{x,y}$
Matching Detections: $\vert C_{x,y}\setminus C_{\null}\vert$

Dataset & preprocess:

Dataset: Autism Brain Imaging Data Exchange (ABIDE) & UK Biobank (UKB)
Preprocess pipeling:
- ABIDE: Configurable Pipeline for the Analysis of Connectomes (C-PAC)

  
  Including:
    * skull striping
    * slice timing correction
    * motion correction
    * global mean intensity normalisation 
    * nuisance signal regression 
    * band-pass filtering (0.01-0.1Hz)
    * registration of fMRI images to standard anatomical space (MNI152)

UKB: Miller2016
ROI:
- ABIDE: Harvard Oxford (HO) atlas (R = 110 cortical and subcortical ROIs)
  - Extract the mean time series for ROI
  - Normalised to zero mean and unit variance.
- UKB: 55 (100 spatially independent components, 55 non artefactual. Miller2016)
Number:
- ABIDE:

  
  Subjects number: N = 871 
  ASD disease: 403 
  Healthy controls: 468 
  Sites number: 20
  (from different imaging sites, 871 met the imaging quality and phenotypic information criteria)

UKB:

  
  Subjects number: N = 2500
  Male: 1181 
  Female: 1319

Network detail:

Task: measure the similarity between two graph
Graph:
- Vertex: Each ROI is represent by a node $\mathcal{v}_i\in\mathcal{V}$
- Input feature: for each ROI, the input feature is the corresponding row of correlation matrix for that ROI.
- Edge & weight:
  - Type 1: Spatial distance as graph $e_{ij}=d(v_i,v_j)=\sqrt{\|v_i-v_j\|^2}$ for weight
  - Type 2: mean functional connectivity as graph
  - The edge is determined by k-NN (k-nearest neighbors). k=10
Network Structure:
1. CNN:
  1. 2 layers with 64 features (shared in Siamese network)
  2. K=3, convolution takes input at most K steps away from a node.
2. FC:
  1. One output with Sigmoid activation $S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}.$
  2. A binary feature is introduced at the FC layer indicating whether the subject pair were scanned at the same site or not.
  3. Dropout 0.2/0.5 (ABIDE/UKB) on FC
Loss function:
- global loss:
  $J^g=(\sigma^{2+}+\sigma^{2-})+\lambda max(0,m-(\mu^+-\mu^-))$
It maximises the mean similarity $\mu^+$ between embeddings belonging to the same class, minimises the mean similarity between embeddings belonging to different classes $\mu^-$ . And minimises the variance of pairwise similarities for both matching $\sigma^{2+}$ and non-matching $\sigma^{2-}$ pairs of graphs.
- constrained variance loss:
  $J^g=max(0,\sigma^{2+}-a)+max(0,\sigma^{2-}-a)+\lambda max(0,m-(\mu^+-\mu^-))$ Compare to global loss, it add a threshold a to the variance.
Network detail:
- Adam optimizer: 0.001 learning rate and 0.0005/0.05 (ABIDE/UKB) regularization
- Loss function: margin m=1.0, weight lambda=1.0, a=m/2
- mini-batch: 200
Train and test:
- ABIDE:
  1. 871 total, 720 train, 151 test.
  2. train form 21802 matching and 21398 non-matching graph pairs. test form 5631 matching and 5694 non-matching.
  3. all graphs are fed to the network the same number of times to avoid biases.
  4. subjects from all 20 sites are included in both training and test sets
- UKB:
  1. 5 fold cross validation
  2. 2500 total, 2000 train, 500 test

Metric learning with spectral graph convolutions on brain connectivity networks

04 Jan 2018

Paper: NeuroImage
Code: (old version) github (tensorflow)
Published: Dec 2017

Key Idea:

We propose to learn a graph similarity metric using a siamese graph convolutional neural network (s-GCN) in a supervised setting.

Backgroung knowledge:

Siamese neural networks

Siamese neural network is a class of neural network architectures that contain two or more identical subnetworks. identical here means they have the same configuration with the same parameters and weights. Parameter updating is mirrored across both subnetworks. Siamese NNs are popular among tasks that involve finding similarity or a relationship between two comparable things.

Degree matrix or diagonal degree matrix

The degree matrix is a diagonal matrix which contains information about the degree of each vertex—that is, the number of edges attached to each vertex.

Adjacency matrix

An adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.

Laplacian matrix / Symmetric normalized Laplacian

Chebyshev polynomials of the first kind

Methodology:

Check this blog

Dataset & preprocess:

Dataset: Autism Brain Imaging Data Exchange (ABIDE) & UK Biobank (UKB)
Preprocess pipeling:
- ABIDE: Configurable Pipeline for the Analysis of Connectomes (C-PAC)

  
  Including:
    * skull striping
    * slice timing correction
    * motion correction
    * global mean intensity normalisation 
    * nuisance signal regression 
    * band-pass filtering (0.01-0.1Hz)
    * registration of fMRI images to standard anatomical space (MNI152)

UKB: Miller2016
ROI:
- ABIDE: Harvard Oxford (HO) atlas (R = 110 cortical and subcortical ROIs)
  - Extract the mean time series for ROI
  - Normalised to zero mean and unit variance.
- UKB: 55 (100 spatially independent components, 55 non artefactual. Miller2016)
Number:
- ABIDE:

  
  Subjects number: N = 871 
  ASD disease: 403 
  Healthy controls: 468 
  Sites number: 20
  (from different imaging sites, 871 met the imaging quality and phenotypic information criteria)

UKB:

  
  Subjects number: N = 2500
  Male: 1181 
  Female: 1319

Network detail:

Task: measure the similarity between two graph
Graph:
- Vertex: Each ROI is represent by a node $\mathcal{v}_i\in\mathcal{V}$
- Input feature: for each ROI, the input feature is the corresponding row of correlation matrix for that ROI.
- Edge & weight:
  - Type 1: Spatial distance as graph $e_{ij}=d(v_i,v_j)=\sqrt{\|v_i-v_j\|^2}$ for weight
  - Type 2: mean functional connectivity as graph
  - The edge is determined by k-NN (k-nearest neighbors). k=10
Network Structure:
1. CNN:
  1. 2 layers with 64 features (shared in Siamese network)
  2. K=3, convolution takes input at most K steps away from a node.
2. FC:
  1. One output with Sigmoid activation $S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}.$
  2. A binary feature is introduced at the FC layer indicating whether the subject pair were scanned at the same site or not.
  3. Dropout 0.2/0.5 (ABIDE/UKB) on FC
Loss function:
- global loss:
  $J^g=(\sigma^{2+}+\sigma^{2-})+\lambda max(0,m-(\mu^+-\mu^-))$
It maximises the mean similarity $\mu^+$ between embeddings belonging to the same class, minimises the mean similarity between embeddings belonging to different classes $\mu^-$ . And minimises the variance of pairwise similarities for both matching $\sigma^{2+}$ and non-matching $\sigma^{2-}$ pairs of graphs.
- constrained variance loss:
  $J^g=max(0,\sigma^{2+}-a)+max(0,\sigma^{2-}-a)+\lambda max(0,m-(\mu^+-\mu^-))$ Compare to global loss, it add a threshold a to the variance.
Network detail:
- Adam optimizer: 0.001 learning rate and 0.0005/0.05 (ABIDE/UKB) regularization
- Loss function: margin m=1.0, weight lambda=1.0, a=m/2
- mini-batch: 200
Train and test:
- ABIDE:
  1. 871 total, 720 train, 151 test.
  2. train form 21802 matching and 21398 non-matching graph pairs. test form 5631 matching and 5694 non-matching.
  3. all graphs are fed to the network the same number of times to avoid biases.
  4. subjects from all 20 sites are included in both training and test sets
- UKB:
  1. 5 fold cross validation
  2. 2500 total, 2000 train, 500 test

LSTM Time and Frequency Recurrence for Automatic Speech Recognition

29 Oct 2017

Paper: [Microsoft]https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/spectral_LSTM.pdf)

Key idea:

We propose an extension to LSTMs that performs the recurrence in frequency as well as in time.

Background knowledge:

LSTMP - LSTM with Recurrent Projection Layer:

By setting n_r < n_c we can increase the model memory (n_c) and still be able to control the number of parameters in the recurrent connections and output layer. n_r is the number of units in the recurrent projection layer

Motivation:

Our model is inspired by the way people read spectrograms.
In standard systems, the log-filter-bank features are independent of one-another, i.e. switching the positions of two filter-banks won’t affect the performance of the DNN or LSTM.
However, this is not the case when a human reads a spectrogram: a human relies on both patterns that evolve on time, and frequency, to predict phonemes. Switching the positions of two filter-banks will destroy the frequency-wise patterns.

Network:

Input:
- size: (40, 11)
- 40: 40-dimensional log-mel filterbank feature
- 11: 11 frames, 1 center frames and 5 contextual frames at left and right
Output:
- 1812 tied-triphone states (senones)
Structure:
- For each Input at time t, apply F-LSTM on 40 frequency
- The output of F-LSTM at time t import to one T-LSTM cell, T-LSTM takes input from all the time t

F-LTSM:

For each time step t:
- Divide total N log-filter-banks at current time into M overlapped chunks and each chunk contains B log filterbanks. C is number of log-filter-banks overlapped between adjacent chunks. $M=\frac{N-C}{B-C}$
- M overlapped chunks as input to M F-LSTM cells. Output as $h_{0~M-1}$
- Merge (concatenate?) $h_{0~M-1}$ to a vector h, use this as input to T-LSTM cell at this time t.

Comparison:

Against CLDNN:
- The two approaches both aim to achieve invariance to input distortions, but the pattern detectors in the CNN maintain a constant dimensionality, while the F-LSTM can perform a general frequency warping.
Against Multidimensional RNN:
- To summarize, the T-F-LSTM works on multidimensional space separately with simplicity while the multidimensional RNN works jointly on multidimensional space with more powerful modeling.

Experiments:

Table 1, F-T-LSTM is better than T-LSTM.
Table 2, F-LSTM with 24 cells performs best.
Table 3, stacking multiply time frames to the input of F-LSTM does not improve the performance.

Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks

28 Oct 2017

Paper: IEEE

Key idea:

We take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture.

Background knowledge:

log-mel filterbank feature
LSTMP - LSTM with Recurrent Projection Layer:

By setting n_r < n_c we can increase the model memory (n_c) and still be able to control the number of parameters in the recurrent connections and output layer. n_r is the number of units in the recurrent projection layer
context dependent acoustic modeling

Motivation:

Higher-level modeling of xt can help to disentangle underlying factors of variation within the input, which should then make it easier to learn temporal structure between successive time steps.
If factors of variation in the hidden states could be reduced, then the hidden state of the model could summarize the history of previous inputs more efficiently. In turn, this could make the output easier to predict. Reducing variation in the hidden states can be modeled by having DNN layers after the LSTM layers.

Input

For each input at time t:

Input
- each $x_t$ is a 40-dimensional log-mel filterbank feature.
- l contextual vectors at left
- r contextual vectors at right

Output

For each output at time t:

Output
- output state label is delayed 5 frames

Network:

Input = Input(size=(40, r+l+1))
H = Conv2D(256, kernel_size=(9,9))(Input) # filter (9,9) on frequency-time
H = Maxpool2D(pool_size=(3,1), strides=(3,1))(H) # pooling 3 on frequency only
H = Conv2D(256, kernel_size=(4,3))(H)
H_cnn = TimeDistributed(Linear(256))(H) # shared parameters on time
x_t = Input(size=(40, 1))
H = Unknown_add(x_t, H_cnn)
H = LSTMP(832, recurrent_projection_units=512, truncated_bptt_steps=20, return_sequence=True)(H)
H = LSTMP(832, recurrent_projection_units=512, truncated_bptt_steps=20, return_sequence=False)(H)
H = Unknown_add(H, H_cnn)
H = Linear(1024)(H)
H = Linear(1024)(H)

r = 0
Loss: cross-entropy
Optimizer: (distributed) asynchronous stochastic gradient descent (ASGD)
Initialization: Unit variance Gaussian for LSTM, glorot normal/uniform for CNN and DNN
Learning rates: exponentially decay
Sequence training in larger data sets.

Experiments:

From Table 2, A larger context of 20 hurts performance. Use l=10
From Table 3, unrolling 30 time steps degrades WER.
From Table 4, it is beneficial to use DNN layers to transform the output of the LSTM layer to a space that is more discriminative and easier to predict output targets.
From Table 5, CLDNN performs better.
From Table 6, CLDNN benefits from the better weight initialization (uniform initialization).
From Table 7, short term feature x_t to the LSTM has better performance. CNN features only to LSTM is sufficient.
From table 8 & 9, dataset outperform LSTM in larger datasets.

Pros & Cons:

Pros:
- Good result and good performance
- Good intuition
Cons:
- The paper is not very detailed like, it did not fully explain the network (multi-scale addiction and input features). This is troublesome since it did not provide code.
- The baseline are relatively simple.

Mastering the game of Go with deep neural networks and tree search

30 Aug 2017

Paper: Nature
For full paper, please Google yourself.

Key idea:

We introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves.

Background knowledge:

Previous approach:

Depth of the search may be reduced by position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function $v(s)\approx v^*(s)$ that predicts the outcome from state s.
The breadth of the search may be reduced by sampling actions from a policy $p(a\vert s)$ that is a probability distribution over possible moves a in position s.
Monte Carlo tree search (MCTS) uses Monte Carlo rollouts to estimate the value of each state in a search tree.
However, prior work has been limited to shallow policies or value functions based on a linear combination of input features.

This paper approach:

We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.
Structure (Check Figure 1 in paper):
1. Supervised Learning (SL) policy network $p_{\epsilon}$ directly from expert human moves
2. Reinforcement Learning (RL) policy network $p_{\rho}$ that improves the SL policy network by optimizing the final outcome of games of selfplay.
3. Value network $v_{\theta}$ that predicts the winner of games played by the RL policy network against itself.
4. MCTS with policy and value network

Notation:

s: state (current state of the Go game, check Extended Data Table 2 in paper)
a: moves (represent the legal moves)

Supervised Learning (SL) policy network

Task: predicting the probability of human experts next moves a given the current state s
Input: state s
Output: probability distribution of all legal moves a
Structure: formed by Convolutional layer and rectifier nonlinearities, softmax as final output.
Detailed network: 13 layer
Data: 30 million state-action pairs from KGS Go server.
Performance: 57.0% using all input features, 55.7% using only raw board position and move history as inputs.
Time consumption: 3ms
Extra: a faster rollout policy $p_{\pi}(a\vert s)$ with 24.2% and $2\mu s$

Reinforcement Learning (RL) policy network

Objective: improving the policy network by policy gradient reinforcement learning (RL)
Network: RL policy network $p_{\rho}$ is identical in structure to SL policy network
Training procedure:
1. RL policy network weights $\rho$ is initialized to the same values of SL policy network, $\rho=\sigma$ .
2. play GO games between the current policy network $p_{\rho}$ and a randomly selected previous iteration of the policy network.
3. At the end of the game, update the network for all previous move $p_{\rho}(a_t\vert s_t)$ where t=1:T by $\Delta\rho\propto\frac{\partial \log p_{\rho}(a_t\vert s_t)}{\partial\rho}z_t$ where $z_t$ is outcome of the game (+1 for winning, -1 for losing).
Performance: 80% winning rate against SL policy network, 85% against strongest open-source Go program Pachi (previous state-of-art supervised CNN only got 11% wining rate against Pachi)

Value network

Objective: position evaluation of any given state s
Task: estimate a value function $v^p(s)$ that predicts the outcome from position s of games played by using policy p for both players
Input: state-outcome pairs $(s,z)$
Output: estimated value function $v^p(s)=\mathbb{E}[z_t\vert s_t=s,a_{t...T}~p]$
Network: single prediction output value network $v_\theta(s)$ with similar architecture to the previous two policy network and weight $\theta$ that $v_\theta(s)\approx v^{p_\rho}(s)\approx v^*(s)$
Training: use state-outcome pairs (s,z) from both KGS dataset and self-play data set with SGD to minimize the MSE between the predicted value $v_\theta(s)$ and the corresponding z, ( $\Delta\theta\propto\frac{\partial v_\theta(s)}{\partial\theta}(z-v_\theta(s))$ )
Performance: 0.226 and 0.234 MSE on training and testing set respectively, near RL policy network performance but 15,000 times less computation.

MCTS (Searching) with policy and value networks

Objective: Determine next move action
Input: state s
Output: estimated optimal move a
Algorithm: The network would run multiple times of simulation for the same input state s. The simulation is traversing the tree from the root state (input state s). For one simulation (Figure 3):
1. For the node with children, branch to the child that $a_t=argmax_a(Q(s_t,a)+u(s_t,a))$ where $u(s_t,a))\propto\frac{P(s,a)}{1+N(s,a)}$
2. For the leaf node, it may be expanded by SL policy network $p_{\sigma}$ and the output probabilities are stored as prior probabilities P for each action.
3. At the end of a simulation, the leaf node is evaluated in two ways: using the value network $v_\theta$ ; and by running a rollout to the end of the game with the fast rollout policy $p_{\pi}$ . The outcome is a leaf function $V(s_L)=(1-\lambda)v_\theta(s_L)+\lambda(z_L)$ .
4. Update action values $Q(s,a)=\frac{1}{N(s,a)}\sum_{i=1}^n1(s,a,i)V(s_L^i)$ where $N(s,a)=\sum_{i=1}^n1(s,a,i)$ ( $1(s,a,i)$ indicates whether an edge (s, a) was traversed during the ith simulation. $s_L^i$ is the leaf node from the ith simulation.)

Performance

Against serveral other programs:
- 5s computation time per move
- AlphaGo (single-machine) wining 494 out of 495 games
- Four handicap stones: AlphaGo win 77%, 86% and 99% against Crazy stones, Zen and Pachi.
- AlphaGo (distributed) win 100% against other programs and 77% against AlphaGo (single-machine)
Variants of AlphaGo that evaluation postition using just value network or just rollouts was tested, and the mixed evaluation give the best results.
Against professional player Fan Hui: AlphaGo won 5 games to 0

Older Newer

Deep Paper Pool really deep.

The Elephant in the Room

Key Idea:

Backgroung knowledge:

Experiment 1:

Detailed Analysis of these 3 findings:

Dataset & preprocess:

Network detail:

Metric learning with spectral graph convolutions on brain connectivity networks

Key Idea:

Backgroung knowledge:

Methodology:

Dataset & preprocess:

Network detail:

LSTM Time and Frequency Recurrence for Automatic Speech Recognition

Key idea:

Background knowledge:

Motivation:

Network:

F-LTSM:

Comparison:

Experiments:

Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks

Key idea:

Background knowledge:

Motivation:

Input

Output

Network:

Experiments:

Pros & Cons:

Mastering the game of Go with deep neural networks and tree search

Key idea:

Background knowledge:

Previous approach:

This paper approach:

Notation:

Supervised Learning (SL) policy network

Reinforcement Learning (RL) policy network

Value network

MCTS (Searching) with policy and value networks

Performance