Deep Paper Pool really deep.

The Elephant in the Room

Paper: arxiv
Code: Published: 9 Aug 2018

Key Idea:

We showcase a family of common failures of state-of-the art object detectors, “object transplanting”.

Backgroung knowledge:

Object transplanting

putting (“transplanting”) an object from one image in a new location of another image

Experiment 1:

  • Using state-of-the-art object detection method (Faster-RCNN [9] with a NASNet backbone [20])
  • Image of a living-room from the Microsoft COCO object detection benchmark
  • Transplant “elephant” (refer as T) into this image at various locations
  • several interesting phenomena
    1. Detection is not stable: the object may occasionaly become undetected or be detected with sharp changes in confidence
    2. The reported identity of the object T is not consistent (chair in 1,f): the object may be detected as a variety of different classes depending on location
    3. The object causes non-local effects: objects nonoverlapping with T can switch identity, boundingbox, or disappear altogether.

Detailed Analysis of these 3 findings:

  • Dataset: 2017 version of the MS-COCO dataset
  • Models: models from the Tensorflow Object Detection API (Table II)
  • Test Image Generation: picking a random pair of images I, J and transplanting a random object from the image J into image I
    • Randomly pick image J I
    • Randomly select a object in J “transplant” to a series of location in image I
    • Dectect result: with bounding box coordinates, detection score and object category
    • Confidently detected object classes:
  • Matching Detections:

Dataset & preprocess:

  • Dataset: Autism Brain Imaging Data Exchange (ABIDE) & UK Biobank (UKB)
  • Preprocess pipeling:
    • ABIDE: Configurable Pipeline for the Analysis of Connectomes (C-PAC)
  
  Including:
    * skull striping
    * slice timing correction
    * motion correction
    * global mean intensity normalisation 
    * nuisance signal regression 
    * band-pass filtering (0.01-0.1Hz)
    * registration of fMRI images to standard anatomical space (MNI152)
  
  • UKB: Miller2016

  • ROI:
    • ABIDE: Harvard Oxford (HO) atlas (R = 110 cortical and subcortical ROIs)
      • Extract the mean time series for ROI
      • Normalised to zero mean and unit variance.
    • UKB: 55 (100 spatially independent components, 55 non artefactual. Miller2016)
  • Number:
    • ABIDE:
  
  Subjects number: N = 871 
  ASD disease: 403 
  Healthy controls: 468 
  Sites number: 20
  (from different imaging sites, 871 met the imaging quality and phenotypic information criteria)
  
  • UKB:
  
  Subjects number: N = 2500
  Male: 1181 
  Female: 1319 
  

Network detail:

  • Task: measure the similarity between two graph
  • Graph:
    • Vertex: Each ROI is represent by a node
    • Input feature: for each ROI, the input feature is the corresponding row of correlation matrix for that ROI.
    • Edge & weight:
      • Type 1: Spatial distance as graph for weight
      • Type 2: mean functional connectivity as graph
      • The edge is determined by k-NN (k-nearest neighbors). k=10
  • Network Structure:
    1. CNN:
      1. 2 layers with 64 features (shared in Siamese network)
      2. K=3, convolution takes input at most K steps away from a node.
    2. FC:
      1. One output with Sigmoid activation
      2. A binary feature is introduced at the FC layer indicating whether the subject pair were scanned at the same site or not.
      3. Dropout 0.2/0.5 (ABIDE/UKB) on FC
  • Loss function:
    • global loss:

    It maximises the mean similarity between embeddings belonging to the same class, minimises the mean similarity between embeddings belonging to different classes . And minimises the variance of pairwise similarities for both matching and non-matching pairs of graphs.

    • constrained variance loss:
      Compare to global loss, it add a threshold a to the variance.
  • Network detail:
    • Adam optimizer: 0.001 learning rate and 0.0005/0.05 (ABIDE/UKB) regularization
    • Loss function: margin m=1.0, weight lambda=1.0, a=m/2
    • mini-batch: 200
  • Train and test:
    • ABIDE:
      1. 871 total, 720 train, 151 test.
      2. train form 21802 matching and 21398 non-matching graph pairs. test form 5631 matching and 5694 non-matching.
      3. all graphs are fed to the network the same number of times to avoid biases.
      4. subjects from all 20 sites are included in both training and test sets
    • UKB:
      1. 5 fold cross validation
      2. 2500 total, 2000 train, 500 test

Metric learning with spectral graph convolutions on brain connectivity networks

Paper: NeuroImage
Code: (old version) github (tensorflow)
Published: Dec 2017

Key Idea:

We propose to learn a graph similarity metric using a siamese graph convolutional neural network (s-GCN) in a supervised setting.

Backgroung knowledge:

Siamese neural networks

Siamese neural network is a class of neural network architectures that contain two or more identical subnetworks. identical here means they have the same configuration with the same parameters and weights. Parameter updating is mirrored across both subnetworks. Siamese NNs are popular among tasks that involve finding similarity or a relationship between two comparable things.

Degree matrix or diagonal degree matrix

The degree matrix is a diagonal matrix which contains information about the degree of each vertex—that is, the number of edges attached to each vertex.

Adjacency matrix

An adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.

Laplacian matrix / Symmetric normalized Laplacian


Chebyshev polynomials of the first kind

Methodology:

Check this blog

Dataset & preprocess:

  • Dataset: Autism Brain Imaging Data Exchange (ABIDE) & UK Biobank (UKB)
  • Preprocess pipeling:
    • ABIDE: Configurable Pipeline for the Analysis of Connectomes (C-PAC)
  
  Including:
    * skull striping
    * slice timing correction
    * motion correction
    * global mean intensity normalisation 
    * nuisance signal regression 
    * band-pass filtering (0.01-0.1Hz)
    * registration of fMRI images to standard anatomical space (MNI152)
  
  • UKB: Miller2016

  • ROI:
    • ABIDE: Harvard Oxford (HO) atlas (R = 110 cortical and subcortical ROIs)
      • Extract the mean time series for ROI
      • Normalised to zero mean and unit variance.
    • UKB: 55 (100 spatially independent components, 55 non artefactual. Miller2016)
  • Number:
    • ABIDE:
  
  Subjects number: N = 871 
  ASD disease: 403 
  Healthy controls: 468 
  Sites number: 20
  (from different imaging sites, 871 met the imaging quality and phenotypic information criteria)
  
  • UKB:
  
  Subjects number: N = 2500
  Male: 1181 
  Female: 1319 
  

Network detail:

  • Task: measure the similarity between two graph
  • Graph:
    • Vertex: Each ROI is represent by a node
    • Input feature: for each ROI, the input feature is the corresponding row of correlation matrix for that ROI.
    • Edge & weight:
      • Type 1: Spatial distance as graph for weight
      • Type 2: mean functional connectivity as graph
      • The edge is determined by k-NN (k-nearest neighbors). k=10
  • Network Structure:
    1. CNN:
      1. 2 layers with 64 features (shared in Siamese network)
      2. K=3, convolution takes input at most K steps away from a node.
    2. FC:
      1. One output with Sigmoid activation
      2. A binary feature is introduced at the FC layer indicating whether the subject pair were scanned at the same site or not.
      3. Dropout 0.2/0.5 (ABIDE/UKB) on FC
  • Loss function:
    • global loss:

    It maximises the mean similarity between embeddings belonging to the same class, minimises the mean similarity between embeddings belonging to different classes . And minimises the variance of pairwise similarities for both matching and non-matching pairs of graphs.

    • constrained variance loss:
      Compare to global loss, it add a threshold a to the variance.
  • Network detail:
    • Adam optimizer: 0.001 learning rate and 0.0005/0.05 (ABIDE/UKB) regularization
    • Loss function: margin m=1.0, weight lambda=1.0, a=m/2
    • mini-batch: 200
  • Train and test:
    • ABIDE:
      1. 871 total, 720 train, 151 test.
      2. train form 21802 matching and 21398 non-matching graph pairs. test form 5631 matching and 5694 non-matching.
      3. all graphs are fed to the network the same number of times to avoid biases.
      4. subjects from all 20 sites are included in both training and test sets
    • UKB:
      1. 5 fold cross validation
      2. 2500 total, 2000 train, 500 test

LSTM Time and Frequency Recurrence for Automatic Speech Recognition

Paper: [Microsoft]https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/spectral_LSTM.pdf)

Key idea:

We propose an extension to LSTMs that performs the recurrence in frequency as well as in time.

Background knowledge:

  • LSTMP - LSTM with Recurrent Projection Layer:

    By setting n_r < n_c we can increase the model memory (n_c) and still be able to control the number of parameters in the recurrent connections and output layer. n_r is the number of units in the recurrent projection layer

Motivation:

  • Our model is inspired by the way people read spectrograms.
  • In standard systems, the log-filter-bank features are independent of one-another, i.e. switching the positions of two filter-banks won’t affect the performance of the DNN or LSTM.
  • However, this is not the case when a human reads a spectrogram: a human relies on both patterns that evolve on time, and frequency, to predict phonemes. Switching the positions of two filter-banks will destroy the frequency-wise patterns.

Network:

  • Input:
    • size: (40, 11)
    • 40: 40-dimensional log-mel filterbank feature
    • 11: 11 frames, 1 center frames and 5 contextual frames at left and right
  • Output:
    • 1812 tied-triphone states (senones)
  • Structure:
    • For each Input at time t, apply F-LSTM on 40 frequency
    • The output of F-LSTM at time t import to one T-LSTM cell, T-LSTM takes input from all the time t

F-LTSM:

  • For each time step t:
    • Divide total N log-filter-banks at current time into M overlapped chunks and each chunk contains B log filterbanks. C is number of log-filter-banks overlapped between adjacent chunks.
    • M overlapped chunks as input to M F-LSTM cells. Output as
    • Merge (concatenate?) to a vector h, use this as input to T-LSTM cell at this time t.

Comparison:

  • Against CLDNN:
    • The two approaches both aim to achieve invariance to input distortions, but the pattern detectors in the CNN maintain a constant dimensionality, while the F-LSTM can perform a general frequency warping.
  • Against Multidimensional RNN:
    • To summarize, the T-F-LSTM works on multidimensional space separately with simplicity while the multidimensional RNN works jointly on multidimensional space with more powerful modeling.

Experiments:

  • Table 1, F-T-LSTM is better than T-LSTM.
  • Table 2, F-LSTM with 24 cells performs best.
  • Table 3, stacking multiply time frames to the input of F-LSTM does not improve the performance.

Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks

Paper: IEEE

Key idea:

We take advantage of the complementarity of CNNs, LSTMs and DNNs by combining them into one unified architecture.

Background knowledge:

Motivation:

  • Higher-level modeling of xt can help to disentangle underlying factors of variation within the input, which should then make it easier to learn temporal structure between successive time steps.
  • If factors of variation in the hidden states could be reduced, then the hidden state of the model could summarize the history of previous inputs more efficiently. In turn, this could make the output easier to predict. Reducing variation in the hidden states can be modeled by having DNN layers after the LSTM layers.

Input

For each input at time t:

  • Input
    • each is a 40-dimensional log-mel filterbank feature.
    • l contextual vectors at left
    • r contextual vectors at right

Output

For each output at time t:

  • Output
    • output state label is delayed 5 frames

Network:

Input = Input(size=(40, r+l+1))
H = Conv2D(256, kernel_size=(9,9))(Input) # filter (9,9) on frequency-time
H = Maxpool2D(pool_size=(3,1), strides=(3,1))(H) # pooling 3 on frequency only
H = Conv2D(256, kernel_size=(4,3))(H)
H_cnn = TimeDistributed(Linear(256))(H) # shared parameters on time
x_t = Input(size=(40, 1))
H = Unknown_add(x_t, H_cnn)
H = LSTMP(832, recurrent_projection_units=512, truncated_bptt_steps=20, return_sequence=True)(H)
H = LSTMP(832, recurrent_projection_units=512, truncated_bptt_steps=20, return_sequence=False)(H)
H = Unknown_add(H, H_cnn)
H = Linear(1024)(H)
H = Linear(1024)(H)
  • r = 0
  • Loss: cross-entropy
  • Optimizer: (distributed) asynchronous stochastic gradient descent (ASGD)
  • Initialization: Unit variance Gaussian for LSTM, glorot normal/uniform for CNN and DNN
  • Learning rates: exponentially decay
  • Sequence training in larger data sets.

Experiments:

  • From Table 2, A larger context of 20 hurts performance. Use l=10
  • From Table 3, unrolling 30 time steps degrades WER.
  • From Table 4, it is beneficial to use DNN layers to transform the output of the LSTM layer to a space that is more discriminative and easier to predict output targets.
  • From Table 5, CLDNN performs better.
  • From Table 6, CLDNN benefits from the better weight initialization (uniform initialization).
  • From Table 7, short term feature x_t to the LSTM has better performance. CNN features only to LSTM is sufficient.
  • From table 8 & 9, dataset outperform LSTM in larger datasets.

Pros & Cons:

  • Pros:
    • Good result and good performance
    • Good intuition
  • Cons:
    • The paper is not very detailed like, it did not fully explain the network (multi-scale addiction and input features). This is troublesome since it did not provide code.
    • The baseline are relatively simple.

Mastering the game of Go with deep neural networks and tree search

Paper: Nature
For full paper, please Google yourself.

Key idea:

We introduce a new approach to computer Go that uses ‘value networks’ to evaluate board positions and ‘policy networks’ to select moves.

Background knowledge:

Previous approach:

  • Depth of the search may be reduced by position evaluation: truncating the search tree at state s and replacing the subtree below s by an approximate value function that predicts the outcome from state s.
  • The breadth of the search may be reduced by sampling actions from a policy that is a probability distribution over possible moves a in position s.
  • Monte Carlo tree search (MCTS) uses Monte Carlo rollouts to estimate the value of each state in a search tree.
  • However, prior work has been limited to shallow policies or value functions based on a linear combination of input features.

This paper approach:

  • We use these neural networks to reduce the effective depth and breadth of the search tree: evaluating positions using a value network, and sampling actions using a policy network.
  • Structure (Check Figure 1 in paper):
    1. Supervised Learning (SL) policy network directly from expert human moves
    2. Reinforcement Learning (RL) policy network that improves the SL policy network by optimizing the final outcome of games of selfplay.
    3. Value network that predicts the winner of games played by the RL policy network against itself.
    4. MCTS with policy and value network

Notation:

  • s: state (current state of the Go game, check Extended Data Table 2 in paper)
  • a: moves (represent the legal moves)

Supervised Learning (SL) policy network

  • Task: predicting the probability of human experts next moves a given the current state s
  • Input: state s
  • Output: probability distribution of all legal moves a
  • Structure: formed by Convolutional layer and rectifier nonlinearities, softmax as final output.
  • Detailed network: 13 layer
  • Data: 30 million state-action pairs from KGS Go server.
  • Performance: 57.0% using all input features, 55.7% using only raw board position and move history as inputs.
  • Time consumption: 3ms
  • Extra: a faster rollout policy with 24.2% and

Reinforcement Learning (RL) policy network

  • Objective: improving the policy network by policy gradient reinforcement learning (RL)
  • Network: RL policy network is identical in structure to SL policy network
  • Training procedure:
    1. RL policy network weights is initialized to the same values of SL policy network, .
    2. play GO games between the current policy network and a randomly selected previous iteration of the policy network.
    3. At the end of the game, update the network for all previous move where t=1:T by where is outcome of the game (+1 for winning, -1 for losing).
  • Performance: 80% winning rate against SL policy network, 85% against strongest open-source Go program Pachi (previous state-of-art supervised CNN only got 11% wining rate against Pachi)

Value network

  • Objective: position evaluation of any given state s
  • Task: estimate a value function that predicts the outcome from position s of games played by using policy p for both players
  • Input: state-outcome pairs
  • Output: estimated value function
  • Network: single prediction output value network with similar architecture to the previous two policy network and weight that
  • Training: use state-outcome pairs (s,z) from both KGS dataset and self-play data set with SGD to minimize the MSE between the predicted value and the corresponding z, ()
  • Performance: 0.226 and 0.234 MSE on training and testing set respectively, near RL policy network performance but 15,000 times less computation.

MCTS (Searching) with policy and value networks

  • Objective: Determine next move action
  • Input: state s
  • Output: estimated optimal move a
  • Algorithm: The network would run multiple times of simulation for the same input state s. The simulation is traversing the tree from the root state (input state s). For one simulation (Figure 3):
    1. For the node with children, branch to the child that where
    2. For the leaf node, it may be expanded by SL policy network and the output probabilities are stored as prior probabilities P for each action.
    3. At the end of a simulation, the leaf node is evaluated in two ways: using the value network ; and by running a rollout to the end of the game with the fast rollout policy . The outcome is a leaf function .
    4. Update action values where ( indicates whether an edge (s, a) was traversed during the ith simulation. is the leaf node from the ith simulation.)

Performance

  • Against serveral other programs:
    • 5s computation time per move
    • AlphaGo (single-machine) wining 494 out of 495 games
    • Four handicap stones: AlphaGo win 77%, 86% and 99% against Crazy stones, Zen and Pachi.
    • AlphaGo (distributed) win 100% against other programs and 77% against AlphaGo (single-machine)
  • Variants of AlphaGo that evaluation postition using just value network or just rollouts was tested, and the mixed evaluation give the best results.
  • Against professional player Fan Hui: AlphaGo won 5 games to 0