hyao23

Humans make use of this existing knowledge when learning new things. This is also what this post is trying to do. By incorporating recurrent neural networks, we decompose the learning of a big task into a sequence of smaller tasks.

We started this research from simple 3D convolutional neural networks. The following figure is an illustration of what we did during this period. The upper left subplot is ModelNet, the labeled geometry database that we are using for training. These raw models are stored as mesh files. In order to feed these geometries into 3D-CNN, we need to convert them into binary voxels first as the upper right subplot shows. As all other existing literature, the voxel model we used here only has a resolution of 32 in each dimension. In the beginning, we tried a simplest 3D-CNN structure, which is also shown in the upper right figure. This structure is very similar to LeNet and it works surprisingly well on ModelNet, as the first subplot in the bottom shows. It achieved 91% percent accuracy, comparable to the best result in 2015. Identical to 2D-CNN, people have tried ResNet and Inception structures to improve the accuracy, but I wanted to do something different.

I believe increasing the resolution of converted voxel data can be helpful to further improving the accuracy of 3D-CNN. However, the first problem that I ran into is its high computation cost due to the curse of dimensionality. A comparison of the complexity of a typical 2D-CNN and 3D-CNN is shown in the last two subplots in the bottom of the figure bellow. Increasing the voxel resolution even slightly can result in a huge increasing in computation cost. This is the problem that I was trying to solve.

%e5%9b%be%e7%89%8710

After some literature reading, I thought attention mechanism might be a way to solve the problem in computation cost. What we were expecting is that the network only stares at a small portion of the geometry at each time stamp, and then it will decide what to look at next based on the information at previous time stamp. This is also a natural way that animals will use to explore environment when the environment is too big to be perceived at once. In this way, only the first several most important parts of the geometry need to be analyzed by the network, which can be done at a much higher resolution.

Inspired by DRAW: A Recurrent Neural Network For Image Generation, we also used a recurrent neural network to incorporate attention mechanism. A subplot to the left of the following figure is the network structure that we are using. The reader module is borrowed from the DRAW algorithm, which is able to extract small image patches as the upper right subplot shows. For now, we only tested this network on 2D images, which are shown in the lower right subplot. Boxes are where the attentions are at during RNN iteration. The probability of the correct prediction is increasing asymptotically and converges to a pretty high number after only four iterations.

However, another computation problem arouses here. The reader extracts this patch by constructing attention matrix and applies it to the original image, a standard way usually referred as “soft attention”. I didn’t even try this network on 3D voxels because I found that the attention operation is extremely computation intensive even for 2D images. I saw there is also so call “hard attention” mechanism that uses reinforcement learning to decide the location of the next attention. But it is non-differentiable and I’m not sure how well it will work for 3D-CNN in terms of both accuracy and computational cost. I’ll appreciate it if you can give me some comments on this hard attention or suggestions on other attention mechanisms that I can use.

att

Prior to the above network, we also tested this network structure to the bottom. However, the result in this case is very strange and I don’t know why. Reader here is also the same with previous network. To me, there are only two differences with above network structure. The first difference is that this network doesn’t have convolution due to some implementation issue. The second difference is the connection between different time stamp is slightly different.

att2

The result of the second network on MNIST is shown below. Red boxes in the upper right subplot are the attentions. This network is trained using 8 attention window. Each individual box is also plotted out in the second row. The last row is a direct accumulation of all previous attentions. The classification result by this network is visualized in the upper right subplot, where x axis stands for number RNN iteration and y axis stands for different classes. Brighter color stands for higher probability by the network prediction. In contrast to the asymptotic result by the previous network, the probability prediction given by this network is jumping all the time. And the strangest thing is that, despite the fact that the classification probability is not asymptotic, correct classification result can be obtained at the end of the RNN iteration. This phenomenon didn’t change at different number of RNN iteration.

att3

Continue reading “Deep learning with attention” →

Welcome

Author: hyao23

Singular integration

Statistical Machine Learning

Fundamental Algorithms

Note on optimization

Note on machine learning packages

Accelerating computation – hardware aspect

Saliency

Deep learning with attention

Going deeper in neural net

Bipedal Walking