- Google’s BBR fixes TCP’s dirty little secret – Tom Limoncelli’s EverythingSysadmin Blog
- A computer was asked to predict which start-ups would be successful. The results were astonishing
- Donald Trump’s Ghostwriter Tells All | The New Yorker
- The Mystery of Ezra Cohen-Watnick – The Atlantic
- The Trump Administration Just Made it Easier for Law Enforcement to Take Your Property – Mother Jones
- The new Detroit’s fatal flaw – The Washington Post
- This Man Used His Inherited Fortune To Fund The Racist Right
- Too much surveillance makes us less free. It also makes us less safe. – The Washington Post
- A Lost Cat’s Reincarnation, in Masahisa Fukase’s “Afterword” | The New Yorker
- Everywhere You Look, We’ve Downgraded Real Problems Into Mere ‘Issues’ – The New York Times
- How to Mail Your Own Potato – YouTube
- DenseNet/models at master · liuzhuang13/DenseNet
Memory Efficient Implementation of DenseNets
The standard (orginal) implementation of DenseNet with recursive concatenation is very memory inefficient. This can be an obstacle when we need to train DenseNets on high resolution images (such as for object detection and localization tasks) or on devices with limited memory.
In theory, DenseNet should use memory more efficiently than other networks, because one of its key features is that it encourages feature reusing in the network. The fact that DenseNet is "memory hungry" in practice is simply an artifact of implementation. In particular, the culprit is the recursive concatenation which re-allocates memory for all previous outputs at each layer. Consider a dense block with N layers, the first layer’s output has N copies in the memory, the second layer’s output has (N-1) copies, …, leading to a quadratic increase (1+2+…+N) in memory consumption as the network depth grows.
Using optnet (-optMemory 1) or shareGradInput (-optMemory 2), we can significantly reduce the run-time memory footprint of the standard implementaion (with recursive concatenation). However, the memory consumption is still a quadratic function in depth.
We implement a customized densely connected layer (largely motivated by the Caffe implementation of memory efficient DenseNet by Tongcheng), which uses shared buffers to store the concatenated outputs and gradients, thus dramatically reducing the memory footprint of DenseNet during training. The mode -optMemory 3 activates shareGradInput and shared output buffers, while the mode -optMemory 4 further shares the memory to store the output of the Batch-Normalization layer before each 1×1 convolution layer. The latter makes the memory consumption linear in network depth, but introduces a training time overhead due to the need to re-forward these Batch-Normalization layers in the backward pass.
- DenseNet/efficient_densenet_techreport.pdf at master · liuzhuang13/DenseNet
- The Atlantic is ‘most vital when America is most fractured.’ Good thing it soars today. – The Washington Post
- Twitter
RT @djrothkopf: A world being transformed by science and a White House without a scientist in it. Death knell for US leadership.
- Technology Is Biased Too. How Do We Fix It? | FiveThirtyEight
- Michael Chabon: ‘I have a socialist approach to my regrets’ | Life and style | The Guardian
- [1412.6980] Adam: A Method for Stochastic Optimization
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.
- A Gentle Guide to Using Batch Normalization in Tensorflow – Rui Shu
- batch normalization | Francis’s standard
- Installing Emacs on OS X – WikEmacs
- [1707.02968] Revisiting Unreasonable Effectiveness of Data in Deep Learning Era
The success of deep learning in vision can be attributed to: (a) models with high capacity; (b) increased computational power; and (c) availability of large-scale labeled data. Since 2012, there have been significant advances in representation capabilities of the models and computational capabilities of GPUs. But the size of the biggest dataset has surprisingly remained constant. What will happen if we increase the dataset size by 10x or 100x? This paper takes a step towards clearing the clouds of mystery surrounding the relationship between `enormous data’ and deep learning. By exploiting the JFT-300M dataset which has more than 375M noisy labels for 300M images, we investigate how the performance of current vision tasks would change if this data was used for representation learning. Our paper delivers some surprising (and some expected) findings. First, we find that the performance on vision tasks still increases linearly with orders of magnitude of training data size. Second, we show that representation learning (or pre-training) still holds a lot of promise. One can improve performance on any vision tasks by just training a better base model. Finally, as expected, we present new state-of-the-art results for different vision tasks including image classification, object detection, semantic segmentation and human pose estimation. Our sincere hope is that this inspires vision community to not undervalue the data and develop collective efforts in building larger datasets.
- Training an object detector using Cloud Machine Learning Engine | Google Cloud Big Data and Machine Learning Blog | Google Cloud Platform
- Capacity and Trainability in Recurrent Neural Networks
Two potential bottlenecks on the expressiveness of recurrent neural networks (RNNs) are their ability to store information about the task in their parameters, and to store information about the input history in their units. We show experimentally that all common RNN architectures achieve nearly the same per-task and per-unit capacity bounds with careful training, for a variety of tasks and stacking depths. They can store an amount of task information which is linear in the number of parameters, and is approximately 5 bits per parameter. They can additionally store approximately one real number from their input history per hidden unit. We further find that for several tasks it is the per-task parameter capacity bound that determines performance. These results suggest that many previous results comparing RNN architectures are driven primarily by differences in training effectiveness, rather than differences in capacity. Supporting this observation, we compare training difficulty for several architectures, and show that vanilla RNNs are far more difficult to train, yet have slightly higher capacity. Finally, we propose two novel RNN architectures, one of which is easier to train than the LSTM or GRU for deeply stacked architectures.
- Google Brain Team – Research at Google
- A hacker stole $31M of Ether—how it happened and what it means for Ethereum
- Dear tech dudes, stop being so dumb about women | TechCrunch
- U.N. Brought Cholera to Haiti. Now It Is Fumbling Its Effort to Atone. – NYTimes.com
"U.N. Brought Cholera to Haiti. Now It Is Fumbling Its Effort to Atone."
- Twitter
"U.N. Brought Cholera to Haiti. Now It Is Fumbling Its Effort to Atone."
Digest powered by RSS Digest