提交 bf37649e 编写于 作者: wnma3mz's avatar wnma3mz

complete 17th

上级 7b6a264f
# A Tour of Gotchas When Implementing Deep Q Networks with Keras and OpenAi Gym
原文链接:[A Tour of Gotchas When Implementing Deep Q Networks with Keras and OpenAi Gym](http://srome.github.io/A-Tour-Of-Gotchas-When-Implementing-Deep-Q-Networks-With-Keras-And-OpenAi-Gym/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
Starting with the Google DeepMind paper, there has been a lot of new attention around training models to play video games. You, the data scientist/engineer/enthusiast, may not work in reinforcement learning but probably are interested in teaching neural networks to play video games. Who isn’t? With that in mind, here’s a list of nuances that should jumpstart your own implementation.
The lessons below were gleaned from working on my [own implementation](http://www.github.com/srome/ExPyDQN) of the [Nature](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)paper. The lessons are aimed at people who work with data but may run into some issues with some of the non-standard approaches used in the reinforcement learning community when compared with typical supervised learning use cases. I will address both technical details of the parameters of the neural networks and the libraries involved. This post assumes limited knowledge of the Nature paper, in particular in regards to basic notation used around Q learning. My implementation was written to be instructive by relying mostly on Keras and Gym. I avoided theano/tensorflow specific tricks (e.g., theano’s [disconnected gradient](https://github.com/Theano/theano/blob/52903f8267cff316fc669e207eac4e2ecae952a6/theano/gradient.py#L2002-L2021)) to keep the focus on the main components.
# Learning Rate and Loss Functions
If you have looked at some of the [some](https://github.com/spragunr/deep_q_rl/blob/master/deep_q_rl/q_network.py) of the [implementations](https://github.com/matthiasplappert/keras-rl/blob/master/rl/agents/dqn.py), you’ll see there’s usually an option between summing the loss function of a minibatch or taking a mean. Most loss functions you hear about in machine learning start with the word “mean” or at least take a mean over a mini batch. Let’s talk about what a summed loss really means in the context of your learning rate.
A typical gradient descent update looks like this:
θt+1←θt−λ∇(ℓθt)θt+1←θt−λ∇(ℓθt)
where θtθt is your learner’s weights at time tt, λλ is the learning rate and ℓθtℓθt is your loss function which depends on θtθt. I will suppress the θtθt dependence on the loss function moving forward. Let us define ℓℓ as a loss function (where we assume it is the sum over) and ℓ^ℓ^ as your loss function taking the mean over the mini batch. For a fixed mini batch of size mm, notice:
ℓ^=1mℓℓ^=1mℓ
and so mathematically we have
∇(ℓ^)=1m∇(ℓ).∇(ℓ^)=1m∇(ℓ).
What does this tell us? If you train two models each with learning rate λλ, these two variants will have very different behavior as the magnitude of the updates will be off by a factor of mm! In theory, with a small enough learning rate, you can take into account the size of the mini batch and recover the behavior of the mean version of the loss from the summed version. However, this would lead other components coefficients in the loss like for regularization needing adjustment as well. Sticking with the mean version leads to standard coefficients which work well in many situations. Therefore, it’s atypical in other data science applications to use a summed loss function rather than a mean, but the option is frequently present in reinforcement learning. So, you should be aware that you may need to adjust the learning rate! With that said, my implementation uses the mean version and a learning rate of .00025.
# Stochasticity, Frame Skipping, and Gym Defaults
Many (good) blog posts on reinforcement learning show a model being trained using “(Game Name)-v0” and so you might decide to use the ROM as you’ve seen it done before. So far so good. Then you read the various papers and see a technique called “frame skipping”, where you stack nn output screens from the emulator into a single nn-by-length-by-width image and pass that to the model, and so you implement that thinking everything is going according to plan. It’s not. Depending on your version of Gym, you can run into trouble.
In older versions of Gym, the Atari environment *randomly* [repeated your action for 2-4 steps](https://github.com/openai/gym/blob/bde0de609d3645e76728b3b8fc2f3bf210187b27/gym/envs/atari/atari_env.py#L69-L71), and returning the resulting frame. In terms of the code, this is what happens in terms of your frame skip implementation.
```
for k in range(frame_skip):
obs, reward, is_terminal, info = env.step(action) # 2-4 step occur in the emulator with the given action
```
If you implement frame skipping with n=4, it’s possible that your learning is seeing every 8-16 (or more!) frames rather than every 4. You can imagine the impact on performance. Thankfully, this has since been been made [optional](https://github.com/openai/gym/blob/master/gym/envs/atari/atari_env.py#L75-L80) via new ROMs, which we will mention in a second. However, there is another setting to be aware of and that’s repeat_action_probability. For “(Game Name)-v0” ROMs, this is on by default. This is the probability that the game will ignore a new action and repeat a previous action at each time step. To remove both frameskip and the repeat action probability, use the “(Game Name)NoFrameskip-v4” ROMs. A full understanding of these settings can be found [here](https://github.com/openai/gym/blob/5cb12296274020db9bb6378ce54276b31e7002da/gym/envs/__init__.py#L298-L376).
I would be remiss not to point out that Gym is not doing this just to ruin your DQN. There is a legitimate reason for doing this, but this setting can lead to *unending* frustrating when your neural network is ramming itself into the boundary of the stage rather than hitting the ping pong pixel-square. The reason is to introduce stochasticity into the environment. Otherwise, the game will be deterministic and your network is simply memorizing a series of steps like a dance. When using a NoFrameskip ROM, you have to introduce your own stochasticity to avoid said dance. The Nature paper (and many libraries) do this by the “null op max” setting. At the beginning of each episode (i.e., a round for a game of Pong), the agent will perform a series of kk consecutive null operations (action=0 for the Atari emulator in Gym) where kk is an integer sampled uniformly from [0,null op max][0,null op max]. It can be implemented by the following pseudo-code at the start of an episode:
```
obs = env.reset()
# Perform a null operation to make game stochastic
for k in range(np.random.randint(0 , null_op_max, size=1)):
obs, reward, is_terminal, info = env.step(action)
```
# Gradient Clipping, Error Clipping, Reward Clipping
There are several different types of clipping going on in the Nature paper, and each can easily be confused and implemented incorrectly. In fact, if you thought error clipping and gradient clipping were different, you’re already confused!
## What is Gradient Clipping?
The Nature paper states it is helpful to “clip the error term”. The community seems to have rejected “error clipping” for the term “gradient clipping”. Without knowing the background, the term can be ambiguous. In both instances, there is actual no clipping involved of the loss function or of the error or of the gradient. Really, they are choosing a loss function whose gradient does not grow with the size of the error past a certain region, thus limiting the size of the gradient update for large errors. In particular, if the value of the loss function is greater than 1, they switch the loss function to absolute value. Why? Let’s look at the derivative!
This term represents the loss of a mean squared error-like function:
ddx(x−y)2=12(x−y)ddx(x−y)2=12(x−y)
Now compare that to the term resulting from an absolute value-like function when x−y>0x−y>0:
ddx(x−y)=1ddx(x−y)=1
If we think of the value of the loss function, or of the error, as x−yx−y, we can see one gradient update would contain x−yx−y while the other does not. There isn’t really a good, catchy phrase to describe the above mathematical trick that is more representative than “gradient clipping”.
The standard approach is to accomplish this is to use the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) function. The definition of this function is as follows:
f(x)=12x2 if |x|≤δ, δ(|x|−12δ) otherwise.f(x)=12x2 if |x|≤δ, δ(|x|−12δ) otherwise.
There is a common trick to implement this so that symbolic mathematics libraries like theano and tensorflow can take the derivative easier without the use of a switch statement. That trick is outlined/coded below and is commonly employed in most implementations.
The function you actually code is as follows: Let q=min(|x|,δ)q=min(|x|,δ). Then,
g(x)=q22+δ(|x|−q).g(x)=q22+δ(|x|−q).
When
|x|≤δ|x|≤δ
, plugging into the formula shows that we recover
g(x)=12x2g(x)=12x2
. Otherwise when
|x|>δ|x|>δ
,
g(x)=δ22+δ(|x|−δ)=δ|x|−12δ2=δ(|x|−12δ).g(x)=δ22+δ(|x|−δ)=δ|x|−12δ2=δ(|x|−12δ).
So, g=fg=f. This is coded below and plotted to show the function is continuous with a continuous derivative. The Nature paper uses δ=1δ=1.
```
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
def huber_loss(x, clip_delta=1):
error = np.abs(x)
quadratic_part = np.minimum(error, clip_delta)
return 0.5 * np.square(quadratic_part) + clip_delta * (error - quadratic_part)
f=np.vectorize(huber_loss)
x = np.linspace(-5,5,100)
plt.plot(x, f(x))
```
![png](http://srome.github.io/images/dqn_impl/output_8_1.png)
As you can see, the slope of the graph (the derivative) is “clipped” to never be bigger in magnitude than 1. An interesting point is the second derivative is not continuous:
f′′(x)=1 if |x|≤δ, 0 otherwise.f′′(x)=1 if |x|≤δ, 0 otherwise.
This could cause problems using second order methods for gradiet descent, which is why some suggest a pseudo-Huber loss function which is a smooth approximation to the Huber loss. However, Huber loss is sufficient for our goals.
## What Reward Do I Clip?
This is actually a much subtler question when you introduce frame skipping to the equation. Do you take the last reward returned from the emulator? But what if there was a reward during the skipped frames? When you’re so far down in the weeds of neural network implementation, it’s surprising when the answer comes from the basic [Q learning](https://en.wikipedia.org/wiki/Q-learning)framework. In Q learning, one uses the total reward since the last action, and this is what they did in their paper as well. This means even on frames you skipped you need to save the reward observed and aggregate them as the “reward” for a given state (st,at,rt,st+1)(st,at,rt,st+1). This cumulative reward is what you clip. In my implementation, you can see this during the step function on the TrainingEnvironment class. The pseudo code is below:
```
for k in range(frame_skip):
obs, reward, is_terminal, info = env.step(action) # 1 step in the emulator
total_reward += reward
```
This total reward is what is stored in the experience replay memory.
# Memory Consideration
The “experience replay memory” is a store where training examples (st,at,rt,st+1)(st,at,rt,st+1) are sampled from to break the time correlation of the data points, stopping the network’s learning from diverging. The original paper mentions a replay memory of a million examples (st,at,rt,st+1)(st,at,rt,st+1). Both the stst’s and the st+1st+1’s are nn-by-length-by-width images and so you can imagine the memory requirements can become very large. This is one of the cases where knowing more about programming and types while working in Python can be useful. By default, Gym returns images as numpy arrays with the datatype int8. If you do any processing of the images, it’s likely that your images will now be of type float32. So, it’s important when you store your images to make sure they’re in int8 for space considerations, and any transformations that are necessary for the neural network (like scaling) are done before training rather than before storing the states in the replay memory.
It is also helpful to pre-allocate this replay memory. In my implementation, I do this as follows:
```
def __init__(self, size, image_size, phi_length, minibatch_size):
self._memory_state = np.zeros(shape=(size, phi_length, image_size[0], image_size[1]), dtype=np.int8)
self._memory_future_state = np.zeros(shape=(size, phi_length, image_size[0], image_size[1]), dtype=np.int8)
self._rewards = np.zeros(shape=(size, 1), dtype=np.float32)
self._is_terminal = np.zeros(shape=(size, 1), dtype=np.bool)
self._actions = np.zeros(shape=(size, 1), dtype=np.int8)
```
Of course, phi_length=nn from our previous discussion, the number of screens from the emulator stacked together to form a state.
# Debugging In Circles
With so many moving parts and parameters, there’s a lot that can go wrong in an implementation. My recommendation is to set your starting parameters as the same as the original paper to nail down one source of error. Most of the changes from the original NIPS paper to the Nature paper were to standardize learning parameters and performance across different Atari games. There are many quirks that arise from game to game, and most of the techniques in the Nature paper guard against this problem. For example, the consecutive max is used to deal with screens flickering causing objects to disappear under certain frame skip settings. So once you have set the parameters, here are a few things to check when your network is not learning well (or at all):
- Check if your Q-value output jumps in size between batch updates, this means your gradient update is large. Take a look at your learning rate and investigate your gradient to find the problem.
- Look at the states you are sending to the neural network. Does the frame skipping/consecutive max seem to be correct? Are your future states different from your current states? Do you see a logical progression of images? If not, you may have some memory reference issues, which can happen when you try to conserve memory in Python.
- Verify that your fixed target network’s weights are actually fixed. In Python, it’s easy to make the fixed network point to the same location in memory, and then your fixed target network isn’t actually fixed!
- If you find your implementation is not working on a certain game, test your code on a simpler game like Pong. Your implementation may be fine, but you don’t have enough of the cutting-edge advances to learn harder games! Explore Double DQN, Dueling Q Networks, and prioritized experience replay.
# Conclusion
Implementing the DeepMind paper is a rewarding first step into modern advancements in reinforcement learning. If you work mostly in a more traditional supervised learning setting, many of the common ideas and tricks may seem foreign at first. Deep Q Networks are just neural networks afterall, and many of the techniques to stabilize learning can be applied to traditional uses as well. The most interesting point to me is the topic of regularization. If you noticed, we did not use Dropout or L2L2 or L1L1– all the traditional approaches to stabalize training for neural networks. Instead, the same goal of regularization is accomplished with how the mini batches are selected (experience replay) and how future rewards are defined (using a fixed target network). That is one of the most exciting and novel theoretical approaches from the original Nature paper which has continued to be iterated upon and refined in subsequent papers. When you want to further improve your implementation, you should investigate the next iteration of the techniques: [prioritized experience replay](https://arxiv.org/abs/1511.05952), [Double DQN](https://arxiv.org/abs/1509.06461), and [Dueling Q Networks](https://arxiv.org/abs/1511.06581). The (mostly) current standard comes from a modification to allow asyncronous learning called [A3C](https://arxiv.org/abs/1602.01783).
Written on July 26, 2017
#### You May Also Enjoy
\ No newline at end of file
# Scientists Can Read a Bird’s Brain and Predict Its Next Song
原文链接:[Scientists Can Read a Bird’s Brain and Predict Its Next Song](https://www.technologyreview.com/s/609032/scientists-can-read-a-birds-brain-and-predict-its-next-song/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
## Next up, predicting human speech with a brain-computer interface.
Entrepreneurs in Silicon Valley this year set themselves an audacious new goal: creating a brain-reading device that would allow people to effortlessly send texts with their thoughts.
In April, Elon Musk announced a secretive new brain-interface company called [Neuralink](https://www.technologyreview.com/s/604254/with-neuralink-elon-musk-promises-human-to-human-telepathy-dont-believe-it/). Days later, Facebook CEO Mark Zuckerberg declared that “direct brain interfaces [are] going to, eventually, let you communicate only with your mind.” The company says it has [60 engineers](https://www.technologyreview.com/s/604229/facebooks-sci-fi-plan-for-typing-with-your-mind-and-hearing-with-your-skin/) working on the problem.
It’s an ambitious quest—and there are reasons to think it won’t happen anytime soon. But for at least one small, orange-beaked bird, the zebra finch, the dream just became a lot closer to reality.
That’s thanks to some nifty work by Timothy Gentner and his students at the University of California, San Diego, who built a brain-to-tweet interface that figures out the song a finch is going to sing a fraction of a second before it does so.
“We decode realistic synthetic birdsong directly from neural activity,” the scientists announced in a new report published on the website [bioRxiv](https://www.biorxiv.org/content/early/2017/09/27/193987). The team, which includes Argentinian birdsong expert Ezequiel Arneodo, calls the system the first prototype of “a decoder of complex, natural communication signals from neural activity.” A similar approach could fuel advances towards a human thought-to-text interface, the researchers say.
Scientists say they can predict the song of a finch directly from its brain activity.
A songbird’s brain is none too large. But its vocalizations are similar to human speech in ways that make these birds a favorite of scientists studying memory and cognition. Their songs are complex. And, like human language, they’re learned. The zebra finch learns its call from an older bird.
Makoto Fukushima, a fellow at the National Institutes of Health who has used brain interfaces to study the simpler grunts and coos made by monkeys, says the richer range of birdsong is why the new results have “important implications for application in human speech.”
Current brain interfaces tried in humans mostly track neural signals that reflect a person’s imagined arm movements, which can be coopted to move a robot or direct a cursor to very slowly peck out letters. So the idea of a helmet or brain implant that can effortlessly pick up what you’re trying to say remains pretty far from being realized.
But it’s not strictly impossible, as the new study shows. The team at UCSD used silicon electrodes in awake birds to measure the electrical chatter of neurons in part of the brain called the sensory-motor nucleus, where “commands that shape the production of learned song” originate.
The experiment employed neural-network software, a type of machine learning. The researchers fed into the program both the pattern of neural firing and the actual song that resulted, with its stops and starts and changing frequencies. The idea was to train their software to match one to the other, in what they termed “neural-to-song spectrum mappings.”
The team’s main innovation was to simplify the brain-to-tweet translation by incorporating a physical model of how finches make noise. Birds don’t have vocal cords as people do; instead, they shoot air over a vibrating surface in their throat, called a syrinx. Think of how you can make a high-pitched whine by putting two pieces of paper together and blowing at the edge.
The final result, say the authors: “We decode realistic synthetic birdsong directly from neural activity.” In their report, the team says it can predict what the bird will sing about 30 milliseconds before it does so.
You can listen to results yourself in the audio below. Keep in mind that the zebra finch is no nightingale. Its song is more like a staccato quacking.
<audio class="article-audio" controls="controls" data-event-label="" style="box-sizing: border-box; user-select: text !important; margin: 0px; padding: 0px; border: 0px; outline: 0px; font-size: 16px; vertical-align: baseline; background: transparent; width: 600px;"></audio>
The same song, as predicted from neural recordings inside the finch’s brain.
Songbirds are already an important research model. At Elon Musk’s Neuralink, bird scientists were among the first key hires. And UCSD’s trick of focusing on detecting the muscle movements behind speech may also be a key development.
Facebook has said it hopes people will be able to type directly from their brains at 100 words per minute, privately sending texts whenever they want. A device able to read the commands your brain sends out to muscles while you are engaged in subvocal utterances (silent speech) is probably a lot more realistic than one that reads “thoughts.”
Gentner and his team hope their finches will help make it possible. “We have demonstrated a [brain-machine interface] for a complex communication signal, using an animal model for human speech,” they write. They add that “our approach also provides a valuable proving ground for biomedical speech-prosthetic devices.”
In other words, we’re a little closer to texting from our brains.
\ No newline at end of file
# TensorFlow* Optimizations on Modern Intel® Architecture
原文链接:[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
**Intel: Elmoustapha Ould-Ahmed-Vall, Mahmoud Abuzaina, Md Faijul Amin, Jayaram Bobba, Roman S Dubtsov, Evarist M Fomenko, Mukesh Gangadhar, Niranjan Hasabnis, Jing Huang, Deepthi Karkada, Young Jin Kim, Srihari Makineni, Dmitri Mishura, Karthik Raman, AG Ramesh, Vivek V Rane, Michael Riera, Dmitry Sergeev, Vamsi Sripathi, Bhavani Subramanian, Lakshay Tokas, Antonio C Valles**
**Google: Andy Davis, Toby Boyd, Megan Kacholia, Rasmus Larsen, Rajat Monga, Thiru Palanisamy, Vijay Vasudevan, Yao Zhang**
TensorFlow* is a leading deep learning and machine learning framework, which makes it important for Intel and Google to ensure that it is able to extract maximum performance from Intel’s hardware offering. This paper introduces the Artificial Intelligence (AI) community to TensorFlow optimizations on Intel® Xeon® and Intel® Xeon Phi™ processor based platforms. These optimizations are the fruit of a close collaboration between Intel and Google engineers announced last year by Intel’s Diane Bryant and Google’s Diane Green at the first Intel AI Day.
We describe the various performance challenges that we encountered during this optimization exercise and the solutions adopted. We also report out performance improvements on a sample of common neural networks models. These optimizations can result in orders of magnitude higher performance. For example, our measurements are showing up to 70x higher performance for training and up to 85x higher performance for inference on Intel® Xeon Phi™ processor 7250. Intel® Xeon® processor E5 v4 (BDW) and Intel Xeon Phi processor 7250 based platforms, they lay the foundation for next generation products from Intel. In particular, users are expected to see improved performance on Intel Xeon Scalable processors.
Optimizing deep learning models performance on modern CPUs presents a number of challenges not very different from those seen when optimizing other performance-sensitive applications in High Performance Computing (HPC):
1. Code refactoring needed to take advantage of modern vector instructions. This means ensuring that all the key primitives, such as convolution, matrix multiplication, and batch normalization are vectorized to the latest SIMD instructions (AVX2 for Intel Xeon processors and AVX512 for Intel Xeon Phi processors).
2. Maximum performance requires paying special attention to using all the available cores efficiently. Again this means looking at parallelization within a given layer or operation as well as parallelization across layers.
3. As much as possible, data has to be available when the execution units need it. This means balanced use of prefetching, cache blocking techniques and data formats that promote spatial and temporal locality.
To meet these requirements, Intel developed a number of optimized deep learning primitives that can be used inside the different deep learning frameworks to ensure that we implement common building blocks efficiently. In addition to matrix multiplication and convolution, these building blocks include:
- Direct batched convolution
- Inner product
- Pooling: maximum, minimum, average
- Normalization: local response normalization across channels (LRN), batch normalization
- Activation: rectified linear unit (ReLU)
- Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale.
Refer to this [article](https://software.intel.com/en-us/articles/introducing-dnn-primitives-in-intelr-mkl) for more details on these Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) optimized primitives.
In TensorFlow, we implemented Intel optimized versions of operations to make sure that these operations can leverage Intel MKL-DNN primitives wherever possible. While, this is a necessary step to enable scalable performance on Intel® architecture, we also had to implement a number of other optimizations. In particular, Intel MKL uses a different layout than the default layout in TensorFlow for performance reasons. We needed to ensure that the overhead of conversion between the two formats is kept to a minimum. We also wanted to ensure that data scientists and other TensorFlow users don’t have to change their existing neural network models to take advantage of these optimizations.
![img](https://software.intel.com/sites/default/files/managed/55/5d/tensorflow-optimizations-img-01.png)
## Graph Optimizations
We introduced a number of graph optimization passes to:
1. Replace default TensorFlow operations with Intel optimized versions when running on CPU. This ensures that users can run their existing Python programs and realize the performance gains without changes to their neural network model.
2. Eliminate unnecessary and costly data layout conversions.
3. Fuse multiple operations together to enable efficient cache reuse on CPU.
4. Handle intermediate states that allow for faster backpropagation.
These graph optimizations enable greater performance without introducing any additional burden on TensorFlow programmers. Data layout optimization is a key performance optimization. Often times, the native TensorFlow data format is not the most efficient data layout for certain tensor operations on CPUs. In such cases, we insert a data layout conversion operation from TensorFlow’s native format to an internal format, perform the operation on CPU, and convert operation output back to the TensorFlow format. However, these conversions introduce a performance overhead and should be minimized. Our data layout optimization identifies sub-graphs that can be entirely executed using Intel MKL optimized operations and eliminates the conversions within the operations in the sub-graph. Automatically inserted conversion nodes take care of data layout conversions at the boundaries of the sub-graph. Another key optimization is the fusion pass that automatically fuses operations that can be run efficiently as a single Intel MKL operation.
## Other Optimizations
We have also tweaked a number of TensorFlow framework components to enable the highest CPU performance for various deep learning models. We developed a custom pool allocator using existing pool allocator in TensorFlow. Our custom pool allocator ensures that both TensorFlow and Intel MKL share the same memory pools (using the Intel MKL imalloc functionality) and we don’t return memory prematurely to the operating system, thus avoiding costly page misses and page clears. In addition, we carefully tuned multiple threading libraries (pthreads used by TensorFlow and OpenMP used by Intel MKL) to coexist and not to compete against each other for CPU resources.
## Performance Experiments
Our optimizations such as the ones discussed above resulted in dramatic performance improvements on both Intel Xeon and Intel Xeon Phi platforms. To illustrate the performance gains we report below our best known methods (or BKMs) together with baseline and optimized performance numbers for three common [ConvNet benchmarks](https://github.com/soumith/convnet-benchmarks).
1. The following parameters are important for performance on Intel Xeon (codename Broadwell) and Intel Xeon Phi (codename Knights Landing) processors and we recommend tuning them for your specific neural network model and platform. We have carefully tuned these parameters to gain maximum performance for convnet-benchmarks on both Intel Xeon and Intel Xeon Phi processors.
1. Data format: we suggest that users can specify the NCHW format for their specific neural network model to get maximum performance. TensorFlow default NHWC format is not the most efficient data layout for CPU and it results in some additional conversion overhead.
2. Inter-op / intra-op: we also suggest that data scientists and users experiment with the intra-op and inter-op parameters in TensorFlow for optimal setting for each model and CPU platform. These settings impact parallelism within one layer as well as across layers.
3. Batch size: batch size is another important parameter that impacts both the available parallelism to utilize all the cores as well as working set size and memory performance in general.
4. OMP_NUM_THREADS: maximum performance requires using all the available cores efficiently. This setting is especially important for performance on Intel Xeon Phi processors since it controls the level of hyperthreading (1 to 4).
5. Transpose in Matrix multiplication: for some matrix sizes, transposing the second input matrix b provides better performance (better cache reuse) in Matmul layer. This is the case for all the Matmul operations used in the three models below. Users should experiment with this setting for other matrix sizes.
6. KMP_BLOCKTIME: users should experiment with various settings for how much time each thread should wait after completing the execution of a parallel region, in milliseconds.
#### Example settings on Intel® Xeon® processor (codename Broadwell - 2 Sockets - 22 Cores)
![img](https://software.intel.com/sites/default/files/managed/55/5d/tensorflow-optimizations-img-02.png)
#### Example settings on Intel® Xeon Phi™ processor (codename Knights Landing - 68 Cores)
![img](https://software.intel.com/sites/default/files/managed/55/5d/tensorflow-optimizations-img-03.png)
1. Performance results on Intel® Xeon® processor (codename Broadwell – 2 Sockets – 22 Cores)
![img](https://software.intel.com/sites/default/files/managed/55/5d/tensorflow-optimizations-img-04.png)
2. Performance results on Intel® Xeon Phi™ processor (codename Knights Landing – 68 cores)
![img](https://software.intel.com/sites/default/files/managed/55/5d/tensorflow-optimizations-img-05.png)
3. Performance results with different batch sizes on sizes on Intel® Xeon® processor (codename Broadwell) and Intel® Xeon Phi™ processor (codename Knights Landing) - Training
![img](https://software.intel.com/sites/default/files/managed/55/5d/tensorflow-optimizations-img-06.png)
![img](https://software.intel.com/sites/default/files/managed/55/5d/tensorflow-optimizations-img-07.png)
![img](https://software.intel.com/sites/default/files/managed/55/5d/tensorflow-optimizations-img-08.png)
## Installing TensorFlow with CPU Optimizations
You can either install pre-built binary packages with pip or conda by following the directions within [Intel Optimized TensorFlow Wheel Now Available ](https://software.intel.com/en-us/articles/intel-optimized-tensorflow-wheel-now-available#)or you can build from sources following the directions below:
1. Run "./configure" from the TensorFlow source directory, and it will download latest Intel MKL for machine learning automatically in tensorflow/third_party/mkl/mklml if you select the options to use Intel MKL.
2. Execute the following commands to create a pip package that can be used to install the optimized TensorFlow build.
- PATH can be changed to point to a specific version of GCC compiler:
export PATH=/PATH/gcc/bin:$PATH
- LD_LIBRARY_PATH can also be changed to point to new GLIBC :
export LD_LIBRARY_PATH=/PATH/gcc/lib64:$LD_LIBRARY_PATH.
- Build for best performance on Intel Xeon and Intel Xeon Phi processors:
bazel build --config=mkl --copt=”-DEIGEN_USE_VML” -c opt //tensorflow/tools/pip_package:
build_pip_package
3. Install the optimized TensorFlow wheel
1. bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/path_to_save_wheel
pip install --upgrade --user ~/path_to_save_wheel /wheel_name.whl
## System Configuration
![img](https://software.intel.com/sites/default/files/managed/55/5d/tensorflow-optimizations-img-09.png)
## What It Means for AI
Optimizing TensorFlow means deep learning applications built using this widely available and widely applied framework can now run much faster on Intel processors to increase flexibility, accessibility, and scale. The Intel Xeon Phi processor, for example, is designed to scale out in a near-linear fashion across cores and nodes to dramatically reduce the time to train machine learning models. And TensorFlow can now scale with future performance advancements as we continue enhancing the performance of Intel processors to handle even bigger and more challenging AI workloads.
The collaboration between Intel and Google to optimize TensorFlow is part of ongoing efforts to make AI more accessible to developers and data scientists, and to enable AI applications to run wherever they’re needed on any kind of device—from the edge to the cloud. Intel believes this is the key to creating the next-generation of AI algorithms and models to solve the most pressing problems in business, science, engineering, medicine, and society.
This collaboration already resulted in dramatic performance improvements on leading Intel Xeon and Intel Xeon Phi processor-based platforms. These improvements are now readily available through [Google’s TensorFlow GitHub repository](https://github.com/tensorflow/tensorflow.git). We are asking the AI community to give these optimizations a try and are looking forward to feedback and contributions that build on them.
\ No newline at end of file
# TensorFlow在现代英特尔体系结构下的优化
原文链接:[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
英特尔:**Elmoustapha Ould-Ahmed-Vall, Mahmoud Abuzaina, Md Faijul Amin, Jayaram Bobba, Roman S Dubtsov, Evarist M Fomenko, Mukesh Gangadhar, Niranjan Hasabnis, Jing Huang, Deepthi Karkada, Young Jin Kim, Srihari Makineni, Dmitri Mishura, Karthik Raman, AG Ramesh, Vivek V Rane, Michael Riera, Dmitry Sergeev, Vamsi Sripathi, Bhavani Subramanian, Lakshay Tokas, Antonio C Valles**
谷歌: **Andy Davis, Toby Boyd, Megan Kacholia, Rasmus Larsen, Rajat Monga, Thiru Palanisamy, Vijay Vasudevan, Yao Zhang**
TensorFlow是一个领先的深度学习和机器学习框架,这对谷歌和英特尔来说,确保它从英特尔的硬件中获得最大的性能是十分重要的。这篇文章介绍了人工智能社区在基于Intel® Xeon® 和Intel® Xeon Phi™ 处理器的平台上对TensorFlow的优化。这些优化是在去年英特尔AI Day的第一天由英特尔的Diane Bryant 和谷歌的Diane Green 提出的,是英特尔和谷歌工程师亲密合作的成果。
我们叙述了在这个优化过程中遇到的各种性能挑战以及所采用的解决方案。我们还报告了一些常见神经网络模型的性能改进。这些优化可以带来数量级的性能提高。例如,我们在Intel® Xeon Phi™ 7250处理器的测试显示可以在训练过程中有70倍的更好表现在验证过程中有最高85倍更好的表现。这些基于 Intel® Xeon® E5 v4 (BDW)处理器 and Intel Xeon Phi 7250 处理器的平台为下一代英特尔的产品奠定了基础。值得一提的是,用户希望看到Intel Xeon可扩展处理器在性能上的提高。
在现代cpu上优化深度学习模型的性能与在高性能计算(HPC)中优化其他性能敏感应用程序时所遇到的挑战并无太大不同:
1.代码重构需要利用现代向量指令。这意味 着确保所有关键原语(如卷积、矩阵乘法和批标准化)都向最新的SIMD指令(用于Intel Xeon处理器的AVX2和用于Intel Xeon Phi处理器的AVX512)矢量化。
2.最大的性能要求特别注意有效地使用所有可用的内核。同样,这意味着查看给定层或操作中的并行化以及跨层的并行化。
3.当执行单元需要数据时,所需的数据必须尽可能地可用。这意味着平衡使用预取、缓存阻塞技术和数据格式,以促进空间和时间局部性。
为了满足这些需求,Intel开发了一些优化的深度学习原语,这些原语可以在不同的深度学习框架中使用,以确保我们能够高效地实现公共构建块。除了矩阵乘法和卷积外,这些构建模块还包括:
* 直接成批的卷积
* Inner product
* 池化:最大、最小、均值
* 标准化:跨信道的局部响应标准化(LRN),批标准化
* 激活函数:线性修正单元(ReLU)
* 数据处理:多维转换、拆分、合并、求和和比例放缩
更多细节请参考本文在这些英特尔®深层神经网络优化原语数学内核库(Intel®MKL-DNN)
在TensorFlow中,我们实现了英特尔操作优化版本,以确保这些操作可以在任何可能的情况下利用英特尔MKL-DNN原语。当然,对于在英特尔架构上实现可扩展性能,我们还需要一些其他的优化。特别是,由于性能原因,Intel MKL使用的布局与TensorFlow中的默认布局不同。我们需要确保在两种格式之间转换的开销保持在最小。我们还希望确保数据科学家和其他TensorFlow用户在利用这些优化时不必改变现有的神经网络模型。
[![img](https://camo.githubusercontent.com/67b6aaf6a45f07faefab3fbfbe8e6766521bcecb/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30312e706e67)](https://camo.githubusercontent.com/67b6aaf6a45f07faefab3fbfbe8e6766521bcecb/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30312e706e67)
## 图优化
我们介绍一些图优化:
​ 1.在CPU上运行时,用英特尔优化版本替换默认的TensorFlow版本。这确保了用户可以运行他们现有的Python程序,并在不更改其神经网络模型的情况下实现性能提升。
​ 2.消除不必要的和开销大的数据布局转换。
​ 3.将多个操作融合在一起,以在CPU上实现高效的缓存重用。
​ 4.处理中间状态,可以使反向传播操作更快。
这些图优化可以提高性能,而不会给TensorFlow程序员带来额外的负担。数据布局优化是性能优化的关键。通常,对于cpu上的某些张量操作,本地的TensorFlow数据格式并不是最有效的数据布局。在这种情况下,我们将数据布局转换操作从TensorFlow的本机格式插入到内部格式,在CPU上执行操作,并将操作输出转换回TensorFlow格式。然而,这些转换引入了性能开销,应该尽量减少。我们的数据布局优化确定了子图,这些子图可以完全使用Intel MKL优化操作执行,并消除了子图中操作的转换。自动插入的转换节点负责子图边界上的数据布局转换。另一个关键的优化是“融合传递”(fusion pass),它可以自动融合可以高效运行的Intel MKL操作。
## 其他方面的优化
我们还调整了一些TensorFlow中的框架组件,以支持以CPU最高性能允许各种深度学习模型。我们使用TensorFlow中的现有池分配器开发了一个定制的池分配器。我们的自定义池分配器确保TensorFlow和Intel MKL共享相同的内存池(使用Intel MKL imalloc功能),我们不会过早地将内存返回操作系统,从而避免代价高昂的页面丢失和页面清除。此外,我们仔细地调整了多线程库(TensorFlow使用的pthreads和Intel MKL使用的OpenMP),使它们能够共存,而不是为了争夺CPU资源而相互竞争。
## 性能测试
正如上文,我们的优化使Intel Xeon和Intel Xeon Phi平台的性能得到显著改进。为了说明性能的提高,我们将在我们最普遍的方法(或BKMs)下测试,并对三个常见的ConvNet基准测试数据进行对比和优化。
​ 1.以下参数对于Intel Xeon(变量名:Broadwell)和Intel Xeon Phi(变量名:Knights Landing)处理器的性能非常重要,我们建议针对特定的神经网络模型和平台对它们进行调优。我们已经仔细调整了这些参数,以获得在Intel Xeon和Intel Xeon Phi处理器上的convnet-benchmark的最大性能。
​ i.数据格式:我们建议用户可以为其特定的神经网络模型指定NCHW格式,以获得最大的性能。TensorFlow默认NHWC格式对于CPU来说不是最有效的数据布局,它会导致一些额外的转换开销。
​ ii.Inter-op / intra-op:我们还建议数据科学家和用户试验TensorFlow中的intra-op和intra-op参数,以便为每个模型和CPU平台优化设置。这些设置影响同层以及跨层之间的并行性。
​ iii.批尺寸(batch size):批尺寸是另一个重要的参数,它影响可用的并行性,以利用所有的核心,以及工作集大小和内存性能。
​ iv.OMP_NUM_THREADS:最大的性能要求有效地使用所有可用的核心。这个设置对于Intel Xeon Phi处理器的性能特别重要,因为它控制了超线程级别(1到4)。
​ v.矩阵乘法中的转置:对于某些矩阵大小,转置第二个输入矩阵b可以在Matmul层中提供更好的性能(更好的缓存重用)。下面三个模型中使用的所有Matmul操作都是如此。用户应该对其他矩阵大小使用此设置。
​ vi:KMP_BLOCKTIME:用户应该针对在完成并行区域的执行后,每个线程应该等待多少时间(毫秒)尝试各种设置。
<strong>在Intel®Xeon®处理器的示例设置(代号Broadwell - 2套接字- 22内核)</strong>
[![img](https://camo.githubusercontent.com/7f5fe519021aeaf84974be52feadc041c4b432c2/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30322e706e67)](https://camo.githubusercontent.com/7f5fe519021aeaf84974be52feadc041c4b432c2/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30322e706e67)
<strong>Intel®Xeonφ™处理器的示例设置(代号Knights Landing- 68内核)</strong>
[![img](https://camo.githubusercontent.com/d31983dcf4f45a2a32102cdc76719d02452e0ab6/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30332e706e67)](https://camo.githubusercontent.com/d31983dcf4f45a2a32102cdc76719d02452e0ab6/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30332e706e67)
1.在Intel®Xeon®处理器的运行结果(代号Broadwell - 2套接字- 22内核)
[![img](https://camo.githubusercontent.com/6939fb3814a15680379fe44464ec3664b7453229/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30342e706e67)](https://camo.githubusercontent.com/6939fb3814a15680379fe44464ec3664b7453229/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30342e706e67)
2.在Intel®Xeonφ™处理器上的运行结果(代号Knights Landing- 68内核)
[![img](https://camo.githubusercontent.com/7e2f91683296bb12afb8bcdedefdcf80d6a530b2/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30352e706e67)](https://camo.githubusercontent.com/7e2f91683296bb12afb8bcdedefdcf80d6a530b2/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30352e706e67)
3.训练过程中在不同批尺寸下Intel® Xeon® 处理器 (代号 Broadwell) and Intel® Xeon Phi™ 处理器(代号 Knights Landing) 的表现结果
[![img](https://camo.githubusercontent.com/149e1b63dffd5ae62e7f9a30ad2d81caa0bb844a/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30362e706e67)](https://camo.githubusercontent.com/149e1b63dffd5ae62e7f9a30ad2d81caa0bb844a/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30362e706e67)
[![img](https://camo.githubusercontent.com/82254e49c8c9e1988a9764f72595abc6526499d7/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30372e706e67)](https://camo.githubusercontent.com/82254e49c8c9e1988a9764f72595abc6526499d7/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30372e706e67)
[![img](https://camo.githubusercontent.com/630c1e41069e39d33866d1cc76d54109cfdf2a86/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30382e706e67)](https://camo.githubusercontent.com/630c1e41069e39d33866d1cc76d54109cfdf2a86/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30382e706e67)
## 安装带有CPU优化的TensorFlow
您可以使用pip或conda安装预构建的二进制包, 详细步骤于此链接:[Intel Optimized TensorFlow Wheel Now Available](https://software.intel.com/en-us/articles/intel-optimized-tensorflow-wheel-now-available#) ,或者您可以根据以下方向从源代码构建:
​ 1.从TensorFlow源码位置运行"./configure",如果你选择使用英特尔MKL,它将会在tensorflow/third_party/mkl/mklml 下自动下载最新的英特尔机器学习MKL。
​ 2.执行以下命令来创建一个pip包,该包可用于安装优化的TensorFlow。
​ 添加指向GCC编译器的环境变量:export PATH=/PATH/gcc/bin:$PATH
​ 将LD_LIBRARY_PATH 改为指向新的GLIBC:export LD_LIBRARY_PATH=/PATH/gcc/lib64:$LD_LIBRARY_PATH
​ 为了最好的性能在Intel Xeon and Intel Xeon Phi 处理器上构建包:bazel build --config=mkl --copt=”-DEIGEN_USE_VML” -c opt //tensorflow/tools/pip_package: build_pip_package
​ 3.安装已优化的TensorFlow wheel
​ i.bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/path_to_save_wheel pip install -- upgrade --user ~/path_to_save_wheel /wheel_name.whl
## 系统配置
[![img](https://camo.githubusercontent.com/6509ddfa2fae9b2a8e699706f1a3a473cead56f8/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30392e706e67)](https://camo.githubusercontent.com/6509ddfa2fae9b2a8e699706f1a3a473cead56f8/68747470733a2f2f736f6674776172652e696e74656c2e636f6d2f73697465732f64656661756c742f66696c65732f6d616e616765642f35352f35642f74656e736f72666c6f772d6f7074696d697a6174696f6e732d696d672d30392e706e67)
## 对于AI来说它意味着什么
优化TensorFlow意味着使用这种广泛应用的框架构建的深度学习应用程序现在可以在英特尔处理器上运行得更快,从而提高灵活性、可访问性和可伸缩性。以英特尔Xeon Phi处理器为例,其设计目的是在核心和节点之间以近乎线性的方式进行扩展,从而大大减少训练机器学习模型的时间。随着我们不断提高英特尔处理器的性能,我们可以处理更大、更有挑战性的人工智能工作任务,这样,TensorFlow可以随着未来性能的提升而扩展。
英特尔和谷歌在优化TensorFlow方面的合作,是让开发人员和数据科学家更容易访问人工智能的持续努力的一部分,也是让人工智能应用程序可以在任何设备上运行——从边缘到云事业的一部分。英特尔相信这是创造下一代解决商业、科学、工程、医学和社会中最紧迫的问题的人工智能算法和模型的关键。
这种合作已经在领先的Intel Xeon和Intel Xeon Phi处理器平台上带来了显著的性能提升。这些改进现在可以通过谷歌的TensorFlow GitHub库轻松获得。我们希望人工智能社区尝试这些优化,并期待基于它们的反馈和贡献。
\ No newline at end of file
# 一个用Kears和 OpenAi Gym实现深度Q网络的Gotchas指南
原文链接:[A Tour of Gotchas When Implementing Deep Q Networks with Keras and OpenAi Gym](http://srome.github.io/A-Tour-Of-Gotchas-When-Implementing-Deep-Q-Networks-With-Keras-And-OpenAi-Gym/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
从谷歌DeepMind这篇论文开始,关于训练一个模型来玩电子游戏得到了很多人的关注。你,数据科学家/工程师/爱好者,可能不从事强化学习,但可能对教神经网络玩电子游戏感兴趣。谁不是呢?考虑到这一点,这里列出了一些小教程,可以帮助您快速开始自己的实现。
下面的课程是从我自己的[Nature](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html) 论文的[实现](http://www.github.com/srome/ExPyDQN) 中收集的。这些课程针对的是那些从事数据工作的人,但是与典型的监督学习用例相比,强化学习社区中使用的一些非标准方法可能会遇到一些问题。我将讨论神经网络参数的技术细节和所涉及的库。这篇文章首先讲述了关于Nature论文中的基础知识,特别是关于Q学习使用的基本符号。我的实现主要依赖于Keras和Gym,因此很有指导意义,我避免了使用theano/tensorflow中特定的技巧(例如theano的[断开梯度](https://github.com/Theano/theano/blob/52903f8267cff316fc669e207eac4e2ecae952a6/theano/gradient.py#L2002-L2021) )来保持对主要程序逻辑的关注。
## 学习率和损失函数
如果您看过其中的[一些](https://github.com/spragunr/deep_q_rl/blob/master/deep_q_rl/q_network.py) [实现](https://github.com/matthiasplappert/keras-rl/blob/master/rl/agents/dqn.py) ,您会发现通常有一个选项,可以是对一个小批量的损失函数求和,也可以是取平均值。在机器学习领域你听到过的大多数损失函数都是以“平均“开头或者至少在一个小批量中取一个平均值。让我们来谈谈在学习率的背景下,一个总的损失到底意味着什么。
一个典型的梯度下降更新就像这样:
θt+1←θt−λ∇(ℓθt)θt+1←θt−λ∇(ℓθt)
θtθt 是在tt时刻的权重,λλ 是学习率,ℓθtℓθt 是由θtθt 决定的一个损失函数。我将抑制θtθt依赖损失函数前进。让我们将ℓℓ 定义为一个损失函数(在这里我们假设它求和)且ℓ^ℓ^作为损失函数的意思是小批量(mini batch).对于尺寸为mm的固定小批量,请注意:
ℓ^=1mℓℓ^=1mℓ
并且在数学上我们有,
∇(ℓ^)=1m∇(ℓ).∇(ℓ^)=1m∇(ℓ).
这个告诉我们什么?如果你使用学习率λλ训练两个模型,这两种变体将会在mm级别有非常不同的行为。从理论上讲,有了足够小的学习率,你可以考虑到迷你批(mini batch)的大小,并从汇总版本中恢复平均版本的损失行为。然而,这也将导致损失中其他成分系数比如正则化需要调整。坚持使用平均值会使标准系数在许多情况下都能很好地工作。因此,在其他数据科学应用中,使用累加损失函数而不是平均值是很不寻常的,但强化学习中经常出现这种选项。所以,你应该意识到你可能需要调整学习率!也就是说,我的实现使用了平均版本和0.00025的学习率
## 随机性、跳帧和Gym默认值
许多(好的)关于强化学习的博客文章显示了一个模型正在使用“(GameName)-v0”进行训练,所以你可能决定使用ROM,就像你以前看到的那样。到目前为止还不错。然后你看了各种各样的论文,看到了一种叫做“跳帧”的技术,在这种技术中,你把模拟器的神经网络(nn)输出区域堆叠成一个长度宽度为n的图像,然后把它传递给模型,这样你就实现了所有事情都按照计划进行。并不是这样的。取决于你Gym的版本,你可能会遇到麻烦。
在旧版本的Gym,Atari环境随机[重复你的动作2-4步 ](https://github.com/openai/gym/blob/bde0de609d3645e76728b3b8fc2f3bf210187b27/gym/envs/atari/atari_env.py#L69-L71) ,并返回结果帧。就代码而言,这就是在你的跳帧实现中发生的事情。
```
for k in range(frame_skip):
obs, reward, is_terminal, info = env.step(action) # 2-4 step occur in the emulator with the given action
```
如果你用n=4实现了跳帧,那么你的训练可能是每8-16帧(或者更多!)而不是每4帧。你可以想象它对性能的影响。幸运的是,这已经通过新的rom变得[可调节](https://github.com/openai/gym/blob/master/gym/envs/atari/atari_env.py#L75-L80) ,我们稍后会提到。但是,还有另一个设置需要注意,那就是repeat_action_probability。对于“(Game Name)-v0”rom,这是默认打开的。这是游戏会忽略一个新动作并在每一步重复之前动作的概率。要删除跳帧和重复动作概率,使用“(Game Name)NoFrameskip-v4”rom。可以在[这里](https://github.com/openai/gym/blob/5cb12296274020db9bb6378ce54276b31e7002da/gym/envs/__init__.py#L298-L376) 找到对这些设置的完整理解。
如果我没有指出Gym这么做不是为了破坏你的DQN,那我就太失职了。这样做是有正当理由的,但是当你的神经网络将自己放在无关紧要的位置而不是工作重点时,这种设置会导致无休止的挫败。原因是将随机特性引入到环境中。否则,游戏将是确定性的,你的网络只是简单地记忆一系列步骤,如舞蹈。当使用NoFrameskip ROM时,你必须实现你自己的随机特性来避免网络陷入定式。《自然》杂志的论文(以及许多库)通过“null op max”设置来实现这一点。在每一局游戏的开头(比如乒乓比赛的每一回合),代理将执行一系列连续的kk空操作(Atari模拟器在Gym中的action=0),其中kk是一个从[0,null op max],[0,null op max]均匀采样的整数。在一局的开始可以通过下面的伪代码来实现。
```
obs = env.reset()
# Perform a null operation to make game stochastic
for k in range(np.random.randint(0 , null_op_max, size=1)):
obs, reward, is_terminal, info = env.step(action)
```
## 梯度裁剪,错误裁剪,奖励裁剪
在《自然》杂志里有许多不同种类的裁剪,并且每一个都很容易被混淆和错误地实现。实际上,如果你认为错误裁剪和梯度裁剪是不同的,那么你已经对此十分困惑了!
### 什么是梯度裁剪
《自然》杂志的这篇论文指出,“删去错误术语”是有帮助的。社区似乎已经拒绝了和“梯度裁剪”一词相似的“错误裁剪”。如果不了解背景,这个词的意思就会模棱两可。在这两种情况下,实际不涉及损失函数、误差或梯度的裁剪。实际上,他们选择的是一个损失函数,它的梯度不会随着误差的大小超过某个区域而增加,因此对于较大的误差限制了梯度更新的大小。特别地,如果损失函数的值大于1,则将该值转换为绝对值。为什么?我们来看看它的导数!
这一项表示平方均误差类函数的损失:
ddx(x−y)2=12(x−y)ddx(x−y)2=12(x−y)
如果我们考虑损失函数的值或误差,当x−yx−y,我们可以看到一个梯度更新将包含x−yx−y,而另一个没有。没有一个真正好的、朗朗上口的短语来描述上述比“梯度裁剪”更有代表性的数学技巧
标准的方法是使用Huber 损失函数来完成这个任务。该函数的定义如下:
f(x)=12x2 如果 |x|≤δ, δ(|x|−12δ) .
f(x)=12x2 如果|x|≤δ, δ(|x|−12δ) .
实现这一点有一个常见的技巧,像theano和tensorflow这样的符号数学库就可以在不使用switch语句的情况下更容易地求导。下面的叙述和代码展示了这个技巧,并在大多数实现中普遍使用。
实际上需要你编码的函数如下:让q = min(x | |,δ)q = min(δx | |),然后
g(x)=q22+δ(|x|−q).g(x)=q22+δ(|x|−q).
当|x|≤δ|x|≤δ 时,
把它代入公式中就可以看出我们得到了
g(x)=12x2g(x)=12x2
否则当|x|>δ|x|>δ 时,
g(x)=δ22+δ(|x|−δ)=δ|x|−12δ2=δ(|x|−12δ).g(x)=δ22+δ(|x|−δ)=δ|x|−12δ2=δ(|x|−12δ).
所以,g = fg = f
下面被编码和被绘制出来的函数是一个连续函数,并且有连续导数
```
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
def huber_loss(x, clip_delta=1):
error = np.abs(x)
quadratic_part = np.minimum(error, clip_delta)
return 0.5 * np.square(quadratic_part) + clip_delta * (error - quadratic_part)
f=np.vectorize(huber_loss)
x = np.linspace(-5,5,100)
plt.plot(x, f(x))
```
[![png](https://camo.githubusercontent.com/7e386e09e50344f2ead0d5125cbc1d3609b0fb5d/687474703a2f2f73726f6d652e6769746875622e696f2f696d616765732f64716e5f696d706c2f6f75747075745f385f312e706e67)](https://camo.githubusercontent.com/7e386e09e50344f2ead0d5125cbc1d3609b0fb5d/687474703a2f2f73726f6d652e6769746875622e696f2f696d616765732f64716e5f696d706c2f6f75747075745f385f312e706e67)
如您所见,该图的斜率(导数)被“裁剪”,其大小永远不会大于1。有趣的一点是它的二阶导数不是连续的。
如果x | |≤δ,0,f′′(x)= 1,否则,如果|x|≤δ, 0 ,f′′(x)=1 (f′′(x)=1 if |x|≤δ, 0 otherwise.f′′(x)=1 if |x|≤δ, 0 otherwise. 原文正确吗??)
使用二阶方法进行梯度下降可能会导致问题,这就是为什么有些人建议使用伪Huber损失函数,它是Huber损失函数的平滑近似。然而,Huber损失函数对我们的目标来说已经足够了。
### 如果我使用裁剪技巧我会得到什么?
当你把跳帧技术引入方程时,这实际上是一个更加微妙的问题。您是否接受模拟器返回的最后一个结果?但是如果跳过的帧中有一个更好的结果呢?当你对神经网络的实现知之甚少时,如果答案来自基本的Q learning框架时,你会惊讶不已。在Q学习中,一些人使用的是最后一次操作之后的总结果,这也是他们在论文中所做的。这意味着即使在跳过的帧中,您也需要保存观察到的结果,并将它们聚合为给定状态(st,at,rt,st+1)(st,at,rt,st+1)的“结果”。这个累积的结果就是你所获得的。在我的实现中,你可以在TrainingEnvironment类的step函数中看到这一点。伪代码如下:
```
for k in range(frame_skip):
obs, reward, is_terminal, info = env.step(action) # 1 step in the emulator
total_reward += reward
```
这个最终的结果是储存在保留区域中的。
## 关于保留区域的思考
“保留区域”是一个存储位置,从训练样本(st,at,rt,st+1)(st,at,rt,st+1)中采样,打破数据点的时间相关性,阻止网络学习发散。最初的论文提到了100万个例子的保留区域(st,at,rt,st+1)(st,at,rt,st+1)。stst和st+1 st+1都是长度宽度为n的图像所以你可以想象内存需求会变得非常大。这种情况对于那些使用Python编程的人来说是很有用的。默认情况下,Gym以数据类型为int8的numpy数组的形式返回图像。如果你对图像进行任何处理,你的图像现在可能是float32类型。所以,当你用int8数据类型储存你的图像时,你一定要确保这是为了节省储存空间,这是十分重要的。对于神经网络来说,任何必要的转换(比如缩放操作)都是在训练之前完成的,而不是在保留区域中存储状态之前。
预分配保留区域所需内存也很有帮助,在我的实现中,我是这样做的:
```
def __init__(self, size, image_size, phi_length, minibatch_size):
self._memory_state = np.zeros(shape=(size, phi_length, image_size[0], image_size[1]), dtype=np.int8)
self._memory_future_state = np.zeros(shape=(size, phi_length, image_size[0], image_size[1]), dtype=np.int8)
self._rewards = np.zeros(shape=(size, 1), dtype=np.float32)
self._is_terminal = np.zeros(shape=(size, 1), dtype=np.bool)
self._actions = np.zeros(shape=(size, 1), dtype=np.int8)
```
当然,phi_length=nn是在我们之前的讨论中,模拟器的输出区域数量堆叠在一起形成的状态。
## 闭环调试
有了这么多可变的部件和参数,实现中可能会出现很多错误。当出现错误时,我的建议是将你的起始参数设置为与示例文件相同,以确定一个错误的来源。从最初的NIPS论文到《自然》论文的大部分修改都是为了标准化不同Atari游戏的学习参数和性能。在不同的游戏中会出现很多异常的现象,《自然》杂志上的大多数技术都能防止这个问题。例如,连续求最大用于处理导致对象在某些跳帧设置下消失的屏幕闪烁问题。因此,一旦你更改了一些参数,当你的网络表现不好(或根本不能运行)时,一定要检查如下原因:
​ 1.检查你输出的q值是否在批更新之间跳跃,这意味着你的梯度更新是偏大大的。查看你的学习率,从你的梯度发现问题。
​ 2.查看你发送给神经网络的数据。跳帧/连续最大值操作是否正确?你的神经网络一段时间之后的状态和先前的状态是否不同?你看到图像的逻辑级数了吗?如果没有,你可能会遇到一些内存引用问题,这可能在尝试在Python中保存内存中的数据时发生。
​ 3.在训练目标一致的情况下,验证网络的权重实际上是固定的。在Python中,很容易使固定网络指向内存中的相同位置,但你的固定目标网络内部权重实际上不是固定的!
​ 4.如果你发现你的实现不能在特定的游戏上工作,请在更简单的游戏(如Pong)上测试你的代码。你的实现可能很好,但是你没有足够的前沿技术来学习更难的游戏!探索双重DQN,对抗Q网络,优先体验重播。
## 结论
实现DeepMind的论文是强化学习领域迈向现代技术前沿有价值的一步。如果你主要使用一些更传统的监督学习方法,那么本文的许多想法和技巧乍一看可能是陌生的。深度Q网络毕竟也只是神经网络,许多成熟的学习技术也可以应用到传统应用中。对我来说,最有趣的一点是关于正规化的话题。不知你有没有注意到,我们没有使用drop - out、L2L2或L1L1 等传统的方法来稳定神经网络的训练。这是从原始的《自然》的论文开始并一直在后来的论文中被重复和完善的最令人兴奋和新颖的理论方法之一。当您想进一步改进您的实现时,您应该研究技术的下一个迭代方向:[优先体验重放](https://arxiv.org/abs/1511.05952)[双重DQN](https://arxiv.org/abs/1509.06461)[对抗Q网络](https://arxiv.org/abs/1511.06581) 。目前的标准(主要)来自于一种允许异步学习( [A3C](https://arxiv.org/abs/1602.01783) )的修改版本。
写于2017年7月26日
你也可能喜欢。
\ No newline at end of file
# 科学家可以读取一只鸟的大脑并且预测它的下一声啼叫
原文链接:[Scientists Can Read a Bird’s Brain and Predict Its Next Song](https://www.technologyreview.com/s/609032/scientists-can-read-a-birds-brain-and-predict-its-next-song/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
## 下一步,通过一个脑机接口预测人类的讲话。
​ 在硅谷的企业家们今年给自己设定了一个大胆的新目标:创造一个读脑设备使人们可以通过他们的思想毫不费力的传递信息。
​ 在四月,埃隆 · 马斯克爆出了一家叫做“[Neuralink](https://www.technologyreview.com/s/604254/with-neuralink-elon-musk-promises-human-to-human-telepathy-dont-believe-it/) ”的人脑接口新公司。几天后,脸书的CEO马克 · 扎克伯格宣称:“可用的人脑接口即将到来,最终,让你可以只用你的思想进行交流。”脸书表示有[六十名工程师](https://www.technologyreview.com/s/604229/facebooks-sci-fi-plan-for-typing-with-your-mind-and-hearing-with-your-skin/)正在研究这个问题。
​ 这是一个有野心的探索,并且我们有理由认为它不会很快的就发生。但至少对一只橙色喙的小鸟——斑胸草雀来说,这个梦想离现实更近了。
​ 这要感谢Timothy Gentner 和他在加利福尼亚大学学生们的杰出工作,他们设计了一个“大脑到鸣叫”的界面,在此界面中,他可以预测一只雀在几秒钟之后的啼叫。
​ “我们直接从神经行为解码现实中各种鸟叫”这是科学家发表在网站 [bioRxiv](https://www.biorxiv.org/content/early/2017/09/27/193987) 的一篇新报告中提到的。阿根廷鸟鸣专家Ezequiel Arneodo 所在的团队称这个系统是一个从神经行为到复杂的神经交流信号的解码器的原型。研究人员说,类似的方法可以推动人类思想到文字转换接口的发展。
​ 科学家们说他们可以直接从一只雀的大脑活动预测这只雀的叫声。
​ 一只鸣禽的大脑并不大。但它的发声方式与人类的语言相似,这使得这些鸟成为研究记忆和认知的科学家们的最爱。它们的叫声很复杂,就像人类的语言一样,它们也是后天习得的。斑胸草雀从一只年长的鸟那里学会了它的叫声。
​ Makoto Fukushima是美国国立卫生研究院的一名研究员,他使用大脑接口来研究猴子发出的较简单的咕哝声和咕咕声。他说,鸟鸣声的范围更广,这就是为什么新的研究结果“对人类语言的应用具有重要意义”。
​ 目前在人类中尝试的大脑界面主要是跟踪反映一个人想要手臂做出特定活动的神经信号,这种神经信号可以用来移动机器人或引导光标非常缓慢地啄出字母。因此,那种可以毫不费力接收到你想要说的内容的头盔或大脑植入装置还远未实现。
​ 但正如这项新研究显示的那样,这并非完全不可能。加州大学圣地亚哥分校的研究小组利用醒着的鸟儿身上的硅电极,测量了大脑中被称为感觉运动核的部分神经元的电颤,而这部分神经元正是“形成学习歌曲的指令”产生的地方。
​ 这项实验使用了神经网络软件,这是一种机器学习方法。研究人员在程序中输入神经放电的模式和实际产生的叫声,以及它的停止和开始和频率的变化。他们的想法是训练他们的软件去匹配其中一个他们称之为“神经到歌曲的频谱映射”。
​ 该团队的主要创新之处在于,通过整合雀科鸣叫的物理模型,简化了从大脑到啼叫的翻译。鸟类不像人类那样有声带。相反,它们把空气喷到喉咙中振动的表面上,这种表面被称为鸣管 。想想看,如果你把两张纸放在一起,在纸的边缘吹气,就能发出尖锐的呜呜声。
​ 总结:作者说:“我们直接从神经活动中解码现实中各种各样鸟鸣。在他们的报告中,研究小组称,他们可以在一只鸟发声的30毫秒之前预测它会如何啼叫。
​ 你可以在下面的音频中听到结果。记住斑马雀不是夜莺。它的歌更像是断断续续的嘎嘎叫。
​ 结果证实是同一种啼叫,正如从雀脑内的神经记录预测的那样。
​ 鸣禽已经是一个重要的研究模型。在埃隆·马斯克的神经网络学院(Neuralink),鸟类科学家是首批聘用的关键人员之一。UCSD专注检测的言语背后肌肉运动技巧也可能是脑机接口技术的一个关键突破点。
​ Facebook表示,它希望人们能够以每分钟100字的速度直接通过大脑打字,随时随地都可以私下发短信。当你需要用唇语表达你的意思时,一个能读懂大脑对肌肉发出的指令的设备可能比一个能读懂“思想”的设备要现实得多。
​ Gentner和他的团队希望他们的雀类能够帮助实现这一目标。他们写道:“我们已经用动物模型演示了一种复杂的通信信号(脑机接口),”。还补充说,“我们的方法也为生物医学语音假体设备提供了一个有价值的试验场。”
​ 换句话说,我们离通过大脑发短信的目标又近了一小步。
\ No newline at end of file
......@@ -15,26 +15,26 @@
## 翻译贡献者
| 日期 | 翻译 | 校对 |
| --------------------------------------------------- | -------------------------------------------------------- | ---- |
| [2017/09/25 第1期](https://hackcv.com/daily/p/1/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/04 第2期](https://hackcv.com/daily/p/2/) | [@doordiey](https://github.com/doordiey) | |
| [2017/10/05 第3期](https://hackcv.com/daily/p/3/) | [@Arron206](https://github.com/Arron206) | |
| [2017/10/06 第4期](https://hackcv.com/daily/p/4/) | [@mllove](https://github.com/mllove) | |
| [2017/10/07 第5期](https://hackcv.com/daily/p/5/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/08 第6期](https://hackcv.com/daily/p/6/) | [@doordiey](https://github.com/doordiey) | |
| [2017/10/09 第7期](https://hackcv.com/daily/p/7/) | [@mllove](https://github.com/mllove) | |
| [2017/10/10 第8期](https://hackcv.com/daily/p/8/) | [@AlexdanerZe](https://github.com/AlexdanerZe) | |
| [2017/10/11 第9期](https://hackcv.com/daily/p/9/) | [@exqlnet](https://github.com/exqlnet) | |
| [2017/10/11 第10期](https://hackcv.com/daily/p/10/) | [@aboutmydreams](https://github.com/aboutmydreams) | |
| [2017/10/13 第11期](https://hackcv.com/daily/p/11/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/14 第12期](https://hackcv.com/daily/p/12/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/15 第13期](https://hackcv.com/daily/p/13/) | | |
| [2017/10/16 第14期](https://hackcv.com/daily/p/14/) | [@pickonecat](https://github.com/pickonecat)(暂未完成) | |
| [2017/10/17 第15期](https://hackcv.com/daily/p/15/) | | |
| [2017/10/18 第16期](https://hackcv.com/daily/p/16/) | | |
| [2017/10/19 第17期](https://hackcv.com/daily/p/17/) | | |
| [2017/10/20 第18期](https://hackcv.com/daily/p/18/) | | |
| 日期 | 翻译 | 校对 |
| --------------------------------------------------- | ------------------------------------------------------------ | ---- |
| [2017/09/25 第1期](https://hackcv.com/daily/p/1/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/04 第2期](https://hackcv.com/daily/p/2/) | [@doordiey](https://github.com/doordiey) | |
| [2017/10/05 第3期](https://hackcv.com/daily/p/3/) | [@Arron206](https://github.com/Arron206) | |
| [2017/10/06 第4期](https://hackcv.com/daily/p/4/) | [@mllove](https://github.com/mllove) | |
| [2017/10/07 第5期](https://hackcv.com/daily/p/5/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/08 第6期](https://hackcv.com/daily/p/6/) | [@doordiey](https://github.com/doordiey) | |
| [2017/10/09 第7期](https://hackcv.com/daily/p/7/) | [@mllove](https://github.com/mllove) | |
| [2017/10/10 第8期](https://hackcv.com/daily/p/8/) | [@AlexdanerZe](https://github.com/AlexdanerZe) | |
| [2017/10/11 第9期](https://hackcv.com/daily/p/9/) | [@exqlnet](https://github.com/exqlnet) | |
| [2017/10/11 第10期](https://hackcv.com/daily/p/10/) | [@aboutmydreams](https://github.com/aboutmydreams) | |
| [2017/10/13 第11期](https://hackcv.com/daily/p/11/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/14 第12期](https://hackcv.com/daily/p/12/) | [@wnma](https://github.com/wnma3mz) | |
| [2017/10/15 第13期](https://hackcv.com/daily/p/13/) | | |
| [2017/10/16 第14期](https://hackcv.com/daily/p/14/) | [@pickonecat](https://github.com/pickonecat)(暂未完成) | |
| [2017/10/17 第15期](https://hackcv.com/daily/p/15/) | | |
| [2017/10/18 第16期](https://hackcv.com/daily/p/16/) | | |
| [2017/10/19 第17期](https://hackcv.com/daily/p/17/) | [@lbllol365](https://github.com/lbllol365?tdsourcetag=s_pctim_aiomsg) | |
| [2017/10/20 第18期](https://hackcv.com/daily/p/18/) | | |
## 贡献指南
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册