# A Tour of Gotchas When Implementing Deep Q Networks with Keras and OpenAi Gym
原文链接:[A Tour of Gotchas When Implementing Deep Q Networks with Keras and OpenAi Gym](http://srome.github.io/A-Tour-Of-Gotchas-When-Implementing-Deep-Q-Networks-With-Keras-And-OpenAi-Gym/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
Starting with the Google DeepMind paper, there has been a lot of new attention around training models to play video games. You, the data scientist/engineer/enthusiast, may not work in reinforcement learning but probably are interested in teaching neural networks to play video games. Who isn’t? With that in mind, here’s a list of nuances that should jumpstart your own implementation.
The lessons below were gleaned from working on my [own implementation](http://www.github.com/srome/ExPyDQN) of the [Nature](http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html)paper. The lessons are aimed at people who work with data but may run into some issues with some of the non-standard approaches used in the reinforcement learning community when compared with typical supervised learning use cases. I will address both technical details of the parameters of the neural networks and the libraries involved. This post assumes limited knowledge of the Nature paper, in particular in regards to basic notation used around Q learning. My implementation was written to be instructive by relying mostly on Keras and Gym. I avoided theano/tensorflow specific tricks (e.g., theano’s [disconnected gradient](https://github.com/Theano/theano/blob/52903f8267cff316fc669e207eac4e2ecae952a6/theano/gradient.py#L2002-L2021)) to keep the focus on the main components.
# Learning Rate and Loss Functions
If you have looked at some of the [some](https://github.com/spragunr/deep_q_rl/blob/master/deep_q_rl/q_network.py) of the [implementations](https://github.com/matthiasplappert/keras-rl/blob/master/rl/agents/dqn.py), you’ll see there’s usually an option between summing the loss function of a minibatch or taking a mean. Most loss functions you hear about in machine learning start with the word “mean” or at least take a mean over a mini batch. Let’s talk about what a summed loss really means in the context of your learning rate.
A typical gradient descent update looks like this:
θt+1←θt−λ∇(ℓθt)θt+1←θt−λ∇(ℓθt)
where θtθt is your learner’s weights at time tt, λλ is the learning rate and ℓθtℓθt is your loss function which depends on θtθt. I will suppress the θtθt dependence on the loss function moving forward. Let us define ℓℓ as a loss function (where we assume it is the sum over) and ℓ^ℓ^ as your loss function taking the mean over the mini batch. For a fixed mini batch of size mm, notice:
ℓ^=1mℓℓ^=1mℓ
and so mathematically we have
∇(ℓ^)=1m∇(ℓ).∇(ℓ^)=1m∇(ℓ).
What does this tell us? If you train two models each with learning rate λλ, these two variants will have very different behavior as the magnitude of the updates will be off by a factor of mm! In theory, with a small enough learning rate, you can take into account the size of the mini batch and recover the behavior of the mean version of the loss from the summed version. However, this would lead other components coefficients in the loss like for regularization needing adjustment as well. Sticking with the mean version leads to standard coefficients which work well in many situations. Therefore, it’s atypical in other data science applications to use a summed loss function rather than a mean, but the option is frequently present in reinforcement learning. So, you should be aware that you may need to adjust the learning rate! With that said, my implementation uses the mean version and a learning rate of .00025.
# Stochasticity, Frame Skipping, and Gym Defaults
Many (good) blog posts on reinforcement learning show a model being trained using “(Game Name)-v0” and so you might decide to use the ROM as you’ve seen it done before. So far so good. Then you read the various papers and see a technique called “frame skipping”, where you stack nn output screens from the emulator into a single nn-by-length-by-width image and pass that to the model, and so you implement that thinking everything is going according to plan. It’s not. Depending on your version of Gym, you can run into trouble.
In older versions of Gym, the Atari environment *randomly*[repeated your action for 2-4 steps](https://github.com/openai/gym/blob/bde0de609d3645e76728b3b8fc2f3bf210187b27/gym/envs/atari/atari_env.py#L69-L71), and returning the resulting frame. In terms of the code, this is what happens in terms of your frame skip implementation.
```
for k in range(frame_skip):
obs, reward, is_terminal, info = env.step(action) # 2-4 step occur in the emulator with the given action
```
If you implement frame skipping with n=4, it’s possible that your learning is seeing every 8-16 (or more!) frames rather than every 4. You can imagine the impact on performance. Thankfully, this has since been been made [optional](https://github.com/openai/gym/blob/master/gym/envs/atari/atari_env.py#L75-L80) via new ROMs, which we will mention in a second. However, there is another setting to be aware of and that’s repeat_action_probability. For “(Game Name)-v0” ROMs, this is on by default. This is the probability that the game will ignore a new action and repeat a previous action at each time step. To remove both frameskip and the repeat action probability, use the “(Game Name)NoFrameskip-v4” ROMs. A full understanding of these settings can be found [here](https://github.com/openai/gym/blob/5cb12296274020db9bb6378ce54276b31e7002da/gym/envs/__init__.py#L298-L376).
I would be remiss not to point out that Gym is not doing this just to ruin your DQN. There is a legitimate reason for doing this, but this setting can lead to *unending* frustrating when your neural network is ramming itself into the boundary of the stage rather than hitting the ping pong pixel-square. The reason is to introduce stochasticity into the environment. Otherwise, the game will be deterministic and your network is simply memorizing a series of steps like a dance. When using a NoFrameskip ROM, you have to introduce your own stochasticity to avoid said dance. The Nature paper (and many libraries) do this by the “null op max” setting. At the beginning of each episode (i.e., a round for a game of Pong), the agent will perform a series of kk consecutive null operations (action=0 for the Atari emulator in Gym) where kk is an integer sampled uniformly from [0,null op max][0,null op max]. It can be implemented by the following pseudo-code at the start of an episode:
```
obs = env.reset()
# Perform a null operation to make game stochastic
for k in range(np.random.randint(0 , null_op_max, size=1)):
There are several different types of clipping going on in the Nature paper, and each can easily be confused and implemented incorrectly. In fact, if you thought error clipping and gradient clipping were different, you’re already confused!
## What is Gradient Clipping?
The Nature paper states it is helpful to “clip the error term”. The community seems to have rejected “error clipping” for the term “gradient clipping”. Without knowing the background, the term can be ambiguous. In both instances, there is actual no clipping involved of the loss function or of the error or of the gradient. Really, they are choosing a loss function whose gradient does not grow with the size of the error past a certain region, thus limiting the size of the gradient update for large errors. In particular, if the value of the loss function is greater than 1, they switch the loss function to absolute value. Why? Let’s look at the derivative!
This term represents the loss of a mean squared error-like function:
ddx(x−y)2=12(x−y)ddx(x−y)2=12(x−y)
Now compare that to the term resulting from an absolute value-like function when x−y>0x−y>0:
ddx(x−y)=1ddx(x−y)=1
If we think of the value of the loss function, or of the error, as x−yx−y, we can see one gradient update would contain x−yx−y while the other does not. There isn’t really a good, catchy phrase to describe the above mathematical trick that is more representative than “gradient clipping”.
The standard approach is to accomplish this is to use the [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) function. The definition of this function is as follows:
f(x)=12x2 if |x|≤δ, δ(|x|−12δ) otherwise.f(x)=12x2 if |x|≤δ, δ(|x|−12δ) otherwise.
There is a common trick to implement this so that symbolic mathematics libraries like theano and tensorflow can take the derivative easier without the use of a switch statement. That trick is outlined/coded below and is commonly employed in most implementations.
The function you actually code is as follows: Let q=min(|x|,δ)q=min(|x|,δ). Then,
As you can see, the slope of the graph (the derivative) is “clipped” to never be bigger in magnitude than 1. An interesting point is the second derivative is not continuous:
f′′(x)=1 if |x|≤δ, 0 otherwise.f′′(x)=1 if |x|≤δ, 0 otherwise.
This could cause problems using second order methods for gradiet descent, which is why some suggest a pseudo-Huber loss function which is a smooth approximation to the Huber loss. However, Huber loss is sufficient for our goals.
## What Reward Do I Clip?
This is actually a much subtler question when you introduce frame skipping to the equation. Do you take the last reward returned from the emulator? But what if there was a reward during the skipped frames? When you’re so far down in the weeds of neural network implementation, it’s surprising when the answer comes from the basic [Q learning](https://en.wikipedia.org/wiki/Q-learning)framework. In Q learning, one uses the total reward since the last action, and this is what they did in their paper as well. This means even on frames you skipped you need to save the reward observed and aggregate them as the “reward” for a given state (st,at,rt,st+1)(st,at,rt,st+1). This cumulative reward is what you clip. In my implementation, you can see this during the step function on the TrainingEnvironment class. The pseudo code is below:
```
for k in range(frame_skip):
obs, reward, is_terminal, info = env.step(action) # 1 step in the emulator
total_reward += reward
```
This total reward is what is stored in the experience replay memory.
# Memory Consideration
The “experience replay memory” is a store where training examples (st,at,rt,st+1)(st,at,rt,st+1) are sampled from to break the time correlation of the data points, stopping the network’s learning from diverging. The original paper mentions a replay memory of a million examples (st,at,rt,st+1)(st,at,rt,st+1). Both the stst’s and the st+1st+1’s are nn-by-length-by-width images and so you can imagine the memory requirements can become very large. This is one of the cases where knowing more about programming and types while working in Python can be useful. By default, Gym returns images as numpy arrays with the datatype int8. If you do any processing of the images, it’s likely that your images will now be of type float32. So, it’s important when you store your images to make sure they’re in int8 for space considerations, and any transformations that are necessary for the neural network (like scaling) are done before training rather than before storing the states in the replay memory.
It is also helpful to pre-allocate this replay memory. In my implementation, I do this as follows:
Of course, phi_length=nn from our previous discussion, the number of screens from the emulator stacked together to form a state.
# Debugging In Circles
With so many moving parts and parameters, there’s a lot that can go wrong in an implementation. My recommendation is to set your starting parameters as the same as the original paper to nail down one source of error. Most of the changes from the original NIPS paper to the Nature paper were to standardize learning parameters and performance across different Atari games. There are many quirks that arise from game to game, and most of the techniques in the Nature paper guard against this problem. For example, the consecutive max is used to deal with screens flickering causing objects to disappear under certain frame skip settings. So once you have set the parameters, here are a few things to check when your network is not learning well (or at all):
- Check if your Q-value output jumps in size between batch updates, this means your gradient update is large. Take a look at your learning rate and investigate your gradient to find the problem.
- Look at the states you are sending to the neural network. Does the frame skipping/consecutive max seem to be correct? Are your future states different from your current states? Do you see a logical progression of images? If not, you may have some memory reference issues, which can happen when you try to conserve memory in Python.
- Verify that your fixed target network’s weights are actually fixed. In Python, it’s easy to make the fixed network point to the same location in memory, and then your fixed target network isn’t actually fixed!
- If you find your implementation is not working on a certain game, test your code on a simpler game like Pong. Your implementation may be fine, but you don’t have enough of the cutting-edge advances to learn harder games! Explore Double DQN, Dueling Q Networks, and prioritized experience replay.
# Conclusion
Implementing the DeepMind paper is a rewarding first step into modern advancements in reinforcement learning. If you work mostly in a more traditional supervised learning setting, many of the common ideas and tricks may seem foreign at first. Deep Q Networks are just neural networks afterall, and many of the techniques to stabilize learning can be applied to traditional uses as well. The most interesting point to me is the topic of regularization. If you noticed, we did not use Dropout or L2L2 or L1L1– all the traditional approaches to stabalize training for neural networks. Instead, the same goal of regularization is accomplished with how the mini batches are selected (experience replay) and how future rewards are defined (using a fixed target network). That is one of the most exciting and novel theoretical approaches from the original Nature paper which has continued to be iterated upon and refined in subsequent papers. When you want to further improve your implementation, you should investigate the next iteration of the techniques: [prioritized experience replay](https://arxiv.org/abs/1511.05952), [Double DQN](https://arxiv.org/abs/1509.06461), and [Dueling Q Networks](https://arxiv.org/abs/1511.06581). The (mostly) current standard comes from a modification to allow asyncronous learning called [A3C](https://arxiv.org/abs/1602.01783).
# Scientists Can Read a Bird’s Brain and Predict Its Next Song
原文链接:[Scientists Can Read a Bird’s Brain and Predict Its Next Song](https://www.technologyreview.com/s/609032/scientists-can-read-a-birds-brain-and-predict-its-next-song/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
## Next up, predicting human speech with a brain-computer interface.
Entrepreneurs in Silicon Valley this year set themselves an audacious new goal: creating a brain-reading device that would allow people to effortlessly send texts with their thoughts.
In April, Elon Musk announced a secretive new brain-interface company called [Neuralink](https://www.technologyreview.com/s/604254/with-neuralink-elon-musk-promises-human-to-human-telepathy-dont-believe-it/). Days later, Facebook CEO Mark Zuckerberg declared that “direct brain interfaces [are] going to, eventually, let you communicate only with your mind.” The company says it has [60 engineers](https://www.technologyreview.com/s/604229/facebooks-sci-fi-plan-for-typing-with-your-mind-and-hearing-with-your-skin/) working on the problem.
It’s an ambitious quest—and there are reasons to think it won’t happen anytime soon. But for at least one small, orange-beaked bird, the zebra finch, the dream just became a lot closer to reality.
That’s thanks to some nifty work by Timothy Gentner and his students at the University of California, San Diego, who built a brain-to-tweet interface that figures out the song a finch is going to sing a fraction of a second before it does so.
“We decode realistic synthetic birdsong directly from neural activity,” the scientists announced in a new report published on the website [bioRxiv](https://www.biorxiv.org/content/early/2017/09/27/193987). The team, which includes Argentinian birdsong expert Ezequiel Arneodo, calls the system the first prototype of “a decoder of complex, natural communication signals from neural activity.” A similar approach could fuel advances towards a human thought-to-text interface, the researchers say.
Scientists say they can predict the song of a finch directly from its brain activity.
A songbird’s brain is none too large. But its vocalizations are similar to human speech in ways that make these birds a favorite of scientists studying memory and cognition. Their songs are complex. And, like human language, they’re learned. The zebra finch learns its call from an older bird.
Makoto Fukushima, a fellow at the National Institutes of Health who has used brain interfaces to study the simpler grunts and coos made by monkeys, says the richer range of birdsong is why the new results have “important implications for application in human speech.”
Current brain interfaces tried in humans mostly track neural signals that reflect a person’s imagined arm movements, which can be coopted to move a robot or direct a cursor to very slowly peck out letters. So the idea of a helmet or brain implant that can effortlessly pick up what you’re trying to say remains pretty far from being realized.
But it’s not strictly impossible, as the new study shows. The team at UCSD used silicon electrodes in awake birds to measure the electrical chatter of neurons in part of the brain called the sensory-motor nucleus, where “commands that shape the production of learned song” originate.
The experiment employed neural-network software, a type of machine learning. The researchers fed into the program both the pattern of neural firing and the actual song that resulted, with its stops and starts and changing frequencies. The idea was to train their software to match one to the other, in what they termed “neural-to-song spectrum mappings.”
The team’s main innovation was to simplify the brain-to-tweet translation by incorporating a physical model of how finches make noise. Birds don’t have vocal cords as people do; instead, they shoot air over a vibrating surface in their throat, called a syrinx. Think of how you can make a high-pitched whine by putting two pieces of paper together and blowing at the edge.
The final result, say the authors: “We decode realistic synthetic birdsong directly from neural activity.” In their report, the team says it can predict what the bird will sing about 30 milliseconds before it does so.
You can listen to results yourself in the audio below. Keep in mind that the zebra finch is no nightingale. Its song is more like a staccato quacking.
The same song, as predicted from neural recordings inside the finch’s brain.
Songbirds are already an important research model. At Elon Musk’s Neuralink, bird scientists were among the first key hires. And UCSD’s trick of focusing on detecting the muscle movements behind speech may also be a key development.
Facebook has said it hopes people will be able to type directly from their brains at 100 words per minute, privately sending texts whenever they want. A device able to read the commands your brain sends out to muscles while you are engaged in subvocal utterances (silent speech) is probably a lot more realistic than one that reads “thoughts.”
Gentner and his team hope their finches will help make it possible. “We have demonstrated a [brain-machine interface] for a complex communication signal, using an animal model for human speech,” they write. They add that “our approach also provides a valuable proving ground for biomedical speech-prosthetic devices.”
In other words, we’re a little closer to texting from our brains.
# TensorFlow* Optimizations on Modern Intel® Architecture
原文链接:[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
**Intel: Elmoustapha Ould-Ahmed-Vall, Mahmoud Abuzaina, Md Faijul Amin, Jayaram Bobba, Roman S Dubtsov, Evarist M Fomenko, Mukesh Gangadhar, Niranjan Hasabnis, Jing Huang, Deepthi Karkada, Young Jin Kim, Srihari Makineni, Dmitri Mishura, Karthik Raman, AG Ramesh, Vivek V Rane, Michael Riera, Dmitry Sergeev, Vamsi Sripathi, Bhavani Subramanian, Lakshay Tokas, Antonio C Valles**
**Google: Andy Davis, Toby Boyd, Megan Kacholia, Rasmus Larsen, Rajat Monga, Thiru Palanisamy, Vijay Vasudevan, Yao Zhang**
TensorFlow* is a leading deep learning and machine learning framework, which makes it important for Intel and Google to ensure that it is able to extract maximum performance from Intel’s hardware offering. This paper introduces the Artificial Intelligence (AI) community to TensorFlow optimizations on Intel® Xeon® and Intel® Xeon Phi™ processor based platforms. These optimizations are the fruit of a close collaboration between Intel and Google engineers announced last year by Intel’s Diane Bryant and Google’s Diane Green at the first Intel AI Day.
We describe the various performance challenges that we encountered during this optimization exercise and the solutions adopted. We also report out performance improvements on a sample of common neural networks models. These optimizations can result in orders of magnitude higher performance. For example, our measurements are showing up to 70x higher performance for training and up to 85x higher performance for inference on Intel® Xeon Phi™ processor 7250. Intel® Xeon® processor E5 v4 (BDW) and Intel Xeon Phi processor 7250 based platforms, they lay the foundation for next generation products from Intel. In particular, users are expected to see improved performance on Intel Xeon Scalable processors.
Optimizing deep learning models performance on modern CPUs presents a number of challenges not very different from those seen when optimizing other performance-sensitive applications in High Performance Computing (HPC):
1. Code refactoring needed to take advantage of modern vector instructions. This means ensuring that all the key primitives, such as convolution, matrix multiplication, and batch normalization are vectorized to the latest SIMD instructions (AVX2 for Intel Xeon processors and AVX512 for Intel Xeon Phi processors).
2. Maximum performance requires paying special attention to using all the available cores efficiently. Again this means looking at parallelization within a given layer or operation as well as parallelization across layers.
3. As much as possible, data has to be available when the execution units need it. This means balanced use of prefetching, cache blocking techniques and data formats that promote spatial and temporal locality.
To meet these requirements, Intel developed a number of optimized deep learning primitives that can be used inside the different deep learning frameworks to ensure that we implement common building blocks efficiently. In addition to matrix multiplication and convolution, these building blocks include:
- Direct batched convolution
- Inner product
- Pooling: maximum, minimum, average
- Normalization: local response normalization across channels (LRN), batch normalization
- Activation: rectified linear unit (ReLU)
- Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale.
Refer to this [article](https://software.intel.com/en-us/articles/introducing-dnn-primitives-in-intelr-mkl) for more details on these Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) optimized primitives.
In TensorFlow, we implemented Intel optimized versions of operations to make sure that these operations can leverage Intel MKL-DNN primitives wherever possible. While, this is a necessary step to enable scalable performance on Intel® architecture, we also had to implement a number of other optimizations. In particular, Intel MKL uses a different layout than the default layout in TensorFlow for performance reasons. We needed to ensure that the overhead of conversion between the two formats is kept to a minimum. We also wanted to ensure that data scientists and other TensorFlow users don’t have to change their existing neural network models to take advantage of these optimizations.
We introduced a number of graph optimization passes to:
1. Replace default TensorFlow operations with Intel optimized versions when running on CPU. This ensures that users can run their existing Python programs and realize the performance gains without changes to their neural network model.
2. Eliminate unnecessary and costly data layout conversions.
3. Fuse multiple operations together to enable efficient cache reuse on CPU.
4. Handle intermediate states that allow for faster backpropagation.
These graph optimizations enable greater performance without introducing any additional burden on TensorFlow programmers. Data layout optimization is a key performance optimization. Often times, the native TensorFlow data format is not the most efficient data layout for certain tensor operations on CPUs. In such cases, we insert a data layout conversion operation from TensorFlow’s native format to an internal format, perform the operation on CPU, and convert operation output back to the TensorFlow format. However, these conversions introduce a performance overhead and should be minimized. Our data layout optimization identifies sub-graphs that can be entirely executed using Intel MKL optimized operations and eliminates the conversions within the operations in the sub-graph. Automatically inserted conversion nodes take care of data layout conversions at the boundaries of the sub-graph. Another key optimization is the fusion pass that automatically fuses operations that can be run efficiently as a single Intel MKL operation.
## Other Optimizations
We have also tweaked a number of TensorFlow framework components to enable the highest CPU performance for various deep learning models. We developed a custom pool allocator using existing pool allocator in TensorFlow. Our custom pool allocator ensures that both TensorFlow and Intel MKL share the same memory pools (using the Intel MKL imalloc functionality) and we don’t return memory prematurely to the operating system, thus avoiding costly page misses and page clears. In addition, we carefully tuned multiple threading libraries (pthreads used by TensorFlow and OpenMP used by Intel MKL) to coexist and not to compete against each other for CPU resources.
## Performance Experiments
Our optimizations such as the ones discussed above resulted in dramatic performance improvements on both Intel Xeon and Intel Xeon Phi platforms. To illustrate the performance gains we report below our best known methods (or BKMs) together with baseline and optimized performance numbers for three common [ConvNet benchmarks](https://github.com/soumith/convnet-benchmarks).
1. The following parameters are important for performance on Intel Xeon (codename Broadwell) and Intel Xeon Phi (codename Knights Landing) processors and we recommend tuning them for your specific neural network model and platform. We have carefully tuned these parameters to gain maximum performance for convnet-benchmarks on both Intel Xeon and Intel Xeon Phi processors.
1. Data format: we suggest that users can specify the NCHW format for their specific neural network model to get maximum performance. TensorFlow default NHWC format is not the most efficient data layout for CPU and it results in some additional conversion overhead.
2. Inter-op / intra-op: we also suggest that data scientists and users experiment with the intra-op and inter-op parameters in TensorFlow for optimal setting for each model and CPU platform. These settings impact parallelism within one layer as well as across layers.
3. Batch size: batch size is another important parameter that impacts both the available parallelism to utilize all the cores as well as working set size and memory performance in general.
4. OMP_NUM_THREADS: maximum performance requires using all the available cores efficiently. This setting is especially important for performance on Intel Xeon Phi processors since it controls the level of hyperthreading (1 to 4).
5. Transpose in Matrix multiplication: for some matrix sizes, transposing the second input matrix b provides better performance (better cache reuse) in Matmul layer. This is the case for all the Matmul operations used in the three models below. Users should experiment with this setting for other matrix sizes.
6. KMP_BLOCKTIME: users should experiment with various settings for how much time each thread should wait after completing the execution of a parallel region, in milliseconds.
#### Example settings on Intel® Xeon® processor (codename Broadwell - 2 Sockets - 22 Cores)
3. Performance results with different batch sizes on sizes on Intel® Xeon® processor (codename Broadwell) and Intel® Xeon Phi™ processor (codename Knights Landing) - Training
You can either install pre-built binary packages with pip or conda by following the directions within [Intel Optimized TensorFlow Wheel Now Available ](https://software.intel.com/en-us/articles/intel-optimized-tensorflow-wheel-now-available#)or you can build from sources following the directions below:
1. Run "./configure" from the TensorFlow source directory, and it will download latest Intel MKL for machine learning automatically in tensorflow/third_party/mkl/mklml if you select the options to use Intel MKL.
2. Execute the following commands to create a pip package that can be used to install the optimized TensorFlow build.
- PATH can be changed to point to a specific version of GCC compiler:
export PATH=/PATH/gcc/bin:$PATH
- LD_LIBRARY_PATH can also be changed to point to new GLIBC :
Optimizing TensorFlow means deep learning applications built using this widely available and widely applied framework can now run much faster on Intel processors to increase flexibility, accessibility, and scale. The Intel Xeon Phi processor, for example, is designed to scale out in a near-linear fashion across cores and nodes to dramatically reduce the time to train machine learning models. And TensorFlow can now scale with future performance advancements as we continue enhancing the performance of Intel processors to handle even bigger and more challenging AI workloads.
The collaboration between Intel and Google to optimize TensorFlow is part of ongoing efforts to make AI more accessible to developers and data scientists, and to enable AI applications to run wherever they’re needed on any kind of device—from the edge to the cloud. Intel believes this is the key to creating the next-generation of AI algorithms and models to solve the most pressing problems in business, science, engineering, medicine, and society.
This collaboration already resulted in dramatic performance improvements on leading Intel Xeon and Intel Xeon Phi processor-based platforms. These improvements are now readily available through [Google’s TensorFlow GitHub repository](https://github.com/tensorflow/tensorflow.git). We are asking the AI community to give these optimizations a try and are looking forward to feedback and contributions that build on them.
原文链接:[TensorFlow* Optimizations on Modern Intel® Architecture](https://software.intel.com/en-us/articles/tensorflow-optimizations-on-modern-intel-architecture?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
英特尔:**Elmoustapha Ould-Ahmed-Vall, Mahmoud Abuzaina, Md Faijul Amin, Jayaram Bobba, Roman S Dubtsov, Evarist M Fomenko, Mukesh Gangadhar, Niranjan Hasabnis, Jing Huang, Deepthi Karkada, Young Jin Kim, Srihari Makineni, Dmitri Mishura, Karthik Raman, AG Ramesh, Vivek V Rane, Michael Riera, Dmitry Sergeev, Vamsi Sripathi, Bhavani Subramanian, Lakshay Tokas, Antonio C Valles**
谷歌: **Andy Davis, Toby Boyd, Megan Kacholia, Rasmus Larsen, Rajat Monga, Thiru Palanisamy, Vijay Vasudevan, Yao Zhang**
原文链接:[A Tour of Gotchas When Implementing Deep Q Networks with Keras and OpenAi Gym](http://srome.github.io/A-Tour-Of-Gotchas-When-Implementing-Deep-Q-Networks-With-Keras-And-OpenAi-Gym/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)
如果我没有指出Gym这么做不是为了破坏你的DQN,那我就太失职了。这样做是有正当理由的,但是当你的神经网络将自己放在无关紧要的位置而不是工作重点时,这种设置会导致无休止的挫败。原因是将随机特性引入到环境中。否则,游戏将是确定性的,你的网络只是简单地记忆一系列步骤,如舞蹈。当使用NoFrameskip ROM时,你必须实现你自己的随机特性来避免网络陷入定式。《自然》杂志的论文(以及许多库)通过“null op max”设置来实现这一点。在每一局游戏的开头(比如乒乓比赛的每一回合),代理将执行一系列连续的kk空操作(Atari模拟器在Gym中的action=0),其中kk是一个从[0,null op max],[0,null op max]均匀采样的整数。在一局的开始可以通过下面的伪代码来实现。
```
obs = env.reset()
# Perform a null operation to make game stochastic
for k in range(np.random.randint(0 , null_op_max, size=1)):
原文链接:[Scientists Can Read a Bird’s Brain and Predict Its Next Song](https://www.technologyreview.com/s/609032/scientists-can-read-a-birds-brain-and-predict-its-next-song/?from=hackcv&hmsr=hackcv.com&utm_medium=hackcv.com&utm_source=hackcv.com)