Reinforcement Learning

Deep reinforcement learning (RL) has an ever increasing number of success stories ranging from realistic simulated environments, robotics and games. Experience Replay (ER) enhances RL algorithms by using information collected in past policy iterations to compute updates for the current policy. ER has become one of the mainstay techniques to improve the sample-efficiency of off-policy deep RL.

ER recalls experiences from past iterations to compute gradient estimates for the current policy, increasing data-efficiency. However, the accuracy of such updates may deteriorate when the policy diverges from past behaviors and can undermine the performance of ER. Many algorithms mitigate this issue by tuning hyper-parameters to slow down policy changes. An alternative is to actively enforce the similarity between policy and the experiences in the replay memory. We introduce Remember and Forget Experience Replay (ReF-ER), a novel method that can enhance RL algorithms with parameterized policies. ReF-ER (1) skips gradients computed from experiences that are too unlikely with the current policy and (2) regulates policy changes within a trust region of the replayed behaviors. We couple ReF-ER with Q-learning, deterministic policy gradient and off-policy gradient methods. We find that ReF-ER consistently improves the performance of continuous-action, off-policy RL on fully observable benchmarks and partially observable flow control problems.

Contours of the vorticity field (red and blue for anti- and clockwise rotation respectively) of a 2D flow control problem: The D-section cylinder is moving leftward, the agent is marked by A and by the highlighted control force and torque. On the right, the returns obtained by V-Racer (red), ACER (purple), DDPG with ER (blue) and DDPG with ReF-ER (green). V-Racer outperforms all other methods in this fluid mechanics application. ReF-ER extension improves the performance of DDPG. (ACER, DDPG are other state-of-the-art RL methods)

We can utilize the reinforcement learning algorithms to study the behavior of agents in fluid mechanics applications, where obtaining data from interactions with the environment is extremely costly and data-efficient methods like V-Racer are imperative. In the following video, V-Racer is utilized in a fish agent swimming behind a cylinder that learns to swim in order to optimize its efficiency (reward). One of the strategies the agent learns is to slalom between the vortices, harvesting the energy in the flow.

2019

  • G. Novati, L. Mahadevan, and P. Koumoutsakos, “Controlled gliding and perching through deep-reinforcement-learning,” Phys. Rev. Fluids, vol. 4, iss. 9, 2019.

BibTeX

@article{novati2019b,
author = {Guido Novati and L. Mahadevan and Petros Koumoutsakos},
doi = {10.1103/physrevfluids.4.093902},
journal = {{Phys. Rev. Fluids}},
month = {sep},
number = {9},
publisher = {American Physical Society ({APS})},
title = {Controlled gliding and perching through deep-reinforcement-learning},
url = {https://cse-lab.seas.harvard.edu/files/cse-lab/files/novati2019b.pdf},
volume = {4},
year = {2019}
}

Abstract

Controlled gliding is one of the most energetically efficient modes of transportation for natural and human powered fliers. Here we demonstrate that gliding and landing strategies with different optimality criteria can be identified through deep-reinforcement-learning without explicit knowledge of the underlying physics. We combine a two-dimensional model of a controlled elliptical body with deep-reinforcement-learning (D-RL) to achieve gliding with either minimum energy expenditure, or fastest time of arrival, at a predetermined location. In both cases the gliding trajectories are smooth, although energy/time optimal strategies are distinguished by small/high frequency actuations. We examine the effects of the ellipse’s shape and weight on the optimal policies for controlled gliding. We find that the model-free reinforcement learning leads to more robust gliding than model-based optimal control strategies with a modest additional computational cost. We also demonstrate that the gliders with D-RL can generalize their strategies to reach the target location from previously unseen starting positions. The model-free character and robustness of D-RL suggests a promising framework for developing robotic devices capable of exploiting complex flow environments.
  • G. Novati and P. Koumoutsakos, “Remember and forget for experience replay,” in Proceedings of the 36th international conference on machine learning, 2019.

BibTeX

@inproceedings{novati2019a,
author = {Novati, Guido and Koumoutsakos, Petros},
booktitle = {Proceedings of the 36th International Conference on Machine Learning},
title = {Remember and Forget for Experience Replay},
url = {https://cse-lab.seas.harvard.edu/files/cse-lab/files/novati2019a.pdf},
year = {2019}
}

Abstract

Proceedings of the 36th International Conference on Machine Learning Experience replay (ER) is a fundamental component of off-policy deep reinforcement learning (RL). ER recalls experiences from past iterations to compute gradient estimates for the current policy, increasing data-efficiency. However, the accuracy of such updates may deteriorate when the policy diverges from past behaviors and can undermine the performance of ER. Many algorithms mitigate this issue by tuning hyper-parameters to slow down policy changes. An alternative is to actively enforce the similarity between policy and the experiences in the replay memory. We introduce Remember and Forget Experience Replay (ReF-ER), a novel method that can enhance RL algorithms with parameterized policies. ReF-ER (1) skips gradients computed from experiences that are too unlikely with the current policy and (2) regulates policy changes within a trust region of the replayed behaviors. We couple ReF-ER with Q-learning, deterministic policy gradient and off-policy gradient methods. We find that ReF-ER consistently improves the performance of continuous-action, off-policy RL on fully observable benchmarks and partially observable flow control problems.

2017

  • G. Novati, S. Verma, D. Alexeev, D. Rossinelli, W. M. van Rees, and P. Koumoutsakos, “Synchronisation through learning for two self-propelled swimmers,” Bioinspir. Biomim., vol. 12, iss. 3, p. 36001, 2017.

BibTeX

@article{novati2017a,
author = {Guido Novati and Siddhartha Verma and Dmitry Alexeev and Diego Rossinelli and Wim M van Rees and Petros Koumoutsakos},
doi = {10.1088/1748-3190/aa6311},
journal = {{Bioinspir. Biomim.}},
month = {mar},
number = {3},
pages = {036001},
publisher = {{IOP} Publishing},
title = {Synchronisation through learning for two self-propelled swimmers},
url = {https://cse-lab.seas.harvard.edu/files/cse-lab/files/novati2017a.pdf},
volume = {12},
year = {2017}
}

Abstract

The coordinated motion by multiple swimmers is a fundamental component in fish schooling. The flow field induced by the motion of each self-propelled swimmer implies non-linear hydrodynamic interactions among the members of a group. How do swimmers compensate for such hydrodynamic interactions in coordinated patterns? We provide an answer to this riddle though simulations of two, self-propelled, fish-like bodies that employ a learning algorithm to synchronise their swimming patterns. We distinguish between learned motion patterns and the commonly used a-priori specified movements, that are imposed on the swimmers without feedback from their hydrodynamic interactions. First, we demonstrate that two rigid bodies executing pre-specified motions, with an alternating leader and follower, can result in substantial drag-reduction and intermittent thrust generation. In turn, we study two self-propelled swimmers arranged in a leader-follower configuration, with a-priori specified body-deformations. These two self-propelled swimmers do not sustain their tandem configuration. The follower experiences either an increase or decrease in swimming speed, depending on the initial conditions, while the swimming of the leader remains largely unaffected. This indicates that a-priori specified patterns are not sufficient to sustain synchronised swimming. We then examine a tandem of swimmers where the leader has a steady gait and the follower learns to synchronize its motion, to overcome the forces induced by the leader’s vortex wake. The follower employs reinforcement learning to adapt its swimming-kinematics so as to minimize its lateral deviations from the leader’s path. Swimming in such a sustained synchronised tandem yields up to 30% reduction in energy expenditure for the follower, in addition to a 20% increase in its swimming-efficiency. The present results show that two self-propelled swimmers can be synchronised by adapting their motion patterns to compensate for flow-structure interactions. Moreover, swimmers can exploit the vortical structures of their flow field so that synchronised swimming is energetically beneficial.